Spam Filter Detection

The Problem

Email marketing engagement metrics (opens and clicks) are contaminated by automated spam filter activity. Spam filters automatically open emails and click links during their scanning process, inflating engagement statistics. Customers requested a product that could separate genuine human engagement from spam filter activity to provide accurate campaign performance metrics.

Binary Classification Problem

Built a supervised learning model to classify email engagement events as either human or spam filter interactions. The model needed to handle event sequences where both human and spam filter activity could be present.

Feature Engineering

Worked with the Deliverability team to identify features that distinguished human from spam filter behavior. Engineered features from engagement event data including:

Time between open and click events
Parsed user agent strings into separate fields: browser name, version, privacy protection status, spoofing indicators
IP address provider identification
Interaction sequence characteristics

Class Imbalance and Training Set Curation

The dataset exhibited complex class imbalance issues:

Human interactions outnumbered spam filter interactions overall
Privacy-protected browsers dominated human interactions
Specific browser versions dominated spam filter interactions

Curated the training set to limit over-represented classes and ensure sufficient representation of all interaction types. This curation was necessary for the model to learn subtle differences between different human and spam filter behaviors.

Model Development and Selection

Evaluated multiple classification algorithms including Random Forest, XGBoost, and CatBoost. Used F1 score as the primary evaluation metric due to its balance between precision and recall. The Deliverability team reviewed model results and approved the production release.

Initial production model was Random Forest, which achieved the best performance on the original training data. After refining the data annotation process and training set curation, CatBoost outperformed Random Forest and the production model was updated.

Final model achieved F1 score of 92%.

Data Annotation Quality Control

Managed data annotation process where interns labeled training examples. Built Snowflake dashboards to monitor annotation quality:

Class distribution monitoring to ensure lesser-occurring classes were adequately represented
Common annotation error identification and correction tracking

Data queries in Snowflake included separate pipelines for pulling unannotated data and aggregating annotated results.

Production Deployment

Provided the trained model and deployment instructions to the Engineering team for integration into the email analytics infrastructure. The model became a customer-facing product feature for accurate email engagement reporting.

Development Environment

Python
Numpy
Pandas
scikit-learn
Random Forest
XGBoost
CatBoost
Matplotlib
Modin
Ray
MLFlow
Snowflake
Snowflake Python Connector
Git
Bitbucket
Jupyter Lab
Visual Studio Code
ChatGPT
Claude