Human vs. Spam Filter Detection
The Problem
Email marketing engagement metrics (opens and clicks) are contaminated by automated spam filter activity. Spam filters automatically open emails and click links during their scanning process, inflating engagement statistics. Customers requested a product that could separate genuine human engagement from spam filter activity to provide accurate campaign performance metrics.
Binary Classification Problem
Built a supervised learning model to classify email engagement events as either human or spam filter interactions. The model needed to handle event sequences where both human and spam filter activity could be present.
Feature Engineering
Worked with the Deliverability team to identify features that distinguished human from spam filter behavior. Engineered features from engagement event data including:
- Time between open and click events
- Parsed user agent strings into separate fields: browser name, version, privacy protection status, spoofing indicators
- IP address provider identification
- Interaction sequence characteristics
Class Imbalance and Training Set Curation
The dataset exhibited complex class imbalance issues:
- Human interactions outnumbered spam filter interactions overall
- Privacy-protected browsers dominated human interactions
- Specific browser versions dominated spam filter interactions
Curated the training set to limit over-represented classes and ensure sufficient representation of all interaction types. This curation was necessary for the model to learn subtle differences between different human and spam filter behaviors.
Model Development and Selection
Evaluated multiple classification algorithms including Random Forest, XGBoost, and CatBoost. Used F1 score as the primary evaluation metric due to its balance between precision and recall. The Deliverability team reviewed model results and approved the production release.
Initial production model was Random Forest, which achieved the best performance on the original training data. After refining the data annotation process and training set curation, CatBoost outperformed Random Forest and the production model was updated.
Final model achieved F1 score of 92%.
Data Annotation Quality Control
Managed data annotation process where interns labeled training examples. Built Snowflake dashboards to monitor annotation quality:
- Class distribution monitoring to ensure lesser-occurring classes were adequately represented
- Common annotation error identification and correction tracking
Data queries in Snowflake included separate pipelines for pulling unannotated data and aggregating annotated results.
Production Deployment
Provided the trained model and deployment instructions to the Engineering team for integration into the email analytics infrastructure. The model became a customer-facing product feature for accurate email engagement reporting.
Development Environment
- Python
- Numpy
- Pandas
- scikit-learn
- Random Forest
- XGBoost
- CatBoost
- Matplotlib
- Modin
- Ray
- MLFlow
- Snowflake
- Snowflake Python Connector
- Git
- Bitbucket
- Jupyter Lab
- Visual Studio Code
- ChatGPT
- Claude