Legacy Data Identification and Cleaning

Business Context

A startup analytics company was establishing itself in a competitive market by providing analysis services to customers. They had acquired the largest and most comprehensive publicly available dataset in their target area—approximately 2.1 million laboratory test results. Using this free dataset was critical for managing costs while competing against established players with proprietary data.

However, the dataset was unusable. All temporal analysis showed severely inflated numbers during early 2018, rendering sales trends, waste generation patterns, and quarterly/annual comparisons wildly inaccurate. The company could not provide reliable analysis to customers and could not determine what was causing the data contamination or how to correct it.

I was engaged to diagnose the root cause and determine if the dataset could be salvaged for analysis.

Root Cause Analysis

Conducted comprehensive exploratory analysis across multiple tables and fields to identify the source of the temporal anomaly. Examined distributions of various attributes over time, looking for patterns that could explain the concentrated spike in early 2018. The analysis revealed that the new tracking system had imported legacy data from a previous system, with all legacy records incorrectly timestamped to the migration date rather than their actual creation dates.

The dataset provided no explicit field distinguishing legacy from current records. However, analysis of the `lab_results_id` field revealed inconsistent identifier patterns. Some records contained lab IDs with the prefix "WAL" followed by numbers (e.g., "WAL22.LRB9L"), while others used different prefixes such as "WAJ" (e.g., "WAJ412964.LR2").

After a month of analyzing the data structure and patterns, I directed specific questions to stakeholders about the lab ID format. This revealed that the tracking system implementation had changed the lab identifier numbering scheme. Current system records used the "WAL##" format, while the legacy system had used different identifier conventions. This information had not been surfaced earlier in the investigation.

Solution Design and Implementation

Separation Strategy

Designed a separation approach based on the lab ID format difference. Implemented string parsing logic to extract lab identifiers from the `lab_results_id` field, searching for the "WAL##" pattern that characterized current system records. Records where extraction returned an empty string were classified as legacy data.

This approach partitioned the dataset into:

Current Data: ~593,000 records (28%) containing "WAL##" lab IDs
Legacy Data: ~1,535,000 records (72%) without the "WAL##" pattern

Validation and Quality Assessment

Validated the separation method through time series analysis. Created visualizations comparing record distributions by creation date for the separated datasets against the original combined dataset. The legacy data showed expected concentration around the migration date, while current data exhibited typical temporal distribution patterns for ongoing data collection. The similarity between separated current data and expected collection patterns provided supporting evidence for the separation accuracy.

Further analysis revealed systemic data quality issues throughout the dataset. Neither the legacy nor current tracking systems enforced data entry standards, resulting in pervasive inconsistencies across multiple fields. Definitive validation of the separation was not feasible given these quality constraints. I made the judgment that the lab ID-based separation method provided sufficient accuracy for the intended temporal analysis use cases, acknowledging that only ballpark-level precision would be achievable regardless of separation method.

Impact and Recommendations

Enabled Analysis Capabilities

The cleaned dataset enabled the analytics company to provide temporal analysis services to customers. Sales trends over time, waste generation patterns, and quarterly/annual comparisons became feasible where they had previously been impossible due to data contamination. This allowed the company to compete effectively using the free public dataset rather than requiring expensive proprietary data sources.

Data Quality Assessment and Limitations

Documented comprehensive findings about systemic data quality issues affecting the dataset. Both legacy and current tracking systems lacked data entry enforcement, creating pervasive inconsistencies that could not be retroactively corrected. Communicated clearly to stakeholders—through written documentation, verbal discussion, and notebook markdown—that temporal analysis would remain suspect and that only ballpark-level accuracy should be expected from any analysis using this dataset.

These limitations applied regardless of the legacy data separation. The fundamental data quality constraints meant that highly precise analysis was not achievable, but the separation enabled the level of accuracy that was feasible given the available data.

Deliverables

Provided separated datasets for legacy and current records, along with a documented Jupyter notebook containing the extraction logic, validation visualizations, and data quality assessment. The notebook enabled the analytics team to reproduce the separation process for future data extracts and understand both the methodology and its limitations.

Development Environment

Python
Jupyter Notebook
Pandas
NumPy
Matplotlib
Seaborn