Data Validation Automation

Context
Challenge
Solution
Automating data validation in NumPy and pandas for clinical data

Context

We worked with millions of rows of messy, user-input data. This data needed to be validated and cleaned before it entered our ETL software.

Challenge

The data validation and cleaning process was highly manual: changing values in Excel to match the expected input within our internal system. There was no universal validation standard, which left large room for error. Data validation took 3 business days to complete, which was an exceptionally long time based on team size and capacity.

Solution

We utilized Python to write functions to standardize and automate the data validation and cleaning process. We used the pandas and NumPy libraries to conduct the essential data analysis tasks. These included validating data for accurate totals, flagging anomalies associated with operating costs, and ensuring completeness in critical fields (e.g. ID numbers, department codes, dates, etc.) Scripts scanned selected files we received from clients, flagged discrepancies, and normalized the files into a format we could then review and load into our ETL system much sooner. Data validation time was cut by 50%.