Building a Data Integrity and Profiling Prototype in Two Days

Author: Mark Meschi

On August 8th and 9th, FINRA hosted a hackathon, dubbed Creatathon. Over 45 teams participated in various locations including Rockville, New York City, and Chicago. The only constraint was to create a prototype that uses data to drive decision making. I used this opportunity to delve into code again and deliver a great prototype.

My group’s idea was to run quality reports on incoming data. The goal was to deliver a data integrity and profiling tool. It would alert operations and technology staff to potential data anomalies before delivering crucial pattern outputs. The tool would provide insight into object ingestion and help the user, most likely operations staff, quickly take appropriate next steps. These could include researching the potential issue with the development team, or providing business owners with a heads up on a potential impact.

We wanted to create a tool that had the capability of analyzing a namespace, object, or field within an object for potential data anomalies. Analysis capability was important to our ultimate goal, improving decision making in running patterns. This tool could help decision makers understand how to make the best use of technology resources while considering business pattern Software License Agreements (SLAs) and reporting requirements.

From idea to prototype

Once we had the idea, the next question was “how do we do this?” I worked a few weekends before the event to sharpen my coding skills. As a technology manager, my coding skills were a little rusty. I used FINRA’s ‘safari books’ to research languages and technologies. I found that the R and Python languages were perfect for delivering a polished end product with interactive graphs that I thought would impress the judges.

Our first data quality check was on incoming object metadata by namespace. Specifically, we would determine the size of the object and the number of rows. The Python script would loop through REST calls to FINRA’s open source data management service called herd, pulling object size and row count by namespace.

This metadata would be pulled for ‘n’ days (a script parameter) prior to the date for which the script was running for a particular namespace (another parameter). It would also compute the mean and standard deviation over the ‘n’ days for object size and row count. Any object that would have size or row count above or below the mean plus three standard deviations would be considered anomalous. These would be flagged for further research by operations.

Graph Chart - mean and standard deviation over the ‘n’ days for object size and row count

The next step was to visualize the data. After extensive research, we decided to use R and extend it with custom packages from Comprehensive R Archive Network (CRAN). Our R code was written to read the output of the Python script and put the data into data frames.

After that, we used add on packages written by R luminary Hadley Wickem to prepare the data for plotting. We wanted a nice interactive line chart to display the row count and size by day as well as have horizontal lines to display the mean, mean plus/minus one standard deviation, mean plus/minus two standard deviations, and finally mean plus/minus three standard deviations for each object. We did this by using Ramnanth Vaidyanathan's open source rchart package which wraps the nvd3 JavaScript plotting library for use in R.  

Finally, all of this data was displayed using yet another open source R package called knitr. This package allowed us to embed our R code and plots in a markdown document that would create a static HTML page to display our interactive charts.

The sheer amount of quality open source development resources available was key to our working prototype: We wrote it with relatively few lines of code. The hardest part of the project was finding the right technology to use to complete our vision of profiling data for quality. Some data analysts in our office suggested we look into R, open source versions of Pentaho and QlikView, as well as Talend. Only R had the data manipulation abilities and the polished report tools in their freeware version that we needed.

Using Python was a kind of accident. A lot of my group wanted to learn it, including myself. Python is a great tool: it is a simple interpreted language that can be used on any platform. Safari books, had extensive documentation and visual courses on these technologies that helped me to ramp up quickly. Once the learning process was complete, writing a relatively slick prototype was not that difficult.

Where it could go next

Our prototype was just that, a prototype. True data profiling requires more than just analyzing metadata. If the idea gets funded, the next step would be to profile the data itself. This would include a set of audit rules per object that would be displayed in a daily report. It would determine if ingested objects had the potential for data issues. By identifying data issues before consumers of that data run their jobs, we could cut costs for AWS capacity, reduce analysts work analysts, and technology research. There is nothing a developer can do to code around bad data. It’s better and easier to find any potential issue quickly.

Unfortunately our prototype did not finish in the top three. However, when presented in front of upper management, it was well received by Tigran Khirman – Vice President of Data Platforms. Many ideas outside of the top three are being discussed over the next quarter; hopefully our project will get funded in the near future.

This is just an example of one of the many innovative prototypes demonstrated as part of the FINRA hackathon. This event helped to inspire me as a manager to continue to look for new technologies to improve the end product that we deliver to our users. It showed me ways we could improve pattern output by using better data visualization tools.

It also improved my work: now that I know R, I can use it to hook into our Presto database by spinning up an On Demand Analytics Platform (ODAP). This is a standalone AWS EC2 instance with R studio preinstalled, sized to the amount of data that I want to research, and uses R to do ad-hoc requests and research at a much quicker pace than doing mere queries in our AquaData Studio client.