How do we process up to 75 billion market events daily?

As one of the self-regulatory organizations for the financial industry, FINRA monitors 99% of equities & 70% of options in the US, including NASDAQ and the New York Stock Exchange (NYSE). This task includes collecting and parsing through immense amounts of data. On an average day, these markets have around 30 billion events. High volume days can be even more: We’ve processed 75 billion events from a single day of trading.

All this data needs to be sorted and used by various groups inside of FINRA. The Extract, Transform, and Load Group (ETL) is the team that works to bring in all this data and process it in less than a day. Hundreds of people rely on the data provided by ETL and look at it on a daily basis.

Here’s how they make it possible to use market data to protect investors.

Bringing the data in

To begin, exchanges send data from trading to FINRA. Different exchanges send the data differently, but most send the data in batches throughout the day.

Many exchanges, like the Chicago Board Options Exchange (CBOE), send their data via API. This data goes directly to the cloud and is stored in AWS’s S3. It’s critical that we keep both unprocessed and processed versions of the data for record purposes as a regulator.

Processing the data

Once data has been gathered, the ETL group needs to process the data. This has three important stages: validation, transformation, and publication. With EMR Hadoop clusters, OOZIE, and Hive, all three stages happen quickly in the cloud, often in 12 hours or less. The ETL group begins by validating the data. At this stage, they focus on semantics, file names, and the details of the data. They also provide feedback to markets about whether or not the data was sent in the proper formatting. Details such as file name can seem small, but when processing so much data, these details must be correct.

From there, the ETL group transforms the data. Even when the data passes validation, each market’s raw data has their own formatting and set of symbols. For easier use internally, the ETL group takes all the various markets data and standardizes that format into one or more flat objects.

Finally, the market data is published to internal consumers. There are two types of publications. One is the standard publication that is common to all groups in FINRA. However, some groups have some specific requirements for their work. These custom pieces take up about 25% of ETL publications.

Yet, there’s always work toward further standardization. The ETL group works to make the standard publication more robust. This helps keep customized publication work low so they can focus on their larger goal: to move from batch and scheduled processing to processing data as it arrives without any delays.

Faster processing for growing data

Trading on American financial markets continues to grow and become faster. With rising demand, the cloud has been the solution to the 60-70 terabytes of data we need to process each month. On premises, processing market data could take 24 hours, sometimes even longer depending on volume of trades.

In the cloud, we’ve been able to dramatically reduce that time. The ETL group begins processing at 9 pm and finishes by 9 am the next day. Stretching this work over up to 1500 transient Hadoop clusters a month makes the quick turnaround possible. While faster than previous on premises processing, we’re still working to shift all market data processing to the cloud.

Challenges with the cloud

However, the shift to the cloud does come with hurdles. The new technologies we’re using, such as Hive and AWS, are younger open source tech compared to the legacy systems of the past. There may be bugs which can affect the data, which can be a large problem for a regulatory organization.

In addition, part of the transition must include security. The paradigm for how we think about security has radically changed in the cloud. This requires time to ensure no individual or organization is put at risk.

Finally, processing this much data in the cloud requires a new way of thinking about cost. Whereas before we focused on how much could we fit into one proprietary processor, today we can have infinite processing at any time. Using spot pricing from AWS has been great to keep costs down, sometimes getting nodes as cheap as 25 cents a day. Because spot pricing changes, processing costs can dramatically change day to day or quarter to quarter.

This change requires less of a technology shift but a change more for teams and project managers inside the ETL group. Still, with a working cloud system in place, the transition continues to have more and more market data sent, processed, and stored in the cloud. By July 2016, we’re expecting all the ETL processing will occur in the cloud, continuing the work of processing data quickly to protect investors.

Your technical skills can make a difference: check out our job openings.