Part of FINRA’s mission is to analyze trillions of data points across a multitude of data sources to protect investors and ensure the integrity of US markets. As a regulator, FINRA has a heightened focus on data governance. In the past, we faced a number of problems common to on-premises data processing environments – data fragmentation across organizations, multiple data processing appliance systems, and storage constraints requiring costly upgrades. These fragmented data systems created operational issues. Performing data corrections was a lengthy exercise requiring coordination across multiple teams, and in some cases loading data from tape. The most complex data fixes could take weeks or even months to complete.
FINRA researched various architecture options and developed an innovative cloud-based architecture which separates compute from storage.
Cloud storage technologies like Amazon S3 and Glacier provide a simple, durable, and scalable storage platform at manageable cost. Multi-region architecture allowed FINRA to meet disaster-recovery requirements without requiring tape backups. But cloud storage lacks even basic data management features.
Cloud data processing combines a rich big data ecosystem with elastic compute resources. FINRA knew that our technology teams would deliver maximum business value by taking advantage of various processing tools such as Hive, Presto, and Redshift. We needed to design a system that fostered a heterogeneous data processing environment on top of cloud storage.
With core requirements around storage and compute fulfilled by cloud vendors, the only thing missing was the ability to manage the data and help orchestrate the processing between storage and compute.
With our data footprint growing steadily in the cloud, FINRA’s data management team saw a growing need to create a unified metadata repository and services to manage it all: from this need, the open source product herd was born.
herd's data catalog structure uses a set of REST APIs for naming, organization, lifecycle, storage locations, and job processing. Once configured, the REST services provide interfaces for systems to integrate with herd and quickly register files, find data, or launch processing jobs.
herd is set up for highly-regulated data environments. Version and lineage tracking vastly improves management of data changes. It also makes assessing the impact of data changes faster and easier. If data becomes invalidated, herd can send an automatic notification to anyone using that specific dataset.
herd’s job orchestration and cluster management capabilities reduce the operational responsibilities for all the data processing teams while providing a centralized platform to monitor and reduce processing costs. Now teams focus on defining the business logic of their data processing job and delegate the operational responsibility to the herd job orchestration engine.
Currently, we have 8 teams managing 1.2 petabytes of data in S3 across 40 million objects with herd. With peak intraday processing at around 1.5 million objects and totaling 110 terabytes, cost control becomes a factor. herd keeps processing costs under control by taking advantage of spot pricing and using advanced metrics to optimize the use of per-hour compute purchases. One team has seen savings of more than 80% on processing costs within days of adopting herd.
herd can be tailored to various enterprise storage needs. While FINRA created it to use with S3, herd can work with any organization’s cloud storage and harness with most cloud data processing tools. Knowing that many organizations are struggling with big data management, we decided to release herd to the open source community. Learn more about herd on our Github page.