Moving Your Production System to the Cloud?


6 Things You Need to Know

In the past two years, we’ve been moving many of our market regulation SQL batch analytics to the cloud. Collecting and analyzing up to 75 billion events a day, our production system runs hundreds of surveillance algorithms to protect investors from market manipulation. While massive parallel processing worked on fixed appliances, data was spread over multiple appliances, hindering efficiency and accessibility.

Adopting Hadoop, Amazon EMR, and S3, made our data more accessible and analytics more affordable. Yet, like any transition, we had to resolve issues as we adopted new technology. Over the past two years, we learned many lessons in the transition. At Strata + Hadoop World 2015, Jaipaul Agonus shared lessons we learned in the transition. Here are six that will save you time and money in the long run.

1. Partition as much as possible

On legacy products, data was separated by device. We had data on appliances, physical storage spaces, and even tapes at FINRA. Today, all our data can live in Amazon’s S3 cloud. It’s easy to simply put all your source data in the format you receive. But does that actually make sense for your use cases? In ours, analyzing that data was slower and more costly without partitions. To get the best performance, organizing data from the beginning is key. Creating the best organizational system for your needs can be found by asking yourself two questions:

These types of questions ensure you partition data along natural query boundaries. Partitioning data along these lines will reduce I/O scans during the join process. It will make your batch analytics easier and more efficient in the long run.

2. Measure and profile clusters

In the world of legacy hardware, the question production systems developers faced was: How do we avoid overloading appliances? We focused on quantifying the amount of processes happening at any point in time. The limitation was capacity.

Today with endless on demand compute available, you have to ask a very different set of questions for optimal production systems.

All these questions ask you to evaluate and understand the trade-off between cost and timeliness of your process and choose the right cluster configuration. Once again, testing is critical to ensure your assumptions are correct. For instance, if you think one batch will need a compute intensive cluster with 10 machines, test your theory first. How did it work? Did it work like you expected? Did you use the cluster to its full potential? If not, go back to the drawing board to decide what you really need.

In the world of cloud technology, we have to rethink production systems. The variable is no longer the processes. The variable is now the machines.

3. Abstract underlying execution framework in Hive

When using legacy systems, you knew the technology you would work with was relatively stable for the life of the device. Moving to Hadoop, Hive, and Amazon means tossing aside these assumptions. These tools have rapidly changed over the past few years and will continue to.

While the same application code lives on in our production system, there are more and more backend options to solve production issues. From a customer perspective, these innovations are a great opportunity. Now you have multiple options to process data.

To harness these options, you need to change how you use these execution frameworks. Abstracting execution framework from code makes the switch easy. As technologies change or different use cases arise, simply choose the right backend option to process your data.

4. Fill the Gap with Hive UDFs

With hundreds of thousands of line of legacy SQL code and many employees with deep understanding of this language, it made sense for FINRA to continue to use SQL even as we transitioned our production system to the cloud. Still, there are times when SQL isn’t a good fit for solving some problems. For instance, analytics may have a requirement to sift through a chunk of data sets while holding values in the memory. SQL can perform this, but it’s a more complicated task.

To fill these gaps, Hive User Defined Functions (UDFs) are a great alternative. These allow you to harness programming languages such as Python or Java. After processing, it’s easy to push it back to SQL and integrate with Hive. UDFs give you the power of programming languages while maintaining the SQL interface.

They also help fulfill functions Hive misses. For instance, Hive doesn’t have a default function to convert to specific date formats. At FINRA, we use UDFs to convert the date format instead. UDFs are a simple solution to fix the problem allowing us to avoid creating a complicated SQL script.

UDFs aren’t going to address all analytics use cases. However, they provide a fuller set of functionality to make our SQL batch analytics easier and faster.

5. Choosing optimal storage format

Previously, traditional data warehouses would do file storage behind the scenes. Organizations would just stream data in appliance. With Hadoop, it’s now up to you to make the decision. How will you store it? What format is best? How will you compress your big data? Different formats have different advantages. For instance, some of our data is massive and requires complex analytics. Each object can have few dozens, if not over a hundred attributes. In this case, we prefer to store files in a column level format. However, for objects with fewer attributes, a row level format may be a better fit.

Similar issues come up with compression. Some formats are fast but don’t reduce file size well. Some are highly efficient with space but are slow to read and write. Some can be split, but not all. At FINRA, we look for a balance between speed and time as well as the option to split files across multiple nodes.

Once again, knowing your use case and needs is critical. From there, it is easier to choose the best file format, compression, and storage for your needs.

6. Prepare for risk with mitigation strategy

In every transition, there will be bugs and errors. While new platforms and tools provide amazing benefits, they simply aren’t as mature as older systems. Oracle has been around for over forty years, with thousands of users finding and fixing bugs. Hadoop, on the other hand, is just now reaching its tenth year.

The best way to prepare for these risks is to have a mitigation strategy. For instance, during migration we ran our cloud analytics in parallel with our legacy system and ran comparisons for over a year before turning them over to production.

These tests have helped us find bugs and issues with underlying Hadoop/Hive software. Only by comparing the data processed in the cloud to our legacy systems were we able to see the issue and ensure that the bugs were fixed by vendors.

We’ve maintained our legacy and cloud infrastructure in parallel during this migration, slowly shifting more and more to the cloud. Our full transition will occur in mid-2016. While this may seem conservative, it has also allowed us to ensure that the data we are processing is accurate, critical in our work as a self-regulatory organization.

Conclusion

Moving from legacy proprietary hardware to the cloud, especially for large organizations, is a big commitment. The cost reduction, flexibility, and data accessibility all make it a worthwhile investment. However, a production system in the cloud requires a new mindset. Focus on your end use case needs and creating a mitigation strategy will help you transition to the cloud and protect your data.

Want to learn more? Check out Agonus’ original slideshare deck and presentation below.

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud from Jaipaul Agonus