Unit Testing Hive Scripts with HiveQLUnit

Author: Marshall Peters


HiveQLUnit is a library of JUnit extensions for unit testing Hive scripts. In this tutorial we will explore how to create test cases for Hive scripts and then show how to implement those test cases using HiveQLUnit.

Creating an Example Test Case

Before writing a unit test, we need a test case to implement. Every good unit test case defines expected output produced in response to a fixed input. Imagine we had a Hive script we wished to make a test case for

    
    CREATE TABLE IF NOT EXISTS src (
        columnOne String,
        columnTwo String
    ) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|' ESCAPED BY '\\'
    NULL DEFINED AS ''

This sample script is just a CREATE TABLE statement without any real logic. However, it does have functional behavior that can be tested. If we loaded a data file into the 'src' table, and this file contained a single line of data:

    
    Lorem\|Ipsum|Doler

We would expect the table to contain one record with 'Lorem|Ipsum' as the value of columnOne and 'Doler' as the value of columnTwo. This is a complete test with a well defined input and expectations for the output. Any other observed behavior would mean the sample script had the wrong instructions for parsing data files. For example, this slightly mutated script could not pass the test as it uses commas and not pipes to delemit fields.

    
    CREATE TABLE IF NOT EXISTS src (
        columnOne String,
        columnTwo String
    ) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',' ESCAPED BY '\\'
    NULL DEFINED AS ''

Instead of the correct output, it would read the value of columnOne as 'Lorem|Ipsum|Doler' and the value of columnTwo as NULL.

Implementing the Test Case as a Unit Test in HiveQLUnit

Now that we have a test case, we can use HiveQLUnit and Java to construct and execute it. As mentioned, HiveQLUnit is a JUnit extension, so the unit test is a Java class among any suite of JUnit test classes

    
    import org.apache.spark.sql.Row;
    import org.finra.hiveqlunit.resources.ResourceFolderResource;
    import org.finra.hiveqlunit.resources.TextLiteralResource;
    import org.finra.hiveqlunit.rules.SetUpHql;
    import org.finra.hiveqlunit.rules.TearDownHql;
    import org.finra.hiveqlunit.rules.TestDataLoader;
    import org.finra.hiveqlunit.rules.TestHiveServer;
    import org.finra.hiveqlunit.script.MultiExpressionScript;
    import org.finra.hiveqlunit.script.SingleExpressionScript;
    import org.junit.Assert;
    import org.junit.ClassRule;
    import org.junit.Rule;
    import org.junit.Test;

    public class SrcTableCreateTest {

        @ClassRule
        public static TestHiveServer hiveServer = new TestHiveServer();

        @Rule
        public static TestDataLoader loader = new TestDataLoader(hiveServer);

        @Rule
        public static SetUpHql prepSrc =
            new SetUpHql(
                hiveServer,
                    new MultiExpressionScript(
                        new TextLiteralResource("CREATE TABLE IF NOT EXISTS src (columnOne String, columnTwo String)\n"
                        + "ROW FORMAT DELIMITED\n"
                        + "FIELDS TERMINATED BY '|' ESCAPED BY '\\\\'\n"
                        + "NULL DEFINED AS ''")
                    )
            );

        @Rule
        public static TearDownHql cleanSrc =
            new TearDownHql(
                    hiveServer,
                    new SingleExpressionScript(
                            new TextLiteralResource("DROP TABLE IF EXISTS src")
                    )
            );

        @Test
        public void testDelimiter() {
            loader.loadDataIntoTable("src", new ResourceFolderResource("/delimiterTest.txt"));

            Row[] results = hiveServer.getHiveContext().sql("SELECT columnOne from src").collect();

            Assert.assertEquals(1, results.length);
            Assert.assertEquals(results[0].get(0), "Lorem|Ipsum");
        }

    }

In this section, we only present the test implementation. We will break it down into peices for review in the next section.

This test class works with a file, 'delimiterTest.txt', in the resources folder of the test project. 'delimiterTest.txt' contains the single line of data required as input for the test

    
    Lorem\|Ipsum|Doler

TestRules

Our JUnit test class 'SrcTableCreateTest' has a number of parts. Starting with version 4, the JUnit framework added a new extension hook through TestRules. TestRules are implementations of the TestRule interface and behave as more reusable versions of the functionality provided by @Before and @After tags. They are included in test classes as static class variables (a JUnit framework requirement). They are also annotated with either the @ClassRule or @Rule annotations. HiveQLUnit provides a number of TestRules, all demonstrated in the example 'SrcTableCreateTest' class.

The first TestRule is TestHiveServer. The TestHiveServer constructs an instance of HiveContext, which provides access to a disposable Hive cluster created just for testing using Apache Spark. TestHiveServer needs to be the first TestRule used in a test class, and must be annotated with the @ClassRule annotation:

    
    @ClassRule
    public static TestHiveServer hiveServer = new TestHiveServer();

The @ClassRule annotation informs the JUnit framework to create and execute the TestHiveServer only once for the entire test class; if we had multiple unit tests to run they should all use the same testing server for efficiency purposes. TestHiveServer needs to be the first TestRule because the following TestRules are dependent upon it.

The TestDataLoader TestRule provides utility methods accessible from within test methods for loading data into a Hive table.

    
    @Rule
    public static TestDataLoader loader = new TestDataLoader(hiveServer);

SetUpHql TestRules take in a Hive script and runs the given script on the Hive server before every test method.

    
    @Rule
    public static SetUpHql prepSrc =
        new SetUpHql(
            hiveServer,
                new MultiExpressionScript(
                    new TextLiteralResource("CREATE TABLE IF NOT EXISTS src (columnOne String, columnTwo String)\n"
                    + "ROW FORMAT DELIMITED\n"
                    + "FIELDS TERMINATED BY '|' ESCAPED BY '\\\\'\n"
                    + "NULL DEFINED AS ''")
                )
        );
    

In this test, we hardcoded the Hive script under test into the unit test. This was for convience but this script could be stored elsewhere for a proper test.

    
    @Rule
    public static SetUpHql prepSrc =
        new SetUpHql(
            hiveServer,
                new MultiExpressionScript(
                    new ResourceFolderResource(/scriptUnderTest.hql)
                )
        );
    

TearDownHql is exactly like SetUpHql, except it runs after a @Test method executes.

    
    @Rule
    public static TearDownHql cleanSrc =
        new TearDownHql(
            hiveServer,
                new SingleExpressionScript(
                    new TextLiteralResource("DROP TABLE IF EXISTS src")
                )
        );
    

Though the Hive server created by HiveQLUnit is disposable, things like the metastore often persist and need to be cleaned between tests, so be sure to drop tables and such after every test to keep Hive clean. An unclean test environment will corrupt future executions of the test with unplanned test data from previous test executions, giving invalid results.

The 'testDelimiter' Method

The 'testDelimiter' method is the core of the test:

    
    @Test
    public void testDelimiter() {
        ...
    }

It is in this method that we actually enter the input into Hive and validate the expected results. Our TestDataLoader is used to load the input data from 'delimiterTest.txt' and into the 'src' table:

    
    loader.loadDataIntoTable("src", new ResourceFolderResource("/delimiterTest.txt"));

We then run a Hive query against the 'src' table.

    
    Row[] results = hiveServer.getHiveContext().sql("SELECT columnOne from src").collect();

We'll use the results of the query to validate our expected behavior.

    
    Assert.assertEquals(1, results.length);
    Assert.assertEquals(results[0].get(0), "Lorem|Ipsum");

The 'src' table should have 1 record, and the 'columnOne' value for this record should be 'Lorem|Ipsum'. Two Assert statements check this expected behavior, failing the test if these two conditions are not met, completing the test we defined in 'Creating an Example Test Case'.

Conclusion

By going through this example, we have shown an approach for defining test cases as provided input and expected output and how HiveQLUnit can be used to execute those test cases as part of a JUnit suite. The example test case tested a simple script with simple logic, but the same strategy can be used to test more complicated queries or ETL processes as well.

Find general help at the HiveQLUnit project page and further readings in the HiveQLUnit user guides.