Best unofficial Apache Server developers community
Username
Forgot password?
Sign in with Twitter account
Sign in with Facebook account

HBase Case-Study: Using HBaseTestingUtility for Local Testing & Development

Motivation

As HBase becomes more mature there's is a growing demand for tools and methods for making development process easier - here at Sematext (@sematext) we've gone through our own per aspera ad astra learning process in addition to Cloudera's Hadoop trainings and certifications. In this post we share what we've learned and show how one can HBaseTestingUtility for this.

Suppose there is a system that deals with processing data stored in HBase and displaying stored data via reporting application. Data processing is done using Hadoop MapReduce jobs. During development, it would be desirable to be able to:

  • debug MapReduce jobs in an IDE
  • run reporting application locally (on developer's machine, without setting up a cluster) with possibility of debugging in IDE
  • easily access data stored in HBase for debugging purposes (easily means "naturally" as if all rows are in a text file)

Disclaimer

Described use-case and solution are just one option, an option that makes use of HbaseTestingUtility and underlying "mini" clusters. Depending on the context, this solution might not be the most optimal, but it is a good fit for presenting the ideas. This solution and this post should encourage developers to look at HBase's unit-test sources when constructing their own tests and/or when finding ways for easier debugging & development.

Problem Details

In our example there are two tables in HBase: one with raw data and another with processed data. Let's call them RawDataTable and ProcessedDataTable. We import data into RawDataTable via simple importing MapReduce job which initally takes data from a log file. Subsequently, another MapReduce job processes data in that table and stores the outcome into ProcessedDataTable. We use HBase Scan and Get operations to access the processed data from the client.

Solution

As stated in javadocs, HBaseTestingUtility is a "facility for testing HBase". Its description comes with a bit more of explanation: "Create an instance and keep it around doing HBase testing. This class is meant to be your one-stop shop for anything you mind need testing. Manages one cluster at a time only." In this post we describe one possible way of how to use it to achieve the goals described above.

Processing Data

Step 1: Init cluster.

The following code starts "local" cluster and creates two tables:

private final HBaseTestingUtility testUtil = new HBaseTestingUtility();
private HTable rawDataTable;
private HTable processedDataTable;
…
void initCluster() throws Exception {
  testUtil.getConfiguration().addResource("hbase-site-local.xml");
  testUtil.getConfiguration().reloadConfiguration();
  // start mini hbase cluster
  testUtil.startMiniCluster(1);
  // create tables
  rawDataTable = testUtil.createTable(RAW_TABLE_NAME, RAW_TABLE_COLUMN_FAMILIES);
  processedDataTable = testUtil.createTable(PROCESSED_TABLE_NAME, PROCESSED_TABLE_COLUMN_FAMILIES);
  // start MR cluster
  testUtil.startMiniMapReduceCluster();
}

testUtil.startMiniCluster(1) means start cluster with 1 datanode and 1 regionserver. You can start cluster with greater number of servers for test purposes.

Step 2: Import Data

We use simple map-only job for import data. Please refer to org.apache.hadoop.hbase.mapreduce.ImportTsv class for an example of such a job. The following code runs the job that uses locally stored files (e.g. a part of the log file of reasonable size) on just created cluster:

String[] importJobArgs = new String[] {RAW_TABLE_NAME, "file://" + inputFile};
if (!MyImportJob.createSubmittableJob(testUtil.getConfiguration(), importJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 3: Process Data

To process data in RawDataTable we run an appropriate MapReduce job in the same way as during the import:

if (!ProcessLogsJob.createSubmittableJob(testUtil.getConfiguration(), processLogsJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 4: Persist Processed Data

Since we need processed data during our reporting application development and debugging we persist it in some local file. In order to have "easy" access to this data during debugging it makes sense to store table data in a text file in a readable form (so that we could perform "grep" and other handy commands). So we actually write to two files at once. The Result class implements Writable interface, so there is a natural way to serialize its data.

BufferedWriter bw = ...;
DataOutputStream dos = ...;
ResultScanner rs = processedDataTable.getScanner(new Scan());
Result next = rs.next();
while (next != null) {
  next.write(dos);
  bw.write(getHumanReadableString(next));
  bw.newLine();
  next = rs.next();
}

After this step, the processed data is stored on the local disk and can be used for running the reporting application. Importing and processing of data is performed locally and is thus easier to debug.
In order to add extra processed data incrementally to the already stored data, instead of rewriting it from scratch, we need to load it from the file after cluster initialization as described in the following section.

Fetching Data

In order to make reporting application run on "local" cluster instead of the "true" one, we create an alternative HTable factory. Reporting application code uses a single HTable object instantiated by the factory during its whole lifecycle - this is the best practice for minimizing creation of HTable objects.

Step 1: Init cluster.

This step is exactly the same as described previously.

Step 2: Load processed data.

We use a file created during processing data stage to load the data back into just initialized cluster:

DataInputStream dis = ...;
Result next = new Result();
next.readFields(dis);
while (next.getRow() != null) {
  Put put = new Put(next.getRow());
  for (KeyValue kv : next.raw()) {
    put.add(kv);
  }
  processedDataTable.put(put);
  next = new Result();
  try {
    next.readFields(dos);
  } catch (EOFException e) {
    // file went to an end.
    break;
  }
}

After data is all loaded, the constructed processedDataTable can be used by the reporting application code. The app can now also be started and debugged easily from an IDE.

Next Steps

Internally HBaseTestingUtility makes use of a whole bunch of "mini" clusters: MiniZooKeeperCluster, MiniDFSCluster, MiniHBaseCluster and MiniMRCluster. Refer to the unit-test implementations in the source code of respective projects to get more examples on how to use them.

Thank you for reading, we hope you found this useful. Follow @sematext on Twitter to be notified of new posts on Hadoop, HBase, Lucene, Solr, Mahout, and other related topics.


Source Article
Comments
0
Be the first to comment

Join with account you already have


Sign in with Twitter account
Sign in with Facebook account
Sign in with Google Friend Connect
avatar
Tags: data, process, cluster, file
Using HBaseTestingUtility for Local Testing & Development
Sep 2, 2010
Hello, In our work with HBase at Sematext we've been using HBaseTestingUtility to make development and debugging easier. At the same time we've noticed others here on the mailing list asking about testing, debugging, etc. (e.g. …

Created: (HIVE-1520) hive.mapred.local.mem should only be used in case of local mode job submissions
Aug 9, 2010
hive.mapred.local.mem should only be used in case of local mode job submissions

HBase-0.89.20100621 "development release" available for download
Jun 26, 2010
Dear HBase Community, The HBase team is pleased to announce the release of HBase 0.89.20100621. This release is the first of a series of "development releases" that will lead up to the release of HBase 0.90 later this year. To call out every new…

local transport testing
Jul 8, 2010
I would like to create a test for calling a service that is created from a Java first approach. I have tried the below test code as an example without success. I get an error complaining about the local protocol. java.net.MalformedURLException:…

Running HBase Junit Testcases on local machine
Jul 23, 2010
I was trying to run the Juint testcases on my custom Cluster which setup on remote machine. I have modified the hbase-site.xml to point to remote hdfs setup. But when run the unit test case it starts looking in my local host. Do I need to make…

HBase reliability testing tools
Jul 22, 2010
Hi. Todd and Jonathan mentioned in last HUG that there are some reliability testing tools for HBase flying around between developers. Could you point me where I can find them? (As far as I know Gremlins is one of them.) Thanks, Mingjie

Derby Self-study Tutorial
Aug 10, 2010
May someone please assist me in this: 1. Can i do the Derby Self-study Tutorial on Mac OS 10.6? 2. I am running Eclipse 3.6 and Netbeans 6.9, both bundled with the Apache Derby. Can I run the tutorials on these IDEs? Thanx in advance. Sal

How to specify HBase cluster end-points from HBase client code in HBase 0.20.0
Jul 7, 2010
Hello, In my current application environment, I need to have two HBase clusters running in two different racks, to form a fault-tolerant group to tolerate power failure. Then I have an HBase client, which is sitting outside of these two clusters, …

Created: (HIVE-1484) use local file system for intermediate data when automatically inferring local mode
Jul 25, 2010
use local file system for intermediate data when automatically inferring local mode

Defining a local address and local port *range* for SSL
Jun 30, 2010
Hi all, Short version : I want to define a local address and local port *range* in the jndi.properties file. Something like java.naming.provider.url = ssl:// 10.1.1.123:61617/localhost:60140-60150. That didn't work (obviously). Is there any way…