In my two recent blogs, I talked about the Terasort and the TestDFSIO benchmarks for Hadoop and how using SanDisk® SSDs led to a significant performance improvement for these benchmarks. These benchmarks are great starting points to study Hadoop and the benchmarks gave us very encouraging results for using Flash within Hadoop clusters.
So, the next logical step in my Hadoop+Flash research was to test more real-life environments. Over the recent couple of years, the IT industry is increasingly adopting Hadoop in Big-Data environments. And many in the IT industry are transitioning to Hadoop infrastructures for their Business Intelligence and Data Analytics requirements. This presented me with the real-life environment to continue studying Hadoop+Flash environments.
So, at SanDisk Labs, I setup a Hadoop cluster using the Cloudera® Distribution of Hadoop (CDH). This cluster consisted of one NameNode and six DataNodes. SanDisk CloudSpeed Ascend™ SATA SSDs were used in the Hadoop cluster. I used the AMPLab Big-Data benchmark with Apache Hive on this cluster, with different storage configurations to study how SanDisk SSDs can help in the business intelligence and data analytics applications. This blog summarizes the results of the testing, after briefly introducing Apache Hive and the AMPLab benchmark.
Data Analytics and Business Intelligence on Hadoop
Business intelligence and Data Analytics involves a set of algorithms and processes which are used to inspect, filter, analyze and transform raw data into meaningful and relevant information used for businesses decision making in various departments ranging from marketing and sales to product development and customer relationship management. Business Intelligence systems typically parse through large volumes of data, available in a data warehouse system.
A data warehouse integrates data from a variety of sources within the business organization and creates a central repository of raw data, which can be analyzed and summarized into reports and used for various business activities. Data warehouses are also called Online Analytical Processing systems (OLAPs). Unlike Online Transaction Processing systems (OLTPs), these systems comprise of fewer but more complex queries operating on very large volumes of data. Both traditional OLAP and OLTP systems use structure query language (SQL) to query for information.
In the Hadoop ecosystem, data analytics and business intelligence can be provided by a variety of products, including (but not limited to) Apache Hive, Apache HBase, Cassandra, MongoDB and other NoSQL variants.
Apache Hive is a data warehouse based on Hadoop/HDFS which understands SQL-like queries. Hive uses a SQL-like language called HiveQL. Hive converts the HiveQL queries into corresponding map-reduce tasks to perform the necessary analytics on the data warehouse stored in HDFS. Hive does not provide OLTP (Online transaction processing) support, but functions exclusively as a data warehouse for batch oriented data analytics workloads.
AMPLab Big Data benchmark is a benchmark tool to analyze a variety of data analytics frameworks, including data warehouse / analytics systems based on Hadoop HDFS and MapReduce/Spark/Tez (example: Hive), systems which impose Massively Parallel Processing (MPP) functionality on Hadoop (example: Impala, HAWQ), or traditional MPP data analytics systems (example: RedShift). The benchmark measures response time for a set of relational queries, including scans, aggregations and joins.
The AMPLab Big-Data benchmark was used to test Apache Hive on the SanDisk labs Hadoop cluster. The ‘5nodes’ dataset (~270GB in size) provided by AMPLab was used for the testing. The AMPLab ‘scan’, ‘aggregate’ and ‘join’ query sets were run against Apache Hive. The response time for each of these query sets was collected.
Three different storage configurations were used on the Hadoop cluster:
- All-HDD configuration
- Hybrid configuration: HDDs were used for data node directories and SSDs were used for the MapReduce local directories, which are used in the shuffle/sort phase
- All-SSD configuration
The response time results are summarized in the following graph:
As seen from the graph above, the testing revealed that SanDisk SSDs can be deployed strategically with Apache Hive data warehouse on Apache Hadoop to provide 9-27% reduction in the Hive query response times (for the ‘aggregate’ and ‘join’ query sets).
The faster query response times translate to faster data analysis results, which in turn can help expedite important business decisions, thus presenting a compelling use of SanDisk SSDs in real-world customer use-cases.
These results are described with more detail in a technical paper – download the PDF here.