Within our SanDisk® labs, I conducted a number of experiments with Apache Hadoop and SanDisk flash, mainly our CloudSpeed Ascend SATA Solid State Drives (SSDs). The initial experiments were with standard Hadoop benchmarks, namely the Terasort and the TestDFSIO benchmarks. These benchmarks helped show me how SanDisk SSDs helped boost the performance of the Terasort and TestDFSIO jobs. I was also able to extrapolate the performance benefits to show lower cost of ownership for SSDs in the Hadoop environment by focusing on the cost per job. These benchmarks were great starting points to study Flash within the Hadoop ecosystem.
Following these standard synthetic benchmarks, I moved on to testing a close to real-world workload: a Data Analytics workload using Apache Hive and Hadoop with SanDisk SSDs. This testing again showed me how SSDs can help reduce query response times and therefore improve business process efficiencies.
More Real-life datasets and workloads
Continued research for Hadoop benchmarks which closely reflect real-world workloads then brought me to the SWIM benchmark. This benchmark provides a repository of real-life MapReduce datasets from production systems. It also provides tools to generate representative workloads that operate on these real-life datasets. The benchmark allows rigorous performance and stability testing of MapReduce systems.
An abbreviated version of the SWIM benchmark also forms the basis of the Cloudera Hardware certification test suite, which provides a strong statement of partnership and confirmed interoperability of technologies to the thousands of customers around the world running Cloudera.
Cloudera Hardware Certification
To continue my testing efforts with the goal of using real-life datasets and workloads, and nurture the strong partnership between Cloudera and SanDisk, I attempted the Cloudera Hardware certification test suite on the SanDisk labs cluster.
If you are not familiar with the program, “the Cloudera Certified Technology program was created to make it simpler for Apache Hadoop technology buyers to purchase the right components and software applications to extract the most value from their data.
Building a Hadoop cluster from the ground up can be challenging. There are numerous choices to be made at all levels of the stack and making those choices can be complex. The Cloudera Certified Technology program is designed to make choosing the right technology easier.” (You can learn more about Cloudera Certified Technology on the Cloudera website)
The lab cluster used the Cloudera® Distribution of Hadoop (CDH), version 5.1, and had one NameNode, and eight DataNodes which were populated with SanDisk CloudSpeed Ascend™ SATA SSDs. The cluster was setup as per the test suite requirements and recommendations. The certification test suite completed without any failures. The results along with the necessary diagnostics were submitted to Cloudera to obtain official certification for the CloudSpeed Ascend SATA SSDs.
With this testing complete and confirmed with Cloudera, I am proud to announce that our CloudSpeed Ascend SATA SSDs are a Cloudera Certifed Technology Product!
For customers, Cloudera Certified Technologies such as SanDisk SSDs operate with lower risk and lower total cost of ownership (TCO) and comply with Cloudera development guidelines for integration with Hadoop ensuring better, trusted value.
You can reach out to me at firstname.lastname@example.org if you have any questions about the Cloudera certification, and join the conversation with us on flash and big data on Twitter at @SanDiskDataCtr.