Hadoop means business. What started out as a highly technical analytics technology designed to make Yahoo!, Google and other web services run better, is having a strong impact on a range of Web-driven businesses.
Data scientists in banking, healthcare, retail, and telecommunications are among those leveraging Hadoop to get better and faster analysis of extremely large datasets – and describing it at Hadoop Summit 2014 in San Jose, June 3-5, 2014. These business users are finding Hadoop-based answers to detect credit-card fraud, determine which retail promotions work best, and to optimize logistics/distribution networks. Increasingly, they are leveraging graphical tools to display the data trends by showing the visual patterns in the data. That alone is a reason why Hadoop-style analysis is breaking new ground – and turning raw data into useful information.
The first edition of Hadoop (Hadoop 1.0) worked very well, typically to run Map/Reduce codes run in a batch processing mode. But those Hadoop workloads only ran sequentially. With Hadoop 2.0, customers can run multiple workload simultaneously – getting more types of work done on a single Hadoop cluster—which, in my view, is pragmatic and cost-efficient.
Now, with the seventh Hadoop Summit, a broadening of the Hadoop ecosystem is well underway, as the technology enters the next phase – addressing business needs with increased SLAs, security, SQL query language support and – of course—performance and cost-efficiency. Enterprises and cloud providers expect these things, as Big Data/Analytics grows in importance to their business users.
At the same time, it’s clear that the Hadoop ecosystem is expanding to include more industry vendors supporting Hadoop, and more data services surrounding this growing platform for data analysis. Apache, HortonWorks and Cloudera were there, of course – but so were longtime enterprise firms like BMC, Cisco, Compuware, Dell, HP, IBM, Intel, Informatica, Microsoft, Oracle, Red Hat, RedPoint, salesforce, SAP, SAS, SUSE, SyncSort, and VMware, among others.
Enter Hadoop 2.0
But now the emergence of Hadoop 2.0 is allowing it to support many workloads with more real-time processing – rather than just one workload at a time. This will make it even more useful as a corporate resource that can be leveraged to run multiple analytics jobs all at once – leveraging the YARN resource manager. Furthermore, Hadoop 2.0 adds a layer of high availability by providing a backup copy of the NameNode, which is the directory that tracks where all the data is stored.
At the same time, the number of Hadoop distributions is growing, including Apache Hadoop, and distros from HortonWorks, Cloudera, IBM, Intel and others – with each distro adding its own spin to the fundamental Hadoop value proposition.
Simply put: Hadoop and HDFS give businesses a software tool to sort through the terabytes, and petabytes and exabytes of data confronting them not only from their business – but also from the Internet of Things (IoT). What is already a data tsunami sweeping into data centers can be addressed, and analyzed, with Hadoop and its ecosystem of software utilities and programs.
Some of the companies and organizations that spoke on these topics included: Ancestry.com, AT&T, Bloomberg, Cardinal Health, Deutsche Telekom, Hulu, LinkedIn, Safeway Inc., Sprint, Thomas Cook Travel, TrueCar Inc., and Twitter – and speakers from academia, including those from UC Berkeley, UCLA, Stanford and University of Washington.
Opportunity for Flash Storage
Although it was understated at the conference, there is a very real opportunity here for flash storage to help users keep pace with the data flowing across their Hadoop clusters. At the moment, in-memory computing, centered on the use of DRAM memory, is being highlighted by some large customers as an acceleration mechanism for Hadoop.
While promising, the number of in-memory compute (IMC) solutions in customer sites is usually limited to those few apps that need it most. In contrast, flash storage – available as solid-state drives (SSDs) for servers and storage arrays – will likely be seen as a ubiquitous enhancement for Hadoop processing. That’s because a range of form-factors will allow solid-state disks (SSDs) to replace HDDs on a one-for-one basis – or to be installed alongside HDDs in a mixed, or “hybrid” deployment, within the same Hadoop cluster.
At SanDisk®, testing of Hadoop workloads has already shown that adding SSDs to the Hadoop clusters – even on a small number of the server nodes within the cluster – brings processing benefits. In one test of a six-node Hadoop cluster with SSDs, processing times were 32% faster, and the solution worked at 15% less cost, than a similar Hadoop cluster based on hard-disk-drives (HDDs) alone. (Learn more in this Hadoop Solution Brief)
Moving forward, Hadoop use will become more widespread – and uses cases will grow. This means that cluster resources should be used as efficiently as possible, while giving each Hadoop 2.0 task the scalable resource it needs.
But the latency that is built into 15K hard-disk-drive (HDD) technology is not going away anytime soon. HDDs account for the majority of disk-drives installed today. Although many of the older drives will be updated through the process of technology refresh, this unrelenting need to boost Hadoop performance will accelerate the demand for fast SSD storage.
Without a storage alternative to HDDs, customers will face the prospect of building “walls and walls” of servers to provide enough capacity and processing power to run their Hadoop workloads. This suggests that flash storage could – and should – play an important role in saving data center space while scaling up Hadoop clusters – and in providing high-IOPS performance on individual servers within those clusters.
Madhura Limaye contributed to this blog post.