The Machine Generated Data 5V Challenge: Volume, Velocity, Veracity, Variety and Value
Machine Generated Data (MGD) is one of the fastest growing and most complex areas of Big Data. The reason is that all applications, systems and IT infrastructure generate data every millisecond of every day. And that means a huge volume of data, a great variety of data types, and all this data is generated at an unimaginable velocity (speed). But the bottom line is value. Companies can extract great value from data that contains user transactions, customer behavior, various sensors, security threats, fraudulent activity and more; and they can only do this if there’s data veracity – the quality to produce credible results.
As such making use of MGD presents many challenges. Our traditional data analysis, management and monitoring solutions such as data warehouse or business intelligence were not engineered to deal with the 5V’s of Big Data. These are batch-oriented systems and require structured data for analysis.
For this reason new platforms were created to deliver better ways of sifting, distilling and understanding machine data so that IT organizations can leverage valuable insights. In this blog I’ll be looking at one such platform: Splunk and how leveraging SanDisk® InfiniFlash and Tegile IntelliFlash for Splunk enables an Operational Intelligence Data Platform that delivers breakthrough performance, scale and TCO.
Splunk as an Operational Intelligence Data Platform
Splunk started as a log-structured analysis system and has since evolved into a full-blown, machine-generated data processing platform. In fact, it is described as “Google for visual analytics.”
Splunk has quickly moved from predictive analytics for IT operations to more broader use cases such as Security Incident and Event Management (SIEM), Business analytics with HUNK using virtual indexes and now also Industrial Internet and Internet of Things (IoT).
The IoT Opportunity
Although Splunk has established itself as a leader for other use cases, IoT is by far the largest market, and here are a few reasons why:
- Cisco estimates that 50 billion devices and objects will be connected to the internet by 2020. Yet today, more than 99% of things in the physical world remain unconnected.
- Gartner estimates that by 2020 IoT product and service suppliers will generate incremental revenue exceeding $300 billion, mostly in services.
- Infrastructure, volume, bandwidth, security and battery life–all these are going to change as IoT will make a significant impact on data centers—here’s how.
- More companies are making strategic moves into this space. Google made a long-term bet with the $3.2 billion acquisition of Nest, , bringing Google into the IoT revolution and into our smart homes.
- We at SanDisk are also working on innovation and technology that brings us closer to making the ‘Internet of Things’ (IoT) pervasive. We recently expanded our commitment to the connected device market with a strategic investment in Altair Semiconductor and I will share more in this blog about our Splunk solution with Tegile.
Splunk Architecture an for Operational Intelligence Platform
Splunk Architecture for an Operational Intelligence Platform, Source: Splunk
If you are not familiar with Splunk, its tiered architecture is built from various blocks that can be described as follows:
- Search Head – Searching and Reporting
- Indexers – Indexing and Search Services
- Forwarders – Data Collection and Forwarding
- Data Management
- Indexer Cluster Master, Search Head Cluster Deployer
- Distributed Management / Deployment Server
- License Master, Distributed Mgmt Console
The search heads allow querying of data sets either using Splunk SPL (Structured Programming Language) or using several applications from the rich ecosystem.
The Indexers serve 3 primary roles:
- Data Storage: processing and parsing at index time as well as indexing
- Data Management: rotation of data and data tiers (hot / warm / cold) and the aging and removal of data.
- Data Retrieval: To perform search upon request, and return data to search heads
Both Indexers and Search heads can provide clustered deployments on-premises or in a geo-distributed configuration.
Splunk Enterprise can be deployed in single instance or distributed deployments and has a very broad set of Forwarder support, ranging from network devices to IoT devices. It also has a variety of connectors which allow indexing data from a number of structured sources (like Enterprise Data Warehouse systems and Operational Systems) as well as from HDFS and S3 based data lakes using HUNK and virtual indexes and an HTTP Event Collector to support DevOps and IoT data analysis.
InfiniFlash-Based Data Grids for Operational Intelligence Platforms
Operational Intelligence Platforms have significant system requirements for operation and require underlying systems that can deliver a scalable, resilient, distributed, enterprise-grade platform. Just as traditional data platforms weren’t design for the needs of Big Data, traditional storage systems were not designed for the challenges of volume, variety and velocity at scale. New systems are required to take on the challenge of performance at scale while keeping costs at bay.
SanDisk’s InfiniFlash system, which IDC defined as “Big Data Flash,” delivers massive capacity with extreme performance and breakthrough economics. A single InfiniFlash system features up to 512 terabytes (TB) of flash using a new form factor in a 3U enclosure. The solution was designed to take on the massive capacity requirements of Big Data, and deliver accelerated performance at unprecedented economics.
InfiniFlash-based data grids can deliver dramatic advantages for building Splunk data platforms as they can support:
- Faster Ingest
With InfiniFlash, you can capture million of events per second without losing events. New data available is ready for analysis in the shortest timeframe and you can easily scale.
- Faster Query / Visualization
InfiniFlash makes critical reports available in record time, it supports real-time queries with msec latencies and both sparse/dense queries
- Indexing: Any Performance You need
The various formats and availabilities of flash solutions can easily be matched to Splunk tiered pipelines for hot, warm, cold and frozen indexes and for both sequential and random I/O’s.Flash can support high throughput batch jobs and low latency real-time queries while handling disparate data sources and bursty workloads and the ability to store data in a schema-free way
- Ease of Scalability and Peace of Mind Reliability
The superior performance of flash means IT requires far less hardware to deploy and manage and you can scale easily from terabytes to petabytes with rackscale architectures that require only minimal investment in infrastructure.
With InfiniFlash you can safely store multi-terabyte data pools for long periods and have predictable performance with very low annual failure rate (AFR).
Splunk Flash Tiering with SanDisk Big Data Flash and Tegile IntelliFlash
The best way to leverage flash is to deploy it with intelligent software that can take advantage of its benefits and maximize its advantages for the use case.
Our partner Tegile’s patented IntelliFlash OS accelerates metadata handling, which provides key differentiators with its ability to ingest Splunk data into a bucket of memory that extends to a large read-write bucket on SSDs. The cache pool is dynamically allocated in real-time as data is written to and read from the Tegile array. Metadata and cache can also be increased non-disruptively to meet scaling performance needs. Once the storage pools are attached to the Splunk buckets, no further administration is required and Splunk handles the placement of data.
Splunk and Tegile IntelliFlash automatically place hot, warm and cold data into the appropriate storage tier within the same storage array and filesystem based on the stage of the Splunk pipeline – Ingest, Search, Index, Query, Visualize. This unique and converged architecture meets Splunk’s application SLAs and negates the need to traverse data over a network between separate storage arrays or filesystems.
Splunk and Tegile IntelliFlash HD Architecture Diagram
Best in Class Flash Stack: SanDisk InfiniFlash and Tegile IntelliFlash HD
By partnering together to deliver a seamless, joint solution, SanDisk InfiniFlash and Tegile IntelliFlash enable an Operational Intelligence Data Platform that delivers breakthrough performance, scale and TCO.
This solution delivers three key benefits:
- Optimize and Accelerate
By deploying our joint solution customers can take advantage of accelerating search and index performance and reporting using a flash-based architecture. They can optimally handle both sequential and random I/O requirements during Ingest, Search, Index, Query phases, onboard and analyze larger datasets, optimize resource utilization and use vertical scaling to maximize the use of CPU power.
- Confidence of World-Wide Support
With worldwide support, tight SLAs and our FlashStart™ capabilities we ensure smooth installation and customer experience. Tegile’s IntelliCare delivers cloud-based analytics and support infrastructure and our tight OEM relationship assures users with best in class service.
- Best Price Performance
As a manufacturer of flash memory, SanDisk’s fab economics and flash technology innovation delivers the most cost effective economics of sub-$1/effective GB, making Big Data Flash a reality today.
By dramatically reducing TCO, hardware and energy footprint, SanDisk and Tegile make Operational Intelligence Data Platform for Machine Generated Data (MGD) and Internet of Things (IoT), an accessible reality for more organizations, helping them take advantage of new insights to transform organizations through Machine Generated Data (MGD).
Learn more about the SanDisk InfiniFlash on SanDisk.com and the Tegile IntelliFlash solution at Tegile.com. I welcome your questions in the comments section below.