Hadoop for the Enterprise

Hadoop 101

Apache Hadoop has come a long way in the ten-plus years since its inception. What was originally a powerful but narrowly-targeted framework for helping Yahoo! Engineers index the web has evolved into a comprehensive enterprise-grade platform for large-scale data storage, processing and analytics. It is not a stretch to say that Hadoop has fundamentally changed the way enterprises think about data and analytics and has ushered in the Era of Big Data.

So exactly what is Hadoop and what makes it such a game-changing approach? There are two capabilities that set Hadoop apart from more traditional databases and data warehouses.

  1. At its heart, Hadoop was designed to store and process staggering volumes of data. It does this by storing data across a cluster of commodity machines and processing data in parallel via an approach called MapReduce. The largest Hadoop clusters to-date store petabytes of data across tens of thousands of nodes.
  2. Most traditional databases only support structured data and require complex data models to be set up in advance. Hadoop, in contrast, can ingest and store virtually any variety of data you throw at it—structured, unstructured, and semi-structured data—via the Hadoop Distributed File System (HDFS) with no pre-determined data model required.

In addition, Apache Hadoop introduced (or at least reintroduced) a key concept to data processing: bring the compute to the data. Rather than extracting data from its resting place and bringing it to an external data analytics tool or platform (bringing the data to the compute), Hadoop allows data scientists and business analysts to perform analytics on data directly where it is stored, in HDFS. This is particularly important when it comes to Big Data. Moving large volumes of data from one environment to another takes time and effort, and the job becomes harder as data volumes increase.

Hadoop also makes storing, processing, and analyzing huge volumes of data affordable for enterprises of virtually all sizes. Most traditional data warehouses rely on expensive, proprietary hardware. Hadoop leverages inexpensive commodity hardware. The result is that enterprises can store, process, and analyze massive volumes of data with Hadoop at a fraction of the cost of doing the same with a proprietary data warehouse.

As a result of its power and flexibility, Hadoop has been adopted by enterprises across industries. Many use Hadoop as the basis of what’s called a Data Lake—a repository to store huge volumes of raw data from various sources, which can then be analyzed via one of a number of analytics tools. Hadoop-based Data Lakes support a variety of use cases across industries. Retailers use Hadoop to help develop personalized offers for customers; financial services firms use Hadoop to support fraud detection and prevention; healthcare organizations use Hadoop to help develop targeted treatments for diseases; and the list goes on.

While powerful and flexible, analyzing data stored in Hadoop can be challenging for all but the most sophisticated data scientists. That’s why Pivotal developed Pivotal HDB. Pivotal HDB, powered by Apache HAWQ (incubating), is the Hadoop-native database for data science and machine learning workloads and is the key to unlocking the business value of Hadoop. Pivotal HDB runs natively on any ODPi-certified Hadoop distribution, including the Hortonworks Data Platform (HDP), and allows data scientists, business analysts, and even casual business users to analyze data in Hadoop using familiar SQL syntax.

Apache Hadoop is a critical building block of modern data architecture. Together, Pivotal HDB and Apache Hadoop provide enterprises all the capabilities they need to extract massive value from massive volumes of data.