Over the past several years, enterprises have started to adopt Apache Hadoop as a core component of their big data analytics stack.
The original data processing framework for Hadoop, called MapReduce, served Hadoop’s early use cases well. But it has become apparent over time that MapReduce is not the easiest way for business analysts and business users to connect to data stored within the Hadoop ecosystem. To unlock the true value of the entire data stored within Hadoop, enterprises need a simpler way of connecting to the data to query, analyze and even perform deep data science.
SQL has a long history in the enterprise. It is the lingua franca of data, used by a wide range of users to connect with their data. SQL support on Hadoop began with Apache Hive, a SQL-like query engine that compiles a limited SQL dialect to MapReduce. However, Hive’s complete reliance on MapReduce results in large query latencies, making it useful primarily for batch-mode operations. Also, while Hive’s support for SQL is a good start, its limited support for SQL means that seasoned SQL users start often run into severe limitations as they work on enterprise-class use cases. This also means the large ecosystem of SQL-based tools cannot easily connect with Hive. Fortunately, progress has been made to provide better solutions for SQL on Hadoop.
What to Expect in a Best-in-Class SQL on Hadoop Implementation
The table below shows features demanded from best-in-class SQL on Hadoop implementations and how the features translate to business benefits for the enterprise. Most of these capabilities are traditionally found in analytical data warehouses.
|Rich and compliant SQL dialect||Powerful and portable SQL applications able to leverage large ecosystem of SQL-based data analysis and data visualization tools|
|TPC-DS specification compliance||TPC-DS helps ensure that all classes of SQL queries will be handled, enabling a wide range of use cases and avoiding surprises during enterprise-class implementation|
|Flexible and efficient joins||Ability to perform SQL joins, combining columns from one or more tables, for more complex queries.|
|Linear scalability||Off-load enterprise data warehouse (EDW) workloads at a significantly lower cost of ownership|
|Integrated deep analytics and machine learning capabilities||Enable use cases in SQL requiring statistical, mathematical and machine learning algorithms|
|Data federation capabilities||Reduce data refactoring costs in implementing end-to-end analytics use cases leveraging diverse enterprise and external data assets|
|High availability and fault tolerance||Ensure business continuity and off-load more business critical analytics from EDW|
|Native Hadoop file format support||Reduced ETL and data movement directly contributes to lower cost of ownership of the analytics solution|
|Native Hadoop management with Apache Ambari||Reduces the total cost of managing the complete Hadoop stack (including Pivotal HDB) and eliminates vendor lock-in issues with proprietary management interfaces|
|Support for Hortonworks Data Platform||Enables enterprises to leverage the benefits of Pivotal HDB on the prominent Hadoop distribution|
Traditional Approaches to SQL on Hadoop
Data analytics vendors and the open source community have taken various approaches to providing SQL on Hadoop. Some have invested in building distributed SQL on Hadoop engines from scratch. Other vendors have invested in optimizing Apache Hive to narrow the performance gap between Hive and traditional SQL engines. The open source community has added Apache Drill to provide interactive queries at lower latencies. A more recent entrant in open source is SparkSQL, which enables structured data in Spark to be queried using SQL.
Pivotal’s Approach to Enterprise SQL on Hadoop
Pivotal’s approach to SQL on Hadoop has been to leverage a decade’s worth of product development effort that was invested in Pivotal Greenplum, Pivotal’s flagship analytical data warehouse. Pivotal leveraged this code base and deep data management expertise to build the industry’s best enterprise SQL on Hadoop database, then enhanced it with the industry’s first cost-based query optimization framework tailored for HDFS.
This enterprise SQL on Hadoop product is called Pivotal HDB, and it is based on the open source Apache HAWQ (incubating) project. It enables enterprises to benefit from the hardened, MPP-based analytic features and query performance of Greenplum while leveraging the Hadoop stack for scale-out storage of multi-structured and unstructured data. Pivotal HDB can coexist with the other traditional SQL on Hadoop engines (shown in Figure 1) in an analytics stack and meets the demands of a best-in-class enterprise SQL on Hadoop implementations.
Rich and Compliant SQL Dialect for Hadoop
Pivotal HDB is 100% ANSI SQL compliant and supports SQL ‘92, ‘99, 2003 OLAP as well as PostgreSQL on Hadoop. It includes features such as correlated subqueries, window functions, rollups and cubes, as well as a broad range of scalar and aggregate functions. Users can connect to Pivotal HDB via ODBC and JDBC. The benefit for enterprises is that there is a large ecosystem of business intelligence, data analysis and data visualization tools that work with Pivotal HDB out of the box due to its fully compliant SQL support. Also, analytic applications written over Pivotal HDB are easily portable to other SQL compliant data engines (and vice-versa). This prevents vendor lock-in for the enterprise and fosters innovation, while at the same time containing business risk.
TPC-DS Compliance Specification
TPC-DS defines 99 templates for queries with various operational requirements and complexities (e.g., ad-hoc, reporting, iterative, OLAP, data mining). Mature SQL on Hadoop systems need to support and correctly execute most of these queries to address a diverse set of analytic workloads and use cases. Benchmarking tests were performed using 111 queries, generated from the 99 templates in TPC-DS. Figure 3 shows compliance levels of some of the prevalent SQL on Hadoop systems in terms of number of supported queries in two terms:
- The number of queries that each system can optimize (i.e., return a query plan)
- The number of queries that can finish execution and return query results
Pivotal HDB’s extensive SQL support leveraging Greenplum’s analytics warehouse code base enabled it to successfully complete all 111 queries. The details of these findings are published in the ACM Sigmod Paper on Modular Query Optimizer Architecture for Big Data.
Flexible and Efficient Joins with SQL on Hadoop
Pivotal HDB incorporates a cutting-edge, cost-based SQL query optimizer – an industry-first in SQL on Hadoop implementations. This query optimizer is based on research highlighted in Modular Query Optimizer Architecture for Big Data.
Pivotal HDB is able to produce execution plans, which optimally use the Hadoop cluster’s resources, regardless of query complexity or data size. The cost function within the optimizer can also be configured for the specific environment (build, hardware, CPU, IOPS, etc.).
Pivotal HDB has been verified to quickly find optimal query plans for the most demanding of queries involving more than 50 joins, making it the industry’s best data discovery and query engine for SQL on Hadoop. This enables enterprises to use Pivotal HDB for off-loading traditional EDW workloads at a significantly lower cost of ownership for the volume of data being analyzed.
Faster Hadoop Queries with Linear Scalability
Pivotal HDB has been designed for petabyte-scale SQL on Hadoop operations. Data is stored directly on HDFS and the cost function in the SQL query optimizer is fine-tuned to account for the filesystem performance profile of HDFS.
A key design goal for SQL on Hadoop implementations is to minimize data transport overhead in performing SQL joins on Hadoop. Pivotal HDB employs dynamic pipelining to address this critical requirement that makes HDFS-based data suitable for interactive queries. Dynamic pipelining is a parallel data flow framework which uniquely combines:
- An adaptive, high-speed UDP interconnect
- A runtime execution environment for operations that underlie all SQL queries and tuned for big data workloads
- A runtime resource management layer, which ensures that queries complete, even in the presence of other very demanding queries on heavily utilized clusters
- A seamless data partitioning mechanism, which groups together the parts of a data set that are often used in any given query
Performance studies highlighted in Modular Query Optimizer Architecture for Big Data have shown that Pivotal HDB is one to two orders of magnitude faster than existing Hadoop query engines for Hadoop-based analytics and data warehousing workloads. These performance improvements are primarily attributable to dynamic pipelining and cost-based query optimizer capabilities within Pivotal HDB. This enables Pivotal HDB to off-load EDW workloads at a significantly lower cost to the enterprise.
Integrated Deep Analytics and Machine Learning Capabilities
In addition to table joins and aggregations, data analytics often requires the use of statistical, mathematical and machine learning algorithms (such as regression, principal component analysis) refactored to run efficiently in a parallel environment. This is becoming a basic requirement for SQL on Hadoop implementations. Pivotal HDB provides these capabilities through Apache MADlib , an open source library for scalable in-database analytics extending the SQL capabilities on Hadoop through user-defined functions. MADLib also supports implementation of user-defined functions in PL/R, PL/Python and PL/Java for specifying customized machine learning capabilities. This enables normal analytics workloads to embed advanced machine learning processing for use cases that require these capabilities.
SQL on Hadoop Data Federation Capabilities
SQL on Hadoop implementations that can federate data from external data sources provide additional flexibility in combining data from varied sources to perform analytics. Data is usually federated across other analytical/enterprise data warehouses, HDFS, Hbase and Hive instances and needs to leverage the parallelism inherent in SQL on Hadoop implementations. Pivotal HDB provides data federation through a module called Pivotal eXtension Framework (PXF). In addition to the usual data federation features, PXF provides industry differentiating capabilities in data federation with SQL on Hadoop:
- Low latency on arbitrarily large data sets: PXF uses intelligent fetch with filter pushdown to Hive and Hbase. The query workload is pushed down to the federated data stack, minimizing data movement and improving query latency, especially for interactive queries.
- Extensible and customizable: PXF provides framework APIs for customers to develop new connectors to their own data stacks, thereby enhancing loose coupling of data engines and obviating data refactoring often needed to implement end-end analytics use cases.
- Efficient: PXF can gather statistics on external data with ANALYZE. This helps build more efficient queries for federated environments by accounting for cost models of federated data sources.
SQL on Hadoop High Availability and Fault Tolerance
Pivotal HDB supports transactions, a first for SQL on Hadoop. Transactions allow users to isolate concurrent activity on Hadoop and to rollback modifications when a fault occurs. Pivotal HDB’s fault tolerance, reliability and high-availability features tolerate disk-level and node-level failures. These capabilities ensure business continuity and enable more business-critical analytics to be off-loaded to Pivotal HDB.
Native Hadoop File Format Support
Pivotal HDB supports AVRO, Parquet and native HDFS file formats in Hadoop. This minimizes the need for ETL during data ingest and enables schema-on-read type processing using Pivotal HDB. Reduced need for ETL and data movement directly contributes to lower cost of ownership of the analytics solution.
Native Hadoop Management with Apache Ambari
Pivotal HDB uses Apache Ambari as the foundation for configuration and management. The appropriate Ambari plug-in enables Pivotal HDB to be modeled as another Hadoop service within Ambari. Therefore, IT operations do not need two management interfaces – one to drive Hadoop and one to drive Pivotal HDB. This enables enterprises to focus on building and deploying functional use cases and minimizing investment on support activities, such as configuration and management. Furthermore, Ambari is the open source native configuration and management interface for Hadoop, which eliminates vendor lock-in issues and minimizes business risk.
Support for Hortonworks Data Platform
As the next step toward executing on the ODPi, Pivotal HDB works seamlessly with Hortonworks Data Platform, providing all the benefits of the world’s most advanced SQL on Hadoop engine running on the industry’s leading Hadoop distribution.
Best-in-Class SQL on Hadoop
Pivotal’s significant investment in SQL on Hadoop enables enterprises to expand the use of Hadoop from a data lake, primarily used for storing structured and unstructured data and ETL workloads, to implementing more analytic data warehouse workloads over Hadoop. Read more about Pivotal HDB here and on the Big Data blog.
 CDH 4.4 and Impala 1.1.1 for Impala, Presto 0.52 and Hive 0.12 for Stinger