EMC Data Computing Appliance (DCA)

Accelerate Big Data Analysis

Meeting the challenges of a data-driven world requires changes in analytical technologies and a new approach to utilizing data. Yet your organization faces significant cost and time challenges when it comes to building a data warehouse to accelerate analysis of Big Data assets. Distributed database systems require the physical infrastructure—servers, networks and storage—to be configured to exacting standards. And setting up the system management and monitoring often requires additional software and resources.

The EMC Data Computing Appliance (DCA) is a platform purpose built to be both a modular and a fully integrated appliance for Pivotal Greenplum, the world’s first open source massively parallel data warehouse. It accelerates Big Data analyses within a single appliance, delivering faster time to value and lower integration risks and total cost. With the EMC DCA, your organization can maximize security, availability and performance for Greenplum—without the complexity and constraints of proprietary hardware.


Maximize Security, Availability and Performance for Greenplum
  • Maximize Security, Availability and Performance - Redundant design throughout the EMC DCA mitigates risks posed by system outages, eliminating single points of failure and providing a dependable repository for your Pivotal Greenplum mission-critical information. Additionally, the EMC DCA delivers high availability, reliability and multi-level, self-healing fault tolerance. To protect your data, the EMC DCA is hardened against known vulnerabilities and new threats are mitigated by ongoing software updates and patches. Enhanced availability and disaster recovery are available through integrated third-party solutions. You gain extreme performance and powerful analytics with elastic scalability because the appliance relies on Pivotal's shared-nothing, massively parallel processing (MPP) architecture optimized for BI and analytical processing. Customer Support Services provide the resources and services to quickly and proactively resolve your solution-related issues and questions, further assuring business continuity and a highly available data environment.
  • Minimize Integration Risk and Time to Value - The EMC DCA is designed to grow incrementally, so you can add resources as you need them without an expensive replacement project. For faster results, EMC DCAs are designed, optimized, tested, debugged and delivered ready to run, shortening the time needed to deliver analytics to business stakeholders. A core principle of the Pivotal vision is to co-locate processing and data, so you gain the greatest possible performance for analytical processing while alleviating computation and network bandwidth burdens on other data center infrastructures.
  • Reduce Total Cost - EMC DCA is flexible. Your enterprise can tailor capabilities and capacity both initially and over the lifetime of the appliance to support data analytics plus ETL, BI, machine learning and data visualization. Unlike more rigid appliances, EMC DCA users specify the capacity and capabilities of their appliance, they can specify modest initial configurations knowing that they can easily add capacity or new capabilities later.


What Is the EMC Data Computing Appliance?

The EMC Data Computing Appliance (DCA) is a unified Big Data analytics appliance—a modular and flexible solution for analyzing data.

DCAs, coupled with Pivotal Greenplum, offer the power of a massively parallel processing (MPP) architecture without the constraints and complexity of proprietary hardware. This continues Pivotal’s history of delivering the fastest data-loading rates and the best price/performance ratios in the industry.

DCAs also offer flexibility and scalability not usually found among analytics appliances.

Pivotal Greenplum

Pivotal Greenplum offers industry-leading price-performance for SQL and analytical processing on the DCA appliance.

Greenplum running on DCA delivers industry-leading price performance with redundant servers, automatic failover and automatic sparing of storage devices to assure availability and minimize downtime.

How Does EMC DCA Enable High Availability, Backup/Restore and Disaster Recovery?

The EMC DCA includes four key components for maximizing availability and minimizing disruptions:

High Availability

The DCA includes high-availability features such as hardware and data redundancy for Pivotal Greenplum and resource failover for disks, switches, servers and master nodes used in both.

Master Server Redundancy

The DCA includes a standby Master Server to serve as a warm standby backup in case the primary Master Server becomes inoperative. Should the primary Master Server fail, the standby is available to take over as the primary.

The Master Servers act as primary and secondary database masters and are kept up to date by a transaction log replication process that keeps the primary and standby masters synchronized. Should the primary Master Server fail, the log replication process is ceases and the standby it automatically activated. Upon activation of the standby, the replicated logs are used to reconstruct the state of the primary Master Server at the time of the last successfully committed transaction.

User Data Redundancy in Pivotal Database

Information stored in Pivotal Greenplum has multiple layers of protection. Data is stored in RAID-protected disk drive groups—with hot spare drives ready should it be necessary to replace a primary disk.

The drive arrays are connected to database segment processors via redundant connections. All data segments are mirrored on two segment processors.

The system automatically fails over to the mirror segment whenever a primary segment becomes unavailable. A DCA can remain operational if a segment instance or segment server goes down, as long as all portions of data are available on the remaining active segments. Whenever the master cannot connect to a primary segment, it marks that primary segment as down in the Pivotal Database system catalog and brings up the corresponding mirror segment in its place. A failed segment can be recovered while the system is up and running. The recovery process only copies over the changes that were missed while the segment was out of operation.

In the event of a segment failure, the file replication process is stopped and the mirror segment is automatically brought up as the active segment instance. All database operations then continue using the mirror. While the mirror is active, it is also logging all transactional changes made to the data. When the failed segment is ready to be brought back online, your administrator initiates a recovery process to bring it back into operation.

Backup/Restore of Pivotal Database

EMC DCA backs up all user data in parallel directly to Data Domain, deduplicating the data using Data Domain Boost or “ddboost.” With this backup and restore method, ddboost dramatically reduces backup storage and backup network requirements, enabling the DCA to achieve effective backup rates up to 14TB/hour. The solution also supports both full and partial replication, with 1-to-1, 1-to-many and cascading topologies.

What is a DCA Pivotal Greenplum Database Module?

The DCA Pivotal Greenplum module is the basic building block for the EMC DCA appliance, there are up to four modules in a DCA rack and users can scale by adding modules and racks. A module is a purpose-built, highly scalable data-analytics component that architecturally integrates database, computing, storage and network into an enterprise-class, easy-to-implement system.

Pivotal Database Modules Hardware Configurations

Size / Rack Configuration

8U rack units

Number of Servers

4 per module

Total Number of Cores

96 cores per module

Total Memory

1024GB per module

Number of Storage Drives

96 per module

Storage Type

96 1.8 TB 10K SAS RPM drives per module

Usable Capacity - User Data

46 TB per module (uncompressed)

Usable Capacity – Compressed

184 TB per module (4x compression)

The module connects to the high-speed interconnect in the DCA as well as Command Center and the administration framework of the DCA.

What Is the Role of EMC DCA Software?

EMC DCA Software configures the physical components installed in the EMC DCA. It facilitates updates, monitoring, history tracking and notification functions. Your system administrators interact with EMC DCA Software primarily through Command Center, our web-based, comprehensive administration console for the DCA.

If abnormal indications are detected during operation, DCA Software can also provide proactive alerting. Alerts can be delivered via EMC Dial-Home functionality via SNMP Alerts, or directly to you via Pivotal Command Center.

EMC Automated Dial-Home Functionality

The DCA dials out to EMC for the following hardware-related events: single-disk failure, virtual disk or RAID degradation, issues with the RAID controller battery, PSU, power usage, memory, NIC, CPU, fan, 10 GB switch (fan and PSU), or SNMP daemon not working.

The following software issues also trigger dial-home events: Pivotal Database unavailable, segment database down, master failover notification, core file dump detected, kernel crash, or disk space over limit. The DCA also dials home weekly to test its connection with EMC.

DCA Software delivers weekly dial-home reports that outline disk usage, license report, database table size/index/overall capacity, number of functional modules and types, health state of the database, percent of hard drive full for all modules.

Hardware monitoring in the DCA Software reports condition and status information for approximately 150 different parameters for database modules. These highly-detailed MIBs report items ranging from hardware identification to physical operating parameters such as power supply status, voltages and current, cooling device(s) status to warn of impending problems, I/O rates, CPU usage percentages, etc.

Measured system parameters are available through Pivotal Command Center, through Simple Network Management Protocol (SNMP) reporting and through EMC’s Call Home functionality, helping DCAs to be easily integrated into most data center management frameworks.

What are the Physical Architecture and Configurations of EMC DCA?

All DCAs begin with a single rack which can be expanded by adding Pivotal Database modules. The additions are in ¼ rack, or 8U (rack unit) increments. Your Pivotal and EMC teams will help you determine the racking arrangement for your particular needs.

Common to all DCAs are master servers, interconnect and administration switches.

Master Servers

All DCAs have two master servers, one primary and one standby which are used for administrative and dial home functions. Customer can directly connect the master servers to their networks to do SQL queries.

Interconnect Bus

The Interconnect Bus consists of software and hardware that work together to provide inter-process communication between master servers and segment servers, as well as between segment servers themselves, over a 10GB/second Ethernet network.

The Interconnect Bus runs on a private network and is not usually connected to public or internal networks except when needed for high-speed parallel loading or unloading. The Interconnect Bus extends to all modules in the DCA, providing high-speed connectivity to Hadoop, ETL servers or analytics products running on DCAs as well as to externally integrated environments, including Data Domain and Isilon.

The Interconnect Bus consists of two 52-port 10GB Ethernet Switches and necessary cabling to connect all modules, via bonded Ethernet for redundancy and higher bandwidth. To maximize redundancy, a primary segment and its corresponding mirror segment use different interconnect networks. With this configuration, the DCA can continue operations in the event of a single Interconnect Bus failure.

Administrative Switching

Present in all racks of all DCAs, administrative switches provide a means for separating administrative traffic from data interconnects. This ensures that administrative traffic never delays data interconnect traffic and vice versa. Redundant 1GB Ethernet switches are used, connecting all modules and their servers to master servers and all master servers to all racks in the DCA.

Physical Architecture: Pivotal Database DCAs

In Pivotal Database DCAs, the master database runs on the primary master server (replicated to the standby master server) and is the entry point into the Pivotal Database for your end users and applications. Through the master database, clients connect and submit SQL, MapReduce and other query statements.

The primary master server is also where the global system catalog resides. It is the set of system tables that contain metadata about the entire Pivotal Database. The primary master server does not contain any user data, data resides only on the Database Modules. The primary master server performs the following operations:

  • Authenticates client connections
  • Processes incoming SQL, MapReduce and other query commands
  • Distributes the work load between the segment instances
  • Presents aggregated final results to the client program

The primary master server mirrors logs to the standby master server so that it is available to take over in the event of a failure. The standby master server is kept up to date by a process that synchronizes the write-ahead-log (WAL) from the primary to the standby.

Database Modules and their Segment Servers

Each Database Module provides four segment servers with CPUs, memory and directly-attached RAID storage. Within database modules are segment servers that process database data.

Master nodes use cost-based optimization and workload management to determine how may queries to run at any given time to maximize throughput. The majority of query processing occurs within segment servers as they process database data to satisfy the queries. Multiple queries can be processed at any give time, depending on workload and query complexity.

To process data, all data is divided up into segment instances and each segment server is responsible for processing a set of segment instances for all queries on that affect that segment instance. All segment instances contain roughly equal-sized data slices that include data from each table and an index in the database. In this way, queries are executed in parallel by every segment server acting on one or more segment instances.

Existence of segment servers and segment instances is transparent to your users and they need not interact directly with them. All query requests are expressed to the master in SQL, with no knowledge that execution will be conducted in parallel. Behind the scenes, the master nodes decompose query logic, recompiling it into a set of actions for each segment server to execute on one or more segment instance handled by that segment server. For some queries, communications is required for movement of data between segment servers and this is done transparently during the query using the high-speed interconnect (described below) between all nodes, including masters and segment servers. Once all segment servers complete their work, results are returned to the master node, marshaled into a single response and returned to the requesting client.

Segment Instance Redundancy

There are two types of segment instances: primary segment instances (primary segments) and mirror segment instances (mirror segments). Each primary segment instance has a corresponding mirror segment instance. A mirror segment always resides on a different host and subnet than its corresponding primary segment. This is to ensure that in case of a segment server failover, where a segment server is unreachable or down, the mirror counterpart of the primary instance and its data is still available on another segment server and the query task is then executed there.