Find out how we can help your digital transformation. Contact us to learn more.
Empowering Data Scientists and Business Analysts with a Self-Service Big Data Platform
EMC is a global enterprise technology software and hardware provider based in Hopkinton, Mass. The company is the enterprise storage market leader with thousands of customers, including both large enterprises and small & medium-sized businesses, spanning virtually all industries and geographies. EMC’s mission is to help its corporate customers store, manage, secure and analyze their most valuable asset – information – and accelerate their digital transformation journeys. The company realized revenue of nearly $25 billion in 2015 and employs over 70,000 people around the globe.
Plenty of Data, Not Enough Agility or Insight
EMC is awash in data. Like any large enterprise, it creates lots of data related to customer transactions, internal operations, ﬁnancial performance, machine data, and more. This includes a steady stream of “phone home” data - that is, log data its products create about their daily operations at customer sites that is sent back to EMC’s own data centers. In all, EMC’s global data warehouse contains 30 terabytes of mostly structured data, while various other databases and repositories stored another 60 to 70 terabytes of unstructured and semi-structured data, according to Shahidul Mannan, Head of Delivery, Big Data and Analytics, at EMC.
What the company didn’t have, however, was an agile but powerful platform to analyze all that data and turn it into actionable insights. Not that it wasn’t trying. Again, like any large enterprise, EMC had a mature data warehouse and business intelligence practice that produced reports and dashboards for the business. But it was largely an IT-owned practice, meaning requests for new data sources, reports and other types of analytics from business users were routed through IT.
IT Bottleneck Stiﬂes Experimentation and Innovation
It could take months to get a new data-related project off the ground, according to Mannan, and the slow and costly process discouraged business users and data scientists alike from experimenting with analytics. “Because what would happen is … you spend, say, half a million dollars in building some data capability only to ﬁnd that this is not the right capability for whatever reason,” said Mannan. “Now, you have to do another ... expensive program.”
Mannan knew EMC needed to rethink the way it approached data and analytics. The company simply wasn’t making the most of its data with the existing approach, which, as Mannan put it, ”contradicted the fail-fast concept.”
Ramesh Razdan, Vice President of Big Data and Analytics at EMC IT, agreed. Razdan and the EMC IT group set out “to create a ﬂexible, scalable analytical platform that would enable EMC’s business users and data scientists to develop innovate analytics use cases at the speed of business,” said Razdan. The ultimate goal was to provide a self-service platform to users, allowing them to analyze huge volumes of data and uncover actionable insights - without involving IT - that would differentiate EMC from its competition, he said.
Collaboration with Pivotal Leads to Scalable Data Lake and Powerful Self-Service Analytics Capabilities
Recognizing the need for a scalable but cost-effective foundation for its data and analytics practice, EMC IT began by building a Hadoop-based data lake to store and process EMC’s growing data footprint. Hadoop is an open source framework for storing and processing large volumes of both structured and unstructured data. It not only runs on commodity hardware and scales linearly, making it signiﬁcantly less expensive to scale than a traditional data warehouse appliance, but it also runs on EMC’s industry leading and IT-proven technologies. These include XtremIO, ScaleIO, Isilon and Data Domain. These technologies enable enterprise capabilities such as built-in name node fail-over, replication, storage efficiency, disaster recovery, backup and recovery, snapshots, and the ability to scale-out compute and storage separately.
Once operational, Razdan, Mannan and team used Pivotal Spring XD, a data integration and pipelining tool, to orchestrate batch and streaming data ﬂows to land in the data lake. It didn’t take long for the data lake to exceed EMC’s global data warehouse in size, and it continues to grow. Today, the Hadoop-based data lake exceeds 500 terabytes in size, according to Mannan. That’s over ten times the size of EMC’s global data warehouse, illustrating just how much data EMC had been overlooking from an analytics perspective in the past.
With the foundation in place, EMC IT turned its attention to analytics. The most important requirement for Razdan and Mannan was that, whatever tools and technologies they chose, they must provide self-service capabilities so users can easily spin up new analytics projects without having to involve IT. More speciﬁcally, EMC IT set out to build analytic capabilities that allowed users themselves to easily identify the data sets they needed, to integrate new data sources when needed, to create analytical workspaces to blend and iteratively interrogate data, and to publish the results of analysis for sharing and collaboration with colleagues.
Creating these agile capabilities required a mix of technologies. Using Pivotal Cloud Foundry (PCF), developers are now capable of building data-driven applications running on the data lake platform. PCF allows for more agile and faster deployment of analytics applications across the enterprise than previously possible thanks to PCF’s continuous delivery and automated operations capabilities. EMC IT also developed a framework for data API/services based on PCF and the Pivotal Big Data Suite (BDS) that enables users to seamlessly interface with the data lake.
For the analytics itself, among the technologies EMC IT chose was Pivotal Greenplum, a massively parallel processing analytical database that is part of Pivotal’s BDS. Users bring their desired data sets into analytics workspace powered by Greenplum, where they can run different styles of analytics - including machine learning, geospatial analytics and text analytics - and visualize results with the tools of their choice. EMC’s data scientists primarily use MADlib, R and SAS to develop and run algorithms and predictive models inside Greenplum, while business users mostly use business intelligence tools including Tableau and Business Objects, said Mannan.
Finally, users can publish the resulting analysis for others to access via a new collaborative environment that Razdan and Mannan call a data hub. This signiﬁcantly shortens time to insight as users can build on one another’s’ work rather than constantly starting from scratch. “Anyone coming in can know off the shelf what’s available and where and who actually owns it, built it, what are the usages for it,” said Mannan. “In many ways, you don’t have to reinvent the wheel because someone else might have already done it.”
Self-Service Analytics Tools Empower Users
While it is still early days, Razdan and Mannan said EMC is already reaping the beneﬁts of its new data lake and analytics capabilities.
Namely, data scientists and business users no longer need to go through IT when they want to start new analytics projects. Rather, users simply log into the data hub and use self-service tools to identify potentially valuable data sets for analysis or import new data sets such as social media data or market data. “[Users] can bring in their own data, mesh it with our enterprise data or bring in outside vendor data easily and seamlessly with minimal IT intervention,” said Mannan.
With the IT bottleneck out of the way, “everyone’s very excited and they feel empowered,” said Mannan. “Everyone’s feeling like we have a new gold mine, so to speak, that now we have to harvest and tap into to identify and explore new opportunities.”
Analytics-Driven Services Boost Customer Satisfaction and Loyalty
One such opportunity involves log data created by EMC storage arrays and other products as they operate in the ﬁeld. With its new Hadoop-based data lake, EMC is now equipped to ingest, store and process this log data, which is then analyzed to predict and prevent problems before they occur and impact the customer.
Analysis of log data might reveal, for example, that a particular component in a customer’s storage array is likely to fail in the next eight to 12 hours. With that insight in hand, EMC support can reach out to the customer and take steps to prevent the component failure before it disrupts important business processes.
This preventative maintenance capability, which would not be possible without EMC’s new data lake and agile analytics capabilities, results in higher levels of customer satisfaction and customer loyalty, which has a direct impact on EMC’s bottom line. It also helps EMC’s engineers determine optimal product conﬁgurations for various scenarios and use cases, as well as provides valuable insights as they develop new products and services.
Driving Millions of Dollars in Potential New Revenue
In addition to preventative maintenance, other current analytics use cases leveraging its Hadoop-based data lake and Greenplum include supply chain optimization and customer credit collection analytics, with many more in the pipeline.
We are just “scratching the surface because there’s more data to be available and there’s more data to be harvested, and more analytics to be built out,” said Mannan. “We are working on almost 66 business use cases … that are projected to drive more than $40 million in terms of opportunity cost.”
With its new data lake and analytic capabilities, EMC now feels like it is on the cutting edge of Big Data analytics in the enterprise. More importantly, it now has the foundation and tools in place to use data and analytics to create sustainable, long-term competitive differentiation thanks to its partnership with Pivotal.
“The multi-latency and multi-tenancy capabilities of the Pivotal Big Data Suite differentiated it in the marketplace and we were able to use it successfully. We worked with Pivotal closely and the partnership and the opportunity to mature together is also another big reason that helped us along the way,” said Mannan.
“Pivotal certainly makes its mark in terms of the innovation and innovative capabilities that it brings and along with the great support and the team that provides the help with the implementation,” Mannan continued. “I would highly recommend using the capabilities and the great support team that Pivotal has.”
The successful data lake and analytics project also serves as a showcase for EMC’s approach to Big Data. “The EMC Federation technologies that support our new data lake and analytics capabilities are not only helping us to transform our big data and analytics practices internally here at EMC, but have become proof points for our customers and partners,” said Razdan. “In the era of digital transformation, IT has tremendous potential to drive competitive differentiation for the business and leveraging data is key to it. EMC has made tremendous strides on this journey and we are delighted by our progress and share it frequently with customers and partners.”
Most important of all, perhaps, is that EMC’s data scientists and business users alike are now empowered to experiment with data and analytics to identify new and powerful insights to drive the business. The days of fear of failure are over and a new era of data- driven innovation is underway at EMC.