Top Data Science Trends for 2015

View Slides

With 2015 just around the corner, the Pivotal Data Science team has been challenged to point its predictive inclinations toward spotting emerging trends in Data Science. With a global team of 30, doing innovative work in almost every vertical market, Pivotal’s data scientists have a rich view into the underlying trends and shifts impacting their craft.

In this webcast, leaders from the team – Annika Jimenez, Kaushik Das and Hulya Farinas – will share their insights on the key Data Science industry trends for the coming year. Every angle of Data Science is fair game:

  • New use cases at the vertical level
  • Analytical tool usage trends
  • Implications of the shift in focus to model operationalization
  • Meta observations about maturity of the craft
  • Ethics evolution in Data Science
  • Venture capital activity

Join us for this lively discussion of top predictions for Data Science in 2015. The presentation will be followed by a Q&A session where attendees will have an opportunity to share their own thoughts and predictions.


Kaushik Kunal Das
Head of Data Science, Pivotal

Kaushik Kunal Das is the Head of Data Science at Pivotal. His job is to formulate data science problems and solve them using the Pivotal Big Data Platform. He leads a team of highly accomplished data scientists working in energy, telecommunications, retail and digital media. Kaushik has an engineering background focused on solving mathematical problems requiring large data sets. He is interested in questions such as, "How much can a company know their customer and customize their actions in a context-sensitive fashion?" "How can our living and working environments get smarter and how can we get there?" Kaushik studied engineering at the Institute of Technology of the Banaras Hindu University and the University of California at Berkeley.

Hulya Farinas
Senior Principal Data Scientist, Pivotal

Hulya has extensive experience in the application of algorithmic approaches to complex problems in multiple verticals. Before joining Pivotal, she had held positions at IBM and M-Factor where she helped her customers make optimal business decisions under uncertainty by marrying machine learning algorithms with optimization routines. She is currently a senior principal data scientist at Pivotal where she is the lead for health care vertical. She holds a Ph.D. in Operations Research from the University of Florida.

Annika Jimenez
Vice President, Pivotal Data Labs

Annika is a seasoned leader of analytics initiatives, coming to Pivotal after over six years in data leadership roles at Yahoo! At Pivotal, she has built the “Data Science Dream Team” – an industry-leading group of data scientists, representing a rich combination of vertical domain and horizontal analytical expertise – to facilitate Data Science-driven transformations for Pivotal customers. During her time at Yahoo!, she led Audience and International data solutions for Yahoo!’s central data organization, Strategic Data Solutions and led Insights Services – comprised of a team of 40 researchers covering Web analytics, satisfaction/brand health metrics and audience/ad measurement. Annika is a recognized evangelist for “applied data” and well known for her acute focus on action-enablement.

View Blog Post

Host: Hello and welcome. Thank you for attending the webcast, Data Science Trends for 2015. I'm happy to introduce our presenters for today, leaders from the Pivotal Data Science team: Annika Jimenez, Kaushik Das and Hulya Farinas.

Please feel free to send in your questions anytime during the webcast using the chat console. We will do our best to answer them at the end or, if we can't get to them, we'll get back to you via email. Thanks again for being here. Now, I'll hand it off to Annika.

Annika Jimenez: Thanks, Katherine. Good morning, everybody. Depending on where you are, perhaps good afternoon or good evening as well. Welcome to this morning's webinar from Pivotal entitled Top Data Science Trends for 2015. My name is Annika Jimenez. For those of you who don't know me I'm the Vice President for an organization called Pivotal Data Labs, here within Pivotal.

Pivotal Data Labs is our internal expert data services team made up of best in class data scientists, data architects and data engineers. We thought it would be a really fun way to close out what has been a very interesting year of exciting data science projects work and for us. To be clear, that’s high caliber, cutting edge project work in the realm of predictive modeling and machine learning on Big Data and on Pivotal distributed MPP or Hadoop technologies.

Doing that project work and then at the same time looking towards the New Year, really wanting to identify the trends that we believe we're going to be seeing for data science in 2015. I'll explain a little bit more in the next slide, why we think we're in a good position as a team to be this presumptuous and assume this role of the predictors of trends for 2015. Before I do that, what I want to do is some introductions.

Today I'm joined by Kaushik Das and Hulya Farinas. Let's actually allow each of them to start off with their backgrounds and their role on the team, and then we'll give you a little bit more information on how we pulled together the trends that we'll talk about in the session today. Kaushik, do you want to go ahead and get started?

Kaushik Das: Thanks, Annika. My name is Kaushik Das, very happy to be here today talking to all of you. I head the Data Science team within Annika's organization for the Americas. I was thinking as we were waiting for this to begin, that I actually started working with Big Data a long time ago – before the term Big Data was well known – as a biophysicist, working with data from seismic surveys creating images of the underground.

Since then, I think I've been very lucky to witness the advent of this new technology for Big Data which has made our work so interesting and I’m happy to be in the midst of the emergence and the spread of the community of data scientists that we are seeing today.

Annika Jimenez: Then, Hulya.

Hulya Farinas: Hi. My name is Hulya Farinas. I'm a master data scientist here at Pivotal Data Labs. I'm also the lead for Healthcare and Life Sciences, Retail and Logistics verticals. This means that I'm exposed to all different trends and different verticals.

Annika Jimenez: Then just a quick reminder of my background. I started at what used to be Greenplum over three years ago and had the privilege of beginning to start my collaboration with Kaushik and Hulya leading the build out of a global data science services capacity for Greenplum and then later Pivotal.

Prior to that, I actually spent a long time at Yahoo!, where I led a lot of internal analytic services teams as well as the build out of Big Data centered data capabilities globally. I've also had a very interesting career ride in watching the build out of the adoption of Big Data technologies, not just in digital media but now of course while we've been at Pivotal across many other verticals. What we wanted to do today, now that you have a better understanding of the role that each of us play in our respective perspectives, kind of the viewpoint that we bring into today's session, it's talk a little bit about the methodology that we actually applied in order to come up with these predictions. They're not really cherry picked necessarily by each of us. We actually decided that we wanted to really bring in the viewpoints of the entire team.

That team, just as a reminder, it's about 30 people. It's a global team (as I've mentioned) with groups in EMEA as well as Asia and Singapore and Australia in particular. As well, of course, coverage across the US. This is what we think of to be, to toot our horn a little bit, one of the best data science teams out there. We've been doing work like I alluded to earlier in my original comments across just about any sector that you can imagine and exploring many different of these cases.

We're really on the front lines of the emergence of our craft and the industry is being built around it. That gives us a very luxurious position to really have a sense of how the overall sector is being built. In a way you can think of us as a microcosm and reflecting and becoming a barometer of the maturity, the analytic maturity, of the enterprise class.

For today's session beyond the predictions themselves, we're not going to make this a big countdown and add drum rolls. I think what we'll do is we'll expose you to our prediction and the trends that we want to highlight. Then you're going to be able to expect just a quick conversation amongst the three of us around that particular trend, where we think it's appropriate. Then we'll move on to the next trend. We're for sure going to be living time at the end of our ten predictions to hear any questions from you or to hear any specific predictions that you would actually like to pose to this audience that we now have participating in this morning's webinar.

With that I think we'll move on to the first prediction. You guys ready?

Kaushik Das: All right.

Annika Jimenez: Prediction number one, data science in 2015 as a craft will move to the cloud or to many clouds. In particular obviously you could expect us to make this prediction because we're employed at Pivotal and at the end of the day Pivotal is really becoming the embodiment of the convergence between cloud and data.

I want to be really clear about what we mean when we say data science will move to the cloud. What we specifically mean is that data science execution, being the ad-hoc discovery of insights, building a predictive model, etc., is actually going to move to a cloud environment. We're not really referencing the scenario where companies will leverage for example BI as a service business models from various startups. We specifically mean companies will be in essence leveraging a packaged suite of data science tools and libraries in a certified cloud performance capability on a distributed architecture.

They'll do that either on a public cloud or a private cloud. It's not just setting up an account on AWS. It's actually a little bit more involved than that. It's a natural extension of course of the build out of data services as a service. We have an eye towards companies like Altiscale that are actually extending managed services on top of Hadoop. We think a very natural next step is to then extend the suite of data science tools and enable much more agile ad-hoc data science execution on top of that infrastructure that is now made available either through smaller players like Altiscale and larger players like AWS, Microsoft Azure, Google etc.

As a result as a practitioner, what does this mean for the various stakeholders as a trend in 2015? Obviously as a practitioner this is actually almost panacea. For our team, as an example, we would love nothing more than to be able to plug in in a much more agile fashion on behalf of our customers, tap into the underlying data lake, secure the right size compute required for the project quickly, drive discovery and the creation of insights in a temporary dev environment and actually drive the operationalization of these models much more easily than we can right now in a more fixed hardware environment in a private data center.

If you're IT, that's also a very attractive proposition because you can actually control your access rights, privacy, security, compliance while enabling in a much more agile and ad-hoc basis access to both data and a broader set of tools for your in-house practitioners. In essence collectively between the two sides, with the practitioners and IT teams will be reducing time to insights and time to action, and the overall innovation cycle will be sped up.

It's a really key transition. We believe we're seeing the emergence of this in 2014, but we think it's really going to take off in 2015. Kaushik or Hulya, any thoughts on this one?

Hulya Farinas: These are the kinds of demands that we see from across different domains. Pharmaceutical companies with very large R&D departments demand this because there is variation in the compute needs. Also, consumer packaged goods companies desire this kind of flexibility and agility.

Kaushik Das: One of the big attractions for this move is that the expenditure of Big Data then becomes OPEX instead of CAPEX. Most companies prefer that because then it's a variable cost and you're paying for what you use, rather than depending on some initial estimate.

Although there are of course some exceptions because some companies prefer to have CAPEX, like in a heavily regulated industry, like utilities out here in the United States. Because capital expenditure can be written off, they would still prefer to spend the money as CAPEX. For them, the cloud is also attractive because they are able to regulate how much of that they will use. They are able to turn the tap on and off as they need it.

Annika Jimenez: I think, like we said, it's kind of a natural extension of a lot of the trends that we saw in cloud adoption. It's also going to be the facilitator of really driving data science into the application world. We'll talk a little bit more about that and some additional predictions to come.

I think with that, we'll move to prediction number two.

Kaushik Das: This prediction is that data science driven apps will become much more prevalent next year. This is almost a natural corollary to data science moving to the cloud. As you know, apps which are lightweight applications as opposed to heavy enterprise software are very common now. Even data driven apps are very common now.

This is the next stage. This is data science driven apps. You're not just looking at your data, and some of the statistics and descriptive statistics of your data, but you're looking actually at recommendations based on predictive modeling. This year we saw a lot of different projects. This has already emerged this year. We think that this is going to become really prevalent next year.

One big example on the Internet of Things side is GE’s Predix platform. They have already created a platform and have started selling apps and software based on that platform to different players in oil and gas and in the power industry. The other side of this picture is that, the moment you have a lot of different people, especially people out there in the field with their tablets and smartphones accessing the results of models, for example – when they're out there that they're buying time trying to diagnose the fault that has happened, it puts a lot of pressure on the architecture, especially the enterprise data architecture. Again data science driven apps have emerged from Silicon Valley startups. Out here we are very good at building that infrastructure. As it becomes mainstream, enterprises will need to be very careful about the impact it's having on their architectures. There we have seen already advancements like Nathan Marz’s Lambda architecture, which combines a streaming layer along with a Big Data layer (which is slower).

We expect a lot of such innovation next year. It's going to be pretty exciting.

Annika Jimenez: You can imagine, it was one thing in 2014 to prove the value of a data science driven app. We were engaged in a lot of projects this year where we were doing in fact that. Of course we were partnering very closely with Pivotal Labs to explore how data science actually folds into the application development process, especially an extreme agile process around app dev. I think we're very much on the forefront of the ones that are figuring out how to do that well.

That said, for an enterprise to actually manage multiple predictive models – both keeping them fresh and up to date with the latest data that's coming in, as well as managing the scoring process, enabling the availability of the scores through APIs and then the downstream consumption of the scores, the modification of apps in response to those scores – that's a very new cycle of a value chain that the enterprise class is still getting hip to. I think as a result everything that Kaushik just summarized in terms of impact at architectural level is critical and we're going to see a very serious focus on that from some of the leading data players across many sectors.

Hulya Farinas: Even if an enterprise is used to building data driven applications, there's still some new considerations if they would like to make those applications smart, analytics-driven because then there's a whole issue of where do you keep your models, how often you refresh them, and how do you make those available to your applications and the additional features that profile that you may build. How do you store that? In what frequency do you refresh them? There are a lot of new things for enterprises to think about if they're trying to make their application smarter.

Kaushik Das: Right. APIs, as Annika mentioned, will become very important in this context. We will see much more widespread use of open standards like PMML, which are critical to sharing of data science models and results.

Annika Jimenez: In essence this entire prediction and the trends around it really are foreshadowing I think even a faster jump to leap-frogging into the next gen data and app architectures. Obviously, that's what we do at Pivotal. We are exactly mimicking, I think, the enterprise readiness to make that leap. Let's move to prediction number three.

Hulya Farinas: As you know, many organizations are facing data science talent shortages. Back in 2011, McKinsey famously predicted that by 2018 the US alone could face a shortage of 200,000 people with deep analytical skills. A lot of universities, like Berkeley, Columbia, North Carolina, rose to the occasion and started their advanced degrees in data science. Then there are also some for profit technology companies, educational technology companies like Coursera and Udacity that make data science courses available to a very large group of people. But, the shortage of deep analytical skills is still very real for a lot of our customers.

Naturally in response to this demand, there's an increased focus on making machine learning available to practitioners beyond data scientists. Some of the companies we can think of – like Alpine, GraphLab, RapidMiner, Paxata and Trifacta – have more holistic approaches. They worry about data extraction, transformation, model building, scoring in a single environment. But those are only part of the analytics workflow, like data transformation.

Building analytics tools for people who do not have a background in data science is not a trivial task because, if the analyst doesn't have the academic rigor and industry experience, they may not know how to interrogate the data that they're using or the model that they're building. They can get themselves and the company they work for in big trouble. So, that's the danger lurking part.

We predict that the tools will continue maturing if they will go beyond just estimating coefficients. Because all data scientists know that that's not where it ends. You need to be able to understand what you're building and then tell a story with it. We predict that there will be safeguards, automatic raised to check for assumptions, see if the algorithm that you're using is the best one for the problem that you're studying, understand the shortages, the weaknesses and strengths of the data that you're working with so that machine learning will be accessible to a larger group – because this is not trivial at all.

Annika Jimenez: In essence you're saying to actually embed into the tools themselves, not just an understanding of the predictive accuracy but measurement of the error and the risk associated with those predictions so that those can be factored into any sort of actual operationalization or downstream consumption by an app.

We'd love to see the tool space evolve and become more accessible to a broader set of folks. We also get worried when we see some of those tools almost gloss over some of the rigor that needs to be baked into the capabilities that we're seeing in the tools and where these tools are being pointed to use cases that aren't just in digital media, for example, but perhaps in other sectors that have a bigger impact on livelihood or whatnot. That becomes much more important.

Hulya Farinas: You just predicted my next prediction!

Annika Jimenez: Don't worry, we still have lots to talk about. Did you have any other comments?

Kaushik Das: This reminds me a little bit of when the S statistical language was created at AT&T Bell Labs, and R of course (as you know) the open source language is based on that. That kind of took modeling on the computer to a much bigger group of people and expanded the modeling community because you didn't necessarily have to be good at Fortran or Pascal (to bring back memories) to actually write a program.

What we are seeing with tools like these, like Alpine and Trifacta, is a similar change. You're taking the modeling up one level of abstraction and with it comes the added convenience as well as the associated risks.

Annika Jimenez: With that, we're clearly alluding to our next prediction: prediction number four.

Hulya Farinas: We predict that 2015 will be the year of data science ethical failures. Uber, of course, jumping the gun on our prediction. The picture here is a bit obscure. But it references our friends at Uber who got into some trouble for using their data to determine who is having an affair. They call it the “walk of shame rise”. The recent reporting also suggests that there is misuse of data and data science at that particular company, but they are by no means alone.

More and more companies have the ability to gather collected data. And, now that they have access to technology to sift through these data sources in an efficient way. We like to say at Pivotal, with Big Data power comes Big Data responsibilities. There are important conversations happening, like in August the attorney general, Eric Holder, warned that data driven criminal justice practices could adversely affect certain minority groups and that new efforts need to be studied further before they are used to sentence suspects.

All kinds of conversations are happening about how to use data and what to study. Some of the ethical failures that we've seen in 2014 of course will create a backlash and people will demand to know where their data goes and how it's being used. In response we think that companies will start self-regulating and the government can weigh in.

Part of that self-regulation, the grassroots movement, is the Data Science Association – a non-profit organization. Top professional groups publish their code of conduct that goes beyond focusing on what you study but how do you present your findings (which Annika was mentioning a little while back). Are we being transparent about what the strengths and weaknesses of the models that we build? Because we do not want models ever to be a black box. We want the consumers of data science to understand those strengths and weaknesses.

One of our data scientists, Woo Jung, is actually very passionate about that. When applications are at the center of enterprises, now it's the way for enterprises to interact with their customers, partners, employees. This is becoming very important.

It's important for data scientists to be transparent about what they’ve built so that the consumers will not have to believe in it blindly but it's also a difficult task of course. It's not very trivial. You just want to be able to explain it simply.

Annika Jimenez: To be really clear, we're actually extending this specifically into the realm of data science. We all have played in the data space long enough to know that there've been sensitivities around the use of data for the last five years at least. I think what we're specifically calling out here as we anticipate 2015, because there's been kind of a broad option of data science techniques at the enterprise sector, there's also a kind of an overall immaturity and an understanding of the ethical requirements for the consumption of data science as a function.

To reemphasize Hulya's point, with great data power comes great data responsibility. Even as an example for us, the data science code of conduct that Hulya just mentioned is something that we're looking at very seriously as a potential policy and option for our team so that we can actually state with certainty that we will be applying only the most stringent ethical procedures when we're working with our customers' data.

Kaushik Das: Right, because with any usage of Big Data you need to be respectful of the rights of individuals, whether you're in data science or not. With data science comes an added complexity. You need to be careful of the assumptions and limitations of your models when you draw conclusions from them and expect people to act upon them.

Annika Jimenez: Okay. I think we've beat that one. We'll move on with the predictions. Prediction number five.

Kaushik Das: We all feel very passionate about that.

Annika Jimenez: Prediction number five.

Kaushik Das: This prediction is about that enterprises will shift from Hadoop-only to Hadoop++ to meet their data science goals. When I go to customers I still get asked the question, 'Do we need to really roll out Hadoop, or should it be something new like Amazon’s Dynamo?'

My answer is always that it's all of the above. Because what you really need depends on how you're going to use it. Especially as we mentioned earlier with the advent of the data science apps, we will increasingly need data architecture or data lake if you will which will have capabilities at different latencies.

It is not very useful to think that there will be one technology or tool that will come in and meet all your needs. So, for instance, Spark has seen a lot of momentum. It's actually going to see a lot of increased adoption that we are seeing in our team as well as out there in the community. There are other interesting things, interesting technologies like Tachyon – which can make Spark even faster – coming out of AMPLab at Berkeley.

What we have seen is that there are certain things that are done well in memory, like real time applications, and certain things are done better in a relational layer. We have actually spent a lot of time and effort out here at Pivotal developing and contributing to this open source library called MADlib, which is a set of machine learning tools which work in parallel so that they can work with Big Data but in the relational layer.

Essentially what we are going to see is that there will be an emergence, that as enterprises go to Hadoop, they will actually install Hadoop++. They’ll realize that on Hadoop, they might need a relational layer like Pivotal’s HAWQ or Dynamo.

What will emerge is the realization that what people need is a layer for doing data science, which will essentially be what we are calling Hadoop++.

Annika Jimenez: I think what we see a lot is companies that have gotten caught up in the Big Data hype and then therefore the natural extension of that, the Hadoop hype. We are of course one of the purveyors of that hype. I don't want to deny that fact. However there's sometimes not an understanding of the true requirements for data science for machine learning and predictive analytics and Hadoop suitability in supporting that as Hadoop alone, as Apache Hadoop and the MapReduce paradigm that that brings.

As companies are able to move into the HAWQ context and leverage existing machine learning libraries like what we have with MADlib as well as explore further Spark and the new libraries – MLlib, etc. – there will be the buildup of an analytical layer that sits on top of Hadoop to specifically meet the data science requirements, not just BI and aggregating data requirements but the agility and the sophistication and complexity of machine learning and predictive modeling.

Any other comments on this one? No? Okay. Let's move on to prediction number six.

Kaushik Das: That's my favorite picture. Unlike the one ring, there is not going to be one algorithm to rule them all. Again this is similar to the previous prediction where we shouldn't really expect one tool to come and solve all our problems. In the same way, we cannot expect one algorithm to come and solve our problems.

In fact, the way that data science community has developed, we have seen that it has enabled people from pretty eclectic backgrounds. Not just the machine learning discipline within computer science departments, but people from mechanical, electrical engineering, econometrics to join this community, take advantage of our tools and bring in with them their ideas and their algorithms.

What we have seen here though is that there are a lot of advancements that happen once you bring your ideas into the realm of data science and Big Data. For instance, neural networks have been around for a long time. But, deep learning has been a remarkably big advancement to that same idea – which we are able to enable in parallel architecture.

We think that this is going to be very promising. It's not going to kill all other algorithms, but it will find increased usage. Hulya will talk more about that later, of some specific usages. We're also seeing things like bagging and boosting of distributed computing platforms for instance. As well as queue related approaches to capture context.

Although there is no magic bullet, we are seeing a remarkable momentum on working on different algorithms within academia and in the industry.

Annika Jimenez: I think it folds into the earlier sentiment, which is we're seeing the build out of a rich analytic layer. As the complexity of problem solving kind of approaches new areas and new use cases and businesses are creatively thinking about how they're going to leverage their curated data, it's going to naturally drive broader the algorithm requirements that come to bear against that.

I think in that sense we are also very enthusiastic about some of the latest technologies and approaches to problem solving. We fold them and leverage them as needed depending on the use cases.

Hulya Farinas: Advancement can come in a brand new algorithm form or just the way you implement it in a brand new way in a distributed computing platform. Bagging and boosting has been very successful because there are ways for you to turn the problem into a data pile up problem and run many instances of different algorithm flavors and then just build an ensemble model at the end. We see just the evidence of that, increasing the robustness of the model and getting much more accurate models and using those methods.

We see that that trend is going to continue in 2015.

Kaushik Das: It’s great for the data scientists because essentially our toolbox is getting bigger and we are getting more options with those tools.

Annika Jimenez: Maybe building even further on that, let's move to prediction number seven. 2015 will be the beginning of the end of the venture capital driven proliferation of one-algorithm wonders. I think for those of you who are children of the '80s, not the actual age myself. You may or may not know the reference that you see here on the slide, which is a scene from the video for A-ha's Take On Me, which is probably one of the most well known one-hit wonders out there, at least for my generation.

I think the point that we want to make here really is that there is a huge round of enthusiasm on emergent technology build out that’s being financed by the VC community, by the VCs. That's very exciting for us as a team of practitioners because we can see first hand, as our craft is maturing. It goes back to the earlier prediction that we made about the tool evolution.

The challenge that that presents for the enterprise community is that in essence it's fragmenting the market. Now it's putting the onus on the enterprise to actually sort through that complexity to figure out how they're going to start piecing together the process that they actually have to go through in order to take raw data, explore it, transform it, build features, create models, and then operationalize those models and connect them to apps, or embed them into apps.

That's an extended set of steps that now has a ton of different tools attached to many different points along that process. In fact for those of you who don't know her, Shivon Zilis who works at Bloomberg Beta, has just done a machine intelligence landscape that's quite interesting summary of this fragmentation. It’s a good view into just how rich and complex the market is.

That said, the downside as I've mentioned is that it's putting the onus on enterprises, it's actually I think potentially complicating their maturity or their evolution towards increased maturity. I think what we will see is the beginning of consolidation in 2015 and a renewed focus on driving tools down into a common platform or hub that is supporting the analytic process end to end.

Just to make us more concrete, obviously attaching some examples if we're going to use Trifacta for data preparation and then maybe bring in Alpine for machine learning, maybe do some visualizations on D3 or even Tableau, and then extend operationalization and leverage figuring out the role of PMML and driving that back into a production environment.

Or, leveraging any one of a slew of vertical centric one-algorithm wonders as I call them, I think the challenge for the enterprise is understanding really how to deeply embed that as a new capability into an organization rather than just attacking kind of a one-off use case.

We increasingly are going to want to see the broadening of that analytic layer by bringing together these new capabilities that are currently embodied in fragmented small startups. We'll see that naturally consolidate in 2015.

Kaushik Das: The one-algorithm wonders have played their role. They have solved specific problems and spread awareness of data science and the capability of data science. Now it's time to kind of consolidate those.

Annika Jimenez: Perfect. Prediction number eight. Got video?

Hulya Farinas: Got video? It's probably a very safe prediction. In the last year we have seen an increase in use cases that feature image and video data. That's evident in the number of blog posts we have that discuss image processing and video analytics. Of course, being able to implement deep learning in a distributed computing platform also allows the learning of the features and it creates brand new opportunities.

A lot of the times, image-processing algorithms are implemented in Python and in C++. Traditionally images are processed, they're usually under-utilized. Whenever they're analyzed, it’s with single threaded processing and they’re analyzed in isolation. It's not usually merged with very structured and unstructured data sources. But we see that in certain domains – especially health care, life sciences, security and media domains. There is an increased interest in bringing in these unstructured data sources images and videos and merging it with other data sources and increasing the understanding of the situation, whatever you might be studying.

Our team is working on this for a while now. We have implemented a lot of different algorithms in our environment. We made image processing scalable even if the algorithms are very complex and rely on multiple libraries; we have found ways to make that scalable. We think that in 2015 that demand is going to continue existing and the technology will respond to that demand.

Annika Jimenez: Again it's not necessarily just leveraging the video data by itself. It's drawing upon the same trends that we see underlying Big Data and data science as a whole, which is actually going back to the stored video files and bringing them together with other sources of data that perhaps haven't been brought together before. Now we have the platform to actually power that, as well as new techniques – including deep learning and others.

Hulya Farinas: Especially in healthcare, medical images, they are being analyzed but it is very rare that you'll also marry it with medical history, prior diagnosis, procedures, genomics data, medications, lab results, etc. Just by bringing those data sources together, you have a much more complete understanding of the patient. Whatever you're trying to predict, whether it's by comparative effectiveness, whether cancer therapy is going to be effective for this particular cohort. This is more information and value they can extract from images. We’re seeing that companies are realizing this.

Annika Jimenez: We see expectancy not just in healthcare but security, media, etc. We'll move to prediction number nine. Prediction number nine is 2015 is going to give rise to what we're calling data arbitration solutions, specifically to help address the growing number of proprietary data sets. You want to ...

Hulya Farinas: I want to give an example of that in healthcare. Say a fitness tracking device manufacturer may want to create a smart intervention. For that they would have to have an understanding of the medical history of the user. That won't be available to them of course. Similarly, physicians and providers would want to understand the development of the patient between office visits.

There is value to both organizations to share data but it is often very difficult, especially in healthcare to share those proprietary data sources. A lot of the companies are starting to question that. They have a prerequisite to have a common data lake so that they'll be able to take advantage of analytics that run off of emerged data.

There are different methods that we think are going to be prevalent in 2015. One is preserving data mining techniques, instead of just bringing data sources together and having a federation of data sets. Then let the algorithms wander, instead of bringing data together. Of course there are other solutions as well.

Annika Jimenez: This was a very common problem in the digital media stage. Those of you who were in that world know that there are things like third-party data exchanges, like that provided by Experian and others that in essence allow for this double blind anonymization and integration of the data for the mutual benefit, perhaps of two parties.

We expect to see, because of this rise of proprietary data sets, companies have been curating these and because of the richness of these data sets, and the potential that they have for deep problem solving maybe across a sector. We're going to see a lot more build out of these third-party both technologies as Hulya alluded to, as well as services that actually enable the bringing together and the anonymization and the secure environment of data to power data science, but doing that in a security and privacy compliant way.

Kaushik Das: We actually see new standards emerging as well, to facilitate the sharing of data. Some sectors are more ahead of this than others. In the utility industry there is a CIM standard, a common information model, where a lot of the data is already being stored in that model. That process takes a little bit of time because they bring many players together and there are teething problems.

Once you get the momentum going then it's rapidly emerging a standard and all software then has to comply with that. Actually you find it easy to do so.

Annika Jimenez: This is perhaps one that's a little bit more on the fringe and maybe wasn't on people's radar. We see in particular these proprietary data sets and the need to leverage them outside of the enterprise, becoming and important trend. We actually think that that opens up opportunities in the market that we'll see exploited in 2015.

I think for our last prediction, it's actually a set of predictions. I'm going to let Kaushik summarize those as they get into the vertical market.

Kaushik Das: Since our team actually works with companies all across the spectrum of verticals, we see a lot of interesting strengths which are vertical specific. For instance, you could think of the Internet of Things which actually encompasses many verticals. We are basically going to see two very important things happening next year.

One is that the Internet of Things will become smarter. This will come from applying data science to find value in the data and make it consumable. Particularly we're seeing the usage of anomaly detection algorithms to predict failures and lead to zero unplanned downtime regimes. We also see more innovation in the algorithmic space and the Internet of Things.

On the other hand, companies are also beginning to understand that they need a strategy to exploit the Internet of Things. The Internet of Things is going to become mainstream next year. This November, the Harvard Business Review had an article for instance by Michael Porter on how to leverage the Internet of Things.

This has now become mainstream. The Internet of Things has become an essential part of the strategy of most big companies and a lot of the media models.

On the other hand what we've seen, the mainstreaming of this will have an effect on different sectors. For instance, think of connected cars when it comes to automobile. Think of connected homes. On the other hand we have seen that oil and gas for instance has already started using and realized the importance of Big Data and started leveraging that. Consumer electronics and utilities are still a little bit behind, with a couple of notable exceptions, but they're catching up.

Orthogonal to this is the impact of security. As the Internet of Things is becoming mainstream, what about the security of that data? We have seen a lot of movement and momentum in the security space where the two big things we're noticing in terms of application of Big Data to security is a monitoring of lateral movement in networks and finding anomalies, and detecting activities which could be harmful that way.

On the other hand detecting a malware, especially malware beaconing. You don't necessarily know the signature of the malware for which you already have other software in place to detect. This is a general supervision of your entire network.

Annika Jimenez: It's really important to mention there that the leadership in that space is provided by Derek Lin and the team that we have here that does a lot of work in the security space. They're specifically seeing a lot of adoption of, in particular, the problem-solving for these two use cases. Also looking at both graph based modeling approaches to solve for in both of these cases.

Kaushik Das: Derek has innovated and led the frontier of data science into this area. We have seen again, for instance in pharmaceuticals, rapid growth in genomics data sets which will require pharmaceuticals to actually leave HPC environments for Hadoop as the data grows.

In finance, we have seen increased application of scenario analysis to understand and compute risk on distributed environments, and also the increasing use of text, particularly in situations like e-compliance. Know your customer, or KYC, requirements are now legal requirements for most financial institutions.

We are seeing the movement from manually doing compliance, employing hundreds and even thousands of people to using machine learning. This will accelerate a lot in the coming year. Then of course when it comes to areas like consumer product goods and retail we have seen a lot of growth in location-aware apps and the data science to support that, doing dynamic pricing offline – like Kohl’s has experimented with over the last year.

Combining online and offline marketing and sales into one unified strategy using data science. Then we're also seeing the importance of data science and product design. Understanding how people behave, both online and offline with products and incorporating the behavior of the users into design.

So, that's a lot of different things.

Annika Jimenez: A round up of our market. I think the ones that you just called out really are the ones that we expect to see very strong adoption in 2015 based on the maturity of the various enterprises that are competing in these sectors that we've been working with in 2014 and 2013.

It's the natural extension of their maturity and the evolution of their adoption of data science. That rounds up our predictions. I think what we wanted to do now is actually move to this hype cycle from Gartner for Big Data. I think it's an interesting way for us to wind down our set of predictions for 2015, knowing that this actually I think was published in August of 2014.

Again it's Gartner's view into the various sub-components of the Big Data space and where they sit on their hype cycle, including of course those that are nearing the dreaded trough of disillusionment. To be really clear we're not necessarily highlighting this hype cycle because we agree with it 100%. In fact there are a bunch of elements of this particular hype cycle that we actually found ourselves disagreeing with, a fair amount. Maybe it's worthwhile commenting on a few of those things as well.

As we do that we'll share a few of our thoughts and then shortly after this we'll have a few more minutes left for questions. So, stay with us if you'd like to participate in that.

Turning to this visual, I think the thing that struck us at the outset was the placement, in particular of the Internet of Things at the peak of inflated expectations. That's one of the areas that I actually would disagree with. I think they're signaling that there's kind of a longer-term evolution for this particular item, five to ten years before it moves into this plateau of productivity.

I do think that that's true. I think that we're just at the beginning of an adoption of the Internet of Things and the understanding of the enterprise class of how to actually leverage the Internet of Things. If I had to disagree just generally, I think, we're not so sure that there will be a trough of disillusionment for the Internet of Things. There's a lot of lessons already established generally around weblogs and how weblogs can be consumed as well as other sources of data.

There's a lot of preexisting knowledge that players, who have the opportunity to leverage that the Internet of Things can actually base their strategies on. I think we're a little bit more optimistic and we would also put it a little bit earlier on the curve as well.

Kaushik Das: Right. The bottleneck there is not really the technology, it’s getting the data together from different players. We have seen a lot of movement toward solving that problem this year. We think that it is definitely on an upswing.

Annika Jimenez: Great. I think the same goes for, let's see here some of the other ... We would agree with placing early on the curve as a service offering data as a service, and some of the others. I think to our own number one prediction with data science moving into the cloud, we're just beginning to see that emerge as the natural next step for the industry. But, we do expect that to get big traction in 2015.

I don't know, Hulya, if there was an item in here that you wanted to ...

Hulya Farinas: I wanted to pull out the supply chain Big Data analytics one because in the '90s a lot of enterprises spent a lot of money on building supply chain management solutions. There are a few who are realizing that there's an increased amount of data now, much more granular, that will allow them to make better business decisions. It's not necessarily solving brand new problems but increased data availability makes these problems new and requires brand new approaches.

I think the placement of it sounds about right. I don't think the news has gotten to everyone yet.

Annika Jimenez: I think the way to take this is, if you think about all of our predictions, in essence we're signaling that there is a maturing in the understanding of how to leverage Big Data technologies and to apply them to the business needs based on the data that is available to that company.

Bringing distributed compute capabilities into their architecture which incorporates a lot of different open-source technology, proprietary technology etc. and understanding how to architect that into the platform and then apply it to problem solving across many different verticals and use cases, is kind of where we are in the overall process. We expect 2015 to be the one where there's true traction on that, on multiple points of the stack.

I think in a way that's what you're seeing in this hype cycle as well. I don't know if we have any other comments on this. It's a good way to summarize what some of the folks who are paid to opine on Big Data are thinking about Big Data. You now know our predictions from Pivotal reflecting the footprint that we have globally across many different sectors.

I think with that we want to take a couple of minutes and take some questions. I'm going to bring that back to Katherine and see if she has any she would like to pose that may have come up from the audience during the presentation.

Host: Great, thanks Annika. One of the questions that came in initially was that you haven't talked about the role of optimization in the data science process. Any thoughts on its role in 2015?

Hulya Farinas: That's an excellent question. My background is in operations research. I do believe optimization being the last mile of predictive analytics. Once you understand the causalities, how do you act – you can inform the decision maker, or be the decision support system. There are also many instances where you would want to feed that into an optimization routine and have all kinds of capabilities on our technology stack to make that happen.

We do believe in the marriage of machine learning and optimization. We see an adoption of that in 2015.

Annika Jimenez: Any other questions?

Host: Yes. Another one came in, said comment through all these predictions are one, anomaly discovery, two, behavioral modeling, and three, personalization. Can you comment on that?

Kaushik Das: Sure. That's actually a very good question because a lot of data science does boil down to those approaches. They're by no means exhaustive because for instance there is the whole area of demand modeling in general, which enables us to understand all the factors that effect demand and therefore predict, both at the strategic and tactical level, how our actions can influence the outcome in different scenarios.

There is that. There are a few other things there. For instance we spend a lot of time, especially when it comes to the Internet of Things in understanding time series data and dealing with time series data and understanding the patterns there.

Sometimes that leads to anomaly detection, but in other times it just leads to a greater understanding of the levers that affect a particular system of machines.

Annika Jimenez: I think we are seeing, as part of this natural maturation at the enterprise level, there is a consistent theme which is especially when you think about the concept of all of this being powered by a data lake. There's this consistent theme that we're seeing of companies building a 360 degree view of X thing or event that matters to them.

That could be a patient, the quantified patient. It could be a consumer. It could be a product. We're seeing that in CPG sector where they're building that profile of the life cycle of a product, inclusive of even the manufacturing lines. I think that is raising the natural problem solving questions that can be asked once that 360 degree view of the product is getting assembled.

I think we're at the point now where there's a natural, because that build out has happened, there's just a natural phase to exploit what could be asked of all of the data that we're seeing now in 2015.

Hulya Farinas: Anomaly detection especially is probably part of any Lab engagement we have anyway. A lot of companies believe that if their data is messy, there’s no value to get. We don't believe that at all because there are all kinds of sophisticated methods that you can use to identify what is true, what's not, cross-validate the data, detect an anomaly and then imputation methods so that you can improve the data completeness so that it will be ready for data science.

It's almost a prerequisite to any Lab engagement. We've been educating our customers about the value of it and we see the increased appreciation of that method.

Annika Jimenez: Great, great question. The other ones?

Host: Maybe one more if you're able. I know we're a little over our time here. What are the skills which are going to be in demand in 2015 from a data science perspective?

Kaushik Das: When we hire data scientists, we have built up a very good team here, we look at skills in essentially three axes. The first one of course is having a very good understanding at the algorithmic level of the math that's needed to solve problems of data science. The base is of course in statistics and machine learning. Then people have often specific expertise in different areas.

The second thing is a domain knowledge which is coming from the side of the business. The use cases that are formulated, after which you have effectively applied data science, require an understanding of what goes on in the domain. That is the non-mathematical aspect of data science. So, we really value expertise in those areas in different domains.

Along with the math, on the first axis of course is knowledge of technology. Here we're looking more for people who are capable of learning technology very quickly. A lot of comfort with programming and rapidly picking up new technologies because as you are well aware the specific tools that will become popular and important tomorrow are not here yet.

Annika Jimenez: I don't know that is different. Do we think of the profile of folks that we will be looking for in 2015 is different from the ones that we used to hire until now? I don't think so.

Kaushik Das: No, I don't think so.

Annika Jimenez: There's going to be a specific drill down into some of these new techniques but we're hiring the kind of people who come in ready to dive into those. We're not specifically going out and trying to find, for example, somebody out of AI who is going to help us open up a new area. We've applied a pretty consistent profile – programming, stats, domain knowledge – and we've been pretty successful.

Hulya Farinas: Once the tools become so mature that you don't really need a lot of programming skills, maybe this storytelling part of it would ...

Kaushik Das: Yes, the last part which is very important is the storytelling part. Remember that we are not operating in a vacuum as data scientists. Our results need to be used by people, consumers, public, or employees of an organization. We need to build a narrative and explain what we have done, rather than just give some results out of a black box.

Annika Jimenez: Yes, I agree with that. I also think that signals other roles outside of just the data scientist that become also increasingly important, people that we used to call engagement managers whose job it was to actually engage with the business and help them consume the output of the analytics process.

I think it goes on both sides. I think we're still seeing the build out organizationally of the community around data and data science. We'll continue to see that mature into 2015 as well.

Katherine, I think that might do it for us. Perhaps we respond separately to any questions that we didn't get to.

Host: Sure. We'll send the question list to you guys and you can respond via email to the folks who weren't able to get to. I guess we'll bring this to a close. We'll be sending out the links to this webcast recording and the SlideShare deck of the session in the next few days.

There's also lots more information on the Pivotal website if you're interested. Thank you all again for joining us and have a great day.