White Paper

Digital Drivers

How Technology Enables the Data Geek in Life Sciences and Healthcare





In what is considered one of the earliest historical examples of an epidemiological study, the physician John Snow set out to identify the source of a Cholera outbreak in London in 1854. Collecting data by interviewing local residents, he was able to triangulate the source – a well of contaminated water - and persuade the local authorities to prevent its further use. This early example illustrates how data can inform decisions and impact health-based outcomes.

At the time of his discovery, John Snow lamented that the lack of sufficient data and a quick response had prevented even more lives from being saved. Today, we are in a position to collect terabytes of data from thousands of sources, and to act upon it immediately. Cholera outbreaks in Haiti could have been detected well in advance using data from Twitter feeds instead of traditional epidemiological monitoring. Developments in the near future may see disease outbreaks being preempted through the continuous monitoring of DNA sequences from samples of sewer water.

What developments over the past 161 years have made it possible to collect such large volumes of data, analyze it, and react to it in meaningful time? This article discusses some of the advancements in biomedicine and technology which are enabling the creation and analysis of an ever-increasing volume, variety, and velocity of data, and describes how computational solutions are propelling the field of biomedical informatics forwards. Areas of the healthcare industry which are undergoing transformation as a result of these developments are also examined.

Figure 1. John Snow's map showing infected individuals clustered around the affected well

The Growth of Data in Biomedicine: The Promise and Perils

Due to advancements in data collection, we have a growing repertoire of sources from which we can build more complex models than we could before. Integrating such data requires systems that can perform rapid computations. Also key are environments that can handle enormous volumes of data and allow us to seamlessly bring them together for analysis. Before we discuss solutions to these challenges we need to understand the available datasets and how computational approaches can enhance the field of biomedicine.

The use of computing in biomedicine is certainly nothing new. The field of genomics provides one example: sequences of small DNA fragments generated by the Human Genome Project needed to be aligned and ultimately strung together to reconstruct the three billion base pairs, or letters, of the human reference genome. This task, impossible by manual means, required computer scientists and biologists to work together and implement algorithms to solve the problem computationally. The biomedical space also provided early applications of artificial intelligence as AI systems attempted to mimic physicians in an effort to improve the way patients are treated.

What is new is the orders of magnitude more data that scientists have to work with today. Consider, for example, the prediction of a cancer prognosis. This involves analysis of the patient’s genomic record, imaging data, and family history among other things. The genomic record alone - swelled by contributions from the fields of metabolomics, proteomics, and transcriptomics - contains a staggering amount of data about patient health and potential treatments. The challenge this poses is that the size and variety of the datasets is exploding. With the widespread adoption of electronic medical records (EMR), the volume of data will continue to grow. Comprehensive medical histories, including diagnoses and treatments, can be studied along with bedside monitor feeds and external data sources. The growth in the use of sensor data — both in hospitals and from wearable devices — means datasets are quickly becoming unmanageable using existing methods. Similarly, several new modalities, each with its own imaging technique, produce large volumes of high-resolution images that are available for analysis. The challenge lies in managing the ever-increasing size and variety of these datasets.

Why Biomedicine has a big data problem

The need for new architectures to deal with biomedical datasets is immediately apparent.

Data sizes are increasing rapidly; besides the commonly cited problem of genomic sequencing, where some institutions are already approaching the 100 petabyte mark, an image in the field of radiology can be as large as one terabyte, and high-throughput techniques can produce petabytes of data annually. In order to derive value from such volumes of data, rapid processing is crucial. In addition, storage cost becomes a significant consideration. While alternative long-term storage solutions, such as DNA computing, are being examined, it is possible that in the future data will not be stored at all: resequencing samples may become cheaper than storing raw data on disk.

The speed with which data is flowing in and the immediacy of action required also contribute to biomedicine’s big data problem. Take the example of sensor data. Monitors, Fitbits, RFIDs (Radio Frequency Identification), and your smartphone are all potential sources, each generating nearly continuous streams of data. Depending on the application, this data may require an immediate response. Predicting infant mortality using bedside monitors requires rapid responses to changes in physiology, as does the prediction of disease outbreaks using Facebook and Twitter feeds. RFIDs may be used to inform surgeons if sponges are left behind after surgery. This concept of data from connected sensors in our surroundings, even our own bodies, which are analyzed to predict and respond to events is known as the “Internet of Things”. Its application extends far beyond medical treatments to use cases as diverse as automatically ordering groceries to predicting failing jet engines.

This brings us to our final challenge: bringing these data sources to a single place where they can be aggregated and analyzed simultaneously? Research institutions, hospitals, and pharmaceutical companies too often produce data that remains isolated in the silos where it is generated. Remedying this situation requires an environment capable of accommodating each dataset’s peculiar storage and computational requirements. Designing such an environment involves

  1. Architecting systems that can rapidly ingest and transform data from all relevant sources
  2. Building models that can accurately and precisely alert when action is needed
  3. Producing alerts in a timely fashion.

New paradigms have been introduced to respond to these challenges

Breakthroughs require the emergence of new technologies

Over the past few decades, computation has become the cornerstone of large-scale problem solving. Simulations on a massive scale have been responsible for solving some of the hardest compute problems – beating a Grandmaster at chess, for example. What these problems required most was pure computational horsepower to search an enormous – possibly infinite – space for an optimal solution, not the ability to crunch through huge volumes of data. Therefore, they cannot be considered big data problems in the modern sense.

As a result, these large-scale computing systems were designed with relatively small storage space, focusing instead on the rapid communication of information across a massive number of CPUs, and the transmission of results between multiple threads. The processing of terabytes, let alone petabytes, of data was hardly a consideration during the development of these systems. Though some problems in the biomedical sphere, such as protein folding and molecular dynamics, do lend themselves to this architecture, it is, in general, ill-suited to the large-scale data problems described above. In addition, algorithms implemented to work on gigabytes of data cannot function at the terabyte-plus scale. Traditional programming languages were simply not designed with trillion-row matrices in mind. And while some languages have expanded their range of valid array sizes, it is not long before these new limits too are exceeded.

Nonetheless, data scientists persisted with using these systems, choosing to work around the limitations rather than develop new solutions. Some progress was made, no doubt. Graphics Processing Units were used for sequence alignment. Groups like the Pande Lab at Stanford made use of idle cycles to achieve “cheap compute.” But these solutions could still work with only small datasets.

This resulted in a need to ship data from locations where it was produced to CPU-rich environments where it could be processed to produce results. The results were then copied back to the original environments. Very soon, computation wasn’t the only bottleneck; network traffic and I/O were factors too. Something needed to change.

The Massively Parallel Processing (MPP) database introduced a new paradigm: storing and processing data in distributed environments in order to minimize data movement whenever possible. The premise behind MPP databases is simple: take the traditional PostgreSQL database management system, store data across multiple machines, and perform queries by pushing the work to the nodes that store the data. Consider the example of counting elements in a table; each node counts elements in the portion of data it stores, loading only small amounts of data into memory at a time, and these counts are subsequently aggregated. Data movement is minimized by pushing only the local count over the network during aggregation, and parallelization is achieved without copying data. Pivotal’s Greenplum is one example of several early MPP technologies.

The MPP paradigm works well with data that is easily stored in tabular formats. However, data does not always take on an obvious tabular structure, a priori. The solution? Hadoop. Hadoop’s file system enables the storage of massive amounts of unstructured data such as images and text documents. It uses the MapReduce framework to achieve the distributed store and process-in-place functionality of MPP.

Conceptually, a MapReduce pattern is comprised of three phases: a processing phase called map, a handoff or a synchronization phase, called shuffle, where data from the mappers is moved between nodes, and finally, an aggregation phase called reduce. In a task such as counting words in a document, the map phase involves each node counting the occurrence of words within its local data. During shuffle, the word counts are transmitted across specific nodes as prescribed by the system. And then the reduce phase sees the count for each word aggregated in parallel and written back to the distributed file system (Figure 3).

MapReduce solves the issue of data movement and parallel processing, and enables execution of jobs on large volumes of data. One issue, however, is that it always works with data stored on disk. Complex analytic workloads often mean stringing together a sequence of MapReduce jobs, resulting in large latencies due to frequent disk access. This makes the framework less than ideally suited to some near-real-time jobs. Enter Spark, a fault tolerant framework for distributed data storage and in-memory computing.

At the core of the Spark framework is an abstraction called the Resilient Distributed Dataset (RDD). Baked into an RDD is the ability to rebuild a lost partition from lineage, enabling Spark’s fault tolerance. RDDs can also be cached in memory across several machines, allowing reuse in multiple workloads and avoiding frequent disk reads. This results in extremely low-latency processing. The Spark ecosystem also allows for both real-time stream and batch-driven computation to coexist in what is commonly referred to as the Lambda architecture.

The advancements in distributed storage and compute frameworks like MPP databases, Hadoop, and Spark have resulted in the increased adoption of these platforms for building predictive analytics applications involving sophisticated machine-learning. This, in turn, has spurred a great deal of activity around the development of machine-learning libraries for these platforms. Examples include MADlib, MLlib, and GraphX. Using these tools, models can be built to leverage- billions - or even trillions - of rows of rapidly streaming data. Moreover, as each framework allows existing code, written in languages like R and Python, to execute on the data where it resides locally, old-school single-threaded applications still find a place in this new, distributed data paradigm.

The flexibility of these environments makes it possible to ingest and effectively analyze the vast amounts of data continuously produced. What is next? How does the world derive value from and act on this data?

Beyond data and architecture: Operationalized predictive models

Exciting opportunities lie ahead for academicians to take advantage of the ever-growing data sources and develop a better understanding of diseases, risk factors, and effectiveness of treatments. Even more exciting are the opportunities for the healthcare industry to use these insights for the prevention of undesirable events, improvement of outcomes, and reduction of costs. Hospitals, pharmaceutical companies, and consumers all stand to benefit from operationalized predictive models. This section discusses the priorities of each group and the challenges they can expect to face on their way to becoming data-driven enterprises.


Through government incentives like Meaningful Use, nearly 70% of healthcare providers have adopted electronic health records. This enables government agencies such as the Centers for Medicare and Medicaid Services (CMS) to collect data from healthcare providers, and provide transparent performance and quality measures. Hospitals themselves can improve outcomes by using the electronic health records in predictive models. Some examples include Hospital Census, Emergency Room Wait-time Modeling, Hospital Readmission, Gaps in Care, Length of Stay, and Throughput Modeling. Two questions need consideration for the model to be effective once operationalized:

  1. What data is available at a particular point in time?
  2. What data is actionable?

These two concepts may be illustrated using an example.

A hospital streamlining its discharge operations by predicting the duration of a patient’s stay may perform a retrospective study using all available data. This includes diagnoses, type of operations performed, and the operating surgeon. The study would allow the hospital to attribute duration of stay to a patient’s particular condition and to predict length of stay for each future patient accordingly. However, a predictive model that helps discharge scheduling may not have all data elements available when the decision is made. For example, a precise diagnosis code is not available at the time of admission into an emergency room. Is the patient having a panic attack or a heart attack? This will not be known until later.

As technology evolves, the goal is for hospitals to use more granular data in order to become “Connected Hospitals.” A hospital already produces a lot of rich data that can be effectively used in models: blood pressure, body temperature, respiratory rates, and oxygen saturation of patients. For example, models exist that use a patient’s medical history to predict whether their condition will severely deteriorate. But in what timeframe? A model may be able to predict the outcome of the next hour very accurately, but at that point caregivers would not need a model to know the extent of physiological deterioration. Models must be built to generate alerts while action can still be taken.

In the coming years, hospitals will continue to collect more highly granular, real-time data and feed it to a central “brain” to drive predictive models and take actions. Hospitals that require their caregivers to wear wristbands that read RFID tags at hand-washing and sanitizing stations demonstrate this trend. An accelerometer detects how long a caregiver spends washing their hands, and an alarm prompts them if the action is not done correctly. By collecting such data in a central database, a hospital can identify all potential contamination patterns in real-time and prevent a hospital-acquired infection from spreading.

Pharmaceutical Companies

R&D departments of pharmaceutical companies are excited about the increased amount of rich data from new and interesting sources. Fitness tracking device logs may allow a better understanding of the mobility of the patients to determine the progression of debilitating diseases like multiple sclerosis. The merging of some very diverse datasets including various “-omics”, high-resolution images, and chemical structure data is driving new medical discoveries. While, for the most part, real-time decision making that is powered by predictive modeling is less crucial in such research environments, pharmaceutical companies can improve product quality and reduce costs by using sensor generated highly granular data, and predictive models.

An example is a model to predict vaccine potency using manufacturing data. The potency of a vaccine must be monitored to produce a high-quality product that meets FDA standards. Vaccines can be very expensive to manufacture, taking months to produce. During this process both machine-generated monitoring data and manually entered measurements are collected. Leveraging this data can yield a very accurate model that can prove useful in making critical business decisions.

The primary considerations when developing models in this particular case are:

  1. Whether the model can make predictions accurately enough for some action to be taken?
  2. Is the prediction done in time to enable either taking corrective action, or prevent wasting additional resources in producing a “throwaway” batch of vaccines? If the vaccine’s potency can be predicted very accurately, but only on the last day of the manufacturing process, leading to a yes or no decision on dumping the vaccine, then is the model helping save money at all?
  3. Can some root-cause-analysis enable us to tune the manufacturing process?


When a patient visits a doctor’s office, numerous types of rich data are collected: the physician’s assessment of the patient’s health, lab results, MRIs, and CT-Scans. According to the Center for Disease Control and Prevention (CDC), a patient visits a physician three times a year on average. As a result, the majority of healthcare decisions are made using very few, albeit well-documented, data points. But a lot of patients, especially ones with chronic illnesses like diabetes, need help in between visits to make better health decisions, and adhere to medication, exercise and diet regimens that are prescribed to them. Care for people diagnosed with diabetes accounts for over 20% of healthcare dollars spent in the U.S., and so, naturally, this disease receives a lot of attention from healthcare companies. Some solutions designed to help patients better manage their condition themselves include Sentry’s wristwatch-style tool that sounds an alert when either perspiration occurs or body temperature drops – two markers of low blood sugar, contact lenses developed by Google and Novartis that measure glucose in tear fluid and send the data wirelessly to a mobile device, and Medtronic’s artificial pancreases that sense glucose levels continuously and pump the optimal amount of insulin.

Consumer facing solutions have the potential to make the biggest impact since they target one of the major decision makers — consumers themselves. However, these solutions are not without their challenges. First, the sensitivity of the data that is transferred means additional scrutiny from regulatory agencies. Second, making the findings of a statistical model accessible to the general public is not always easy. Oversimplified communication may have unintended consequences such as alarming individuals and making them use healthcare resources unnecessarily, or conversely, giving them a false sense of security. Finally, companies that build data science-driven healthcare apps must consider their adoption by the right cohort of patients. Sometimes those sections of the population that would benefit the most from a particular solution are not accessible through technology.

After all, isn’t innovation useless if we cannot reach those most in need?

Author Bios

Sarah Aerni is the principal data scientist at Pivotal where she focuses primarily on healthcare and life sciences. She received a Ph. D. in Biomedical Informatics from Stanford University.

Hulya Farinas is currently a senior principal data scientist at Pivotal and has previously held positions at IBM and M-Factor. She received a Ph.D. in Operations Research from the University of Florida.

Gautam Muralidhar received his Ph.D. in Biomedical Engineering from The University of Texas at Austin and his areas of expertise include machine learning and computer vision. He is currently a senior data scientist at Pivotal.