Big Data Has Changed the Practice of Healthcare Forever – and the Change is Just Beginning. Healthcare organizations – old and new – are investing heavily in Big Data applications.

Big Data projects process data measured in petabytes to deliver significant healthcare benefits. Only a small proportion of that data comes from traditional databases with well-structured data. Instead, almost all of the data comes from sources that are messy, inconsistent, and never intended for a computer to use. I’m talking about messy, unstructured patient records. Accessing this unstructured data and making sense of it gives health care professionals and leaders insights they would never have otherwise. They directly affect the way health care is delivered on a patient-by-patient basis.

I’ll give you four real-world examples the health care industry has already realized. We’ll take a quick look at Apixio, Fitbit, the center for Disease Control, and IBM’s Watson Health.


Medical research has always been conducted on randomized trials of small populations. No one tried to conduct massive healthcare research using all the data on all patients because the work would have been over whelming. Limiting the size of the data sets researchers used made their research manageable. Working with small sample sizes creates methodological flaws of its own. This is not to criticize those studies but to recognize the limitations of the research outcomes based on the limitations of what was feasible at the time those studies were conducted.

Apixio set out to change all that. Apixio developed mechanisms for conducting healthcare research based on studies of actual patient healthcare records. Their mechanisms leverage both Big Data and machine learning. Further, they work with ALL the patient healthcare records a facility has to offer – not just a randomized subset. As new patients are treated, Apixio collects data about the symptoms, diagnoses, treatment plans, and actual outcomes. By integrating these new cases into the mix, the company can quickly determine what works and what doesn’t. The difference between discovering the effectiveness of healthcare treatment programs based on limited clinical research studies and those based on analyses of the effectiveness of treatment programs based on reviews of ALL patients can be dramatic. I’m talking here about studying the treatment outcomes for all patients, not just a small number included in clinical research studies.

Only about 20% of the patient healthcare records reside in well-ordered databases. 80% of the data is messy, unstructured data. I’m talking about the GP’s notes, consultant’s notes, and forms prepared for Medicare reimbursement purposes. Working with unstructured data used to be problematical. Institutions had to hire and train “coders” who would read free form materials (handwritten notes, typed notes, etc.) and capture the meanings of those notes in a form suitable for computer processing. Apixio dealt with this issue quite differently. It used computer based algorithms to scan and interpret this data. The company found that its computer assisted techniques enable coders to process two to three more patient records per hour. Further, the coded data it created this way can be as much as 20% more accurate than the manual only approach.

This computer-assisted approach also finds gaps in the documentation. In one nine-month period, Apixio reviewed 25,000 patient records and found 5,000 records that either did not record a disease or didn’t label it correctly. Correcting the data can only improve diagnoses and treatment programs.

Apixio does far more than produce studies that physicians can use to inform their treatment plans. It takes the next step. It reviews the healthcare records of each patient and develops personalized treatment plans based on a combination of the data it has collected for that patient and the results of its analyses of practice-based clinical data. This enables physicians to only order the tests that are useful and avoid expensive but worthless procedures.

This pays off handsomely for insurance companies that treat patients who are enrolled in the Medicare Advantage Plans. Under these plans, Medicare pays a “capitated payment.” This is a payment paid to treat patients based on their expected healthcare costs. By tailoring the diagnostic tests and treatment programs by individual, the company is able to reduce its costs dramatically. Those savings drop directly to the bottom line.

It’s not just the insurance companies that benefit, though. Patients benefit as well. Patients are not required to undergo inconvenient or painful procedures that would provide no benefit.


Fitbit is the leader in the sale of wearable devices that track fitness metrics, although Apple is hot on its heels with its Apple Watch. Fitbit sold 11 million devices between its founding in 2007 and March 2014. These devices track fitness metrics such as activity, exercise, sleep, and calorie intake. The data collected daily can be synchronized with a cumulative database that allows users to track their progress over time.

The driving principle here is that people can improve their health and fitness if they can measure their activity, diet, and its outcomes over time. In other words, people need to be informed in order to make better fitness decisions. Fitbit provides users with progress reports presented in a preformatted dashboard. This dashboard tracks body fat percentage, body mass index (BMI), and weight among other metrics.

Patients can share their data with their physicians to give them an on-going record of their key healthcare parameters. This means that doctors are not forced to rely on the results of tests that they order on an infrequent basis. To be fair, however, not all physicians are open to treating the data their patients collect on their own to be as credible as that collected in a clinical setting.

Insurance companies are prepared to adjust their premiums based on the extent to which their policyholders look after themselves as measured by Fitbit. This means that policyholders are required to share their Fitbit or Apple Watch data with the company. John Hancock already offers discounts to those who wear Fitbit devices and the trend is likely to spread to other insurance companies.

The fastest growing sub-market for Fitbit is employers. Employers can then provide their employees with Fitbit devices to monitor their health and activity levels (with their permission).

The CDC and NIH

The Center for Disease Control (CDC) and the National Institutes of Health (NIH) are leaders is applying Big Data identifying epidemics, tracking the spread of those epidemics, and – in some cases – projecting how they are likely to spread.

The CDC is tracks the spread of public health threats including epidemics through analyses of social media such as Facebook posts.

The NIH launched a project in 2012 it calls Big Data to Knowledge or BD2K. This project encourages initiatives to improve healthcare innovation by applying data analytics. The NIH website says, “Overall, the focus of the BD2K program is to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of Big Data and data science into biomedical research.”

A couple years ago the CDC used Big Data to track the likely spread of the Ebola virus. It used BigMosaic. BigMosaic is a Big Data analytics program that the CDC coupled with HealthMap. HealthMap is a data base that maps census data and migration patterns. HealthMap shows where immigrants from various countries are likely to live – right down to the county or even the community level. When the CDC identifies countries where there is a public health problem – like the Ebola virus – it can link that census data showing the distribution of expat communities with airline schedules to determine how the disease is likely to spread in the US – or even other countries. This allows the CDC to track the spread of disease in near real time. In some cases, it could even project how diseases are likely to spread.

These Big Data applications merge data about weather patterns, climate data, and even the distribution of poultry and swine. These applications present this data in a graphic form that makes it easier for epidemiologists to visualize how diseases are spreading geographically. The benefit, of course, is that the CDC and the World Health Organization can deploy its scarce resources to the areas where they can do the most good. They can do that because Big Data provides the tools to chart the spread of diseases by international travellers.

The Center for Disease Control now uses Big Data linked with Social Media to forecast the spread of communicable diseases. Historically, CDC tracked how they observed the reported spread of diseases; forecasting how diseases will spread is a new ball game. The CDC ran competitions for research groups to develop Big Data models that accurately forecasted the spread of diseases. The CDC received proposals for 28 systems. The two most successful were both submitted by Carnegie Mellon’s Delphi research group. These models are not predetermined but, instead, leverage Machine Learning to develop tailored models to forecast the specific spread of each disease.

The model is by no means perfect. The CDC gave the Carnegie Mellon model a score of .451 where 1.000 would be a perfect model. The average score for all 28 models was .430. That means that the model the CDC will use is the best available and much better than nothing, but still has considerable room for improvement.

The Delphi group is studying the spread of the dengue fever. It has plans to study the spread of HIV, Ebola, and Zika.

IBM and Watson Health

IBM is particularly proud of Watson, its artificial intelligence system on steroids. Although Watson has produced some stunning results such as winning the TV game Jeopardy against the two best Jeopardy contestants, our interests today are in healthcare.

Watson is machine learning at its finest. In the healthcare field, its managers feed it an on-going stream of peer reviewed research papers from medical journals and pharmaceutical data. Given that Big Data knowledge base, Watson applies that knowledge to individual patient records to suggest the most effective treatment programs for cancer patients. Watson’s suggestions are personalized to each patient.

Watson’s handlers don’t program the software to deliver predetermined outcomes. Instead, they apply Big Data algorithms to enable Watson to learn for itself based on the research it reviews as well as the diagnoses, treatment programs, and observed outcomes for individual patients.

IBM is partnering with Apple, Johnson & Johnson, and Medtronic to build and deploy a cloud-based service to provide personalized, tailored guidance to hospitals, insurers, physicians, researchers and even individual patients. This IBM offering is based on Watson – its remarkably successful system that integrates Big Data with machine learning to enable personalized healthcare on a massive scale.

Until now, IBM has used Watson in leading edge medical centers including the University of Texas MD Anderson Cancer Center, the Cleveland Clinic, and the Memorial Sloan Kettering Cancer Center in New York. Given its successes to date, IBM is now ready to take its system mainstream and broad based.

Leave a Comment