Skills in Data Science and Analytics: The Next Frontier for the Medical Laboratory Scientist?

By Tarush Kothari, Lynda Leopold, Tiffany Yu and Gaurav Sharma - November 16, 2022


Since its inception, the laboratory has been the beneficiary of innovation and advances in other fields, especially chemistry, biological sciences, and computer technology. While the complexity has grown many-fold, the work continues to be managed by just a handful of well-trained and dedicated medical laboratory scientists. They are able to do so thanks to a diverse set of skills and well-rounded training in the science and art of laboratory medicine.

The advent of data science and analytics is the next chapter in the growth of information technology and informatics beyond the traditional laboratory information system (LIS). Medical laboratory scientists find themselves surrounded by rapid advances in computing and informational sciences. Never before has it been possible to manage so much (with so little) in the laboratory.

The Promise of Data Science
The advent of web-based services and technologies has enabled an interconnected world that can create, capture, collate, catalogue, and exchange vast amounts of data across information systems. This capability has created entirely new sectors of industry (eg, social media and location-based services) that focus on mining this data for previously unattainable real-time business information.

However, we would posit that (at least in the healthcare sector) such capabilities to capture data and act on information are not a recent development. The medical laboratory has performed this function (albeit at a smaller, local level) for the benefit of patients and doctors for decades. The advent of the first fully functional database systems in the 1970s led to the development of electronic LISs and electronic medical record (EMR) systems that can now serve all sizes of laboratories and hospitals.

The Affordable Care Act of 2010 and other healthcare information laws (such as the HITECH Act) resulted in rapid and near-complete computerization of medical records across the nation. For the first time in the history of humankind, an entire (or almost entire) nation’s medical records have been made available in a format that allows them to be mined for actionable patient-level information and refined population-level knowledge. A vast majority (some research says 70 percent) of this healthcare data are laboratory data.1

This unique set of circumstances—computerization of records, advances in artificial intelligence, and an urgent emphasis on decreasing costs while increasing quality—presents unprecedented opportunities for medical laboratory scientist and pathologists. Our goal is to present an abstracted view of synergies between data sciences and laboratory science, and how medical laboratory professionals can and should harness this combination.

Data Science and Analytics Fundamentals
To achieve mastery of data sciences and analytics in the laboratory context, one must understand a few fundamentals: big data, data mining and analytics, and how to choose the technology used to perform it.

Big Data
“Big data” is a term used to describe very large datasets that may be amenable to computational analysis to reveal patterns, trends, and associations. Such a dataset is said to possess three characteristics: volume, velocity, and variety.2  We submit that laboratory data are unquestionably a type of big data in the healthcare setting. From laboratory data, one can easily mine and retrieve longitudinal information (ergo, volume) on clinical diagnoses, order sets, and reported results (ergo, variety) as soon as they are reported (ergo, velocity). The advantage of laboratory data is that they are often numerical, temporal, objective, and comparable.

Claims vs Laboratory Data
Even a single laboratory claim provides a vast amount of information: the specific test performed, when it was performed, charge and reimbursement data, assessment of costs, details of the ordering provider, and most importantly the insurance profile of the patient. However, such financially based information has inherent limitations. It provides no information on laboratory result values or granular procedural information. Another major limiting factor is timeliness of the claims data that insurance companies use. These data may be three to six months old and therefore have a low probability of impacting patient care.

In contrast, laboratory data can be retrieved in a matter of hours or days, provide a detailed clinical profile of the patient, and be instantly actionable. Thus, insurance companies are increasingly using laboratory results data in combination with claims data for population health management to risk-stratify patients, improve coordination of care, and reduce costs. With a move from fee-for-service to value-based delivery models, providers (including laboratory professionals) will be at increasing financial risk to manage health of populations while simultaneously improving quality and reducing costs of care.

In order to stay relevant in the changing healthcare delivery scenario, medical laboratory scientist and pathologists must not only be generators of data but also have additional skills in data science and analytics to solve more complex healthcare challenges that are traditionally viewed as outside the laboratory domain.

For example, skills in data science will enable laboratory professionals to create algorithms that can link clinical laboratory data with other datasets, such as insurance claims or pharmacy data. Such combined datasets can provide a wealth of information that cannot be provided by laboratory results alone. Laboratory professionals will then be able to provide insight into clinical and financial activity at both patient and population level. As the role of data science expands, laboratory professionals will be critical to adoption of a standardized vocabulary for laboratory test information (eg, LOINC, SNOWMED). This will improve interoperability and allow for innovative opportunities for secondary use of laboratory data. They can also become knowledgeable in reporting of important quality metrics (eg, HEDIS, pay-for-performance) that are based on laboratory results data and help in identification of cohorts for disease management and even clinical trial enrollment.

Laboratory data, when normalized, aggregated, and analyzed can give potential insight into many disease conditions, allowing for opportunities for early targeted intervention at a population level. Listed below are some examples of how laboratory data analytics can be used to improve quality of care.

  • Serum creatinine and eGFR data can be aggregated to identify prevalence and severity of chronic kidney disease in ambulatory patients. When this information is tracked proactively and shared with ordering providers, an earlier and more targeted intervention is made possible. This in effect shifts the focus of laboratory testing from reactive to proactive prevention strategies.3
  • Pap smear and HPV genotyping results can be aggregated and used to identify a population of patients with abnormal cervical cancer screening results. The lab can then work with providers to schedule timely follow-up for repeat testing and track disease progression. Timely screening and treatment for cervical cancer is an important quality criterion in various pay-for-performance metrics, which directly impacts hospital reimbursement and quality ratings. Laboratories have access to this information and can work closely with clinicians to reduce gaps in care while simultaneously improving public health.
  • HbA1c results data, when aggregated by zip code or practice location, can identify populations of patients with poorly controlled diabetes and even prediabetes. Laboratories can play an important role in collating existing datasets on HbA1C and glucose values and bring them to the attention of clinicians caring for these at-risk patients. In the long term, such lab-led community disease management efforts prevent progression and complications from diabetes. HbA1c testing is also an important pay-for-performance measure for ambulatory practices and Accountable Care Organizations.

Many therapeutic agents used in hospitals rely on timely laboratory monitoring. Clinicians face the challenge of synthesizing and correlating vast amounts of laboratory result and drug data. Linking laboratory data (eg, toxicology reports) to pharmacological data (eg, opioid prescription rates) enables real-time clinical decision support for the frontline provider.

Data Mining and Data Analytics
The terms “data mining” and “data analytics” are often used interchangeably. However, data mining is the method of processing data to discover previously unknown patterns and associations to come up with a hypothesis, data analytics is processing of data to test a hypothesis and study known patterns and associations.

Data Science and Analytics Technology
To design, deploy, and maintain a robust data science and analytics solution, an organization (such as a laboratory) might consider three very different needs. The first is the need to identify the correct information technology platform. This may include the type of computers (single vs multiple), network (local vs remote), and data warehousing (local vs cloud). The organization must match hardware and technology to operational needs, confidentiality, available resources, and intended use.

Next, the organization may need to identify the type of database management system that runs its existing information systems and match it to the platforms that it intends to acquire for analytics. At a very basic level, flat database structures are fast, but very large in size and difficult to update over time. In contrast, relational databases are smaller in size and easier to maintain, but may be a bit more complex to program and operate.

Lastly, resources and attention must be directed to determining the visualization technique that would display information synthesized from data. It can range from simple printed tables to multicolored graphs with many variables; nevertheless, form should defer to functionality, and the latter should defer to the business need.

Career Considerations for the Laboratory Professional
A combination of factors, such as the closing of educational programs nationwide and hospital consolidation resulting in decreased internship opportunities, has created a shortage in the workforce that has become critical in some areas of the country. A logical response to managing more volume (stemming from an aging and more chronically ill population) with less staff is the utilization of automation combined with more robust informatics capabilities.

Medical laboratory scientists have always been in a unique position to use laboratory-generated data, as they are “on the front line” of reporting each patient’s results and able to detect variations as they arise. Increasingly, however, computer algorithms are assuming a prominent role in the way medical laboratories manage and assess the flow of information into and out of the laboratory. As the healthcare needs of the population change, the role of the medical laboratory scientists may move beyond analyzing and reporting individual results to employing data analytics to correlate predictive outcomes.

To read the full version of this article, visit


  1. Forsman RW. The value of the laboratory professional in the continuum of care. Clin Leadersh Manag Rev. 2002 Nov-Dec;16(6):370-3.
  2. Laney, Douglas. 3D Data Management: Controlling Data Volume, Velocity and Variety. Gartner website. 2001.
  3. Crawford JM, Shotorbani K, Sharma G, et al. Improving American healthcare through “Clinical Lab 2.0”: a Project Santa Fe report. Acad Pathol. 2017 Apr 18;4:2374289517701067.