Using Claims Data

Administrative data related to health insurance claims is extremely powerful for driving improvements in population health to address issues related to cost, quality and outcomes. Health care is a data intense industry. Information is collected routinely for clinical purposes as part of every health care encounter. Health care data is also created for a variety of other purposes, including payment via submitted claims. Claims data include information at the patient encounter level regarding diagnoses, treatments and billed and paid amounts. Clinical data from electronic health records (EHR) are critical for analyses to improve health care delivery. However, the use of claims data can effectively complement EHR data by providing an extremely broad view of a patient’s interactions across the continuum of the health care system, reduce selection bias and provide access to large and diverse samples (Stein et al., 2014).

Regardless of its importance, there are many challenges with using claims data. One challenge related to using claims is assessing data quality and accounting for incomplete or missing data. Other challenges include integrating data from multiple sources and developing methods for describing utilization or appropriateness of care (Stein et al, 2014). Other technical challenges with creating specific datasets based upon claims data include:

  • Converting claims into unique visits
  • Identifying incomplete claims data
  • Categorizing providers and locations of service
  • Selecting the most useful measures of utilization and expenditures (Tyree, Lind and Lafferty, 2006)

The purpose of this resource is to provide examples of analyzing claims data. Specifically, the resource provides explanations and videos on the use of synthetic claims data developed by the Centers for Medicare & Medicaid Services (CMS) and instructions on how to acquire and use the data.

DE-SynPUF Overview

The Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) is a set of realistic claims data from 2008 through 2010 made available by CMS. The information provided in the dataset is real patient data, but is provided in a format that protects patients' identities. The purpose of the dataset is to provide training in data analysis, data mining, and development of software that may lead to increased knowledge from claims data in practice.

The DE-SynPUF consists of five types of administrative data that are linked together by a unique identifier at the patient level -- beneficiary summary, inpatient claims, outpatient claims, carrier claims, and prescription drug events. The dataset includes a 5 percent sample of Medicare beneficiaries in 2008, and the total sample includes over 100 million records across the three years sampled.

To acquire the DE-SynPUF data, go to the DE-SynPUF website and choose the data you want to download. You will see that the data is segmented into 20 unique samples. When you click on a sample, you can choose to download all of the datasets for that sample of beneficiaries. The video below offers an example of how to interact with the website and download a sample of the data.

Analyzing Claims Data

Claims data is a rich source that includes information related to diagnoses, procedures, and utilization. There are numerous analyses that can be conducted on claims data to derive information and knowledge to drive decision making. Claims data can be used for comparing prices of health care services at local, state, regional or national levels. Claims data can be used to compare services provided by specific providers or health care organizations based upon specific diagnoses (or combinations of diagnoses). It can also be used to evaluate quality of care provided by health care providers. According to the Pew Charitable Trusts, “claims data can reveal whether a doctor followed nationally recommended medical protocols for treating patients diagnosed with diabetes. How many received quarterly exams? Did they receive an eye exam? How many were admitted to a hospital?” (Vestal, 2014).

In the video below, the CMS DE-SynPUF claims data is used as an example of how claims data can be used for population health analytics. The exercise demonstrates how to determine high outpatient utilizers using the outpatient claims data to examine their economic impact.

This project is/was supported by the Health Resources and Services Administration (HRSA) of the U.S. Department of Health and Human Services (HHS) under grant number UB1RH24206, Information Services to Rural Hospital Flexibility Program Grantees, $1,009,121 (0% financed with nongovernmental sources). This information or content and conclusions are those of the author and should not be construed as the official position or policy of, nor should any endorsements be inferred by HRSA, HHS or the U.S. Government.