Validating Values in a Medical Data Warehouse Using Statistical Tests

Authors: Jeff Claassen, MPAff

Primary Advisor: Dean F. Sittig, PhD (co-author)

Committee Members: Elmer Bernstam, MD, MSE (co-author)

Masters thesis, The University of Texas School of Biomedical Informatics at Houston.

Abstract:

Medical data warehouses are playing a larger role for health care organizations, and their data may have errors introduced during data entry or the transfer process. Finding and correcting those errors is a crucial first step before the data can be used for analysis. A literature review found articles and books that mentioned the need for data validation but did not describe how the range or consistency checks were created and validated.  To explore alternatives for identifying errors in medical data, six methods were applied to an extract of 2 million vital signs values from a medical data warehouse, and to a subset of those records that were flagged as errors. The most successful error-identification methods were delta, for weight changes of 10 percent or more, and a Tukey fence. Both methods resulted in a narrower range of acceptable values than the other four. The weight value errors were not normally distributed, which also favored the two methods. Future research should determine the typical distribution and other characteristics of errors in medical data, and test the six methods on a set of calculated errors, allowing a wider range of performance metrics.