Predicting Biomedical Document Access as a Function of Past Use

Author: Caleb Goodwin

Primary Advisor: Todd R. Johnson, PhD

Committee Members: Trevor Cohen, PhD; Jorge Herskovic, MD, PhD; Elmer V. Bernstam, MD, MSE, MS

Masters thesis, The University of Texas School of Biomedical Informatics at Houston.

Abstract:

Objective: To determine whether past access to biomedical documents can predict future document access.

Materials and Methods:   We used one year of query logs from PubMed users in the Texas Medical Center, which is the largest medical center in the world.  We evaluated two document access models based on the work of Anderson & Schooler.  The first is based on how frequently a document was accessed.  The second is based on both frequency and recency.

Results: The model based only on frequency of past access was highly correlated with the empirical data (  0.932), whereas the model based on frequency and recency had a much lower correlation ( 0.668).

Discussion: The frequency only model accurately predicted whether a document will be accessed based on past use. Modeling accesses as a function of frequency requires storing only the number of accesses and the creation date for the document.  This model requires low storage overhead and is computationally efficient making it scalable to large corpora such as MEDLINE.

Conclusion:  It is feasible to accurately model the probability of a document being accessed in the future based on past accesses.