Abstract

AUTOMATED DIAGNOSIS EXTRACTION FROM ELECTRONIC MEDICAL RECORDS WITH MACHINE LEARNINGCLASSIFIERS

Full text
Background: While Electronic Medical Records (EMR) constitute a rich resource for research into various diseases, their unstructured format often poses practical challenges. For instance, retrieval of the records belonging to all patients with a particular outcome is often accomplished with naïve methods such as exact word matching. A more advanced alternative is to employ methods of Machine Learning (ML) for text classification. Rather than requiring a set of rules, an ML-model extracts these rules by itself given sufficient example records with known annotations. Objectives: To build a reliable classifier with machine learning techniques that can identify Rheumatoid Arthritis (RA) cases in provided EMR entries. Methods: Data was acquired from the HiX-EMR database consisting of 2,771 patients that visited the rheumatology outpatient clinic of the Leiden University Medical Centre between 2007 and 2018. This database featured a total of 38,216 entries. The first visit entry (if available) was selected per patient for annotation, resulting in a total of 1,361 entries. The annotated sample was then randomly split into an equally sized training and test set. Both sets were preprocessed and then classified with the following methods: Exact word-matching, Naive Bayes (NB), Decision Tree, Gradient Boosting (GB), Neural Networks and Support Vector Machines (SVM), see table 1 for more information. Classification of the naïve word-matching model was based on the presence of the Dutch RA-defining terms ‘Reumatoïde Artritis’ and ‘RA’. Default Scikit-learn implementations [1] were used to create the ML-models. Finally, the performance of the models was evaluated with a receiver operating characteristic (ROC) curve analysis via the pROC R-package [2]. The Delong test was used to assess the 95% confidence intervals (CI) and to determine the difference in performance between the word-matching method and the ML-models. Results: The exact word-matching approach resulted in an area under the curve (AUC) of 0.76 (CI: 0.7265-0.7783), see figure. Likewise, the ML-models resulted in relatively high AUC-scores (CI) as well: NB =0.83 (0.80-0.86), SVM=0.91 (0.89-0.93), Neural Networks=0.92 (0.90-0.94) and the GB-method with a 0.94 (0.92-0.96). The Decision Tree showed the worst performance with an AUC-ROC of only 0.51 (0.49-0.56). In comparison to the exact word-matching ROC-curve, all the ML-models showed a significant difference: Decision Tree (p<2.2e-16), NB (p= 0.004), Neural Networks (p<2.2e-16), GB (p<2.2e-16) and the SVM (p=4.0e-16). Conclusion: The Gradient Boosting, Neural Networks, SVM and Naïve Bayes models all showcased a significantly better performance than a naïve exact word matching, which establishes these ML-methods as an efficient approach for data extraction from EMR. REFERENCES: [1] Pedregosa, F. et al. JMLR (2011) 12: 2825-2830 [2] Robin, X. et al. BMC Bioinformatics. (2011) 12: 77 Disclosure of Interests: Tjardo Maarseveen: None declared, Thomas Huizinga Consultant for: Merck, UCB, Bristol Myers Squibb, Biotest AG, Pfizer, GSK, Novartis, Roche, Sanofi-Aventis, Abbott, Crescendo Bioscience Inc., Nycomed, Boeringher, Takeda, Zydus, Epirus, Eli Lilly, Marcel Reinders: None declared, Erik van den Akker: None declared, Rachel Knevel: None declared DOI: 10.1136/annrheumdis-2019-eular.2408Citation: Ann Rheum Dis, volume 78, supplement 2, year 2019, page A1388Session: Epidemiology, risk factors for disease or disease progression (Scientific Abstracts)

4 organizations