Natural Language Processing for Public Health Coding OLIS Microbiology Data Kevin Brown Scientist Public Health Ontario Assistant Professor University of Toronto Adjunct Scientist ICES
Ontario Lab Information System Currently Covers 2007 to Fall 2015 2,892 unique test result types 202 Ontario labs 1 billion test orders 2.2 billion test results
Text result field Comment:\.br\The presence of C. difficile DNA is not diagnostic of\.br\infection and should be interpreted in conjunction with\.br\clinical presentation.\.br\ \.br\ Duplicate specimens received within 7 days of a NEGATIVE\.br\test or within 14 days of a POSITIVE test will be rejected. C. difficile Cytotoxin (PCR) Positive by RT PCR\.br\ **Infection Control Additional Precautions - Contact**\.br\ \.br\ \.br\ Phoned to (416) 123-4567 \.br\ Date: Apr 3,2015 \.br\ At: 14:10\.br\ By: John Doe
COMBAT-AMR Comprehensive Ontario Microbiology LaBoratory Administrative data for AntiMicrobial Resistance The stated purpose was to: Identify incidence and prevalence of AMR Measure the attributable mortality of each AMR Combine into Clinical Antimicrobial Resistance Index
Objective Identify tests of interest Code test results Describe test results Validate
Identify ICES provided us with a table of LOINC codes Logical Observation Identifiers Names and Codes A description of a microbiology result field LOINC -- Type C = Culture Test S = Susceptibility Test OST = Organism Specific Test LOINC -- Priority 1 = definitely of interest. 6 = definitely not of interest
Identify LOINC loincfullyspecifiedname frequency 05-Dec AMIKACIN:SUSC:PT:ISOLATE:ORDQN:MIC 3201 10138-6 T' wave amplitude.lead V3:Elpot:Pt:Heart:Qn:EKG 124 10219-4 Surgical operation note preoperative Dx:Imp:Pt:Patient:Nar 13646 10328-3 Lymphocytes/100 leukocytes:nfr:pt:csf:qn:manual count 5286 10329-1 Monocytes/100 leukocytes:nfr:pt:csf:qn:manual count 1145 10330-9 Monocytes/100 leukocytes:nfr:pt:body fld:qn:manual count 5227 10334-1 CANCER AG 125:ACNC:PT:SER/PLAS:QN 270790 10346-5 Hemoglobin A:ACnc:Pt:Bld:Qn:Electrophoresis 1437 10352-3 Bacteria identified:prid:pt:gen:nom:aerobic culture 4503 10353-1 Bacteria identified:prid:pt:nose:nom:aerobic culture 1541 10355-6 MICROSCOPIC OBSERVATION:PRID:PT:BONE MAR:NOM:WRIGHT GIEMSA STAIN 1413 10357-2 MICROSCOPIC OBSERVATION:PRID:PT:WND:NOM:GRAM STAIN 80150 10434-9 COMPLEMENT C3 AG:ACNC:PT:TISS:ORD:IMMUNE STAIN 3157 10466-1 ANION GAP 3:SCNC:PT:SER/PLAS:QN 1624478 10488-5 IGA AG:ACNC:PT:TISS:ORD:IMMUNE STAIN 3156 10491-9 IGG AG:ACNC:PT:TISS:ORD:IMMUNE STAIN 3156 10493-5 IGM AG:ACNC:PT:TISS:ORD:IMMUNE STAIN 3156 10501-5 LUTROPIN:ACNC:PT:SER/PLAS:QN 890065 10524-7 Microscopic observation:prid:pt:cvx:nom:cyto stain 31989 10525-4 MICROSCOPIC OBSERVATION:PRID:PT:XXX:NOM:CYTO STAIN 313922
Identify LOINC loincfullyspecifiedname frequency priority testype 634-6 BACTERIA IDENTIFIED:PRID:PT:XXX:NOM:AEROBIC CULTURE 9216722 1 C 6463-4 BACTERIA IDENTIFIED:PRID:PT:XXX:NOM:CULTURE 7087153 1 C 43409-2 BACTERIA IDENTIFIED:PRID:PT:ISOLATE:NOM:CULTURE 2167222 1 C 626-2 BACTERIA IDENTIFIED:PRID:PT:THRT:NOM:CULTURE 1246629 1 C 630-4 BACTERIA IDENTIFIED:PRID:PT:URINE:NOM:CULTURE 940225 1 C 600-7 BACTERIA IDENTIFIED:PRID:PT:BLD:NOM:CULTURE 673178 1 C 625-4 BACTERIA IDENTIFIED:PRID:PT:STOOL:NOM:CULTURE 294157 1 C 17928-3 Bacteria identified:prid:pt:bld:nom:aerobic culture 284073 1 C 18998-5 TRIMETHOPRIM+SULFAMETHOXAZOLE:SUSC:PT:ISOLATE:ORDQN 2552324 1 S 18955-5 NITROFURANTOIN:SUSC:PT:ISOLATE:ORDQN 2409756 1 S 18928-2 GENTAMICIN:SUSC:PT:ISOLATE:ORDQN 2391134 1 S 18906-8 CIPROFLOXACIN:SUSC:PT:ISOLATE:ORDQN 2324269 1 S 18864-9 AMPICILLIN:SUSC:PT:ISOLATE:ORDQN 2211872 1 S 18900-1 CEPHALOTHIN:SUSC:PT:ISOLATE:ORDQN 1572324 1 S 18878-9 CEFAZOLIN:SUSC:PT:ISOLATE:ORDQN 1173398 1 S
Identify Culture Tests N = 40 Priority 1 LOINCs Susceptibility Tests N = 197 Priority 1 LOINCs
Describe Culture Tests 4,552,482 test result records in 2014 63,312 unique values 1 every 70 records are unique Susceptibility Tests 3,823,864 test result records in 2014 2,217 unique values 1 out of 1700 records are unique Complexity Culture tests are 25X more complex
Coding Culture Tests List of 70 organisms (e. coli, staph aureus, etc) Multiple organisms Not classified Susceptibility Tests Susceptible Intermediate Resistant Other (MIC values, etc) Not classified
Coding Rule-based NLP (using regular expressions) Traditional Machine Learning Deep learning
Regular Expressions Character groups Any character =. Class of characters = [] E.g. lowercase letters [a-z], numbers [0-9], etc. Repetition? (absent or once) * (absent or any number) + (at least once) Concatenation = implicit or () ab = (a)(b) Alternation (OR statement) = 15
Regular Expressions R: uses POSIX standard grep(regexp, textvector) returns position grepl(regexp, textvector) whether or not it exists grepl( e.coli, DESCRIPTION) SAS (within a data step): uses Perl standard PRXPARSE compiles the regular expression PRXMATCH finds the position of the pattern match PRXMATCH(PRXPARSE('/e.coli/'), textvariable) > 0 16
Describing C & S coding This is a first pass 1. C coding 2. S coding 3. S x antibiotic
Describe C Organism N % No growth 976,245 21.4 Escherichia coli 525,543 11.5 Neisseria gonorrhoeae 263,926 5.8 Enterococcus sp. 149,788 3.3 Staphyloccocus aureus 98,662 2.2 Klebsiella sp. 71,812 1.6 Enterococcus faecalis 63,852 1.4 Campylobacter sp. 52,292 1.1 Klebsiella pneumoniae 49,041 1.1 Shigella sp. 48,814 1.1 Yersinia sp. 47,199 1 * Note 2.54 million coded records ~ 55.7% coded
Describe S Result N % Susceptible 2,802,758 73.3 Resistant 707,113 18.5 Intermediate 187,274 4.9 Other 117,366 3.1 Not classified 9,353 0.2 Total 3,823,864
Describe S x Antibiotic LOINC Antibiotic S I R %R Total 18998-5 TMP-SMX 415931 191 95954 18.7 512076 18955-5 Nitrofurantoin 389517 45470 50235 10.4 485222 18906-8 Ciprofloxacin 388648 7526 79949 16.8 476123 18928-2 Gentamicin 437798 3819 33263 7 474880 18864-9 Ampicillin 189159 4456 182741 48.6 376356 18900-1 Cephalothin 132223 71047 61366 23.2 264636 18878-9 Cefazolin 152923 5825 45233 22.2 203981 18996-9 Tobramycin 157200 14658 10370 5.7 182228 18862-3 Amoxiclav 86721 21391 19879 15.5 127991 18895-3 Ceftriaxone 77484 278 14776 16 92538 18880-5 Cefixime 46449 1382 5080 9.6 52911
Cleaning OLIS microbiology data 2-pronged approach Quick and dirty approach Subset of data we suspect (apriori) will be clean Outpatient urine cultures Rule-based NLP Sustainable approach Comprehensive Deep learning Active learning
Conclusions Susceptibility tests are easy to code we can probably do with rule-based NLP Culture tests will be more difficult to code and will require more sophisticated NLP methods