[Emerging Infectious Diseases [Volume 5 No.3 / May - June 1999] Dispatches Application of Data Mining to Intensive Care Unit Microbiologic Data(ft 1) Stephen A. Moser, Warren T. Jones, and Stephen E. Brossette The University of Alabama at Birmingham, Birmingham, Alabama, USA --------------------------------------------------------------------------- We describe refinements to and new experimental applications of the Data Mining Surveillance System (DMSS), which uses a large electronic health-care database for monitoring emerging infections and antimicrobial resistance. For example, information from DMSS can indicate potentially important shifts in infection and antimicrobial resistance patterns in the intensive care units of a single health-care facility. We have defined a new exploratory data mining process for automatically identifying new, unexpected, and potentially interesting patterns in hospital infection control and public health surveillance data. This process, and the system based on it, Data Mining Surveillance System (DMSS), use association rules to represent outcomes and association rule confidences to monitor changes in the incidence of those outcomes over time. Through experiments with infection control data from the University of Alabama at Birmingham Hospital, we have demonstrated that DMSS can identify potentially interesting and previously unknown patterns. Future work on prospective clinical studies to determine the usefulness of DMSS in hospital infection control is needed, as is improved event presentation for the user and strategies for handling larger datasets. The statistical strategies developed for automatically detecting temporal patterns in surveillance data require that analysts explicitly define outcomes of interest before surveillance begins. The Data Mining Surveillance System (DMSS), on the other hand, is not constrained to monitoring changes in user-defined outcomes. In DMSS, complex outcomes are represented by association rules, and outcome incidence is captured monthly. An early version of DMSS, along with association rules and early experiments with a single organism, has been described (1). We briefly describe a newer version of DMSS and experimental results obtained by using it to analyze 1 year's data from intensive care units (ICUs) at the University of Alabama at Birmingham Hospital. DMMS uses the following definitions. An itemset is a subset of the set of all items. The support of an itemset x, sup (x), is the number of records that contain x. If sup (x) >/= FSST, where FSST is the frequent set support threshold (FSST), then x is a frequent set. An association rule, A ==> B, where A and B are frequent sets and the insection of A and B = Ø, is a is a statement about how often the items of B are found with the items of A. the incidence proportion of A ==> B, denoted ip(A ==> B), is equal to sup (union of A and B)/sup (A). The precondition support of association rule A ==> B is sup(A). The incidence proportion of an association rule A ==> B in data partition p(sub i)describes the incidence of the outcome, B, in the group, A, during time ti. A series of incidence proportions for A ==> B from partitions p(sub1), p(sub 2), ...., p(sub n)describes the incidence of the outcome B in group A from t(sub 1) through t(sub n). Therefore, by analyzing the series of incidence proportions of an association rule A==> B, it should be possible to detect important shifts or trends in the incidence of B in A over time. In this way, surveillance of B in A is possible. Bacterial susceptibility and related demographic data of patients in the University of Alabama at Birmingham Hospital ICUs (medical, surgical [SICU], cardiac, neurologic [NICU]) during 1997 were extracted from the PathNet laboratory information system. Each record describes a single isolate and contains the following data elements: date of admission, date of sample collection, date of results reported, source of isolate (e.g., sputum, blood), organism isolated, organism Gram stain and morphologic features, patient's location in the hospital, and resistant (R), intermediate (I), or susceptible (S) test results to relevant antibiotics, according to the National Committee for Clinical Laboratory Standards MIC breakpoints (2). Duplicate records were removed so that for each patient, no more than one isolate per organism per month was included. In each remaining record, certain antimicrobial drug items were removed (only drugs to which the organism is historically susceptible at least 50% of the time remained). Additionally, items of the form S~Antimicrobial were removed so that only I~Antimicrobial and R~Antimicrobial items remained. Finally, data were divided into 1-month partitions (p(sub 1)....p(sub n)) before analysis. For each partition p(sub i), all frequent sets with support of at least 3 (FSST >2) and association rules with precondition support greater than 5 were generated. Both the frequent set discovery and association rule- generating algorithms are beyond the scope of this review (3). Each generated association rule must pass a set of rule templates that describe families of interesting and uninteresting rules. Each template is a construct of the form be(sub 1) ==> be(sub 2), where be(sub 1) and be(sub 2) are Boolean expressions over items and attributes. Association rule A ==> B satisfies rule template be(sub 1) ==> be(sub 2) if A satisfies be1 and B satisfies be(sub 2). Two types of association rule templates are used: include templates and exclude templates. An association rule A ==> B passes a set of rule templates if A ==> B satisfies at least one include template in the set and does not satisfy any exclude template in the set. Rule templates are handcrafted by domain experts to eliminate inherently uninteresting or nonsense rules. This is accomplished through iterative experiments with representative data by initially using few templates and then creating and modifying templates on the basis of pattern review. History is a database that holds association rules and their incidence proportions for different data partitions. In DMSS, the user specifies a set of rule templates that contains any number of inclusive and restrictive templates (Table 1). Only association rules that pass the rule templates are included in the history. To establish a baseline for an association rule, the incidence proportions of the rule for the three previous partitions are obtained and stored in the history. Once stored in the history, a rule is updated for each new partition regardless of whether or not it is generated in the partition. Therefore, for every association rule, the history contains an up-to-date time-series of incidence proportions. Table 1. Templates used to filter association rules -------------------------------------------------------------------------- Template type Left (be(sub 1)) Right (be(sub 2)) Explanation -------------------------------------------------------------------------- Exclude (R~Antibiotic) (Anything) Want antibiotic sensitivity info on the right only. Exclude (Anything) (Source) Source of infection is not an outcome. Therefore, exclude all rules with a source on the right. Exclude (NS OR Org (NS OR Org NS, Org, and GrMp are GrMP) OR GrMP more informative if kept together in either a group or an outcome. Exclude (Loc) (Org OR GrMp) If the left contains AND location, then exclude rules that (R~Antibiotic) have Org and R~Antibiotic or GrMp and R~Antibiotic. Include (Org OR Loc) (R~Antibiotic OR Include rules whose GrMp OR Org) groups are Org- or AND Not (Loc) Loc-specific and whose outcomes are Antibiotic- or GrMp-specific. -------------------------------------------------------------------------- be(sub 1) and be(sub 2), Boolean expressions; R, resistant; NS, nosocomial; OR, "or"; Org, organism; GrMp, Gram stain and morphology; Loc, Location. Table 2. A sample event generated by the Data Mining Surveillance System ----------------------------------------------------------------------------- Association P P P P P P rule (subc-5)(subc-4)(subc-3)(subc-2)(subc-1)(subc) (sup a) ----------------------------------------------------------------------------- (nosocomial ==> {Acinetobacter 0/11 0/10 0/9 0/13 2/9 3/9 SICU(sup b), baumannii} tracheal aspirate ----------------------------------------------------------------------------- w(subp) w(subc) (sup c) ----------------------------------------------------------------------------- (sup a)P(subc), current pair. (sup b)SICU, surgical intensive care unit. (sup c)w(subp), past window; w(subc), current window. By analyzing information stored in the history, DMSS generates alerts that describe an extreme change in the incidence of an outcome B in a group A over time. For example, Table 2 describes the incidence of Acinetobacter baumannii in a nosocomial tracheal aspirate and in SICU isolates over the past six partitions. Clearly, a shift in incidence occurs between the first 4 months and the most recent 2 months of the series. If we call months 1, 2, 3, and 4 the past window, wp, and months 5 and 6 the current window, w(sub c), we can ask if there is an extreme change in the incidence between w(sub p) and w(sub c). We compute the cumulative incidence proportion for w(sub p) (0/43) and for w (sub c)(5/18) and compare the two by a statistical test of two proportions. To generate an alert for an association rule r, DMSS first constructs a current window (w(sub c)) and a past window (w(sub p)) on the series of incidence proportions of r (w(sub c)[r,0], w (sub p)[r,0] from the algorithm in the Figure). Second, it computes the cumulative incidence proportion for each window. Third, it compares the two cumulative incidence proportions by a test of two proportions. Finally, if the difference between the proportions is statistically extreme (p R~Oxacillin 0/10 0/8 7/14 Increase in the aureus (sup a,b) incidence of Source R~Clindamycin oxacillin (ORSA), TRACHASP(sup c) R~Erythromycin clindamycin and erythromycin resistance in all isolated from tracheal aspirates. NSNoso(sup d) ==> R~Ceftazidime 3/88 11/70 Increase in incidence of ceftazidime resistance in all nosocomial isolates. NP_GNR(sup e) ==> R~Piperacillin 0/17 6/14 Increase in the LocSICU incidence of piperacillin resistance in non-pseudomonas gram-negative bacilli isolated from NSNoso. NP_GNR ==> R~Piperacillin 1/12 0/14 4/11 4/8 Increase in the LocSICU (sup f) incidence of piperacillin resistance in non-pseudomonas, nosocomial gram- negative bacilli from the SICU. NSNoso ==> S. aureus 26 3/26 2/28 6/27 5/20 3/11 Increase in the LocNICUg incidence of nosocomial S. aureus in nosocomial isolates from the NICU. ------------------------------------------------------------------------------ (sup a)R, resistant. (sup b)Oxacillin, resistance implies resistance to amoxycillin/clavulanic acid, cephalothin, and cefazolin. (sup c)SourceTRACHASP, tracheal aspirates. (sup d)NSNoso, nosocomial (3 days from admission). (sup e)NP_GNR, non-pseudomonas gram-negative rod. (sup f)LocSICU, location, surgical intensive care unit (SICU). (sup g)LocNICU, location, neonatal intensive care unit (NICU). We believe that this approach to surveillance will allow hospital infection control programs to focus their limited resources on issues of probable significance. We also believe that this approach is a step toward the public health surveillance system described by Dean, Fagan, and Panter-Conner (4). --------------------------------------------------------------------------- This work was supported in part by cooperative agreement U47-CCU411451 with the Centers for Disease Control and Prevention (SAM) and a predoctoral research fellowship LM-00057 from the National Library of Medicine (SEB). Dr. Moser is associate professor, Department of Pathology, University of Alabama at Birmingham, and serves as director of Laboratory Information Services, associate director of Clinical Microbiology for University Hospital, and director of the Pathology Informatics Section. His research interests are applied research in diagnostic microbiology and the application of software as an aid to the intelligent analysis of medical information, especially that generated in laboratory medicine. Address for correspondence: Stephen A. Moser, University of Alabama at Birmingham, Department of Pathology, P246, 619 19th St., South Birmingham, AL 35233-7331, USA; fax: 205-975-4468; e-mail: moser@uab.edu. (footnote 1)Presented in part at the International Conference on Emerging Infectious Diseases, March 8-11, 1998, Atlanta, Georgia. References 1. Brossette SE, Sprague AP, Hardin JM, Waites KB, Jones WT, Moser SA. Association rules and data mining in hospital infection control and public health surveillance. J Am Med Inform Assoc 1998;5:373-81. 2. National Committee for Clinical Laboratory Standards. Methods for dilution antimicrobial susceptibility tests for bacteria that grow aerobically. 4th ed. Approved standard. NCCLS document M7-A4. Wayne (PA): The Committee; 1997. 3. Brossette SE. Data mining and epidemiologic surveillance [dissertation]. Birmingham (AL): University of Alabama at Birmingham; 1998. 4. Dean AG, Fagan RF, Panter-Conner BJ. Computerizing public health surveillance systems. In: Teutsch SM, Churchill RE, editors. Principles and practice of public health surveillance. New York: Oxford University Press; 1994. p. 200-17. Emerging Infectious Diseases National Center for Infectious Diseases Centers for Disease Control and Prevention Atlanta, GA URL: ftp://ftp.cdc.gov/pub/EID/vol5no3/ascii/moser.txt Please note that figures and equations are not available in ASCII format; their placement within the text is noted by [fig] and [eq], respectively. Greek symbols are spelled out. The following codes are used: (ft) for footnote; (sup) for superscript; (sub) for subscript; >/= for greater than or equal to.