[Emerging Infectious Diseases] [Volume 4 No. 2 / April - June 1998] Perspectives Accommodating Error Analysis in Comparison and Clustering of Molecular Fingerprints Hugh Salamon,* Mark R. Segal,* Alfredo Ponce de Leon,† and Peter M. Small‡ *University of California, San Francisco, California, USA; †Instituto Nacional Nutrición, Zubriran, Mexico City, Mexico; and ‡Stanford University Medical Center, Stanford, California, USA ------------------------------------------------------------------------- Molecular epidemiologic studies of infectious diseases rely on pathogen genotype comparisons, which usually yield patterns comprising sets of DNA fragments (DNA fingerprints). We use a highly developed genotyping system, IS6110-based restriction fragment length polymorphism analysis of Mycobacterium tuberculosis, to develop a computational method that automates comparison of large numbers of fingerprints. Because error in fragment length measurements is proportional to fragment length and is positively correlated for fragments within a lane, an align-and-count method that compensates for relative scaling of lanes reliably counts matching fragments between lanes. Results of a two-step method we developed to cluster identical fingerprints agree closely with 5 years of computer-assisted visual matching among 1,335 M. tuberculosis fingerprints. Fully documented and validated methods of automated comparison and clustering will greatly expand the scope of molecular epidemiology. The combination of conventional epidemiologic investigations with molecular techniques for genotyping pathogens has elucidated the epidemiology of many infectious diseases. The most frequently used genotyping techniques (e.g., pulsed-field gel electrophoresis, restriction fragment length polymorphism [RFLP], and randomly amplified polymorphic DNA) yield fragment-based data. Fewer than 100 patterns can be compared visually. For larger numbers, commercially available computer programs can be used to identify a manageable subset of potentially matching patterns, which are then compared visually. This approach is accurate but cumbersome and excessively labor-intensive as the number of isolates exceeds several hundred. Furthermore, the results of computer-assisted matching are not as reproducible as systematic computational methods. These limitations significantly constrain the size, scope, and standardization of molecular epidemiologic investigations. We present an approach by which identical patterns can be identified from large collections of DNA fingerprints. The number of IS6110 fingerprints continues to increase, with many studies across the globe producing IS6110 data to characterize Mycobacterium tuberculosis isolates. Such molecular epidemiologic studies provide information about tuberculosis (TB) transmission patterns. Studies in Ethiopia, Tunisia, and The Netherlands (1), South Africa (2), India (3), Denmark and Greenland (4,5), the United States (6), and Tanzania (7), among many others, exploit IS6110-based RFLP genetic fingerprints. We developed an automated computational system, in which a statistical analysis of the error in measuring fragment sizes provides a conceptual framework for comparing sets of fragment lengths. The computational approach to lane comparison—align-and-count method (ACM)—permits calculation of the number of fragments that match between two IS6110-based RFLP fingerprints. The parameters of the computational ACM are adjusted to provide the same high sensitivity as the labor-intensive visual inspection used over the last 5 years. We developed an approach to identifying a set of identical fingerprints when the identity of the fingerprints is nontransitive. We also explored improving the specificity among matched fingerprints to reveal additional information in RFLP lanes. Data Acquisition Genotyping M. tuberculosis isolates with IS6100-based RFLP fingerprinting was performed as described in van Embden et al., 1993 (8). Computer-assisted comparison and clustering were performed on the RFLP lanes as described in Woelffer et al., 1996 (9). Internal standards were used to quantitate fragment sizes visualized with a DNA probe to IS6110. Two films' exposures were scanned into Whole Band Analyzer software (BioImage, Ann Arbor, MI, USA); one was obtained when probing for the internal standard, the other when probing for IS6110. The resulting images were aligned with three registration marks that gave reference to the original nylon membrane. The Whole Band Analyzer software quantitated fragment lengths for the IS6110-visualized bands, which were inspected and edited manually in the software package. The resulting collections of fragment lengths for each lane (bacterial isolate or laboratory strain H37Rv) were exported to our ACM software and compared with other lanes. Mathematical Methods The following is a descriptive summary of the principles underlying the analysis (Appendix). Analysis of Error in Data The magnitude and characteristics of experimental error were empirically assessed by analyzing variation in the results obtained from a reference strain included in each experiment (gel). The absolute and proportional differences in the measured fragment lengths of biologically identical samples of this strain were calculated; results showed that the error in measurement was proportional to fragment length and greater between than within gels. Align-and-Count Matching Algorithm A method for matching was developed to accommodate the empirically defined error. The fragment length data from two lanes were scaled through a range of values, and the maximum number of mutually closest bands falling within a threshold tolerance was reported (Figure 1). The acceptable tolerance was smaller when lanes from the same gel were compared than when lanes from different gels were compared. An animated demonstration of this method is accessible on the Internet (URL for use with a graphics-capable Web browser: http://molepi.stanford.edu/hugh/acm/counting). A Graph-Theoretic Approach to Identical Fingerprints We considered pairs of patterns identical if all fragments matched. However, because the results might be nontransitive (A might be identical to B, B identical to C, but A not identical to C), the identification of groups of matched patterns is more complicated. A graph-theoretic approach was used to assemble clusters of matching fingerprints. Alignment and Analysis of Residual Error We further aligned collections of lanes determined to match according to the above algorithms. The optimal alignment was defined as that which minimized the proportional error between putatively identical fragments. This analytic step, which is comparable to the experimental approach of rerunning clustered strains in the same gel, improved the ability to distinguish similar, but nonidentical fingerprints. [fig 1] Figure 1. The align-and-count method finds the maximum number of mutually closest bands within a threshold deviation value[Delta], for a search across a ranges of scaling values. The two lanes are scaled incrementally, thus searching for the best alignment. Error in Analysis of H37Rv Data Investigating the absolute error for pairs of 116 H37Rv lanes (Figure 2), we found that the error was consistently higher in gel-to-gel comparisons than in comparisons of lanes from the same gel. Error was proportional to fragment length for the range of fragment lengths from 0.9 kbp up to at least 5 kbp (Figure 2), which included 90% of the bands in the fingerprint data from San Francisco. The latter empirical result was consistent with the fact that the distance migrated by a DNA fragment on an electrophoresis gel is typically proportional to the logarithm of the fragment length. Furthermore, the error found when we compared one band of a lane to that band in another lane was positively correlated to the error found for the other bands; in other words, if the first band in lane A was larger than the average measurement for that band, it was likely that the other bands in lane A would be larger than the average measurements. This positive correlation was intuitively evident in comparing lane maps (i.e., when comparing graphic representations of the fragment lengths). In Figure 3, the set of lane maps on the left are measurements of a genotype found in San Francisco. With the exception of the fourth lane from the right, they represent a set of identical patterns. Note that the error is mostly a scaling error and that if one fragment is larger than average for that lane, the others are very likely to be larger also. These two observations—that error is proportional to fragment length and positively correlated for bands within a given lane—suggests two classes of error: one is a property of each band; the second is a property of each lane. This analysis motivated us to develop a computational algorithm that scales fingerprints and measures the number of mutually closest bands within threshold sizes of each other for the best alignment, i.e., the scaling that maximizes the number of matching fragments. [fig 2] [fig 2] Figure 2. Means and two Figure 3. Additional alignment standard errors of the mean of very similar patterns can error bars for pairwise identify clearly distinct comparisons among 116 12-banded patterns. Measurement noise H37Rv lanes show that error is obscures the detailed consistently larger when relationships between 26 comparing lanes between gels patterns that were identified than when comparing lanes from from 1,335 as being very the same gel. The x-axis similar. However, after corresponds to w(b), and the alignment to a consensus y-axis to d(b), as presented in pattern, a clearly distinct the text. It is evident that pattern (an outlier from the error is proportional to other members of this fragment length in the range of autocluster) can be readily fragment lengths found in identified. Fragment lengths are H37Rv. The data exhibit 2% to given in kilobasepairs (kb). 3% error for between gel comparisons, but only approximately 1% error on average for within gel comparisons. Alignment and Residual Error for H37Rv Lanes By optimally aligning (i.e., minimizing the sum of proportional errors) pairs of H37Rv lanes, we find a distribution of scaling factors and of residual error. Table 1 shows 6,641 pairwise comparisons of 91 replicate lanes (members of each pair are taken from different gels). The mean value of s, the scaling factor, for these 6,641 alignments is 0.0212, and the standard deviation of s is 0.0189. The reduction in error due to alignment is approximately twofold, to approximately 1% with a standard deviation also of approximately 1% (Table 1); therefore, a search range, S, of 0.10 will allow for virtually every incidence of scaling error in the data. (Assuming normally distributed scaling and noting that 0.10 lies more than four standard deviations about the mean scaling error, we conclude that scaling error will not be compensated in fewer than 1 out of 10,000 independent pairwise comparisons.) Employing a deviation tolerance, [Delta], of 0.045 should ensure a sensitivity very close to 100% for matching individual bands, after alignment. (Assuming normally distributed differences between replicate band measurements and a deviation tolerance of 4.5%, which is approximately 3.5 standard deviations above the mean fragment length error after alignment, one should falsely conclude two identical bands do not match with an approximate probability of 0.0002.) These parameter values, together with the number of incremental searches, I, set to 100, empirically gave results agreeing closely with visual inspection by experienced researchers who matched entire lanes. Similarly, one may use analysis of within gel lane error to determine the parameter values for the ACM to match lanes from the same gel. Table 1. Pairwise comparisons (n=6,641) of lanes across gels characterize unaligned proportional error and residual error ----------------------------------------------------- Unaligned Aligned pairwise pairwise proportional proportional Mean error error H37Rv kilo ------------ ---------------- band bases mean s.d. mean s.d. b(sup *)w(sub b) r(b) r(sup a)(b) ---------------------------------------------------- 1 5.029 0.0253 0.0196 0.0099 0.0079 2 4.853 0.0249 0.0202 0.0084 0.0075 3 3.533 0.0253 0.0303 0.0109 0.0203 4 2.814 0.0237 0.0283 0.0105 0.0182 5 2.153 0.0202 0.0186 0.0086 0.0125 6 1.892 0.0210 0.0204 0.0100 0.0085 7 1.800 0.0226 0.0220 0.0092 0.0082 8 1.684 0.0227 0.0193 0.0083 0.0083 9 1.541 0.0221 0.0198 0.0079 0.0081 10 1.397 0.0225 0.0200 0.0093 0.0078 11 1.314 0.0231 0.0200 0.0111 0.0090 12 0.0936 0.0281 0.0249 0.0142 0.0157 mean 0.0235 0.0219 0.0099 0.0110 -------------------------------------------------- (sup *)Symbols as in Appendix. Both the range of scaling factors and the threshold deviations are derived from empirical investigation of the San Francisco data. An adjustment for larger error in measurement is included for the more rare larger fragment length bands. In applying the ACM to San Francisco bacterial fingerprints, [Delta] is allowed to increase at a rate of 0.005/kbp above a value of 7 kbp. San Francisco Bacterial Genetic Fingerprint Comparisons To evaluate the performance of the ACM, we analyzed (by computer-assisted visual inspection and by ACM) 125 lanes from bacterial isolates obtained in the first half of 1996. We evaluated ACM's performance by first determining whether a 1996 lane matched all bands to lanes in visually defined clusters (from previous years), matched other 1996 lanes, or did not find any identical matches at all. The automated matching of the 1996 lanes agreed nearly perfectly with the visual analysis; the few conflicting results were due to inconsistencies in the existing data (e.g., two bands of nearly identical size being edited sometimes as one band and sometimes as two). Humans often compensate for such inconsistencies, whereas a computational method would have to have such capabilities explicitly built in. Using ACM, we analyzed all 890,445 pairwise comparisons of isolates from 1,335 TB cases in San Francisco from 1991 to mid-1996. The autoclusters defined from the pairwise comparisons agreed closely with the clusters defined by computer-assisted visual inspection (Table 2). An example of one autocluster is shown in Figure 3. Without additional alignment, noise in fragment length measurements makes it difficult to determine if these putative clusters include individual patterns which, although similar, are distinct (outliers) or if the cluster contains identifiable subgroups of patterns (subclusters). Table 2. Preliminary autoclustering agrees closely with results obtained by computer-assisted visual inspection (CAVI) ----------------------------------------------------- Clustered Not clustered by CAVI by CAVI Total ----------------------------------------------------- Autoclustered 540 27 567 Not autoclustered 49 719 768 Total 589 746 1,335 ------------------------------------------------------- We then further analyzed autoclusters and identified more precisely nonidentical genotypes. Refinement of clusters begins with defining a consensus pattern for the cluster (consisting of the collection of mean fragment lengths for each band). Then the fragment lengths for each putative member of the autocluster are aligned to the consensus pattern. After alignment, a pattern can be easily identified as an outlier (Figure 3). If the analysis of a refined, aligned cluster shows multiple outliers, realignment of subsets of lanes can be used to reveal subclusters of identical patterns. In Figure 4 the distributions of fragment lengths before (a,b) and after (c,d) an initial alignment are presented for an autocluster of 84 2-banders from San Francisco. Given that the aligned bands are clearly split into two distributions, we split the autocluster into two subclusters. A set of 26 fingerprints (group 1) is aligned to its assembled mean-value lane (Figure 4 e,f), as is a set of 58 fingerprints (group 2, Figure 4 g,h). The contrast between the original fragment length data and the two well-aligned groups of fingerprints shows that the higher fragment length band is shifted between the two groups and no clear outliers exist after alignment. Figure 5 shows that alignment greatly improves a difficult-to-resolve clustering issue among four lanes. Preliminary investigation with a polymorphic GC-rich sequence (PGRS) fingerprinting method shows that the two subclusters exhibit distinct fingerprints, which further validates the increased specificity in IS6110 fingerprints. Of 81 PGRS genotyped autoclustered 2-banders, 63 fall into eight visually defined clusters, the remaining being unique PGRS patterns among the members of the IS6110 2-banded autocluster. Each cluster consists of isolates that all fall into one or the other IS6110-refined subcluster. Conclusions We have developed and validated a systematic approach to pairwise comparison and clustering of identical patterns in a large data set of DNA fragment-based genotypes. Incorporating a control pattern in each experiment allows the nature and magnitude of error in DNA fragment length measurements to be determined. An analysis of measurement error provides parameter values to use with algorithms that accommodate these errors. Relative scaling of entire lanes, an important characteristic of the error generated in quantitating fragment lengths from RFLP patterns used to type M. tuberculosis isolates, arises in part from aligning two images of a gel, one for internal lane size standards and one for data fragments; the image of internal standards and the image of data fragments are registered by three marks. Error in registration of the two images occurs, leading to the scaling effect in the fragment lengths reported. We strongly suggest that software for analyzing internal lane size standards and data fragments from separate images permit (and encourage) use of more than three registration marks. While internal standards compensate for idiosyncrasies in lane-specific fragment mobility, the limitations imposed by poor registration methods can result in increased scaling error. We have demonstrated that allowing for scaling, as in ACM, greatly assists in automating matching; incorporating alignment of pairs of lanes into the method has provided fully automated lane matching that agrees closely with results of the well-established method of computer-assisted visual comparison. We successfully address the nontransitivity of pairwise identity and once again use scaling of lanes to ensure the reliability of automated clustering of identical patterns. Alignment of DNA fragment patterns removes noise in clusters of fingerprints, showing further specificity in genotyping. This is analogous to the experimental approach of rerunning similar patterns on a single gel to reduce intergel noise. Mathematical transformation of the fragment length data yields similar information for far less cost in labor and materials. Further automated clustering within sets of aligned patterns could exploit the fact that we assume putative homology of fragments. As the aligned patterns have the same number of fragments, the residual error presents a multidimensional clustering problem; each pattern may deviate from the mean-value pattern in any of the fragment lengths. This clustering may prove more straightforward than the more general problem of clustering among patterns differing in numbers of fragments. [fig 4] Figure 4. Histograms of the fragment lengths for 84 two-banded patterns connected by identity (autoclustered with in-house software) exhibit enough spread in values to make detecting outliers and band shifts difficult (a,b). Aligning the 84 lanes to the mean-value lane for this collection reveals that the lanes do not align well, but instead shows bimodal distributions for the fragment lengths (c,d). Dividing the 84 fingerprints into two sets and separating the distinct distributions detected when aligning all 84 fingerprints show that 26 fingerprints align well to their mean-value lane (e,f), and the remaining 58 also align well to their respective mean value lane (g,h). The smaller fragment length fragment does not appear shifted between the two sets of 2-banders(comparing e to g), but the larger fragment is clearly shifted (comparing f to h). Numerous commercial packages of computer software are available to compare and match DNA fragment patterns. The availability of these systems fosters unquestioning application of turnkey pattern matching and clustering methods, which are not always fully documented or validated (for fragment-based genotypes) in the scientific literature. While possibly acceptable for small studies when visually validated, this approach to data analysis is risky for large studies. The methods for matching DNA fragment patterns presented in this paper should be an adjunct to software packages that quantitate fragments. We provide a systematic approach to analysis of fragment length estimates, appropriate whether the numerical data are generated by hand using a ruler and arithmetic or are output by a multithousand dollar gel analyzer. This approach can thus be used by molecular epidemiologists working with both large and small budgets around the world. We anticipate that our methods can be incorporated into existing commercial software packages with broad distribution and encourage similar documentation in peer-reviewed journals of other methods provided in software packages. In addition, our focus on the use of fragment length data, as opposed to the comparison of actual images, will fostercomparison of data generated in different laboratories that use different proprietary software. We are working on a World-Wide Web–based system to facilitate IS6110 genotype data sharing. The availability of a precise and validated method to count the number of matching fragments in a pairwise fashion among very large numbers of patterns now permits an assessment of the importance and usefulness of approaches that exploit fingerprint similarity. In conjunction with the growing understanding of the underlying biology of IS6110 instability and the relevant statistical issues, this may greatly expand the scope of TB molecular epidemiology. A general question arises when comparing molecular fingerprints: how many bands need to match to indicate a close biologic relationship? Aside from the issues of defining biologic "closeness"—too often ignored in epidemiologic studies—technical issues are relevant to ACM. The usefulness of the matched band information output by ACM depends on at least several factors: the underlying band size distribution from which fingerprint bands are sampled, the independence of sampled bands, measurement error (both scaling and independent band error), the length of the gel, and the number of fragments. At one extreme, where error is large and few bands are observed, even in the case of a perfect match, statistical analysis may fail to reject coincidental band matches. By using computer simulation and sets of assumptions regarding band size distributions, one may learn about the role of coincidental band matching. We are actively researching these issues for IS6110 fingerprint comparisons. Furthermore, we intend to provide a general computational framework in which one may assess error in laboratory measurement, the appropriateness of ACM for analyzing data (and appropriate parameter values to use with the method), and the role of coincidence in band matching. Information regarding computer programs for various tasks, including ACM matching itself, will be made available on the Internet at http://molepi.stanford.edu/hugh/acm. fig 5] Figure 5. Prior to alignment of two sets of 2-banders, lanes are difficult to cluster(lanes a-d are from the distributions in Figures 4a and 4b). Subsequent to alignment, lanes are much easier to cluster (lanes a' and b' are specific examples from the distributions in Figures 4e and 4f; lanes c' and d' likewise correspond to Figures 4g and 4h). Fragment lengths are given in kilobasepairs (kb). A study of M. tuberculosis isolates from northern Tanzania demonstrates the utility of partial matching (3). The study brings into focus difficulties inherent in employing one-parameter tolerance for DNA fragment-based genotype matching, a technical issue effectively addressed by ACM. Gillespie et al. (7) also call attention to the poor specificity of low copy number IS6110- based fingerprints, exacerbated by the use of the Dice coefficient. We are pursuing alternative similarity measures that use the numbers of matching fragments identified by ACM and are tailored to the needs of epidemiologic investigations. Dendogram clustering methods often provided in software targeted at DNA-fragment genotype management and analysis could in some instances fail to reconstruct the correct relationships among infectious organism isolates, even when presented perfect clock-like genetic distances. This may result in part from the fact that fast-evolving markers are characterized for isolates sampled over a period; samples are not contemporaneous. In conjunction with our efforts to define similarity measures, we are also working to modify phylogenetic inference tools used in clinical and molecular epidemiologic settings to better handle the data typically analyzed in molecular fingerprint management and analysis software. Appendix Analysis of Error in Data To characterize error in fragment length measurement, pairs of 12 band 7H37Rv lanes are compared; the difference between fragment lengths of each band is calculated as follows. Let w(sub i,b) be the measured fragment length of band b of lane i. In general, let B be the number of bands. To compare lanes i and j, we can calculate the absolute difference between the measured lengths, d(sub i,j)(b) = |w(sub i,b) - w(sub j,b)|. Let lower case DELTA(sub i,j) be an indicator that equals 1 when lanes i and j are from the same electrophoresis gel and 0 otherwise. Let n be the total number of replicate lanes. The mean absolute difference for a fragment over all pairwise comparisons of lanes from different gels is found by, (eq) d(b)(overscore)=1(over)t(sup i=n-1)EPSILON(sub i=1)(sup j=n)EPSILON(sub j>i) d(sub i,j)times (1-lower case delta(sup i,j)(sup 3) where (eq) t= EPSILON(sup i=n-1)(sub i=t) EPSILON(sup i=n-1) (sub i=t)(1-lower case delta(sup i,j) is the number of pairs of lanes from different gels. Similarly, we calculate the proportional difference between measurements for a fragment, (eq)r(sub i,j)(b)=|w(sub i,b)- w(sub j,b)| over (w(sub i,b)+ w(sub j,b))/2(sup 2) and its mean for comparisons between lanes from different gels, (eq) r(b)(overscore)=1(over)t(sup i=n-1)EPSILON(sub i=1)(sup j=n)EPSILON(sub j>i) r(sub i,j)times (1-lower case delta(sup i,j)(sub -) Calculating fragment measurement errors for the (eq) u= EPSILON(sup i=n-1)(sub i=1) EPSILON(sup j=n-1)(sub j>i)times lower case delta(sup i,j) pairwise comparisons of lanes from the same gel are also performed. Align-and-Count Matching Algorithm The Align-and-Count Method (ACM) for counting the matching bands between two lanes is defined as follows. Consider two lanes, lane A with m bands, w(sub A,x) , x = 1,...,m, and lane B with n bands, wB,y , y = 1,...,n. We count the mutually closest measured fragment lengths within a proportional deviation factor, [delta] , over a range of alignments (Figure 1). Alignments are searched by scaling the fragment lengths. Multiplying the fragment lengths in a lane by a scaling factor reflects the phenomenon that error in fragment lengths is proportional to fragment length and is positively correlated for fragments in a lane. Define (eq)Match [w(sub A,x),w(sub B,y)]={1 over 0 if |w (sub B,y)- w (sub A,x)|