Data on metabolic pathways were taken from the KEGG database. This includes definitions of metabolic reactions, reaction annotation data for individual organisms as well as data on organization of reactions into metabolic pathways. Metabolic pathways were modeled as directed node labeled graphs. Distance measures were developed based on the theory of edit distances on graphs. It was proven that the distance measures are metrics, and, where appropriate, correspondences between the implemented edit distance-based distance measures and already published distance measures were shown. The developed comparative analysis approach comprises the following steps. Firstly, pairwise distances are calculated between the pathway variants of a set of organisms to be analyzed. Then, organisms are clustered based on these distances using various clustering approaches which results in a dendrogram for each clustering method. Subsequently, these dendrograms are cut at a certain height and thus a classification (partitioning) of the analyzed organisms into groups is achieved. The number of groups is determined as the value for which the cophenetic correlation coefficient between the cophenetic matrix of the partitioning and the distance matrix is maximized. Finally, the differential reaction content is calculated for each pair of groups and can either be presented in a table or visualized on KEGG’s metabolic pathway maps. The entire functionality is implemented as a web-based application called Comparative Pathway Analyzer, which is publicly accessible.
Several distance measures were implemented, namely reaction-based distance measures, metabolite-based distance measures, reaction and metabolite-based distance measures, as well as distance measures that, when calculating the edit cost for the deletion or insertion of a reaction, take into account the neighboring reactions. All distance measures were evaluated against each other in order to find the one that is most adequate for the given data. The evaluation was performed on two manually designed test scenarios, since a standard of truth did not exist. Three different clustering techniques, namely average and complete linkage agglomerative clustering as well as Ward clustering, were evaluated for their suitability to group organisms based on distance data on the organisms’ pathway variants. Furthermore, as an application example, five Corynebacteria were compared against each other using the newly developed approach and the results were discussed in light of their biological relevance.