VF2++ – a subgraph isomophism algorithm for molecular pattern matching
In a previous blog entry, we introduced COIN-OR::LEMON (http://lemon.cs.elte.hu), the open source C++ graph and network modelling library. As we mentioned there, this library targets both academic and industrial applications. In this post a use-case for the latter is shown.
Last year, QuantumBio Inc.1 approached us with a simple request: they were using LEMON successfully in their product, but they needed a subgraph isomorphism algorithm implementation which was missing from the library at that time. It is a well known and well defined problem, and – although hard from theoretical point-of-view – there is a common algorithm called VF2 that is able to efficiently solve large instances in practice. The specific request was to implement this exact algorithm.
There were some special requests, though. For example, the method would be used for matching molecular substructures, thus a “node labeling” extension was needed to ensure that each atom is matched with a same one. Luckily, this extra feature could be added easily.
Needless to say, performance is a crucial issue in an application running this algorithm possibly thousands of times in a short time. Thus, we went further than just implementing the standard VF2, but we also made a couple of significant improvements on the original algorithm.
This new algorithm, called VF2++, runs significantly faster on most test cases and performs especially well on special graph classes stemming from biological questions. VF2++ handles graphs of thousands of nodes in practically near linear time including preprocessing. In fact, we can say that it is by far the fastest existing algorithm as far as biological graphs are concerned.
As a demonstration, as depicted in the figures below, we compared the performance of VF2++ to VF2Plus, another recent independent improvement to VF2. The input instances came from the Protein Data Bank (http://www.rcsb.org/pdb).
The reason why VF2++ is superior to VF2 is twofold. Firstly, taking into account the structure and the node labeling of the graph, VF2++ determines a state order in which most of the unfruitful branches (i.e. partial matchings) of the search space can be pruned immediately. Secondly, introducing more efficient – nevertheless still easier to compute – cutting rules reduces the chance of going astray in a wrong direction.
In addition to the labeled subgraph isomorphism, specialized versions for induced subgraph isomorphism and for graph isomorphism have also been designed and implemented. VF2++ has gained a runtime improvement of one order of magnitude respecting induced subgraph isomorphism and a better asymptotical behaviour in the case of graph isomorphism problem.
Finally, the best news is that – also by the courtesy of QuantumBio Inc. – all of these implementations are available open source (via the very permissive Boost license) as the part of the LEMON library, and the future improvements will also be continuously incorporated.
Alpár Jüttner, Péter Madarasi
Department of Operations Research
Eötvös Loránd University
- QuantumBio Inc., (http://www.quantumbioinc.com/) headquartered in State College, Pennsylvania has been an important vendor of computer-assisted molecular modeling and computer-assisted drug design solutions for pharmaceutical, biotechnology, and life science research organizations ever since 2002.