Tuesday, December 6, 2011

First Light

As promised, here the first results for substructure searching c1ccccc1Cl on the first 100000 compounds from Pubchem: select * from pubchem.compound where compound >= 'c1ccccc1Cl'::molecule

Substructure search speed


RankBuildStorageFingerprintHitsno IndexHitswith Index
1OpenBabelbinary+SMILESFP280709236 ms8067936 ms
2Indigobinaryext+sub804924821 ms80492418 ms
3OpenBabelSMILESFP2807057971 ms80675432 ms

I've checked why the OpenBabel FP2 fingerprint eliminates three structures that otherwise would pass: Using the VF2 OBIsomorphismMapper instead of the OBSmartsPattern, it's also 8067 without index. But it's about four times slower, 90526 ms without index, 8798 with index.
The structures in question are: 80944, 83450 and 99925 and I'm pretty sure it's caused by differences in aromaticity detection.

Stereochemistry in substructure searches


CheckQueryExpectedIndigo binaryOpenBabel SMILESOpenBabel binary
R/S differentselect 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'O=C(O)[C@H](N)[C@@H](O)C'::moleculefalsepassfailfail
R/S sameselect 'C[C@H]([C@@H](C(=O)O)N)O'::molecule <= 'C[C@H]([C@@H](C(=O)O)N)O'::moleculetruepasspasspass
E/Z different select 'C(=C\Cl)\Cl'::molecule <= 'C(=C/Cl)\Cl'::moleculefalsepassfailfail

OpenBabel fails the 'E/Z different' and 'R/S different' checks, but these are known issues up to version 2.3.1. More disturbing are the issues with 'R/S different'. For SMILES I can reproduce it with obgrep, so it's not my code causing it.

Indigo has no obvious issues with matching R/S and E/Z stereochemistry. Speedwise, OpenBabel with binary storage would be the leader of the pack, but has inconsistent behaviour compared with it's SMILES storage sibling. So the serialization/deserialization code clearly is missing something. But I've found a simple workaround: If the query contains chirality information it uses SMILES, otherwise binary. Speedwise, OpenBabel with binary storage now is the leader of the pack.

Fingerprint efficiency


Fingerprint typeCandidates screenedHits matchedfalse positivesEfficiency
ext+sub80908049410.995
FP281458067780.99

Pretty close.

7 comments:

  1. Hello! Could you explain how have you tested "R/S same" check? I have checked it using Indigo and it works well: Indigo finds a match of C[C@H]([C@@H](C(=O)O)N)O in the same molecule.

    ReplyDelete
  2. I'm currently checking this. I suspect it is a bug in my code...

    ReplyDelete
  3. Yes, it was a bug on my side. Fixed it and updated the post.

    ReplyDelete
  4. Hello!
    I have tested substructure searching c1ccccc1Cl on the first 100000 compounds from PubChem snapshot dated May with Bingo-SqlServer cartridge. It returned 8688 resuls, which differs to your case. I suppose that we can use different versions of PubChem. Can you publish compounds used in your tests?
    Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?

    As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster.

    Michael Kvyatkovskiy, GGA Software.

    ReplyDelete
  5. ergo, do you have plans to test some set of queries? Your single query is not very representative.

    Mikhail Rybalkin, GGA Software.

    ReplyDelete
  6. "Also I'm interested in efficiency of FP2 fingerprints. In Bingo 8705 compounds passed fingerprint test and 8688 were finally matched, so in this case fingerprints screening efficiency is 0.998. Can you provide such results for FP2?"

    Done.

    "Can you publish compounds used in your tests?"

    It's a 55 MB archive. I'll see where I can host it...

    "As for time measurements, I recommend you to install Bingo-PostgreSQL cartridge and compare it to pgchem::tigress. Indigo is simplified high level C API over low level C++ API, which is used in Bingo with code optimizations, so that Bingo should be faster."

    This was not intended as a shootout between Bingo and pgchem :-), only as a comparison between Indigo and OpenBabel as alternative matching engines for pgchem. Since I cannot use C++ directly in PostgreSQL, OpenBabel is called through an C wrapper library like Indigo, so I think from pgchem's perspective this is the correct way to compare them.

    "Your single query is not very representative."

    I agree, but the test data used is also not representative. The question is, what is 'representative'? I'd say this depends on the type of system, a catalogue of commercially available chemicals will contain different data than a medchem screening library and the typical queries will be different too. To roughly show differences between the various alternatives my approach is sufficient, I'd say, and that was the only intention, not a in-depth comparison between Indigo and OpenBabel.

    But I'm open to suggestions. What would a representative dataset for general benchmarking be like? And what queries should be used?

    best regards,
    Ergo

    ReplyDelete
  7. "This was not intended as a shootout between Bingo and pgchem :-)"

    Indigo and Bingo shares the same core code. Simplified Indigo API was designed for easy access to the underlying cheminformatics algorithms, but it is not too optimized. For Bingo we have also developed optimized C API for sequential processing, and have thoughts to document this API as it is done with Indigo. Comparison with Bingo could show us whether it worth it or not.

    "What would a representative dataset for general benchmarking be like? And what queries should be used?"

    This is very interesting question, and I do not know an answer to it. Even yesterday Andrew Dalke asked such question on BlueObelisk forum. At least I would suggest to test fingerprints efficiency on different queries, because I'm sure that Indigo fingerprint is better than FP2.

    There is a small set of queries used in article Chemical substructure search in SQL by Golovin and Henrick. These queries were lately reused in other articles:
    GH1 ONC1CC(C(O)C1O)[n]2cnc3c(NC4CC4)ncnc23
    GH2 Nc1ncnc2[n]cnc12
    GH3 CNc1ncnc2[n](C)cnc12
    GH4 Nc1ncnc2[n](cnc12)C3CCCC3
    GH5 CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2
    GH6 OC2=CC(=O)c1c(cccc1)O2
    GH7 Nc1nnc(S)s1
    GH8 C1C2SCCN2C1
    GH9 CP(O)(O)=O
    GH10 CCCCCP(O)(O)=O
    GH11 N2CCC13CCCCC1C2Cc4c3cccc4
    GH12 s1cncc1
    GH13 C34CCC1C(CCC2CC(=O)CCC12)C3CCC4
    GH14 CCCCCCCCCCCP(O)(O)=O
    GH15 CC1CCCC1
    GH16 CCC1CCCC1
    GH17 CCCC1CCCC1

    I think that such list is sufficient for simple comparison.

    ReplyDelete