[Zinc-fans] multiple structures for same id

Pamela Seida pseida at adolor.com
Wed Oct 11 14:16:23 PDT 2006


I recently downloaded the everything SD files (subset 10).  I did some pretty extensive filtering of the single representation set (10_p0) based on calculated properties and undesirable functional groups and found something disturbing while checking for the presence of tautomers in the database.  I look for tautomers using Pipeline Pilot - calculate the original smiles string from the structure, calculate PP's "canonical tautomer" and then look for things that have the same canonical tautomer and different original smiles strings.  I did find some tautomers in ZINC, but that wasn't the disturbing part.  

I found 183 compounds with the same structure in the SD files of the everything subset!  There are two entries for each compound - one with the "common" structure and one with the correct structure (the one you see if you lookup the IDs on the ZINC website).  These same compounds appear 2 or 3 times in the single representation smiles file (10_p0) but the structures differ only by the presence or absence of specified double-bond stereochemistry in the smiles string.  There may be more, but I didn't find them yet (I'm doing some additional analysis right now).

I can send you a list of the affected IDs and the SD files they're found in if you'd like.

I found a lot of multiple entries in these files, even though it's supposed to be only one representation (and I saw something in the archive about this problem).  I neutralize acids and bases before doing my filtering and since I had assumed the multiple structures were probably just two tautomers that would be present near pH 7, I figured it would be fine to just take the first structure I found.  But I guess that wasn't a safe assumption.  I'm going to start over using the smiles file, which doesn't seem to have this problem.

Any thoughts on what might have happened and how to fix it?

Thanks,
Pam

*********************************************
Pamela R. Seida, Ph.D.
Research Fellow, Computational Chemistry
Adolor Corporation
700 Pennsylvania Drive
Exton, PA 19341
Ph: 484-595-1078  Fax: 484-595-1551  
email: pseida at adolor.com <mailto:pseida at adolor.com>
 
NOTICE: This e-mail transmission contains confidential information that is intended only for the individual or entity in the e-mail address. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or reliance upon the contents of this e-mail is strictly prohibited. If you have received this e-mail transmission in error, please reply to the sender, so that ADOLOR can arrange for proper delivery and then please delete the message from your inbox. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://blur.compbio.ucsf.edu/pipermail/zinc-fans/attachments/20061011/51d55aa4/attachment.html


More information about the Zinc-fans mailing list