How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?

Question 1

How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?

python csv similarity fingerprint rdkit

Anna Zhou · Aug 4, 2018 · Viewed 7.4k times · Source

Answer

Answer

Edited the answer to catch all comments.

RDKit has a bulk funktion for similarity, so you can compare one fingerprint against a list of fingerprints. Just loop over the list of fingerprints.

If the CSV's looks like this

First csv with an invalid SMILES

smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C

Second csv with correct SMILES

smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F

This is how to read out the SMILES, delete the invalid ones, do the fingerprint similarity without duplicates and save the sorted values.

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])

# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)

# save as csv
df_final.to_csv('third.csv', index=False, sep=',')

The print out:

Invalid SMILES: CCOCN(C)(C)C

CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']

   query target  Similarity
9   CCCO  CCCCO    0.769231
2    CCO   CCCO    0.600000
1    CCO  CCOCC    0.500000
7  CCOCC   CCCO    0.466667
3    CCO  CCCCO    0.461538
8  CCOCC  CCCCO    0.388889
4    COC  CCOCC    0.333333
5    COC   CCCO    0.272727
0    CCO    COC    0.250000
6    COC  CCCCO    0.214286

Question 2

I'm using RDKit to calculate molecular similarity based on Tanimoto coefficient between two lists of molecules with SMILE structures. Now I'm able to extract the SMILE structures from two separate csv files. I'm wondering how to put these structures into the fingerprint module in RDKit, and how to calculate the similarity pairwise one by one between the two list of molecules?

from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])

I want to put all the SMILE structures I have (over 10,000) into the 'ms' list and get their fingerprints. Then I'll compare the similarity between each pair of molecules from the two lists, maybe a for loop is needed here?

Thanks in advance!

I used pandas dataframe to select and print out the lists with my structures, and I saved my lists into list_1 and list_2. When it runs to the ms1 line, it has the error as following:

TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t, 
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float

Then I checked the files and there's only SMILES in the smiles column. But when I manually put some molecule structures into the lists for testing, there are still errors regarding

fpArgs['minSize'].

For example, the SMILES for gadodiamide is "O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4)=O)CC[N]5(CC(=[O]6)NC)CC(=O)[O-]7)C1)NC", and the error codes are as following (when running the fps line):

ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, 
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, 
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, 
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, 
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).

How to include the molecule names in the output file along with the similarity values if the original csv file is as following:

names,smiles,value,value2

molecule1,CCOCN(C)(C),0.25,A

molecule2,CCO,1.12,B

molecule3,COC,2.25,C

I added these codes to include the molecule names in the output file, and these's some array value error regarding the names (particularly for d2):

name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1): 
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) 
    #print(c_smiles[n], c_smiles[n+1:])
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
        d1.append(name_3[n])
        d2.append(name_3[n+1:][m])
    #print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
    print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')

How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?

Answer

Related questions