SlideShare a Scribd company logo
R & CDK

                                                  1/18




     Chemical Data Mining
    Open Source & Reproducible


              Rajarshi Guha

NIH Center for Advancing Translational Science


            August 21, 2012
            Philadelphia PA
R & CDK
Background
                                                            2/18




   Been using it since 2003, developed a number of R
   packages, mostly public
   Make extensive use of R at NCGC for small molecule &
   RNAi screening and high content analysis
   In paralllel, need to manipulate and process chemical
   structure data


   How R is enhanced by other Open Source software
   How R enables and supports reproducible science
R & CDK
What is R?
                                                                  3/18




    R is an environment for modeling
        Contains many prepackaged statistical and mathematical
        functions
        No need to implement anything (if you don’t want to)
    R is a matrix programming language that is good for
    statistical computing
        Full fledged, interpreted language
        Well integrated with statistical functionality
        Easy to integrate with C, C++, Fortran
        Good for prototyping
R & CDK
Why cheminformatics in R?
                                     4/18




  Much of cheminformatics is data
  modeling and mining
  But the numeric data is derived
  from chemical structure
  Thus we want to work with
      molecules & and their parts
      files containing molecules
      databases of molecules
R & CDK
Why cheminformatics in R?
                                                               5/18




    In contrast to bioinformatics (cf. Bioconductor), not a
    whole lot of cheminformatics support for R
    For cheminformatics and chemistry, relevant packages
    include
        rcdk, rpubchem, chemblr,fingerprint
        bio3d, ChemmineR, caret
    A lot of cheminformatics employs various forms of
    statistics and machine learning - R is exactly the
    environment for that
    We just need to add some chemistry capabilities to it
R & CDK
What does the CDK provide?
                                                                6/18



    Fundamental chemical objects
        atoms
        bonds
        molecules
    More complex objects are also available
        Sequences
        Reactions
        Collections of molecules
    Input/Output for a wide variety of molecular file formats
    Fingerprints and fragment generation
    Rigid alignments, pharmacophore searching
    Substructure searching, SMARTS support
    Molecular descriptors
R & CDK
Using the CDK in R
                                                               7/18


    Based on the rJava package
    Two R packages to install (not counting the
    dependencies)
    Provides access to a variety of CDK classes and methods
    Idiomatic R


         rcdk

   CDK           Jmol                          rpubchem

         rJava             fingerprint            XML

                   R Programming Environment
R & CDK
Reading in data
                                                         8/18




    The CDK supports a variety of file formats
    rcdk loads all recognized formats, automatically
    Data can be local or remote


mols <- load.molecules( c("data/io/set1.sdf",
              "data/io/set2.smi",
              "http://rguha.net/rcdk/remote.sdf"))


    For large SDF’s use an iterating reader
    Can’t do much with these objects, except via rcdk
    functions
R & CDK
Working with molecules
                                                               9/18




    Currently you can access atoms, bonds, get certain atom
    properties, 2D/3D coordinates
    Since rcdk doesn’t cover the entire CDK API, you might
    need to drop down to the rJava level and make calls to
    the Java code by hand
R & CDK
Accessing fingerprints
                                                                   10/18




    CDK provides several fingerprints
        Path-based, MACCS, E-State, PubChem
    Access them via get.fingerprint(...)
    Works on one molecule at a time, use lapply to process a
    list of molecules
    This method works with the fingerprint package
        Separate package to represent and manipulate fingerprint
        data from various sources (CDK, BCI, MOE)
        Uses C to perform similarity calculations
R & CDK
Working with fingerprints
                                                                                        11/18



                   The fingerprint package implements 28 similarity and
                   dissimilarity metrics
                   Easy to run enrichment studies
                   We can compare datasets in O(n) time, using the “bit
                   spectrum”
             1.0
             0.8
 Frequency

             0.6
             0.4
             0.2
             0.0




                      0                        50                          100   150

                                                            Bit Position




             0
                 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
R & CDK
Visualization
                                                              12/18




    rcdk supports visualization of 2D structure images in
    two ways
    First, you can bring up a Swing window
    Second, you can obtain the depiction as a raster image


mols <- load.molecules("data/dhfr_3d.sd")

## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])

## view a table of molecules
view.molecule.2d(mols[1:10])
R & CDK
The QSAR workflow
                    13/18
R & CDK
The QSAR workflow
                                                                  14/18




   Before model development you’ll need to clean the
   molecules, evaluate descriptors, generate subsets
   With the numeric data in hand, we can proceed to
   modeling
   Before building predictive models, we’d probably explore
   the dataset
       Normality of the dependent variable
       Correlations between descriptors and dependent variable
       Similarity of subsets
   Go wild and build all the models that R supports
R & CDK
Interacting with chemical databases
                                                              15/18




    A variety of databases containing structures, physical
    properties, biological activities
    Direct access within R lets us streamline our workflow
    Enabled by public APIs
        Pubchem PUG and REST
        ChEMBL REST API (chemblr)
R & CDK
Reproducible chemical data mining
                                                                               16/18




   The many toolkits and versions         .Rda




                                                        Reproducible Bundle
   make reproducibility tough
   DB and HTTP access ensures
   that an analysis can be always
   up to date if required
   If the analysis is not based on a
   fixed snapshot of data,
                                            .R
   reproducibility cannot be
                                       Sweave / Knitr
   guaranteed
   Might actually make all those
   published QSAR models
   reusable!
R & CDK
Acknowledgements
                              17/18




   rcdk
       Steffen Neumann
       Miguel Rojas
       Ranke Johannes
   CDK
       Egon Willighagen
       Christoph Steinbeck
       ...
R & CDK

                                        18/18




http://sourceforge.net/projects/cdk/

  http://github.com/rajarshi/cdk

              @rguha

More Related Content

What's hot (6)

PDF
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Ryan Rosario
 
DOCX
Fundamentals of programming and problem solving
Justine Dela Serna
 
PPTX
Localization (l10n) - The Process
Sundeep Anand
 
PDF
My Open Access papers
baoilleach
 
PDF
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
PDF
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ryan Rosario
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Ryan Rosario
 
Fundamentals of programming and problem solving
Justine Dela Serna
 
Localization (l10n) - The Process
Sundeep Anand
 
My Open Access papers
baoilleach
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ryan Rosario
 

Viewers also liked (17)

PPTX
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Salford Systems
 
PDF
Data drivenapproach to medicinalchemistry
Ann-Marie Roche
 
PDF
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Sean Ekins
 
PDF
Chemical Spaces: Modeling, Exploration & Understanding
Rajarshi Guha
 
PDF
Exploiting bigger data and collaborative tools for predictive drug discovery
Sean Ekins
 
PPTX
Agile large-scale machine-learning pipelines in drug discovery
Ola Spjuth
 
PDF
Dispensing error
Alina M. Sánchez
 
PDF
EPA CAA Email 9.4.03
Obama White House
 
PPT
Composicion bidimensional (1)
joselizz
 
PPT
Animation lesson 2
Ty171
 
PPT
Wondrous Wise Words
OH TEIK BIN
 
PPT
Latihan 1 tata
leehuanthoo
 
PPTX
두피에좋은음식
준배 채
 
PPT
นิป เอมรัฐ
guest6487de
 
PDF
Rhoades_logo_color
Mary Wilsbach Katz
 
PDF
Sharman 2015 PhD thesis
Murray Sharman
 
PPT
Sdc11 feb14 class12
missjaqui
 
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Salford Systems
 
Data drivenapproach to medicinalchemistry
Ann-Marie Roche
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Sean Ekins
 
Chemical Spaces: Modeling, Exploration & Understanding
Rajarshi Guha
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Sean Ekins
 
Agile large-scale machine-learning pipelines in drug discovery
Ola Spjuth
 
Dispensing error
Alina M. Sánchez
 
EPA CAA Email 9.4.03
Obama White House
 
Composicion bidimensional (1)
joselizz
 
Animation lesson 2
Ty171
 
Wondrous Wise Words
OH TEIK BIN
 
Latihan 1 tata
leehuanthoo
 
두피에좋은음식
준배 채
 
นิป เอมรัฐ
guest6487de
 
Rhoades_logo_color
Mary Wilsbach Katz
 
Sharman 2015 PhD thesis
Murray Sharman
 
Sdc11 feb14 class12
missjaqui
 
Ad

Similar to Chemical Data Mining: Open Source & Reproducible (20)

PDF
Integrating R with the CDK: Enhanced Chemical Data Mining
Rajarshi Guha
 
PPTX
Chemistry development kit
Alichy Sowmya
 
PDF
Crunching Molecules and Numbers in R
Rajarshi Guha
 
PDF
Chemical Databases and Open Chemistry on the Desktop
Marcus Hanwell
 
PDF
Anaconda Python KNIME & Orange Installation
Girinath Pillai
 
PDF
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
 
PDF
Extending lifespan with Hadoop and R
Radek Maciaszek
 
PDF
Robots, Small Molecules & R
Rajarshi Guha
 
PDF
16 years of the Chemistry Development Kit (CDK)
Christoph Steinbeck
 
PDF
Some "challenges" on the open-source/open-data front
Greg Landrum
 
PPTX
Overview of cheminformatics
Benjamin Bucior
 
PDF
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Open source
Gerald Lushington
 
PDF
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
PPTX
Cheminformatics
baoilleach
 
PDF
Fingerprinting Chemical Structures
Rajarshi Guha
 
PPTX
Cheminformatics
baoilleach
 
PPTX
Chemoinformatic File Format.pptx
wadhava gurumeet
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Rajarshi Guha
 
Chemistry development kit
Alichy Sowmya
 
Crunching Molecules and Numbers in R
Rajarshi Guha
 
Chemical Databases and Open Chemistry on the Desktop
Marcus Hanwell
 
Anaconda Python KNIME & Orange Installation
Girinath Pillai
 
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
 
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Robots, Small Molecules & R
Rajarshi Guha
 
16 years of the Chemistry Development Kit (CDK)
Christoph Steinbeck
 
Some "challenges" on the open-source/open-data front
Greg Landrum
 
Overview of cheminformatics
Benjamin Bucior
 
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Getting started with R & Hadoop
Jeffrey Breen
 
Open source
Gerald Lushington
 
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Cheminformatics
baoilleach
 
Fingerprinting Chemical Structures
Rajarshi Guha
 
Cheminformatics
baoilleach
 
Chemoinformatic File Format.pptx
wadhava gurumeet
 
Ad

More from Rajarshi Guha (20)

PDF
Pharos: A Torch to Use in Your Journey in the Dark Genome
Rajarshi Guha
 
PDF
Pharos: Putting targets in context
Rajarshi Guha
 
PDF
Pharos – A Torch to Use in Your Journey In the Dark Genome
Rajarshi Guha
 
PDF
Pharos - Face of the KMC
Rajarshi Guha
 
PDF
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Rajarshi Guha
 
PDF
What can your library do for you?
Rajarshi Guha
 
PDF
So I have an SD File … What do I do next?
Rajarshi Guha
 
PDF
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
 
PDF
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Rajarshi Guha
 
PDF
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Rajarshi Guha
 
PDF
When the whole is better than the parts
Rajarshi Guha
 
PDF
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Rajarshi Guha
 
PDF
Pushing Chemical Biology Through the Pipes
Rajarshi Guha
 
PDF
Characterization and visualization of compound combination responses in a hig...
Rajarshi Guha
 
PDF
The BioAssay Research Database
Rajarshi Guha
 
PDF
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
PDF
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 
PDF
Quantifying Text Sentiment in R
Rajarshi Guha
 
PDF
PMML for QSAR Model Exchange
Rajarshi Guha
 
PDF
Smashing Molecules
Rajarshi Guha
 
Pharos: A Torch to Use in Your Journey in the Dark Genome
Rajarshi Guha
 
Pharos: Putting targets in context
Rajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Rajarshi Guha
 
Pharos - Face of the KMC
Rajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Rajarshi Guha
 
What can your library do for you?
Rajarshi Guha
 
So I have an SD File … What do I do next?
Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Rajarshi Guha
 
When the whole is better than the parts
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Rajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Rajarshi Guha
 
The BioAssay Research Database
Rajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 
Quantifying Text Sentiment in R
Rajarshi Guha
 
PMML for QSAR Model Exchange
Rajarshi Guha
 
Smashing Molecules
Rajarshi Guha
 

Chemical Data Mining: Open Source & Reproducible

  • 1. R & CDK 1/18 Chemical Data Mining Open Source & Reproducible Rajarshi Guha NIH Center for Advancing Translational Science August 21, 2012 Philadelphia PA
  • 2. R & CDK Background 2/18 Been using it since 2003, developed a number of R packages, mostly public Make extensive use of R at NCGC for small molecule & RNAi screening and high content analysis In paralllel, need to manipulate and process chemical structure data How R is enhanced by other Open Source software How R enables and supports reproducible science
  • 3. R & CDK What is R? 3/18 R is an environment for modeling Contains many prepackaged statistical and mathematical functions No need to implement anything (if you don’t want to) R is a matrix programming language that is good for statistical computing Full fledged, interpreted language Well integrated with statistical functionality Easy to integrate with C, C++, Fortran Good for prototyping
  • 4. R & CDK Why cheminformatics in R? 4/18 Much of cheminformatics is data modeling and mining But the numeric data is derived from chemical structure Thus we want to work with molecules & and their parts files containing molecules databases of molecules
  • 5. R & CDK Why cheminformatics in R? 5/18 In contrast to bioinformatics (cf. Bioconductor), not a whole lot of cheminformatics support for R For cheminformatics and chemistry, relevant packages include rcdk, rpubchem, chemblr,fingerprint bio3d, ChemmineR, caret A lot of cheminformatics employs various forms of statistics and machine learning - R is exactly the environment for that We just need to add some chemistry capabilities to it
  • 6. R & CDK What does the CDK provide? 6/18 Fundamental chemical objects atoms bonds molecules More complex objects are also available Sequences Reactions Collections of molecules Input/Output for a wide variety of molecular file formats Fingerprints and fragment generation Rigid alignments, pharmacophore searching Substructure searching, SMARTS support Molecular descriptors
  • 7. R & CDK Using the CDK in R 7/18 Based on the rJava package Two R packages to install (not counting the dependencies) Provides access to a variety of CDK classes and methods Idiomatic R rcdk CDK Jmol rpubchem rJava fingerprint XML R Programming Environment
  • 8. R & CDK Reading in data 8/18 The CDK supports a variety of file formats rcdk loads all recognized formats, automatically Data can be local or remote mols <- load.molecules( c("data/io/set1.sdf", "data/io/set2.smi", "http://rguha.net/rcdk/remote.sdf")) For large SDF’s use an iterating reader Can’t do much with these objects, except via rcdk functions
  • 9. R & CDK Working with molecules 9/18 Currently you can access atoms, bonds, get certain atom properties, 2D/3D coordinates Since rcdk doesn’t cover the entire CDK API, you might need to drop down to the rJava level and make calls to the Java code by hand
  • 10. R & CDK Accessing fingerprints 10/18 CDK provides several fingerprints Path-based, MACCS, E-State, PubChem Access them via get.fingerprint(...) Works on one molecule at a time, use lapply to process a list of molecules This method works with the fingerprint package Separate package to represent and manipulate fingerprint data from various sources (CDK, BCI, MOE) Uses C to perform similarity calculations
  • 11. R & CDK Working with fingerprints 11/18 The fingerprint package implements 28 similarity and dissimilarity metrics Easy to run enrichment studies We can compare datasets in O(n) time, using the “bit spectrum” 1.0 0.8 Frequency 0.6 0.4 0.2 0.0 0 50 100 150 Bit Position 0 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
  • 12. R & CDK Visualization 12/18 rcdk supports visualization of 2D structure images in two ways First, you can bring up a Swing window Second, you can obtain the depiction as a raster image mols <- load.molecules("data/dhfr_3d.sd") ## view a single molecule in a Swing window view.molecule.2d(mols[[1]]) ## view a table of molecules view.molecule.2d(mols[1:10])
  • 13. R & CDK The QSAR workflow 13/18
  • 14. R & CDK The QSAR workflow 14/18 Before model development you’ll need to clean the molecules, evaluate descriptors, generate subsets With the numeric data in hand, we can proceed to modeling Before building predictive models, we’d probably explore the dataset Normality of the dependent variable Correlations between descriptors and dependent variable Similarity of subsets Go wild and build all the models that R supports
  • 15. R & CDK Interacting with chemical databases 15/18 A variety of databases containing structures, physical properties, biological activities Direct access within R lets us streamline our workflow Enabled by public APIs Pubchem PUG and REST ChEMBL REST API (chemblr)
  • 16. R & CDK Reproducible chemical data mining 16/18 The many toolkits and versions .Rda Reproducible Bundle make reproducibility tough DB and HTTP access ensures that an analysis can be always up to date if required If the analysis is not based on a fixed snapshot of data, .R reproducibility cannot be Sweave / Knitr guaranteed Might actually make all those published QSAR models reusable!
  • 17. R & CDK Acknowledgements 17/18 rcdk Steffen Neumann Miguel Rojas Ranke Johannes CDK Egon Willighagen Christoph Steinbeck ...
  • 18. R & CDK 18/18 http://sourceforge.net/projects/cdk/ http://github.com/rajarshi/cdk @rguha