SlideShare a Scribd company logo
A REST API for
The IUPAC Solubility Data Series:
A ‘Skunkworks’ Project
Stuart J. Chalk
Department of Chemistry
University of North Florida
schalk@unf.edu
2014 Fall ACS Meeting
 Motivation
 What is Website ‘Scraping’?
 What are REST and API?
 Project Process
 NIST Website Analysis
 Database Definition
 Data Ingestion
 Project Website Design
 Using the Website
 Future Plans
 Conclusion
Outline
 Linked Open Data (LOD) is important for science
 Defining a process for grabbing high quality science
data and making it semantically available is useful
 Providing a REST API makes information easy to find
 Providing unique REST URLs for data allows linking
 A semantic description of data makes it more useful
 Increase value added -> link data to other available data
 SDS data is fundamentally important to chemistry
Motivation
(1) http://en.wikipedia.org/wiki/Linked_data
 Data in web pages is available for users to copy/paste
 When the available data is large, automation of the
scripts is necessary
 ‘Scraping’ is the processing of web page data using a
scripting language
 Data can be captured and stored in any format
 Most useful to capture data in a relational database
so that it can be repurposed at another website
 This is usually done without the permission of the
authors of the ‘scraped’ web page(s) 
What is Website Scraping?
 Representational State Transfer (REST) is…
“is a software architectural style consisting of a coordinated
set of architectural constraints applied to components,
connectors, and data elements, within a distributed
hypermedia system”2
 REST is applied to websites as a style for providing URL
access to information in a structured human readable way
 Application Programming Interface (API) is…
A standardized way for one computer/software system to
talk to another. For REST this a set of remote (http) based
calls to pre-defined URL’s
What are REST and API?
(2) http://en.wikipedia.org/wiki/Representational_state_transfer
(3) http://en.wikipedia.org/wiki/API
 Analysis of current NIST Solubility Database website
 Definition of database tables needed
 Code generation to automate data scraping
 Data cleanup
 REST API definition and description
 REST API development
 Output file format generation
 Addition of bells and whistles (if there’s time )
Project Process
 http://srdata.nist.gov/solubility/dataSeries.aspx
contains links to all the volumes that are available => volID
 http://srdata.nist.gov/solubility/sys_category.aspx
contains all the system types as part of a select list => typeID
 http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=
SSN contains the different datasets for a specific system type => sysID
 http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>
contains details of system: citation, data tables, refs, preparer etc.
 http://srdata.nist.gov/solubility/sol_2casno.aspx?
STR1=<CASRN1>&STR2=<CASRN2>&OPTION=CASNO allows searching by
chemical CASRN (also name (OPTION=CHEM) or formula (OPTION=MOL)
 http://srdata.nist.gov/solubility/citation_detail.aspx?REF_NO=<?REFNO?>
allows searching system date by paper
NIST Website Analysis
 What types of data are available and how should it be
organized?
 By Volume => volID
 By System Type => typeID
 By System => sysID
 By Chemical => CASRN, name, formula
 By Citation => refNO
 By Author (new)
 Also added Tables and Variables during development
 Note: the actual site uses sysID for the system and
type and particular set of data about a system type
Database Definition
 Data was imported into MySQL either from a tab
delimited text file or insertion via PHP scripts
 Scraped the volume id’s from
http://srdata.nist.gov/solubility/dataSeries.aspx html
cleaned up to generate a tab delimited text file
18 rows
 Similarly the system types were scraped from
http://srdata.nist.gov/solubility/sys_category.aspx
into a tab delimited text file => 2564 rows
Data Ingestion
 Individual systems with data were scraped using a PHP
script which involved
 Lookup of system type and retrieval of typeID
 Construction of system type page URL
http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN
 Retrieval of the page content (HTML) into a PHP variable
 PCRE Regex expression match for the sysID of each system
 Creation of a new entry in the system database table
 4817 rows
Data Ingestion
 System details were scraped using a PHP script by
 Lookup of system and retrieval of sysID
 Construction of system detail page URL
http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>
 Retrieval of the page content (HTML) into a PHP variable
 Processing of HTML to retrieve
citation, variables, data analysis and tables, method, source,
errors, references
 Saving of details to systems table and related tables
Data Ingestion
 In addition to data extraction
 Chemical InChI strings were retrieved from NIH CIR1
 Citation DOI’s were retrieved from CrossRef2 and saved
(article titles and full author names were also added)
 Data tables were converted to JSON format for storage
and reproduction
 Table notes, sources, and additional refs were converted to
JSON for storage
Data Ingestion
(1) http://cactus.nci.nih.gov/chemical/structure
(2) http://www.crossref.org
Database
Database
 Constructed using the CakePHP framework (PHP)
 Index (listing) and view pages for each of
 Authors
 Chemicals
 Citations
 Systems
 System Types
 Volumes
 Search functionality provided via the homepage
 Example URL
http://chalk.coas.unf.edu/solubility/systems/view/20_135
Project Website Design
Project
Website
Design
Project
Website
Design
 Get this project funded
 Clean up references and link to DOI’s
 Clean up authors and link to ORCIDs
 Add procedural references
 Convert table data into searchable/linked format
 Add measurement type, unit, error, and variables
 Provide searching and plotting of data
 Automated calculation of additional parameters
e.g. solubility in different units, mole ratio
 Create solubility ontology => add RDF + searching
 Add microdata1 to each web page
 Next phase ? => Add the other volumes
Future Plans
(1) http://www.w3.org/TR/microdata/
 A RESTful version of the IUPAC-NIST Solubility Series
Database was successfully created and made available
 Metrics
 20 Volumes
 2564 System Types
 4817 Systems
 1484 Chemicals
 1247 References
 1968 Authors
 11 MB size of database
 One week worth of work
Conclusion
 schalk@unf.edu
 Phone: 904-620-5311
 Skype: stuartchalk
 LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
 ResearcherID: http://www.researcherid.com/rid/D-8577-2013
Questions?

More Related Content

PDF
Collecting and Using Funding Data Crossref
PPTX
Collecting and using funding data in your publications
PPTX
The Global reach of Crossref metadata
PDF
Dataverse: Helping Researchers Publish Their Data Through Automation
PPTX
Data, data, everywhere? Not nearly enough!
PPTX
grlc Makes GitHub Taste Like Linked Data APIs
PPTX
PPTX
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Collecting and Using Funding Data Crossref
Collecting and using funding data in your publications
The Global reach of Crossref metadata
Dataverse: Helping Researchers Publish Their Data Through Automation
Data, data, everywhere? Not nearly enough!
grlc Makes GitHub Taste Like Linked Data APIs
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data

What's hot (20)

PPTX
Jcdl2013 mklein
PPTX
Reference linking and Cited-by
PDF
CEK KEMIRIPAN PADA CROSSREF
PPTX
What's up LOD Cloud - Observing the state of Linked Open Data Cloud Metadata
PPT
Design and creation of ontologies for environmental information retrieval
PPT
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
PPTX
Working with Crossref and registering content
PPTX
Automated creation of analytic catalog records for born digital journal articles
PPTX
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
PPTX
ORCID at Crossref LIVE Indonesia
PPTX
Library Support For Ref
PDF
Documents, services, and data on the web
PPT
SubSift web services and workflows for profiling and comparing scientists and...
PPT
Initial proposal for DSpace statistics application
PPTX
ALIADA at MTRS15 Conference
PPS
Electronic Library Bremen – state & focus of development
PPTX
Wrangling RedCap_An Introduction and Inspiration
PPTX
Planning for Libra Data
PPTX
Sheet Music Consortium: Tools for Data Providers
PDF
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Jcdl2013 mklein
Reference linking and Cited-by
CEK KEMIRIPAN PADA CROSSREF
What's up LOD Cloud - Observing the state of Linked Open Data Cloud Metadata
Design and creation of ontologies for environmental information retrieval
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
Working with Crossref and registering content
Automated creation of analytic catalog records for born digital journal articles
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ORCID at Crossref LIVE Indonesia
Library Support For Ref
Documents, services, and data on the web
SubSift web services and workflows for profiling and comparing scientists and...
Initial proposal for DSpace statistics application
ALIADA at MTRS15 Conference
Electronic Library Bremen – state & focus of development
Wrangling RedCap_An Introduction and Inspiration
Planning for Libra Data
Sheet Music Consortium: Tools for Data Providers
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Ad

Similar to ACS 248th Paper 108 NIST-IUPAC Solubility Data (20)

PPTX
ChemSpider compound database as one of the pillars of a semantic web for …
PDF
Scilligence Corporation Overiew
PPTX
ChemReader chemical informatics tool
PDF
Open source
PPT
Feeding and consuming data to support open notebook science via the chem spid...
PDF
Alex M. Clark, CINF, ACS 2012 Philadelphia
PDF
The Open Chemistry Project
PPT
The UK National Chemical Database Service – an integration of commercial and ...
PDF
Open-source from/in the enterprise: the RDKit
PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
PPTX
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
PPTX
PubChem: a public chemical information resource for big data chemistry
PDF
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
PDF
Building a mobile app ecosystem for chemistry collaboration
PPTX
Exploiting PubChem for drug discovery based on natural products
PDF
Chemical Databases and Open Chemistry on the Desktop
PPTX
Cheminformatics
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPTX
Modtrove and the role of electronic notebooks
ChemSpider compound database as one of the pillars of a semantic web for …
Scilligence Corporation Overiew
ChemReader chemical informatics tool
Open source
Feeding and consuming data to support open notebook science via the chem spid...
Alex M. Clark, CINF, ACS 2012 Philadelphia
The Open Chemistry Project
The UK National Chemical Database Service – an integration of commercial and ...
Open-source from/in the enterprise: the RDKit
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
PubChem: a public chemical information resource for big data chemistry
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
Building a mobile app ecosystem for chemistry collaboration
Exploiting PubChem for drug discovery based on natural products
Chemical Databases and Open Chemistry on the Desktop
Cheminformatics
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
Modtrove and the role of electronic notebooks
Ad

More from Stuart Chalk (20)

PPTX
Semantic properties and units
PPTX
Open semantic chemical structures
PPTX
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
PPTX
AnIML: A New Analytical Data Standard
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
PPTX
Scientific Units in the Electronic Age
PPTX
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
PPTX
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
PPTX
The Electronic Notebook Ontology
PPTX
Bringing Flow injection Analysis to the Semantic Web
PPTX
Reactions to the Open Spectral Database
PPTX
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
PPTX
Building a Standard for Standards: The ChAMP Project
PPTX
A Standard Data Format for Computational Chemistry: CSX
PPTX
Overview of the Analytical Information Markup Language (AnIML)
PPTX
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
PPTX
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
PPTX
ACS 248th Paper 104 ChemData Project
PPTX
ACS 248th Paper 71 ChAMP Project
PPTX
ACS 248th Paper 67 Eureka Collaboration
Semantic properties and units
Open semantic chemical structures
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
AnIML: A New Analytical Data Standard
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Scientific Units in the Electronic Age
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
The Electronic Notebook Ontology
Bringing Flow injection Analysis to the Semantic Web
Reactions to the Open Spectral Database
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Building a Standard for Standards: The ChAMP Project
A Standard Data Format for Computational Chemistry: CSX
Overview of the Analytical Information Markup Language (AnIML)
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 67 Eureka Collaboration

Recently uploaded (20)

PPTX
lecture on genetics: location of gene in the chomosomes.pptx
PPTX
biomolecules-class12th chapter board classespptx
PPTX
Pharmacognosy: ppt :pdf :pharmacognosy :
PDF
Directing Generative AI for Pharo Documentation
PPTX
Excretory System in insects ( PPT Presentation)
PDF
Coordination Chemistry(Part-I) - Notes.pdf
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Visualizing our changing climate in real-time
PDF
Little Red Dots As Late-stage Quasi-stars
PDF
Bacteria, Different sizes and Shapes of of bacteria
PDF
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Discovery of Novel Antibiotics from Uncultured Microbes.pptx
PDF
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
PPTX
2019 Upper Respiratory Tract Infections.pptx
PPT
oscillatoria known as blue -green algae
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
scadadd on patiala punjab sarabjeet sarbjeet sarvbjeet.pptx
PPTX
Introduction to biochemistry.ppt-pdf_shotrs!
PPTX
Prawn filtration system. also known by the name pokkalii cultivation
lecture on genetics: location of gene in the chomosomes.pptx
biomolecules-class12th chapter board classespptx
Pharmacognosy: ppt :pdf :pharmacognosy :
Directing Generative AI for Pharo Documentation
Excretory System in insects ( PPT Presentation)
Coordination Chemistry(Part-I) - Notes.pdf
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Visualizing our changing climate in real-time
Little Red Dots As Late-stage Quasi-stars
Bacteria, Different sizes and Shapes of of bacteria
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Discovery of Novel Antibiotics from Uncultured Microbes.pptx
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
2019 Upper Respiratory Tract Infections.pptx
oscillatoria known as blue -green algae
Comparative Structure of Integument in Vertebrates.pptx
scadadd on patiala punjab sarabjeet sarbjeet sarvbjeet.pptx
Introduction to biochemistry.ppt-pdf_shotrs!
Prawn filtration system. also known by the name pokkalii cultivation

ACS 248th Paper 108 NIST-IUPAC Solubility Data

  • 1. A REST API for The IUPAC Solubility Data Series: A ‘Skunkworks’ Project Stuart J. Chalk Department of Chemistry University of North Florida [email protected] 2014 Fall ACS Meeting
  • 2.  Motivation  What is Website ‘Scraping’?  What are REST and API?  Project Process  NIST Website Analysis  Database Definition  Data Ingestion  Project Website Design  Using the Website  Future Plans  Conclusion Outline
  • 3.  Linked Open Data (LOD) is important for science  Defining a process for grabbing high quality science data and making it semantically available is useful  Providing a REST API makes information easy to find  Providing unique REST URLs for data allows linking  A semantic description of data makes it more useful  Increase value added -> link data to other available data  SDS data is fundamentally important to chemistry Motivation (1) http://en.wikipedia.org/wiki/Linked_data
  • 4.  Data in web pages is available for users to copy/paste  When the available data is large, automation of the scripts is necessary  ‘Scraping’ is the processing of web page data using a scripting language  Data can be captured and stored in any format  Most useful to capture data in a relational database so that it can be repurposed at another website  This is usually done without the permission of the authors of the ‘scraped’ web page(s)  What is Website Scraping?
  • 5.  Representational State Transfer (REST) is… “is a software architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system”2  REST is applied to websites as a style for providing URL access to information in a structured human readable way  Application Programming Interface (API) is… A standardized way for one computer/software system to talk to another. For REST this a set of remote (http) based calls to pre-defined URL’s What are REST and API? (2) http://en.wikipedia.org/wiki/Representational_state_transfer (3) http://en.wikipedia.org/wiki/API
  • 6.  Analysis of current NIST Solubility Database website  Definition of database tables needed  Code generation to automate data scraping  Data cleanup  REST API definition and description  REST API development  Output file format generation  Addition of bells and whistles (if there’s time ) Project Process
  • 7.  http://srdata.nist.gov/solubility/dataSeries.aspx contains links to all the volumes that are available => volID  http://srdata.nist.gov/solubility/sys_category.aspx contains all the system types as part of a select list => typeID  http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM= SSN contains the different datasets for a specific system type => sysID  http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID> contains details of system: citation, data tables, refs, preparer etc.  http://srdata.nist.gov/solubility/sol_2casno.aspx? STR1=<CASRN1>&STR2=<CASRN2>&OPTION=CASNO allows searching by chemical CASRN (also name (OPTION=CHEM) or formula (OPTION=MOL)  http://srdata.nist.gov/solubility/citation_detail.aspx?REF_NO=<?REFNO?> allows searching system date by paper NIST Website Analysis
  • 8.  What types of data are available and how should it be organized?  By Volume => volID  By System Type => typeID  By System => sysID  By Chemical => CASRN, name, formula  By Citation => refNO  By Author (new)  Also added Tables and Variables during development  Note: the actual site uses sysID for the system and type and particular set of data about a system type Database Definition
  • 9.  Data was imported into MySQL either from a tab delimited text file or insertion via PHP scripts  Scraped the volume id’s from http://srdata.nist.gov/solubility/dataSeries.aspx html cleaned up to generate a tab delimited text file 18 rows  Similarly the system types were scraped from http://srdata.nist.gov/solubility/sys_category.aspx into a tab delimited text file => 2564 rows Data Ingestion
  • 10.  Individual systems with data were scraped using a PHP script which involved  Lookup of system type and retrieval of typeID  Construction of system type page URL http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN  Retrieval of the page content (HTML) into a PHP variable  PCRE Regex expression match for the sysID of each system  Creation of a new entry in the system database table  4817 rows Data Ingestion
  • 11.  System details were scraped using a PHP script by  Lookup of system and retrieval of sysID  Construction of system detail page URL http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>  Retrieval of the page content (HTML) into a PHP variable  Processing of HTML to retrieve citation, variables, data analysis and tables, method, source, errors, references  Saving of details to systems table and related tables Data Ingestion
  • 12.  In addition to data extraction  Chemical InChI strings were retrieved from NIH CIR1  Citation DOI’s were retrieved from CrossRef2 and saved (article titles and full author names were also added)  Data tables were converted to JSON format for storage and reproduction  Table notes, sources, and additional refs were converted to JSON for storage Data Ingestion (1) http://cactus.nci.nih.gov/chemical/structure (2) http://www.crossref.org
  • 15.  Constructed using the CakePHP framework (PHP)  Index (listing) and view pages for each of  Authors  Chemicals  Citations  Systems  System Types  Volumes  Search functionality provided via the homepage  Example URL http://chalk.coas.unf.edu/solubility/systems/view/20_135 Project Website Design
  • 18.  Get this project funded  Clean up references and link to DOI’s  Clean up authors and link to ORCIDs  Add procedural references  Convert table data into searchable/linked format  Add measurement type, unit, error, and variables  Provide searching and plotting of data  Automated calculation of additional parameters e.g. solubility in different units, mole ratio  Create solubility ontology => add RDF + searching  Add microdata1 to each web page  Next phase ? => Add the other volumes Future Plans (1) http://www.w3.org/TR/microdata/
  • 19.  A RESTful version of the IUPAC-NIST Solubility Series Database was successfully created and made available  Metrics  20 Volumes  2564 System Types  4817 Systems  1484 Chemicals  1247 References  1968 Authors  11 MB size of database  One week worth of work Conclusion
  • 20. [email protected]  Phone: 904-620-5311  Skype: stuartchalk  LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776  ResearcherID: http://www.researcherid.com/rid/D-8577-2013 Questions?