SlideShare a Scribd company logo
Genomic Big Data
Management, Integration, and Mining
E. Weitschek1,2
1 Department of Engineering, Uninettuno International University, Italy
2 Institute of Systems Analysis and Computer Science, National Research Council, Italy
Joint work with P. Bertolazzi, G. Felici , F. Cumbo, G. Fiscon, E. Cappelli
2
Outline
• Growth of biological data
• Next generation sequencing
• Biological data sources
• Biological data management
• Biological data integration
• Big data bioinformatics
• Knowledge extraction
• Supervised Learning
• Biomedical applications
• Conclusions and future directions
3
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks
to the support of computer science
‒ originated by the DNA sequencing method invented by Sanger in early eighties
‒ late nineties significant advances in sequence generation, e.g. Human Genome
Project
‒ actually the genomic sequences are doubling every 18 months
‒ GenBank: collection of all publicly available nucleotide sequences (160 M seq)
4
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks
to the support of computer science
‒ Today next generation high throughput data from modern parallel sequencing
machines, are collected and huge amounts of biological data are currently
available on public and private sources
‒ 10000 Human Genomes project (3000 Mbp)
‒ Nowadays: 1000$ genome
• Very large data sets, that are generated by several different biological experiments,
need to be automatically processed and analyzed with computer science methods
5
DNA Sequencing
• DNA (deoxyribonucleic acid) is the hereditary material in almost all organisms
• DNA sequencing is the process of determining the order of nucleotides within a DNA
molecule
• It includes any method or technology that is used to determine the order of
the four bases—adenine (A), cytosine (C), guanine (G), and thymine (T)
• Originated by the DNA sequencing method invented by Sanger in early eighties
• In late nineties significant advances in sequence generation techniques, largely
inspired by massive projects such as the Human Genome Project
• High costs and time, e.g., for the Human Genome Project 5 billions $ and 13 years
6
Next Generation Sequencing (NGS)
• Today: next generation high throughput data from modern parallel sequencing
machines
‒ Roche 454, Illumina, Applied Biosystems SOLiD,
Helicos Heliscope, Complete Genomics,
Pacific Biosciences SMRT, ION Torrent
‒ Next generation sequencing (NGS) machines output a
large amount of short DNA sequences, called reads
(in fastq format)
‒ Cannot read entire genome one nucleotides at a time from
beginning to end
‒ shred the genome and generate shorts reads
‒ Low cost per base (1000$ for a whole human genome)
‒ High speed (24h to sequence a whole human genome )
‒ Large number of reads
‒ Problems: data storage and analysis, high costs for IT infrastructure
7
Next Generation Sequencing (NGS)
• Data dimension, time and cost of Next Generation Sequencing
Seq type Data Price $ Time
Human Genome 90 GB 1000 1 day
Human Gene
Expression
9 GB 500 12 h
Plant Genome 150 GB 2000 5 days
Bacterial Genome 1 GB 300 6 h
8
Biological data sources
• Several heterogeneous sources of biomedical data are available
• Sequence Read Archive
• The Gene Expression Omnibus
• NCBI
• ELIXIR
• The Cancer Genome Atlas (TCGA)
9
Biological data management
10
Biological data integration
• Challenge for the research community
• Allow everyone to store, organize, access, and analyze the information
available on the web and/or on private repositories
• Integration of data: providing a unified access to heterogeneous and
independent data sources as a single source
• Many solutions from the I.T. and from the bioinformatics community, e.g.
− Heterogeneous Database Systems
− Distributed Database Systems
− SRS
− NCBI Entrez
− Federated databases (BioKleisli)
− Multi-databases (TAMBIS),
− Mediator-based (Bio-DataServer)
− Data warehousing (BioWarehouse)
• Integration of clinical and genomic data
11
Bioinformatics
• New methods are demanded able to extract relevant information from biological data
sets
• Effective and efficient computer science methods are needed to support the analysis
of complex biological data sets
• Modern biology is frequently combined with computer science, leading to
Bioinformatics
• Bioinformatics is a discipline where biology and computer science merge together in
order to design and develop efficient methods for analyzing biological data, for
supporting in vivo, in vitro and in silicio experiments and for automatically solving
complex life science problems
• Bioinformatician: a computer scientist and biology domain expert, who is able to deal
with the computer aided resolution of life science problems
12
• The attention to Big Data in bioinformatics is steadily increasing,
proportionally to the growth of the amount of biological data obtained
through sequencing
• Dealing with such an amount of data, recorded at different stages during
the life of a person and stored for dynamic analysis studies, requires
scalable systems suitable for the collection, management, and analysis
• Biological Big Data Bases
Big Data Bioinformatics
13
• Comprehensive genomic characterization and analysis of more than 30 cancer
type
• National Cancer Institute (NCI), National Human Genome Research Institute
(NHGRI), and National Institute of Health (NIH)
• Aim: improve the ability to diagnose, treat and prevent cancer
• A free-available platform to search, download, and analyze data sets
• 33 tumors with more than 10000 patients
• Public data distributed with the open access paradigm
• Genomic experiments
– Copy Number Variation (CNV)
– DNA-methylation
– DNA-sequencing (whole genome, whole exome, mutations)
– Gene expression data (RNA-Seq V1, V2)
– MicroRNA sequencing
– Meta data (Clinical and Biospecimen)
• Contains more than 15 TB of genomic and clinical data, whose analysis and
interpretation are posing great challenges to the bioinformatics community
The Cancer Genome Atlas (TCGA)
14
TCGA2BED
Data integration from external dbs
15
data set:
DNA-Methylation
data set:
RNA-sequencing
Genomic data integration
Typical problem in Bioinformatics:
• More than 1000 samples (patients), 450 000 features (genes, sites, clinical
variables, proteins, )
• Aim: distinguish healthy vs diseased samples
• Not addressable by a classic machine learning algorithm
• Big Data solutions
16
• Aims: distinguish the diseased from the healthy samples and prediction
• Input: a training set (reference library) containing samples with a priori
known class membership
• Model building: based on this training set the software computes the
classification model
• The classification model can be applied to a test set (query set) which
contains samples that require classification:
− query samples with unknown species membership or
− samples that also have a priori known species membership, allowing verification of the
classifications
Classification and supervised machine learning
17
Rule-based classification
A rule-based classifier is a technique for classifying samples by using a collection of
“if… then rules”, named logic formulas:
– Antecedent  Consequent
– (Condition1) or (Condition2) or … or (Conditionn)  Class
– Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm)
– A = attribute; v = value; op = operator {=, ≠, <, >, ≤, ≥}
• Example of logic classification formula is
• The evaluation of the logic formulas and the classification of the samples to the right
class is performed according :
– Percentage split or cross validation sampling
– Accuracy
– F-measure
“IF Aph1b<0.507 then the experimental sample is CONTROL”
18
CAMUR
• Classifier with Alternative and Multiple Rule-based models (CAMUR)
• New method for classifying RNA-seq case-control samples, which is able to compute
multiple human readable classification models
• Aims of CAMUR:
1) To classify RNA-seq experiments
2) To extract several alternative and equivalent rule-based models,
which represent relevant sets of genes related to the case and control samples
• CAMUR extracts multiple classification models by adopting a feature elimination
technique and by iterating the classification procedure
• Prerequisite: Gene expression normalization
(RPKM or RSEM )
• Available at: http://dmb.iasi.cnr.it/camur.php
19
CAMUR: method
• CAMUR is based on:
1) a rule-based classifier (i.e., in this work RIPPER)
2) an iterative feature elimination technique
3) a repeated classification procedure
4) an ad-hoc storage structure for the classification rules (CAMUR database)
• In brief, CAMUR:
• iteratively computes a rule-based classification model through the supervised
RIPPER algorithm,
• calculates the power set (or a partial combination) of the features present in the
rules,
• iteratively eliminates those combinations from the data set, and
• performs again the classification procedure until a stopping criterion is verified:
 F-measure < threshold
 Maximum number of iterations reached
20
Experimentation and results
21
Experimentation and results
22
(MAMDC2_dMet >= 6.63) and
(ACACB_rnaSeq >= 887.80)
=> class=normal (19.0/3.0)
[ ] => class=tumoral (1102.0/1.0)
Correctly Classified Instances 98.11 %
Incorrectly Classified Instances 1.88 %
Gene occurrences
FIGF_rnaSeq 44
SPRY2_dMet 37
SCN3A_rnaSeq 25
PAMR1_dMet 20
MMP11_rnaSeq 20
Class rule accuracy
Normal (FIGF_rnaSeq >= 184.15) and
(CLEC5A_dMet <= 5.44) ||
(TSHZ2_rnaSeq >= 471.04) and
(DLGAP2_dMet >= 10.06)
9.800
Normal (SPRY2_dMet >= 0.55) and
(CD300LG_rnaSeq >= 454.24) ||
(PAMR1_rnaSeq >= 712.17) and
(PARP8_dMet >= 2.17)
9.700
Camur: occurrences
Classification models for breast cancer
CAMUR: rules
Supervised model extraction
23
Aim: To extract relevant features from the ever-increasing amount of
biological data and to apply supervised learning to classify them
Biology Issue Features Software Data source
Clinical patient
classification
Clinical variables (blood,
imaging, psicosometric
tests…)
DMB, Weka
Heterogeneous
health care facilities
Gene Expression
Analysis
Discretize gene expression
profiles
Gela, CAMUR TCGA, EBRI
DNA barcoding
Nucleotide sequences of
DNA-barcode
Blog, Fasta2Weka
Barcode of Life
Consortium
Polyoma/Rhyno
Viruses
Nucleotide sequences of
Polyoma/Rhyno viruses
DMB, MISSAL
Istituto Superiore di
Sanità
EEG signals
processing
Fourier Coefficients
extracted from EEG
recordings
Matlab, Weka, DMB
IRCCS Centro di
Neurolesi “Bonino-
Pulejo” of Messina
Biomedical
image processing
Oriented Fast and Rotated
BRIEF
Matlab, Weka, DMB
Alzheimer's Disease
Neuroimaging
Initiative
Other applications on biomedical data
24
Conclusions and future directions
• Exponential growth of biomedical data
• Release of many public data bases, data
collection and data management projects
• Data integration
• Supervised classification analysis
• Advanced systems for data integration
• New big data approaches
25
Acknowledgments
Emanuel Weitschek
Department of Engineering
Uninettuno International University
www.iasi.cnr.it/~eweitschek
emanuel@iasi.cnr.it

More Related Content

What's hot (20)

PPT
The uni prot knowledgebase
Kew Sama
 
PPTX
Next generation sequencing
Dayananda Salam
 
PPTX
Nanopore sequencing (NGS)
Sourabh Kumar
 
PPTX
Introduction to Next Generation Sequencing
Farid MUSA
 
PPT
Ion torrent semiconductor sequencing technology
CD Genomics
 
PPTX
DNA MICROARRAY
rishabhaks
 
PDF
Whole Genome Analysis
Stephane Wenric
 
PDF
Bioinformatics
Amna Jalil
 
PDF
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
VHIR Vall d’Hebron Institut de Recerca
 
PPTX
Functional genomics, and tools
KAUSHAL SAHU
 
PPTX
NEXT GENERATION SEQUENCING
MousumeeMahapatra1
 
PPT
PHYLOGENETICS WITH MEGA
UNIVERSITI MALAYSIA SABAH
 
PPT
Structural genomics
Ashfaq Ahmad
 
PDF
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
PDF
A brief history of DNA sequencing
Eurofins Genomics Germany GmbH
 
PPTX
Bioinformatics
Somdutt Sharma
 
PDF
Next Generation Sequencing
Arindam Ghosh
 
PPTX
Comparative genomics
kiran singh
 
PPT
Bioinformatics
JTADrexel
 
The uni prot knowledgebase
Kew Sama
 
Next generation sequencing
Dayananda Salam
 
Nanopore sequencing (NGS)
Sourabh Kumar
 
Introduction to Next Generation Sequencing
Farid MUSA
 
Ion torrent semiconductor sequencing technology
CD Genomics
 
DNA MICROARRAY
rishabhaks
 
Whole Genome Analysis
Stephane Wenric
 
Bioinformatics
Amna Jalil
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
VHIR Vall d’Hebron Institut de Recerca
 
Functional genomics, and tools
KAUSHAL SAHU
 
NEXT GENERATION SEQUENCING
MousumeeMahapatra1
 
PHYLOGENETICS WITH MEGA
UNIVERSITI MALAYSIA SABAH
 
Structural genomics
Ashfaq Ahmad
 
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
A brief history of DNA sequencing
Eurofins Genomics Germany GmbH
 
Bioinformatics
Somdutt Sharma
 
Next Generation Sequencing
Arindam Ghosh
 
Comparative genomics
kiran singh
 
Bioinformatics
JTADrexel
 

Viewers also liked (20)

PDF
How AI will impact Web and Social Media Intelligence - Uljan Sharka (Crystal.io)
Data Driven Innovation
 
PDF
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Data Driven Innovation
 
PDF
The mine of the public open data, a fundamental asset - Flavia Marzano
Data Driven Innovation
 
PDF
Knowledge graph: il percorso di Cerved per connettere i Big Data - Diego Sanvito
Data Driven Innovation
 
PDF
Il deep learning ed una nuova generazione di AI - Simone Scardapane
Data Driven Innovation
 
PDF
Towards intelligent data insights in central banks: challenges and opportunit...
Data Driven Innovation
 
PPTX
Disrupting the weather market, one thousand drops at a time - Paola Allamano ...
Data Driven Innovation
 
PDF
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Data Driven Innovation
 
PPTX
Data driven innovation in chirurgia: il caso EVARplanning - Paolo Spada
Data Driven Innovation
 
PDF
Il paradigma dei Big Data e Predictive Analysis, un valido supporto al contra...
Data Driven Innovation
 
PDF
A visual approach to fraud detection and investigation - Giuseppe Francavilla
Data Driven Innovation
 
PPTX
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Data Driven Innovation
 
PDF
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
Data Driven Innovation
 
PDF
Healthware for medicine - Roberto Ascione
Data Driven Innovation
 
PPTX
Cognitive computing in the digital health era - Federico Neri
Data Driven Innovation
 
PDF
Data Driven UX: Come lo facciamo? C. Frinolli & N. Molchanova (Nois3)
Data Driven Innovation
 
PDF
How Data Drive Beyond Bank - Christian Miccoli (Conio)
Data Driven Innovation
 
PPTX
Portabilità dei dati e benessere del consumatore di servizi cloud - Davide Mula
Data Driven Innovation
 
PPTX
LCA as an innovation tool - Barilla - Luca Ruini
Data Driven Innovation
 
PDF
No Data, No Party - Roberto Magnifico
Data Driven Innovation
 
How AI will impact Web and Social Media Intelligence - Uljan Sharka (Crystal.io)
Data Driven Innovation
 
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Data Driven Innovation
 
The mine of the public open data, a fundamental asset - Flavia Marzano
Data Driven Innovation
 
Knowledge graph: il percorso di Cerved per connettere i Big Data - Diego Sanvito
Data Driven Innovation
 
Il deep learning ed una nuova generazione di AI - Simone Scardapane
Data Driven Innovation
 
Towards intelligent data insights in central banks: challenges and opportunit...
Data Driven Innovation
 
Disrupting the weather market, one thousand drops at a time - Paola Allamano ...
Data Driven Innovation
 
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Data Driven Innovation
 
Data driven innovation in chirurgia: il caso EVARplanning - Paolo Spada
Data Driven Innovation
 
Il paradigma dei Big Data e Predictive Analysis, un valido supporto al contra...
Data Driven Innovation
 
A visual approach to fraud detection and investigation - Giuseppe Francavilla
Data Driven Innovation
 
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Data Driven Innovation
 
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
Data Driven Innovation
 
Healthware for medicine - Roberto Ascione
Data Driven Innovation
 
Cognitive computing in the digital health era - Federico Neri
Data Driven Innovation
 
Data Driven UX: Come lo facciamo? C. Frinolli & N. Molchanova (Nois3)
Data Driven Innovation
 
How Data Drive Beyond Bank - Christian Miccoli (Conio)
Data Driven Innovation
 
Portabilità dei dati e benessere del consumatore di servizi cloud - Davide Mula
Data Driven Innovation
 
LCA as an innovation tool - Barilla - Luca Ruini
Data Driven Innovation
 
No Data, No Party - Roberto Magnifico
Data Driven Innovation
 
Ad

Similar to Genomic Big Data Management, Integration and Mining - Emanuel Weitschek (20)

PPTX
Bioinformatics
chirag thakkar
 
PDF
Bioinformatics Introduction
David Montaner
 
PDF
Introduction to Bioinformatics 2025.....pdf
omniaabdo276
 
PPTX
Introduction to bioinformatics and databases .pptx
ManjuM90
 
PDF
Supporting high throughput high-biotechnologies in today’s research environme...
Ed Dodds
 
PPTX
Major databases in bioinformatics
Vidya Kalaivani Rajkumar
 
PPTX
Bioinformatics t1-introduction wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
PPTX
Basics Of Bioinformatics .pptx
Mohdkaifkhan18
 
PPTX
Lecture_1_Introduction_Bioinformatics.pptx
90loiq2y9
 
PPTX
Bioinformatics_1_ChenS.pptx
xRowlet
 
PDF
Microarry andd NGS.pdf
nedalalazzwy
 
PDF
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson
 
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
European Bioinformatics Institute
 
PPTX
Cloud bioinformatics 2
ARPUTHA SELVARAJ A
 
PPTX
Genomics and Bioinformatics
Amit Garg
 
PPTX
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
nist-spin
 
PPT
bioinfomatics
nguyenpg
 
PDF
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
PDF
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
PDF
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 
Bioinformatics
chirag thakkar
 
Bioinformatics Introduction
David Montaner
 
Introduction to Bioinformatics 2025.....pdf
omniaabdo276
 
Introduction to bioinformatics and databases .pptx
ManjuM90
 
Supporting high throughput high-biotechnologies in today’s research environme...
Ed Dodds
 
Major databases in bioinformatics
Vidya Kalaivani Rajkumar
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
Basics Of Bioinformatics .pptx
Mohdkaifkhan18
 
Lecture_1_Introduction_Bioinformatics.pptx
90loiq2y9
 
Bioinformatics_1_ChenS.pptx
xRowlet
 
Microarry andd NGS.pdf
nedalalazzwy
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson
 
Advanced Bioinformatics for Genomics and BioData Driven Research
European Bioinformatics Institute
 
Cloud bioinformatics 2
ARPUTHA SELVARAJ A
 
Genomics and Bioinformatics
Amit Garg
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
nist-spin
 
bioinfomatics
nguyenpg
 
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 
Ad

More from Data Driven Innovation (20)

PDF
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
Data Driven Innovation
 
PDF
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
Data Driven Innovation
 
PDF
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
Data Driven Innovation
 
PDF
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Data Driven Innovation
 
PDF
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
Data Driven Innovation
 
PDF
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Data Driven Innovation
 
PDF
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
Data Driven Innovation
 
PDF
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
Data Driven Innovation
 
PDF
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
Data Driven Innovation
 
PDF
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
Data Driven Innovation
 
PPTX
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
Data Driven Innovation
 
PDF
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
Data Driven Innovation
 
PDF
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
Data Driven Innovation
 
PDF
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
Data Driven Innovation
 
PDF
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
Data Driven Innovation
 
PDF
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
Data Driven Innovation
 
PDF
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
Data Driven Innovation
 
PDF
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
Data Driven Innovation
 
PDF
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
Data Driven Innovation
 
PDF
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...
Data Driven Innovation
 
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
Data Driven Innovation
 
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
Data Driven Innovation
 
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
Data Driven Innovation
 
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
Data Driven Innovation
 
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
Data Driven Innovation
 
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Data Driven Innovation
 
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
Data Driven Innovation
 
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
Data Driven Innovation
 
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
Data Driven Innovation
 
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
Data Driven Innovation
 
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
Data Driven Innovation
 
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
Data Driven Innovation
 
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
Data Driven Innovation
 
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
Data Driven Innovation
 
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
Data Driven Innovation
 
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
Data Driven Innovation
 
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
Data Driven Innovation
 
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
Data Driven Innovation
 
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
Data Driven Innovation
 
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...
Data Driven Innovation
 

Recently uploaded (20)

PPTX
OBESITY and the underlying physiology.pptx
Dr. Sukriti Silwal
 
PPTX
tuberculosis of spine presebtation .pptx
sumitbhosale34
 
PPTX
Cleaning validation SlideShare presentation
preethibs6
 
PPTX
2.5 Role of Nasal & Pharyngeal Cavity in Voice Production (aqsa mehsood).pptx
Aqsa Mehsood
 
PPTX
Benign Paroxysmal Positional Vertigo (Bppv)
Tejalvarpe
 
PPTX
Sterilization of Endodontic Instruments and Cold Sterilization.pptx
Srinjoy Chatterjee
 
PPTX
Bioavailability and Bioequivalence studies
Principal42
 
PDF
DEVELOPMENT OF GIT. Prof. Dr.N.MUGUNTHAN KMMC.pdf
Kanyakumari Medical Mission Research Center, Muttom
 
PPTX
UPDATE on NEWER MALARIA VACCINE.pptx
AshwaniSood12
 
PPTX
Regulatory Aspects of Herbal and Biologics in INDIA.pptx
Aaditi Kamble
 
PPTX
Hepatopulmonary syndrome presentation .pptx
medicoanil9
 
PDF
RGUHS BSc Nursing Sociology Notes, All types of question answers are availabl...
healthscedu
 
PDF
Balance and Equilibrium - The Vestibular System
MedicoseAcademics
 
PPTX
Complete Drug Discovery Process, AI.pptx
sumitdevkar50
 
PPTX
Epidemiology for Nursing by Dr.Ayan Ghosh.pptx
Ayan Ghosh
 
PDF
BUCAS supporting DOH 8 Health Priorities
pedrofamorca
 
PPTX
Cancer - Treatment Modalities, Principles of cancer chemotherapy.pptx
Ayesha Fatima
 
PPTX
Code Stroke Management / Management of Acute Stroke
GODWIN SUJIN
 
PPTX
7. THORACIC SURGERY (PULMONARY SURGERY) (Part 1).pptx
Bolan University of Medical and Health Sciences ,Quetta
 
PDF
Miraculous Clinico-Radiological Complete Remission to Low-Dose Nivolumab and ...
Kanhu Charan
 
OBESITY and the underlying physiology.pptx
Dr. Sukriti Silwal
 
tuberculosis of spine presebtation .pptx
sumitbhosale34
 
Cleaning validation SlideShare presentation
preethibs6
 
2.5 Role of Nasal & Pharyngeal Cavity in Voice Production (aqsa mehsood).pptx
Aqsa Mehsood
 
Benign Paroxysmal Positional Vertigo (Bppv)
Tejalvarpe
 
Sterilization of Endodontic Instruments and Cold Sterilization.pptx
Srinjoy Chatterjee
 
Bioavailability and Bioequivalence studies
Principal42
 
DEVELOPMENT OF GIT. Prof. Dr.N.MUGUNTHAN KMMC.pdf
Kanyakumari Medical Mission Research Center, Muttom
 
UPDATE on NEWER MALARIA VACCINE.pptx
AshwaniSood12
 
Regulatory Aspects of Herbal and Biologics in INDIA.pptx
Aaditi Kamble
 
Hepatopulmonary syndrome presentation .pptx
medicoanil9
 
RGUHS BSc Nursing Sociology Notes, All types of question answers are availabl...
healthscedu
 
Balance and Equilibrium - The Vestibular System
MedicoseAcademics
 
Complete Drug Discovery Process, AI.pptx
sumitdevkar50
 
Epidemiology for Nursing by Dr.Ayan Ghosh.pptx
Ayan Ghosh
 
BUCAS supporting DOH 8 Health Priorities
pedrofamorca
 
Cancer - Treatment Modalities, Principles of cancer chemotherapy.pptx
Ayesha Fatima
 
Code Stroke Management / Management of Acute Stroke
GODWIN SUJIN
 
7. THORACIC SURGERY (PULMONARY SURGERY) (Part 1).pptx
Bolan University of Medical and Health Sciences ,Quetta
 
Miraculous Clinico-Radiological Complete Remission to Low-Dose Nivolumab and ...
Kanhu Charan
 

Genomic Big Data Management, Integration and Mining - Emanuel Weitschek

  • 1. Genomic Big Data Management, Integration, and Mining E. Weitschek1,2 1 Department of Engineering, Uninettuno International University, Italy 2 Institute of Systems Analysis and Computer Science, National Research Council, Italy Joint work with P. Bertolazzi, G. Felici , F. Cumbo, G. Fiscon, E. Cappelli
  • 2. 2 Outline • Growth of biological data • Next generation sequencing • Biological data sources • Biological data management • Biological data integration • Big data bioinformatics • Knowledge extraction • Supervised Learning • Biomedical applications • Conclusions and future directions
  • 3. 3 Growth of biological data • Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science ‒ originated by the DNA sequencing method invented by Sanger in early eighties ‒ late nineties significant advances in sequence generation, e.g. Human Genome Project ‒ actually the genomic sequences are doubling every 18 months ‒ GenBank: collection of all publicly available nucleotide sequences (160 M seq)
  • 4. 4 Growth of biological data • Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science ‒ Today next generation high throughput data from modern parallel sequencing machines, are collected and huge amounts of biological data are currently available on public and private sources ‒ 10000 Human Genomes project (3000 Mbp) ‒ Nowadays: 1000$ genome • Very large data sets, that are generated by several different biological experiments, need to be automatically processed and analyzed with computer science methods
  • 5. 5 DNA Sequencing • DNA (deoxyribonucleic acid) is the hereditary material in almost all organisms • DNA sequencing is the process of determining the order of nucleotides within a DNA molecule • It includes any method or technology that is used to determine the order of the four bases—adenine (A), cytosine (C), guanine (G), and thymine (T) • Originated by the DNA sequencing method invented by Sanger in early eighties • In late nineties significant advances in sequence generation techniques, largely inspired by massive projects such as the Human Genome Project • High costs and time, e.g., for the Human Genome Project 5 billions $ and 13 years
  • 6. 6 Next Generation Sequencing (NGS) • Today: next generation high throughput data from modern parallel sequencing machines ‒ Roche 454, Illumina, Applied Biosystems SOLiD, Helicos Heliscope, Complete Genomics, Pacific Biosciences SMRT, ION Torrent ‒ Next generation sequencing (NGS) machines output a large amount of short DNA sequences, called reads (in fastq format) ‒ Cannot read entire genome one nucleotides at a time from beginning to end ‒ shred the genome and generate shorts reads ‒ Low cost per base (1000$ for a whole human genome) ‒ High speed (24h to sequence a whole human genome ) ‒ Large number of reads ‒ Problems: data storage and analysis, high costs for IT infrastructure
  • 7. 7 Next Generation Sequencing (NGS) • Data dimension, time and cost of Next Generation Sequencing Seq type Data Price $ Time Human Genome 90 GB 1000 1 day Human Gene Expression 9 GB 500 12 h Plant Genome 150 GB 2000 5 days Bacterial Genome 1 GB 300 6 h
  • 8. 8 Biological data sources • Several heterogeneous sources of biomedical data are available • Sequence Read Archive • The Gene Expression Omnibus • NCBI • ELIXIR • The Cancer Genome Atlas (TCGA)
  • 10. 10 Biological data integration • Challenge for the research community • Allow everyone to store, organize, access, and analyze the information available on the web and/or on private repositories • Integration of data: providing a unified access to heterogeneous and independent data sources as a single source • Many solutions from the I.T. and from the bioinformatics community, e.g. − Heterogeneous Database Systems − Distributed Database Systems − SRS − NCBI Entrez − Federated databases (BioKleisli) − Multi-databases (TAMBIS), − Mediator-based (Bio-DataServer) − Data warehousing (BioWarehouse) • Integration of clinical and genomic data
  • 11. 11 Bioinformatics • New methods are demanded able to extract relevant information from biological data sets • Effective and efficient computer science methods are needed to support the analysis of complex biological data sets • Modern biology is frequently combined with computer science, leading to Bioinformatics • Bioinformatics is a discipline where biology and computer science merge together in order to design and develop efficient methods for analyzing biological data, for supporting in vivo, in vitro and in silicio experiments and for automatically solving complex life science problems • Bioinformatician: a computer scientist and biology domain expert, who is able to deal with the computer aided resolution of life science problems
  • 12. 12 • The attention to Big Data in bioinformatics is steadily increasing, proportionally to the growth of the amount of biological data obtained through sequencing • Dealing with such an amount of data, recorded at different stages during the life of a person and stored for dynamic analysis studies, requires scalable systems suitable for the collection, management, and analysis • Biological Big Data Bases Big Data Bioinformatics
  • 13. 13 • Comprehensive genomic characterization and analysis of more than 30 cancer type • National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), and National Institute of Health (NIH) • Aim: improve the ability to diagnose, treat and prevent cancer • A free-available platform to search, download, and analyze data sets • 33 tumors with more than 10000 patients • Public data distributed with the open access paradigm • Genomic experiments – Copy Number Variation (CNV) – DNA-methylation – DNA-sequencing (whole genome, whole exome, mutations) – Gene expression data (RNA-Seq V1, V2) – MicroRNA sequencing – Meta data (Clinical and Biospecimen) • Contains more than 15 TB of genomic and clinical data, whose analysis and interpretation are posing great challenges to the bioinformatics community The Cancer Genome Atlas (TCGA)
  • 15. 15 data set: DNA-Methylation data set: RNA-sequencing Genomic data integration Typical problem in Bioinformatics: • More than 1000 samples (patients), 450 000 features (genes, sites, clinical variables, proteins, ) • Aim: distinguish healthy vs diseased samples • Not addressable by a classic machine learning algorithm • Big Data solutions
  • 16. 16 • Aims: distinguish the diseased from the healthy samples and prediction • Input: a training set (reference library) containing samples with a priori known class membership • Model building: based on this training set the software computes the classification model • The classification model can be applied to a test set (query set) which contains samples that require classification: − query samples with unknown species membership or − samples that also have a priori known species membership, allowing verification of the classifications Classification and supervised machine learning
  • 17. 17 Rule-based classification A rule-based classifier is a technique for classifying samples by using a collection of “if… then rules”, named logic formulas: – Antecedent  Consequent – (Condition1) or (Condition2) or … or (Conditionn)  Class – Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm) – A = attribute; v = value; op = operator {=, ≠, <, >, ≤, ≥} • Example of logic classification formula is • The evaluation of the logic formulas and the classification of the samples to the right class is performed according : – Percentage split or cross validation sampling – Accuracy – F-measure “IF Aph1b<0.507 then the experimental sample is CONTROL”
  • 18. 18 CAMUR • Classifier with Alternative and Multiple Rule-based models (CAMUR) • New method for classifying RNA-seq case-control samples, which is able to compute multiple human readable classification models • Aims of CAMUR: 1) To classify RNA-seq experiments 2) To extract several alternative and equivalent rule-based models, which represent relevant sets of genes related to the case and control samples • CAMUR extracts multiple classification models by adopting a feature elimination technique and by iterating the classification procedure • Prerequisite: Gene expression normalization (RPKM or RSEM ) • Available at: http://dmb.iasi.cnr.it/camur.php
  • 19. 19 CAMUR: method • CAMUR is based on: 1) a rule-based classifier (i.e., in this work RIPPER) 2) an iterative feature elimination technique 3) a repeated classification procedure 4) an ad-hoc storage structure for the classification rules (CAMUR database) • In brief, CAMUR: • iteratively computes a rule-based classification model through the supervised RIPPER algorithm, • calculates the power set (or a partial combination) of the features present in the rules, • iteratively eliminates those combinations from the data set, and • performs again the classification procedure until a stopping criterion is verified:  F-measure < threshold  Maximum number of iterations reached
  • 22. 22 (MAMDC2_dMet >= 6.63) and (ACACB_rnaSeq >= 887.80) => class=normal (19.0/3.0) [ ] => class=tumoral (1102.0/1.0) Correctly Classified Instances 98.11 % Incorrectly Classified Instances 1.88 % Gene occurrences FIGF_rnaSeq 44 SPRY2_dMet 37 SCN3A_rnaSeq 25 PAMR1_dMet 20 MMP11_rnaSeq 20 Class rule accuracy Normal (FIGF_rnaSeq >= 184.15) and (CLEC5A_dMet <= 5.44) || (TSHZ2_rnaSeq >= 471.04) and (DLGAP2_dMet >= 10.06) 9.800 Normal (SPRY2_dMet >= 0.55) and (CD300LG_rnaSeq >= 454.24) || (PAMR1_rnaSeq >= 712.17) and (PARP8_dMet >= 2.17) 9.700 Camur: occurrences Classification models for breast cancer CAMUR: rules Supervised model extraction
  • 23. 23 Aim: To extract relevant features from the ever-increasing amount of biological data and to apply supervised learning to classify them Biology Issue Features Software Data source Clinical patient classification Clinical variables (blood, imaging, psicosometric tests…) DMB, Weka Heterogeneous health care facilities Gene Expression Analysis Discretize gene expression profiles Gela, CAMUR TCGA, EBRI DNA barcoding Nucleotide sequences of DNA-barcode Blog, Fasta2Weka Barcode of Life Consortium Polyoma/Rhyno Viruses Nucleotide sequences of Polyoma/Rhyno viruses DMB, MISSAL Istituto Superiore di Sanità EEG signals processing Fourier Coefficients extracted from EEG recordings Matlab, Weka, DMB IRCCS Centro di Neurolesi “Bonino- Pulejo” of Messina Biomedical image processing Oriented Fast and Rotated BRIEF Matlab, Weka, DMB Alzheimer's Disease Neuroimaging Initiative Other applications on biomedical data
  • 24. 24 Conclusions and future directions • Exponential growth of biomedical data • Release of many public data bases, data collection and data management projects • Data integration • Supervised classification analysis • Advanced systems for data integration • New big data approaches
  • 25. 25 Acknowledgments Emanuel Weitschek Department of Engineering Uninettuno International University www.iasi.cnr.it/~eweitschek [email protected]