100% found this document useful (4 votes)
79 views

Bioinformatics Managing Scientific Data Verified Download

The document is a comprehensive guide on bioinformatics and managing scientific data, covering topics such as biological data integration, challenges in information integration, and various data management strategies. It includes detailed discussions on specific systems and tools used for data integration, query formulation, and the complexities involved in managing gene expression data. The content is organized into chapters authored by various experts, each addressing different aspects of bioinformatics data management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
79 views

Bioinformatics Managing Scientific Data Verified Download

The document is a comprehensive guide on bioinformatics and managing scientific data, covering topics such as biological data integration, challenges in information integration, and various data management strategies. It includes detailed discussions on specific systems and tools used for data integration, query formulation, and the complexities involved in managing gene expression data. The content is organized into chapters authored by various experts, each addressing different aspects of bioinformatics data management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Bioinformatics Managing Scientific Data

Visit the link below to download the full version of this book:

https://medipdf.com/product/bioinformatics-managing-scientific-data/

Click Download Now


Contents

Preface xix

1 Introduction
Zod Lacroix and Terence Critchlow
1.1 Overview 1
1.2 Problem and Scope 2
1.3 Biological Data Integration 4
1.4 Developing a Biological Data Integration System 7
1.4.1 Specifications 7
1.4.2 Translating Specifications into a Technical Approach
1.4.3 Development Process 9
1.4.4 Evaluation of the System 9
References 10

2 Challenges Faced in the Integration of Biological


Information 11
Su Yun Chung and John C. Wooley
2.1 The Life Science Discovery Process 12
2.2 An Information Integration Environment for Life Science Discovery 14
2.3 The Nature of Biological Data 15
2.3.1 Diversity 15
2.3.2 Variability 17
2.4 Data Sources in Life Science 17
2.4.1 Biological Databases Are Autonomous 18
2.4.2 Biological Databases Are Heterogeneous in Data Formats 18
Contents
vi

2.4.3 Biological Data Sources Are Dynamic 18


2.4.4 Computational Analysis Tools Require Specific
Input/Output Formats and Broad Domain Knowledge 19
2.5 Challenges in Information Integration 19
2.5.1 Data Integration 21
2.5.2 Meta-Data Specification 24
2.5.3 Data Provenance and Data Accuracy 25
2.5.4 Ontology 27
2.5.5 Web Presentations 30
Conclusion 31
References 32

3 A Practitioner's Guide to Data Management and Data


Integration in Bioinformatics 35
Barbara A. Eckman
3.1 Introduction 35
3.2 Data Management in Bioinformatics 36
3.2.1 Data Management Basics 36
3.2.2 Two Popular Data Management Strategies
and Their Limitations 39
3.2.3 Traditional Database Management 41
3.3 Dimensions Describing the Space of Integration Solutions 45
3.3.1 A Motivating Use Case for Integration 45
3.3.2 Browsing vs. Querying 46
3.3.3 Syntactic vs. Semantic Integration 48
3.3.4 Warehouse vs. Federation 49
3.3.5 Declarative vs. Procedural Access 49
3.3.6 Generic vs. Hard-Coded 49
3.3.7 Relational vs. Non-Relational Data Model 50
3.4 Use Cases of Integration Solutions 50
3.4.1 Browsing-Driven Solutions 50
3.4.2 Data Warehousing Solutions 52
3.4.3 Federated Database Systems Approach 54
3.4.4 Semantic Data Integration 58
3.5 Strengths and Weaknesses of the Various Approaches to Integration 60
3.5.1 Browsing and Querying: Strengths and Weaknesses 61
3.5.2 Warehousing and Federation: Strengths and Weaknesses 62
3.5.3 Procedural Code and Declarative Query Language:
Strengths and Weaknesses 63
Contents
~ . . ~ vii

3.5.4 Generic and Hard-Coded Approaches:


Strengths and Weaknesses 63
3.5.5 Relational and Non-Relational Data Models: Strengths
and Weaknesses 64
3.5.6 Conclusion: A Hybrid Approach to Integration Is Ideal 64
3.6 Tough Problems in Bioinformatics Integration 65
3.6.1 Semantic Query Planning Over Web Data Sources 65
3.6.2 Schema Management 67
3.7 Summary 69
Acknowledgments 70
References 70

4 Issues to Address While Designing a Biological


Information System 75
Zo8 Lacroix
4.1 Legacy 78
4.1.1 Biological Data 78
4.1.2 Biological Tools and Workflows 79
4.2 A Domain in Constant Evolution 80
4.2.1 Traditional Database Management and Changes 80
4.2.2 Data Fusion 82
4.2.3 Fully Structured vs. Semi-Structured 82
4.2.4 ScientificObject Identity 84
4.2.5 Concepts and Ontologies 85
4.3 Biological Queries 86
4.3.1 Searching and Mining 87
4.3.2 Browsing 89
4.3.3 Semantics of Queries 90
4.3.4 Tool-Driven vs. Data-Driven Integration 91
4.4 Query Processing 92
4.4.1 Biological Resources 92
4.4.2 Query Planning 94
4.4.3 Query Optimization 95
4.5 Visualization 98
4.5.1 Multimedia Data 99
4.5.2 Browsing Scientific Ob ects 100
4.6 Conclusion 101
Acknowledgments 102
References 102
Contents

5 SRS" An Integration Platform for Databanks


and Analysis Tools in Bioinformatics 109
Thure Etzold, Howard Harris, and Simon Beaulah
5.1 Integrating Flat File Databanks 112
5.1.1 The SRS Token Server 113
5.1.2 Subentry Libraries 116
5.2 Integration of XML Databases 116
5.2.1 What Makes XML Unique? 118
5.2.2 How Are XML Databanks Integrated into SRS? 120
5.2.3 Overview of XML Support Features 121
5.2.4 How Does SRS Meet the Challenges of XML? 122
5.3 Integrating Relational Databases 124
5.3.1 Whole Schema Integration 124
5.3.2 Capturing the Relational Schema 125
5.3.3 Selecting a Hub Table 126
5.3.4 Generation of SQL 127
5.3.5 Restricting Access to Parts of the Schema 128
5.3.6 Query Performance to Relational Databases 128
5.3.7 Viewing Entries from a Relational Databank 128
5.3.8 Summary 129
5.4 The SRS Query Language 129
5.4.1 SRS Fields 130
5.5 Linking Databanks 130
5.5.1 Constructing Links 131
5.5.2 The Link Operators 132
5.6 The Object Loader 133
5.6.1 Creating Complex and Nested Objects 134
5.6.2 Support for Loading from XML Databanks 135
5.6.3 Using Links to Create Composite Structures 136
5.6.4 Exporting Objects to XML 136
5.7 Scientific Analysis Tools 137
5.7.1 Processing of Input and Output 138
5.7.2 Batch Queues 139
5.8 Interfaces to SRS 139
5.8.1 The Web Interface 139
5.8.2 SRS Objects 140
5.8.3 SOAP and Web Services 141
5.9 Automated Server Maintenance with SRS Prisma 141
5.10 Conclusion 143
References 144
Contents

6 The Kleisli Query System as a Backbone for


Bioinformatics Data Integration and Analysis 147
Jing Chen, Su Yun Chung, and Limsoon Wong
6.1 Motivating Example 149
6.2 Approach 151
6.3 Data Model and Representation 153
6.4 Query Capability 158
6.5 Warehousing Capability 163
6.6 Data Sources 165
6.7 Optimizations 167
6.7.1 Monadic Optimizations 169
6.7.2 Context-Sensitive Optimizations 171
6.7.3 Relational Optimizations 174
6.8 User Interfaces 175
6.8.1 Programming Language Interface 175
6.8.2 Graphical Interface 179
6.9 Other Data Integration Technologies 179
6.9.1 SRS 179
6.9.2 DiscoveryLink 181
6.9.3 Object-Protocol Model (OPM) 182
6.10 Conclusions 183
References 184

7 Complex Query Formulation Over Diverse


Information Sources in TAMBIS 189
Robert Stevens, Carole Goble, Norman W. Paton,
Sean Bechhofer, Gary Ng, Patricia Baker, and Andy Brass
7.1 The Ontology 192
7.2 The User Interface 195
7.2.1 Exploring the Ontology 195
7.2.2 Constructing Queries 197
7.2.3 The Role of Reasoning in Query Formulation 202
7.3 The Query Processor 205
7.3.1 The Sources and Services Model 206
7.3.2 The Query Planner 208
7.3.3 The Wrappers 211
7.4 Related Work 213
,.. ~ . . . . . . 9 ... . ..,,,~:~. :, .: ~ . . . . . , : , ~ , ~ , , ~ . ~ , , . , . . ~ . . . , ~ , ~ : : , ~ . , ~ , : ~ . . ~ , ~ . ~ , , ~ : ~ , : , : ~ , ~ . . ~ : ~ , , - . , ~ . , ~ , , ~ , , . ~ . . . . . . . . . .
Contents
9 ~ .. . . . . . . : .... . . . . . : ..... ~ ~ 4 ~ , ~ ~ ~ -_, .... . .: . . ~ , ~ - ~

7.4.1 Information Integration in Bioinformatics 213


7.4.2 Knowledge Based Information Integration 215
7.4.3 Biological Ontologies 216
7.5 Current and Future Developments in TAMBIS 217
7.5.1 Summary 219
Acknowledgments 220
References 220

8 The Information Integration System K2 225


Val Tannen, Susan B. Davidson, and Scott Harker
8.1 Approach 229
8.2 Data Model and Languages 232
8.3 An Example 235
8.4 Internal Language 239
8.5 Data Sources 240
8.6 Query Optimization 242
8.7 User Interfaces 243
8.8 Scalability 244
8.9 Impact 245
8.10 Summary 246
Acknowledgments 247
References 247

9 P/FDM Mediator for a Bioinformatics Database


Federation 249
Graham J. L. Kemp and Peter M. D. Gray
9.1 Approach 250
9.1.1 Alternative Architectures for Integrating Databases 250
9.1.2 The Functional Data Model 252
9.1.3 Schemas in the Federation 254
9.1.4 Mediator Architecture 257
9.1.5 Example 261
9.1.6 Query Capabilities 264
9.1.7 Data Sources 265
9.2 Analysis 266
9.2.1 Optimization 267
9.2.2 User Interfaces 268
9.2.3 Scalability 271
Contents
~ ~ ~ ~ ` ~ ~ ~ ~ ~ ~ ~ ~ ~ ` ~ : ~ ` ~ ` ~ ! ~ ` ~ i ~ i ~ ~". . . . . . . ~ % ' ~ ~ ~ . ~ . ~ ~ ........ ~i~..~,~i
~. . . . . ~ ~.i. . .~. . ~ . , . ~ % ~ ~ & ~ i ~ & ~ ~ X i

9.3 Conclusions 272


Acknowledgment 272
References 272

10 Integration Challenges in Gene Expression Data


Management 277
Victor M. Markowitz, John Campbell, I-Min A. Chen,
Anthony Kosky, Krishna Palaniappan,
and Thodoros Topaloglou
10.1 Gene Expression Data Management: Background 278
10.1.1 Gene Expression Data Spaces 278
10.1.2 Standards: Benefits and Limitations 281
10.2 The GeneExpress System 282
10.2.1 GeneExpress System Components 283
10.2.2 GeneExpress Deployment and Update Issues 283
10.3 Managing Gene Expression Data: Integration Challenges 285
10.3.1 Gene Expression Data: Array Versions 285
10.3.2 Gene Expression Data: Algorithms and Normalization 286
10.3.3 Gene Expression Data: Variability 287
10.3.4 Sample Data 288
10.3.5 Gene Annotations 289
10.4 Integrating Third-Party Gene Expression Data in GeneExpress 291
10.4.1 Data Exchange Formats 291
10.4.2 Structural Data Transformation Issues 293
10.4.3 Semantic Data Mapping Issues 293
10.4.4 Data Loading Issues 296
10.4.5 Update Issues 297
10.5 Summary 298
Acknowledgments 299
Trademarks 299
References 300

11 DiscoveryLink 303
Laura M. Haas, Barbara A. Eckman, Prasad Kodali,
Eileen T. Lin, Julia E. Rice, and Peter M. Schwarz
11.1 Approach 306
11.1.1 Architecture 309
11.1.2 Registration 313
Contents

11.2 Query Processing Overview 316


11.2.1 Query Optimization 317
11.2.2 An Example 319
11.2.3 Determining Costs 322
11.3 Ease of Use, Scalability, and Performance 327
11.4 Conclusions 329
References 331

12 A Model-Based Mediator System for Scientific Data


Management 335
Bertram Ludascher, Amarnath Gupta,
and Maryann E. Martone
12.1 Background 336
12.2 Scientific Data Integration Across Multiple Worlds: Examples
and Challenges from the Neurosciences 338
12.2.1 From Terminology and Static Knowledge
to Process Context 340
12.3 Model-Based Mediation 343
12.3.1 Model-Based Mediation: The Protagonists 343
12.3.2 Conceptual Models and Registration
of Sources at the Mediator 344
12.3.3 Interplay Between Mediator and Sources 349
12.4 Knowledge Representation for Model-Based Mediation 351
12.4.1 Domain Maps 352
12.4.2 Process Maps 357
12.5 Model-Based Mediator System and Tools 360
12.5.1 The KIND Mediator Prototype 360
12.5.2 The Cell-Centered Database and SMART Atlas:
Retrieval and Navigation Through
Multi-Scale Data 362
12.6 Related Work and Conclusion 364
12.6.1 Related Work 364
12.6.2 Summary: Model-Based Mediation
and Reason-Able Meta-Data 365
Acknowledgments 366
References 366
Contents

13 Compared Evaluation of Scientific Data


Management Systems 371
Zod Lacroix and Terence Critchlow
13.1 Performance Model 371
13.1.1 Evaluation Matrix 372
13.1.2 Cost Model 372
13.1.3 Benchmarks 374
13.1.4 User Survey 375
13.2 Evaluation Criteria 376
13.2.1 The Implementation Perspective 377
13.2.2 The User Perspective 382
13.3 Tradeoffs 385
13.3.1 Materialized vs. Non-Materialized 385
13.3.2 Data Distribution and Heterogeneity 386
13.3.3 Semi-Structured Data vs. Fully Structured Data 387
13.3.4 Text Retrieval 388
13.3.5 Integrating Applications 389
13.4 Summary 389
References 390

Concluding Remarks 393


Summary 393
Looking Toward the Future 394

Appendix: Biological Resources 397

Glossary 407

System Information 425


SRS 425
Kleisli 425
TAMBIS 426
K2 426
P/FDM Mediator 427
GeneExpress 427
DiscoveryLink 428
KIND 428

Index 431
This Page Intentionally Left Blank
Contributors

Patricia Baker Jing Chen


Department of Computer Science geneticXchange Inc.
University of Manchester Menlo Park, California
Manchester, United Kingdom
Su Yun Chung
Simon Beaulah The Center for Research on Biological
LION Bioscience Ltd. Structure and Function
Cambridge, United Kingdom University of California, San Diego
La Jolla, California
Sean Bechhofer
Department of Computer Science Terence Critchlow
University of Manchester Lawrence Livermore National
Manchester, United Kingdom Laboratory
Livermore, California
Andy Brass
Department of Computer Science Susan B. Davidson
University of Manchester Department of Computer
Manchester, United Kingdom and Information Science
University of Pennsylvania
John Campbell Philadelphia, Pennsylvania
Gene Logic Inc.
Data Management Systems Barbara A. Eckman
Berkeley, California IBM Life Sciences
West Chester, Pennsylvania
I-Min A. Chen
Gene Logic Inc. Thure Etzold
Data Management Systems LION Bioscience Ltd.
Berkeley, California Cambridge, United Kingdom
Contributors

Carole Goble Zo~ Lacroix


Department of Computer Science Arizona State University
University of Manchester Tempe, Arizona
Manchester, United Kingdom
Eileen T. Lin
Peter M. D. Gray IBM Silicon Valley Lab
Department of Computing Science San Jose, California
University of Aberdeen
King's College Bertram Lud~ischer
Aberdeen, Scotland, United Kingdom San Diego Supercomputer Center
University of California, San Diego
Amarnath Gupta San Diego, California
San Diego Supercomputer Center
University of California, San Diego Maryann E. Martone
San Diego, California Department of Neurosciences
University of California, San Diego
Laura M. Haas
San Diego, California
IBM Silicon Valley Lab
San Jose, California
Victor M. Markowitz
Howard Harris Gene Logic Inc.
Data Management Systems
LION Bioscience Ltd.
Berkeley, California
Cambridge, United Kingdom

Scott Harker Gary Ng


GlaxoSmithKline Network Inference Ltd.
King of Prussia, Pennsylvania London, United Kingdom

Graham J. L. Kemp Krishna Palaniappan


Department of Computing Science Gene Logic Inc.
Chalmers University of Technology Data Management Systems
G6teborg, Sweden Berkeley, California

Prasad Kodali Norman W. Paton


IBM Life Sciences Department of Computer Science
Somers, New York University of Manchester
Manchester, United Kingdom
Anthony Kosky
Gene Logic Inc. Julia E. Rice
Data Management Systems IBM Almaden Research Center
Berkeley, California San Jose, California
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Peter M. Schwarz Thodoros Topaloglou


IBM Almaden Research Center Gene Logic Inc.
San Jose, California Data Management Systems
Berkeley, California
Robert Stevens
Department of Computer Science Limsoon Wong
University of Manchester Institute for Infocomm Research
Manchester, United Kingdom Singapore

Val Tannen John C. Wooley


Department of Computer Center for Research on Biological
and Information Science Structure and Function
University of Pennsylvania University of California, San Diego
Philadelphia, Pennsylvania La Jolla, California
~:~. . ~. . .: . .. . .~ . .

About the Authors

Dr. Zo~ Lacroix is currently a Research Assistant Professor at Arizona State Uni-
versity. She received a PhD in Computer Science in 1996 from the University of
Paris XI (France). Her research interests cover various aspects of data manage-
ment, and she has published more than 20 journal articles, conference papers,
and book chapters. She also has served in numerous conference program commit-
tees, organized several panels and workshops, and was an active member in the
working groups XML Query Language and XML Forms at the World Wide Web
Consortium (W3C). Dr. Lacroix has been involved in bioinformatics for more
than 7 years. She has interacted with the Center of Bioinformatics at the Univer-
sity of Pennsylvania and worked for two biotechnology companies, Gene Logic
Inc. and SurroMed Inc. Her contributions in bioinformatics include publications,
invited talks (Symposium on Bioinformatics organized at the National University
of Singapore), and data integration middlewares, such as the Object-Web Wrapper,
which is currently used at SmithKlineGlaxo.

Dr. Terence Critchlow is a computer scientist in the Center for Applied Scientific
Computing at Lawrence Livermore National Laboratory (LLNL) and leads the
DataFoundry project. His involvement in bioinformatics began more than 7 years
ago as part of a collaboration between the University of Utah Computer Science
department and the Utah Human Genome Center. Since completing his disserta-
tion and joining LLNL in 1997, he has been an active member of the research
community, publishing in both computer science and informatics forums, giv-
ing invited talks, participating in program committees, and organizing the XML
Enabled Searches in Bioinformatics workshop.
Preface

Purpose and Goals


Bioinformatics can refer to almost any collaborative effort between biologists or
geneticists and computer scientists and thus covers a wide variety of traditional
computer science domains, including data modeling, data retrieval, data mining,
data integration, data managing, data warehousing, data cleaning, ontologies, sim-
ulation, parallel computing, agent-based technology, grid computing, and visual-
ization. However, applying each of these domains to biomolecular and biomedical
applications raises specific and unexpectedly challenging research issues.
In this book, we focus on data management and in particular data integration,
as it applies to genomics and microbiology. This is an important topic because data
are spread across multiple sources, preventing scientists from efficiently obtaining
the information required to perform their research (on average, a pharmaceutical
company uses 40 data sources). In this environment, answering a single question
may require accessing several data sources and calling on sophisticated analysis
tools (e.g., sequence alignment, clustering, and modeling tools). While data inte-
gration is a dynamic research area in the database community, the specific needs
of biologists have led to the development of numerous middleware systems that
provide seamless data access in a results-driven environment (eight middleware
systems are described in detail in this book).
The objective of the book is to provide life scientists and computer scientists
with a complete view on biological data management by: (1) identifying specific
issues in biological data management, (2) presenting existing solutions from both
academia and industry, and (3) providing a framework in which to compare these
systems.

Book Audience
This book is intended to be useful to a wide audience. Students, teachers, bioin-
formaticians, researchers, practitioners, and scientists from both academia and
industry may all benefit from its material. It contains a comprehensive description
Preface
X X ::~:::'::~:::~:'::::~::~:::~: :~ :~ :'~ ~ ~ ~::~' '::~:~::~'::~' :~::~::::::'~':::~::~::~ ................. ~:':::'~':'~:~'::~:::::::~:~:::~"~:~:~::~................................ ::~::::::::~'~...................... ::::::::::::::::::::::::::::....................... '~' .................... ~:::::~=::':~...................................... ~ ................................. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
................... :::::::.................. :::::::::::::::::::::::::::::
..................... ~::'::::"~::::~::~:: :::~::==':'~:=~:::~ ::::'~ :"::~:::~:~:::::~:~:~::~*:~: :::~'~'~*~":':: ~:'~............................. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ........... ~ \ ' ~ ' ~

of issues for biological data management and an overview of existing systems,


making it appropriate for introductory and instructional purposes. Developers
not yet familiar with bioinformatics will appreciate descriptions of the numerous
challenges that need to be addressed and the various approaches that have been
developed to solve them. Bioinformaticians may find the description of existing
systems and the list of challenges that remain to be addressed useful. Decision
makers will benefit from the evaluation framework, which will aide in their selec-
tion of the integration system that fits best the need of their research laboratory
or company. Finally, life scientists, the ultimate users of these systems, may be
interested in understanding how they are designed and evaluated.

Topics and Organization


The book is organized as follows: Four introductory chapters are followed by
eight chapters presenting systems, an evaluation chapter, a summary, a glossary,
and an appendix.
The introduction further refines the focus of this book and provides a working
definition of bioinformatics. It also presents the steps that lead to the development
of an information system, from its design to its deployment. Chapter 2 introduces
the challenges faced by the integration of biological information. Chapter 3 refines
these challenges into use cases and provides life scientists a translation of their
needs into technical issues. Chapter 4 illustrates why traditional approaches often
fail to meet life scientists' needs.
The following eight chapters each present an approach that was designed
and developed to provide life scientists integrated access to data from a variety
of distributed, heterogeneous data sources. The presented approaches provide a
comprehensive overview of current technology. Each of these chapters is written by
the main inventors of the presented system, specifies its requirements, and provides
a description of both the chosen approach and its implementation. Because of the
self-contained nature of these chapters, they may be read in any order. Chapter 13
provides users and developers with a methodology to evaluate presented systems.
Such a methodology may be used to select the system most appropriate for an
organization, to compare systems, or to evaluate a system developed in-house.
The summary reiterates the state-of-the-art, existing solutions and new challenges
that need to be addressed.
The appendix contains a list of useful biological resources (databases, orga-
nizations, and applications) organized in three tables. The acronyms commonly
used to refer to them and used in the chapters of this book are spelled out, and
current URLs are provided so that readers can access complete information.

You might also like