Data Analytics
Data Analytics
Large Scale
Data Analytics
www.allitebooks.com
Studies in Computational Intelligence
Volume 806
Series editor
Amandeep S. Sidhu, Biological Mapping Research Institute, Perth, WA, Australia
e-mail: [email protected]
www.allitebooks.com
More information about this series at http://www.springer.com/series/11756
www.allitebooks.com
Chung Yik Cho Rong Kun Jason Tan
•
123
www.allitebooks.com
Chung Yik Cho John A. Leong
Curtin Malaysia Research Institute Curtin Malaysia Research Institute
Curtin University Curtin University
Miri, Sarawak, Malaysia Miri, Sarawak, Malaysia
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
www.allitebooks.com
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Process of Life Science Discovery . . . . . . . . . . . . . . . . . . . 5
2.1.2 The Biological Data Nature . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Constant Evolution of a Domain . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Data Integration Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Semantic Integration Challenges . . . . . . . . . . . . . . . . . . . . . 10
2.1.6 Biomedical Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.7 Creation of Ontology Methodologies . . . . . . . . . . . . . . . . . 12
2.1.8 Ontology-Based Approach for Semantic Integration . . . . . . 16
3 Large Scale Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Language Integrated Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Cloud Computing as a Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Algebraic Operators for Biomedical Ontologies . . . . . . . . . . . . . . . . 21
3.3.1 Select Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Union Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Intersection Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.4 Except Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Query Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Functions for Querying RCSB Protein Data Bank (PDB) . . . . . . . . 27
4.1.1 Make Query Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Do Search Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
vi Contents
vii
viii List of Figures
ix
Chapter 1
Introduction
In this modern technological age, data is growing larger and faster compared to
previous decades. The existing methods used to process and analyse the
overflowing amount of data are no longer sufficient. The term large scale data first
surfaced in the magazine “Visually Exploring Gigabyte Datasets in Real Time” [1]
published in Association for Computing Machinery (ACM) in 1999. It was men-
tioned having large scale data without a proper methodology to analyse data is a
huge challenge and a sad occasion at the same time. In the year 2000, Peter Lyman
and Hal Varian [2] from University of California at Berkeley (both currently resides
in Google as chief economist) attempted to measure the available data volume and
data growth rate. Both senior researchers concluded that 1.5 billion gigabytes of
storage was required to contain the data from film, optical, magnetic and print
material annually.
Starting from 2001 onwards, large scale data was defined as data that contains
high volume, high velocity and high variety. This definition was defined by
Douglas Laney, an industry analyst currently working with Gartner [3]. The defi-
nition of high volume in large scale data refers to the continuous growth of data that
consisted of terabytes or petabytes of information [4]. For instance, the data pro-
duced by existing social networking sites are counted in terabytes per day [5]. High
velocity refers to the speed of data flow from different data sources [4]. For
example, if data is constantly flowing in from a sensor to a database storage, the
amount of data flow is large and fast at the same time [5]. High variety data does not
mainly consist of traditional data, it also contains structured, semi-structured,
unstructured or raw data. These data come from miscellaneous sites such as web
pages, e-mails, sensor devices, social media sites and others, for example Facebook,
Twitter, Outlook and Instagram in our modern society [5].
Two other additional elements are required to be taken into consideration when
it comes to large scale data, variability and complexity. Variability takes the
inconsistency of data flow into consideration as data loads are getting harder to
manage [5]. Due to increasing usage of social media, for instance, Facebook
generates over 40 petabytes of data daily, there are increasingly high peaks in data
loads to databases [6]. As for complexity, data from various sources are very
difficult to be related, matched, cleaned and transformed across systems. It is very
important that the data is associated with its relevant relationships, hierarchies and
data linkages otherwise they will not be sorted accordingly [5].
Large scale data has been growing ever since and it is difficult to contain such
vast information. To make use of large scale data, it is required to have a proper
methodology to retrieve and analyse these data. In this chapter, we are going to
discuss about the research background, problem statements and the objectives of
this research.
Faced with the enormous amount of data, the traditional data analytic methodolo-
gies are no longer sufficient [7]. In this modern technological phase, data can be
processed via the statistical algorithms method through dumping data into the
largest high-performance computing clusters to obtain results [7]. The processed
data is then stored in different data sources and they come in useful in scientific
applications and business usage such as biosciences, market sales and different
fields [8].
Term analytics is defined as a method of data transformation for better decision
making whereas large scale data analytics is defined as a process that extracts large
amounts of information from complex datasets consisting of structured, semi
structured, unstructured and raw data [8]. The usage of large scale data analytics can
be applicable to various fields, such as improving marketing strategies by analysing
real consumer behaviour instead of predicting the needs of their customer and
making gut-based decisions [9]. Information extracted from data sources through
data analytics can perform and improve strategic decisions of business leaders by
just adding a feature to study telemetry and the usage of user data on multiple
platforms be it on mobile applications, websites or desktop applications [10].
Retrieved data can be used for recommendation engines, for example, ‘think
Netflix’ and YouTube video suggestions. Large scale analytics uses intensive data
mining algorithms to produce accurate results and high performance processors are
required for the process [8]. Since large scale data analytics applications requires
huge amount of computational power and data storage, infrastructures offered by
cloud computing can be used as a potent platform [8].
Ontology has been used for large scale analytics to utilize shared vocabulary for
data mapping. The word ontology originates from a philosophical term which refers
to ‘the object of existence’ and from the perspective of the computer science com-
munity, it is known as ‘specification of conceptualization’ for information sharing in
artificial intelligence [11]. There is a conceptual framework which is presented using
1.2 Research Background 3
of data querying and data management are easier compared to its predecessors. The
query framework built using Language Integrated Query needs to be easily
deployable on a cloud computing environment while ensuring the performance in
handling and querying large scale data sources can be done smoothly.
1.4 Objective
1.5 Outline
Relations between the area of gene expression profiles, systems biology, pro-
teomics, and genomics are highly dependent on the integration of experimental
procedures along with a searchable database, computational algorithm applications
and analysis tools [14]. Data from computational analysis and database searches are
essential to the whole discovery procedure. Since the selected systems are complex
to study, the derived data from simulations and derived computational models
obtained from databases are combined to generate experimental data for better
interpretations. Studies on protein pathways, cellular and biochemical processes,
simulation and modelling of protein-protein interactions, genetic regulatory net-
works, normal and diseased physiologies are currently in their infancy state, hence,
some changes are needed [14]. Quantitative details are missing in the process and
experimental observations are needed to fill in the missing pieces. The boundaries
between these experimental datasets and computationally generated data are not
defined due to close interaction, therefore, multidisciplinary groups are required to
integrate these approaches in accelerating progress. With the continuing advances
made using experimental methods, information infrastructure can compute the
understanding of biology with ease [14].
There are three varieties widely used in traditional data management systems. These
varieties are relational, object-relational, and object-oriented. According to Lacroix
[16], data in relational database systems are represented in a form of relations table
with data representation through classes relying on a basic relational representation
provided by object-relation systems. The data representation is user-friendly as data
are organized through classes as well for object-oriented databases. Traditional
database systems are made to support their own data transactions, however, there is
a limitation in data changes that can be supported by the data organization of the
database. For example, the changes are limited to renaming, adding or removing
attributes and relations, and other particulars. Complex schema transactions are not
supported by traditional database systems as the initial designs did not take them
into account. To define a new schema, a new database will need to be constructed.
This will bring changes in the data organization of the database. From a biological
data source standpoint, the said process is too troublesome and unacceptable when
changes have to be frequently made [16].
8 2 Background
Data fusion defines an implementation of data that are obtained from different types
of sources. Scientific data are obtained from different instruments performing mass
spectrometry, microarrays and other specific procedures [16]. These instruments
rely on proper calibration parameters setup for standardized data collection. Data
collected from similar tasks performed on these instruments can be implemented
into the same dataset for analysis.
Using a traditional database approach, complete dataset measurements and
parameters are required for complex queries for the data analysis process [16]. If
any information is missing or incomplete, the data will be ignored and left
unprocessed, which is unacceptable to life data scientists.
The integration of datasets that are alike but disparate in the biological domain is
not supported by existing traditional database methodologies. The solution for this
problem is to adhere to the structure offered by semi-structured methods [16].
A feature where data organization enables the changes of new attributes and
missing attributes are introduced in this semi-structured method. Semi-structured
data is usually shown as either rooted, edge-labelled or directed graph. XML is one
of the examples of semi-structured data. XML has become the standard for storing,
describing and interchanging data between many heterogeneous biological data-
bases [16]. The facilities for XML content definition are provided by the combi-
nation of multiple XML schemas [16]. Flexibility and platform support that are
ideal for capturing and representing the complicated data types of biological data
can be provided by XML.
Data Integration was never easy to begin with. Researchers are struggling to
improve data integration processes to ensure that data translation can be done in a
fast and efficient manner. Kadadi et al. [17] had conducted a survey on the
2.1 Literature Reviews 9
challenges of data integration and interoperability in large scale data and summa-
rized these challenges into 7 parts: accommodation for scope of data, data incon-
sistency, query optimization, inadequate resources, scalability, implementing
support system and Extract Load Transform (ETL) processes in big data. The
challenge to accommodate the scope of large datasets and the addition of new
domains in any organization can be overcome by integrating high performance
computing (HPC) environments and high-performance data storage, for example,
hybrid storage devices with the combined functionality of a standard hard disk drive
(HDD) and solid state drive (SSD) to reduce data latency and to provide fast data
access. However, this method leads to the need to upgrade or purchase new
equipment.
In a survey conducted by Kadadi et al. [17], they clarified that data from different
sources leads to data inconsistency, thus high computing resources are needed to
process unstructured data from large data sources. Therefore, query operations are
easier to perform on structured data to analyse and obtain data for various uses,
such as business decisions. However, in large datasets, there is normally a high
volume of unstructured data. By referring to the survey conducted, query opti-
mization may affect the attributes when data integration takes place at any level or
during data mapping to existing or new schema [17].
Furthermore, Kadadi et al. [17] surveyed where problems arise with inadequate
resources in data integration implementation; these problems include insufficient
financial resources and insufficient skilled personnel in data integration. They also
mentioned high level skilled personnel in big data are hard to find and these skilled
personnel requires a high level of experience at dealing with data integration
modules. Furthermore, the process of obtaining new licenses for tools and tech-
nologies from vendors required for data integration implementation is tedious.
Kadadi et al. [17] identified that scalability issues occurred in scenarios where
new data are extracted and integrated from different sources along with legacy
systems data. Attempting this heterogeneous integration may affect the performance
of the system due to the need to undergo updates and modifications for the system
to adapt to newer technologies. However, if legacy systems meet the requirements
and are compatible with newer technologies, the process is easier as less updates
and modifications are necessary in the ensuing integration process.
Support systems need to be implemented by organizations to handle updates and
report errors in every step of the data integration process. In the survey conducted
by Kadadi et al. [17], they discovered that implementing support systems will
require a training module to train professionals on error report handling, and this
will require a huge sum of investment for organizations. However, through the
implementation of support systems, organizations can determine the weaknesses
existing in their system architecture.
Extract Load Transform (ELT) is an example of data integration. ELT processes
every piece of data that goes through it and outputs these data as a huge dataset
entity after the integration process. The identification of the ELT processes takes
place after the data integration process to determine whether it would affect func-
tionality of database storage due to storing huge data chunks [17]. To improve load
10 2 Background
processes, key constraints are disabled during the load processing part and
re-enabled after the process is done, a step required to be done manually as sug-
gested by researchers [17].
The existing methodologies do not discuss the complex issues of biological data.
Recent efforts made on ontologies intended to provide a way to solve these complex
problems. According to Gruber [20], the term ontology originates from a philo-
sophical term referring to ‘the object of existence’ and from the computer science
community’s perspective, it is known as ‘specification of conceptualization’ for
sharing information in artificial intelligence. A conceptual framework is delivered
by ontologies for a significant structured image through common vocabulary pro-
vided by biological or medical domains [21]. These can be used by either auto-
mated software agents or humans in the domain. The concepts, relationships,
definition of relationships and the prospect of ontology rules and axiom definitions
are included by shared vocabulary to define the mechanism used to control the
substances which are introduced into the ontology and applicable on logical
inference [21]. Ontologies are slowly emerging as a common language in biome-
dicine for higher effective communication needed across multiple sources involving
information and biological data.
A new ‘skeletal model’ was presented by Uschold and King [22] as a design and
evaluation for ontologies. There are several stages in the skeletal model that are
essential for any ontology engineering related methodology. There are several
specific principles designed by Uschold and Gruninger [23] to be uphold in each
phase, which are: coherence (consistency), extensibility, clarity, minimal ontolog-
ical commitment, and minimal encoding bias [20]. A semi-informal ontology
named The Enterprise Ontology has been created by Uschold et al. [24] by fol-
lowing the design principles mentioned above for ontology capture phase.
Based off the experiences of creating the TOVE (TOronto Virtual Enterprise)
ontology, Gruninger and Fox [25] developed a new methodology for both design
and evaluation for ontologies. However, this methodology was designed base on a
very rigid method, hence, this methodology is not suitable for any less formal
ontologies. Furthermore, this methodology is not sufficient for a first-order logic
based ontology language as a first-order logic language is used for this method-
ology for the formulation of axioms, definitions and its justification.
2.1 Literature Reviews 13
3. The process of retrieving and accessing knowledge from the concepts of protein
ontology using a query.
4. User goals are achieved through the usage of extracted knowledge.
There are several phases in the process of protein ontology development using
On-To-Knowledge methodology as shown in the Fig. 2.3 [29]:
1. Phase one of the process is the feasibility study. This phase is implemented from
the CommonKADS methodology [30]. CommonKADS is a framework used to
develop a knowledge-based system (KBS) and it supports the features of KBS
development project, for example: acquisition of knowledge, problem identifi-
cation, project management, knowledge modelling and analysis, system inte-
gration issues analysis, capturing user requirements, and knowledge system
design. The outcome, that has been determined after conducting the feasibility
study, was that On-To-Knowledge should be used to construct Protein Ontology
for maximum support on its development, maintenance and evaluation.
2. Phase two, which is the actual first phase in development, outputs the ontology
requirement specification. The possibilities of having existing protein data
sources integrated into the ontology are analyzed in this stage. In addition, there
are a number of queries generated to capture the protein ontology requirements
for existing protein data and knowledge frameworks.
2.1 Literature Reviews 15
5. Phase five, the maintenance phase is engaged after the protein ontology has been
deployed. In this phase, all the changes that occur in the world will reflect onto
the protein ontology.
The focus of the architecture is the global ontology. Liu et al. [34] way of
constructing a global ontology is through adapting a hybrid strategy. Figure 2.5
shows the process of global ontology.
The first phase of the global ontology process is filtering data from different data
sources, such as entities of the data, relationships and attributes. The second phase of
the process is to generate local ontologies through retrieving schemas from databases
and items from the synonym table. The global ontology process is completed
through ontology evolution, mapping and applying semantic constraints [34].
In summary, the existing methodologies are focusing on integrating multiple
data sources into a single data source and applying ontology-based semantic inte-
gration as a solution to the problem of data query for multiple data sources. Existing
methodologies can be used for integrating small amount of data, however, not for
petabytes of data. Taking RCSB PDB as an example, RCSB RPD databases are
updated from time to time and it is hard and expensive for these methodologies to
live update their database while mapping data at the same time. Multiple data
integration challenges are not properly addressed even with semantic integration
and ontology-based semantic integration approaches.
In this research, the focus is on querying data sources with different data
structures without the need of data integration and data translation. Therefore, the
implementation of a smart query system using Language Integrated Query is
required to reach the research goal.
Chapter 3
Large Scale Data Analytics
The projection over sequence is performed by the Select Operator. The allocation
and return of the enumerable object done by the Select Operator captures the
arguments passed to the operator. An argument null exception is returned if any
argument is null [40].
The Select operator allows the user to highlight and select the portions of an
ontology related to the user’s query. The Select Operator selects the instances
meeting the condition given through the ontology structure and the selected concept
given. These instances, which met the given condition, would belong to a specific
sub tree or are the subset of the instances that belong to one or more sub trees. The
Select Operator selects only those edges in the ontology that connect nodes in each
set. The Select Operator, OS is defined as:
Definition 1
OS ¼ rðNS; ES; RSÞ where
NS ¼ Nodesðcondition ¼ trueÞ
ES ¼ Edgesð8N 2 NSÞ
N, E, R here are represented as set of nodes, edges and the relationships of the
ontology graph while NS, ES, RS are presenting the nodes, edges and relationships
of the set selection. The join condition operator won’t be discussed here as the
Select Operator can be used in the following forms:
• Simple-Condition: Where the select condition is specified using the simple
content types, like Generic Concepts, in the ontology and the select operator is
value-based;
• Complex-Condition; Where the select condition is specified using complex
content types, like Derived Concepts, in the ontology and the select operator is
structure-based; and,
• Pattern-Condition: Where the select condition is specified using a mix of simple
and/or complex content types in the hierarchy with additional constraints such
as ordering defined by using of Sequence Relationships in the ontology and
others, where the select operator is pattern-based.
Example 1
When the user requires every available information in Protein Ontology in respect
to the Protein Families, the details of every single example of the Family Concept is
displayed by using the Select Operator which is shown in Fig. 3.1.
22 3 Large Scale Data Analytics
The union set between two sequences is produced by the Union Operator. The
allocation and returns of the enumerable object which is done by the Union
Operator captures the arguments which are passed on to the operator. An argument
null exception is returned if any argument is null [40].
When Union returns the enumerated object, first and second sequences are
enumerated, in that particular order, and will yield onto each element which was not
previously yielded. Elements are compared by using the non-null comparer argu-
ment if possible. Otherwise, the equality comparer is utilized.
The union of two parts of the ontology, O1 = (N1, E1, R1), and O2 = (N2, E2,
R2) with respect to the semantic relationships (SR) of the ontology is expressed as:
Definition 2
OIð1; 2Þ ¼ O1 [ SR O2 ¼ ðNU; EU; RUÞ; where,
NU ¼ N1 [ N2 [ NIð1; 2Þ
EU ¼ E1 [ E2 [ EIð1; 2Þ; and
RU ¼ R1 [ R2 [ RIð1; 2Þ; where,
OIð1; 2Þ ¼ O1 \ SR O2 ¼ ðNIð1; 2Þ; EIð1; 2Þ; RIð1; 2ÞÞ is the intersection of two
ontologies.
Two parts of the ontology are combined by the union operation and only one
copy of the intersection concepts is retained. N, E, R here are represented as set of
nodes, edges and the relationships of the ontology graph while NU, EU, RU are
representing the nodes, edges and relationships of the set selection.
Example 2
When a person requires all the available information in Protein Ontology in respect
to the protein Structure and Protein Families, every single information which are
highlighted in Fig. 3.2 is then output. That is how the Union Operator is used
(Family [ Structure).
3.3 Algebraic Operators for Biomedical Ontologies 23
The intersection set between two sequences is produced by the Intersect Operator.
The allocation and returns of the enumerable object which is done by the Intersect
Operator captures the arguments which are passed on to the operator. An argument
null exception is returned if any argument is null [40].
When Intersect returns the enumerated object, the first sequence is enumerated,
all the distinct elements of the sequence are collected. The second sequence is
enumerated, marking all elements that occur in both sequences. The marked ele-
ments are yielded in the manner of how they were collected. Elements are compared
by using the non-null comparer argument if possible or using the equality comparer.
Intersection is a particularly significant and fascinating binary operation. There
are two parts, O1 = (N1, E1, R1), and O2 = (N2, E2, R2) in the ontology whereas
an answer to the query submitted is provided by the composition of both ontologies.
N, E, R here are represented as set of nodes, edges and the set of Semantic
Relationship. The ontology semantic relationships in respect to the intersection of
two parts of the intersection operation is:
Definition 3
OI ð1; 2Þ ¼ O1 \ SR O2 ¼ ðNI; EI; RI Þ; where
NI ¼ NodesðSRðO1; O2ÞÞ;
EI ¼ EdgesðE1; NI \ N1Þ þ EdgesðE2; NI \ N2Þ
þ EdgesðSRðO1; O2ÞÞ; and
24 3 Large Scale Data Analytics
The differences of both two sequences is produced by the Except Operator. The
allocation and return of the enumerable objects which is done by the Except
Operator captures the arguments which are passed on to the operator. An argument
null exception is returned if any argument is null [40].
When Except returns the enumerated object, the first sequence is enumerated,
and all the distinct elements of that sequence are collected. The second sequence is
enumerated and the elements which resides in the first sequence is deleted. Then in
order, the remaining elements are finally yielded in the way they were collected.
Elements are compared by using the non-null comparer argument if it is possible.
Otherwise, the equality comparer is utilized.
The differences between O1 and O2, which are the two parts of the ontology are
presented as O1 − O2 which includes portions from the first part which are not the
common in the second part. The difference can also be represented as
O1 ðO1 \ SR O2Þ. Nodes, edges and relationships that are not present in the
intersection, but exists in the first ontology.
Example 4
When a query needed all the available information on Protein Entry without the
Protein Structure and Protein Entry descriptions which resides in Protein Ontology,
every single information of Protein Entry that is not been highlighted in the pre-
vious Fig. 3.3 is displayed. As ChainsRef is the only common in between both
Protein Structure and Protein Entry, everything else excluding ChainsRef is output
3.3 Algebraic Operators for Biomedical Ontologies 25
Protein Data Bank, PDB has a vast amount of resources related to protein 3D
models, complex assemblies, and nucleic acids that can be utilized by both students
and researchers for learning the characteristics of biomedicine. Therefore, a
framework is needed to effectively retrieve information from their database. The
functions that are utilized to enable users to query RCSB PDB is explained in this
chapter.
Figure 4.1 shows the structure and python codes constructed using Visual Studio
for make query function.
The make_query() function initiates a search based on a list of search terms and
requirements and outputs as a compiled dictionary object which users can search
later on. There are several query types that can be used for the search, which are as
follows:
HoldingsQuery A normal search of any related PDB IDs metadata.
ExpTypeQuery A search based on experimental method, for example,
‘X-RAY’.
AdvancedKeywordQuery Any matches that appears in either the title or abstract.
StructureIdQuery A normal search by provided structure ID.
ModifiedStructuresQuery Search based on the structures relevancy.
AdvancedAuthorQuery A search on entries based on the name of author.
MotifQuery A normal search for motif.
NoLigandQuery Search every PDB IDs that has no free ligands
© Springer Nature Switzerland AG 2019 27
C. Y. Cho et al., Large Scale Data Analytics,
Studies in Computational Intelligence 806,
https://doi.org/10.1007/978-3-030-03892-2_4
28 4 Query Framework
Figure 4.2 shows the code and structure in python used for do search function.
The function do_search() converts the dictionary, dict() object into XML format
which then sends a request to obtain a matching list of IDs according to search
results from PDB. In this case, the results obtained from make_query() function are
converted to XML format and the XML format will prompt PDB for a list of
matching PDB IDs.
4.1 Functions for Querying RCSB Protein Data Bank (PDB) 29
Figure 4.3 shows the code and structure of do protsym search function.
The function do_protsym_search() searches identical entries from user-specified
symmetry groups in Protein Data Bank, PDB. The total minimum and maximum
deviation allowed is measured in Angstroms, are adjusted to determine which
results will be categorized as an identical symmetry. For instance, when ‘C9’ has
been used as the point group, the results returned are shown as ‘1KZU’, ‘1NKZ’,
‘2FKW’, ‘3B8M’, ‘3B8N’ respectively.
Figure 4.4 shows the code used to construct get all function.
The function get_all() lists out all the currently available PDB IDs in the RCSB
Protein Data Bank.
Figure 4.5 shows the code and structure of get info function.
The function get_info() retrieves all information related to the inserted PDB ID.
By combining the specific URL and PDB ID, information regarding specific protein
data can be retrieved.
Figure 4.6 shows the structure and codes in Visual Studio of get PDB file function.
For this function, get_pdb_file() allow users to retrieve the full PDB file through
inputting a desired PDB_ID. There are a few file types can be retrieved from PDB,
namely pdb, cif, xml and structfact. The default selection is set to pdb, however,
users can change the file type to their desired one. The compressed (gz) file is
retrieved from PDB as well in this process.
Figure 4.7 shows the python codes and structure of get all info function.
Figure 4.8 shows get raw blast codes and structure coded in Visual Studio.
The purpose of get_raw_blast() function is to search the full BLAST page for
inserted PDB ID. The BLAST page can be shown in either XML, TXT, or HTML
format depending on the preference of the user. The default setting is set to HTML.
Figure 4.9 shows the code and structure written in Visual Studio for parse blast
function.
The parse_blast() function is used to clean up retrieved HTML BLAST selec-
tion. BeautifulSoup and re module are needed for this function to work. The
function processes all complicated results from the BLAST search function and
compile matches into a list. A raw text file is shown to display alignment of all
matches. HTML type of inputs are much more suited for this function compared to
the others.
Figure 4.10 shows the code for get blast wrapper function.
The function get_blast2() is an alternative way of searching BLAST with the
inserted PDB ID. This function serves as a wrapper function for get_raw_blast()
and parse_blast().
Figure 4.11 shows the structure and codes of describe PDB function.
Function describe_pdb() retrieves requested description and metadata for the
input PDB ID. For example, details that are shown in Fig. 4.12 for a search includes
authors, deposition date, experimental method, keywords, nr atoms, release date,
resolution and further related details.
Figure 4.13 shows constructed codes for get entity info function.
The function get_entity_info() returns all information related to the PDB ID.
Information returned to user are entity, type, chain, method, biological assemblies,
release date, resolution and the structure ID as shown in Fig. 4.14.
Figure 4.17 shows structure and code constructed in Visual Studio for get ligands
function.
Figure 4.21 shows the code construction of get sequence cluster function in Visual
Studio.
Function get_seq_cluster() retrieves the sequence cluster of the assigned PDB ID
with a character chain offset. For example, instead of a normal 4 character PDB ID,
it adds a decimal behind which results in XXXX.X. An example of the sequence
cluster retrieved for a PDB ID chain, 2F5N.A, is shown in Fig. 4.22.
Figure 4.23 shows the code and structure of get blast function.
The get_blast() function retrieves BLAST results for the user inputted PDB ID.
The search result will return as a form of a nested dictionary which contains all the
BLAST results and their metadata. For example, when an entry of 2F5N.A is
entered as the PDB ID, the returned result is as shown in Fig. 4.24.
Figure 4.25 shows the way get PFAM function is constructed in Visual Studio.
The get_pfam() function returns PFAM annotations for a PDB ID. The PFAM
annotations result is as shown in Fig. 4.26.
Figure 4.29 shows the structure and codes for find results generator function.
Function find_results_gen() outputs a generator for results returned by any
search of the protein data bank conducted internally. A sample result is shown in
Fig. 4.30.
Figure 4.31 shows the code and structure for the parse results generator function.
Function parse_results_gen() queries PDB with a specific search term and field
without violating the existing limitations of the API. If the search result exceeds the
limit, a warning message is displayed to the user to notify that the results are
returned in a timely manner but may be incomplete.
Figure 4.34 shows the constructed structure and code of the find authors function.
The purpose of the find_authors() function is the same as the find_papers
function, just that it searches top authors instead. It searches based on the number of
PDB entries that an author has his or her name linked with and it is not judged by
the order of the author nor the ranking of the entry. Therefore, if an author has
published a significant number of papers related to the search term, their work will
have priority over any other author who wrote fewer papers that are most likely
related to the search term used. An example is shown in Fig. 4.35 when the title
‘crispr’ is used as the search term.
Figure 4.37 shows the code and structure built in Visual Studio for the list
taxonomy function.
The list_taxa() function examines and returns any taxonomy related information
provided within the description from search results that are returned by the
get_all_info() function. Descriptions from the PDB website includes the species
name in each of their entries and occasionally has information of body parts or
organs. For example, if the user searched for ‘crispr’, the result returned are as
shown in Fig. 4.38.
Figure 4.41 shows the code for the remove at sign function.
The remove_at_sign() function as the name suggests, removes any ‘@’ character
from the start of key names in a dictionary.
Figure 4.43 shows the structure and code in Visual Studio for the walk nested
dictionary function.
A nested dictionary may contain huge lists of other dictionaries with unknown
lengths within. Therefore, a depth-first search method is used to find out whether a
key is in any of the dictionaries. The maxdepth variable can be toggled to determine
the maximum depth needed to search a nested dictionary for the desired result.
For this research, the structure of the query framework that has been explained in
Chap. 4 is implemented on Microsoft Azure. The query framework can be accessed
in the form of a web portal through any web browsing application, for example,
Internet Explorer, Microsoft Edge, Google Chrome and others. The web portal is
built to be user friendly and easy to navigate to retrieve data from RCSB PDB. The
results of the query web portal are shown in this chapter.
Figure 5.1 display the homepage of the query web portal built. The web portal is
built to enable users and researchers in Malaysia to be able to access the system
with ease for protein ontology query purposes.
Figure 5.2 shows the search page of the query web portal. This search function
enables users to search the RCSB PDB with their desired keyword. For example, a
search for data relevant to ‘crispr’ is entered in the search field as shown above.
Figure 5.3 displays the search result for the keyword ‘crispr’. As displayed in
this figure, the search function works as intended. The search webpage displays all
the relevant PDB ID and information for the requested search.
Figure 5.4 shows the information related to Protein ID ‘1WJ9’. The full infor-
mation of the PDB ID obtained from the search query can be further elaborated
when it is selected. As shown in Fig. 5.4, the information that can be accessed are
protein description, molecule, journal, atom sequence, unit cell for cyst, unit cell for
origx, unit cell for scale, helices, sequence residue and sheets.
Figure 5.5 shows the detailed information of protein ID ‘1WJ9’. Each of the
PDB ID attributes can be further expanded through selection to display the full
information for each attribute.
Figure 5.6 shows the contact page of the query web portal. The contact infor-
mation displayed on the webpage enables users or researchers to give feedback on
the query web portal.
5.1 Query Web Portal 49
5.2 Summary
6.1 Conclusion
The study of this research shows the difficulties faced by the current generation for
database querying. Recent methodologies such as semantic integration focuses on
data integration, data mapping and data translation. These approaches can be done
for small to medium data sources. However, when it comes to querying databases
that are huge and are being constantly updated by users around the world, these
approaches are not suitable and not cost effective.
To overcome these challenges from a different perspective, we presented a
different querying method using Language Integrated Query in this research.
Instead of integrating existing datasets from different data sources into a single
source, we used Language Integrated Query to build a query framework that is
capable of querying directly from sources without the need for data translation or
integration. To ensure that there are no performance issues, the query framework is
implemented on a cloud computing environment, Microsoft Azure, to utilize the
vast computing resources available there. A user-friendly web portal was built and
implemented on Microsoft Azure for users to search and query the RCSB PDB
without any issue.
Through the construction and implementation of the query framework, the
framework can perform thorough searches through RCSB PDB for results as
planned. The search might take a longer period to be performed depending on the
keyword or query that has been searched or requested by the user due to certain
limitations on both client and server side. There are several limitations and these are
discussed in the next section of this chapter.
6.2 Limitations
There are several factors that limit the capabilities of the query framework to
function smoothly with minimal delays.
These issues can be improved through several methods:
1. Upgrading of the existing RCSB PDB server infrastructure, mainly hardware,
connection and software wise.
2. Increasing the resources of Microsoft Azure virtual machine, resulting in an
increase in expenses to maintain existing cloud computing infrastructure of
Curtin University Malaysia.
3. Changing the hosting location of virtual machine to the nearest hosting site for
RCSB PDB, in this case, United States of America.
However, the main issue that has been presented is with the technology we
currently have, it is still difficult to solve the issue of hosting large scale data and
ensuring all operations run smoothly. Due to the large number of researchers and
users using RCSB PDB, it is hard for the RCSB PDB server to cater to the needs of
all these requests without having a latency issue. Therefore, the delay in querying
RCSB PDB is due to the latency issue and the hardware limitation issue.
Hardware on the web portal deployment plays a huge part in this as well. If the
hardware performance is insufficient, the framework will crash depending on the
number of queries and users.
For future development, the infrastructure hosting the Microsoft Azure cloud
computing platform can be improved and improvised to withstand the stress
imposed by the query framework on the hardware available under heavy usage.
However, this method will increase the cost of the project.
Other than that, the program can be further optimized to decrease the latency and
stress load imposed on the hosting server. The existing search functions in the
program can be fashioned into an advanced search that can be featured in the web
portal as well as to only search and return a very specific component of a protein
data from RCSB PDB.
Appendix
15. A.S. Sidhu, M. Bellgard, Protein data integration problem, in Biomedical Data and
Applications, ed. by A.S. Sidhu, T.S. Dillon (Springer, Berlin, 2009), pp. 55–69
16. Z. Lacroix, Issues to address while designing a biological information system, in
Bioinformatics: Managing Scientific Data, ed. by Z. Lacroix, T. Critchlow. The Morgan
Kaufmann Series in Multimedia Information and Systems, 2003, pp. 75–108
17. A. Kadadi, R. Agrawal, C. Nyamful, R. Atiq, Challenges of data integration and
interoperability in big data, in 2014 IEEE International Conference on Big Data (Big
Data), Washington, DC, 2014, pp. 38–40
18. K. Baclawski, M. Kokar, P. Kogut, L. Hart, J. Smith, W. Holmes, III et al., Extending UML
to support ontology engineering for the semantic web, in UML 2001—The Unified
Modeling Language. Modeling Languages, Concepts, and Tools, vol. 2185, ed. by M.
Gogolla, C. Kobryn (Springer, Berlin, 2001), pp. 342–360
19. A. Doan, A.Y. Halevy, Semantic integration research in the database community: a brief
survey. AI Mag. 26, 83 (2005)
20. T.R. Gruber, A translation approach to portable ontology specifications. Knowl. Acquis. 5
(2), 199–220 (1993)
21. A.S. Sidhu, M. Bellgard, T.S. Dillon, Classification of information about proteins, in
Bioinformatics: Tools and Applications, ed. by D. Edwards, J. Stajich, D. Hansen (Springer,
New York, 2009), pp. 243–258
22. M. Uschold, M. King, Towards a methodology for building ontologies. Workshop on Basic
Ontological Issues in Knowledge Sharing Held in Conjunction with IJCAI 1995 (Morgan
Kaufmann, 1995)
23. M. Uschold, M. Gruninger, Ontologies: principles methods and applications. Knowl. Eng.
Rev. 11(2), 93–155 (1996)
24. M. Uschold, M. King, S. Morale, Y. Zorgios, The enterprise ontology. Knowl. Eng. Rev. 13
(1), 31–89 (1998)
25. M. Gruninger, M.S. Fox, Methodology for design and evaluation of ontologies. Workshop
on Basic Ontological Issues in Knowledge Sharing Held in Conjunction with IJCAI 1995,
Montreal, Canada (Morgan Kaufmann, 1995)
26. S. Staab, R. Studer, H.P. Schnurr, Y. Sure, Knowledge processes and ontologies. IEEE Intell.
Syst. 16(1), 26–34 (2001)
27. M. Genesereth, Knowledge interchange format, in Second International Conference on
Principles of Knowledge Representation and Reasoning, Cambridge (Morgan Kaufmann,
1991)
28. M. Genesereth, R. Fikes, Knowledge Interchange Format Version 3 Reference Manual
(Stanford University Logic Group, Stanford, 1992)
29. Ontoweb, A survey on methodologies for developing, maintaining, evaluating and
reengineering ontologies, in Deliverable 1.4 of OntoWeb Project, ed. by M.
Fernández-López, Karlsruhe, Germany, AIFB Germany & VUB STAR Lab (2002).
Available: http://www.ontoweb.org/About/Deliverables/index.html
30. G. Schreiber, H. Akkermans, A. Anjewierden, R. Dehoog, N. Shadbolt, W. Vandevelde, B.
Wielinga, Knowledge Engineering and Management: The Common KADS Methodology
(MIT Press, Cambridge, 2000)
31. D.L. McGuinness, F. Harmelen (eds.), W3C-OWL, OWL web ontology language overview,
in W3C Recommendation 10 February 2004, McGuinness. World Wide Web Consortium
(2004)
32. N. Arch-int, S. Arch-int, Semantic information integration for electronic patient records
using ontology and web services model, in 2011 International Conference on Information
Science and Applications, Jeju Island, 2011, pp. 1–7
33. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, K. Wilkinson, Jena:
implementing the semantic web recommendations, in Proceedings of the 13th World Wide
Web Conference, New York City, USA, pp. 74–83, 17–22 May 2004
Bibliography 89
34. X. Liu, C. Hu, J. Huang, F. Liu, OPSDS: a semantic data integration and service system
based on domain ontology, in 2016 IEEE First International Conference on Data Science in
Cyberspace (DSC), Changsha, 2016, pp. 302–306
35. W. Yunxiao, Z. Xuecheng, The research of multi-source heterogeneous data integration
based on LINQ, in 2012 International Conference on Computer Science and Electronics
Engineering (ICCSEE), 2012, pp. 147–150
36. Querying Across Relationships (LINQ to SQL), Microsoft, 2017 (Online). Available: https://
msdn.microsoft.com/en-us/library/vstudio/bb386932(v=vs.100).aspx
37. What is Python? Executive Summary, Python.org, 2017 (Online). Available: https://www.
python.org/doc/essays/blurb/
38. E.J. Qaisar, Introduction to cloud computing for developers: key concepts, the players and
their offerings, in 2012 IEEE TCF Information Technology Professional Conference, Ewing,
NJ, 2012, pp. 1–6
39. M. Hamdaqa, L. Tahvildari, Cloud computing uncovered: a research landscape. Adv.
Comput., 41–85 (2012)
40. A. Hejlsberg, M. Torgersen, The .NET standard query operators, Microsoft, 2017 (Online).
Available: http://msdn.microsoft.com/en-us/library/bb394939.aspx
41. W. Gilpin, A python API for the RCSB protein data bank (PDB), Github, 2016 (Online).
Available: https://github.com/williamgilpin/pypdb