Big Data and Data Science: Case Studies: Priyanka Srivatsa
Big Data and Data Science: Case Studies: Priyanka Srivatsa
Abstract- Big data is a collection of large and complex data sets error: bugs in the computer system, assumptions of the
difficult to process using on-hand database management tools models & results based on erroneous data.
or traditional data processing applications. The three V’s of
Big Data (volume, variety, velocity) constitute a more Data Components provides access to data hosted within the
comprehensive definition busting the myth that big data is only boundaries of the system.
about data volume. Data Volume is the primary attribute of
big data. It can be quantified by counting records, Data Processes are those processes that help in collection &
transactions, tables or files. It can also be quantified in terms manipulation of meaningful data.
of time & in terms of terabytes or petabytes. Data Variety, the
next significant attribute of big data, is quantified in terms of
sources like logs, clickstream or social media. Data velocity,
Data Store is a data repository of a set of integrated objects.
another important attribute of big data, describes the These objects are modeled using classes defined in database
frequency of data delivery & data generation. Analysis of big schemas.
data is very complex & time consuming. An important tool
that helps understand big data & its analysis is Data Science. Data kind refers to the variety of data available for analysis.
Data Science is the study of the generalizable extraction of It includes structured data, unstructured data & semi-
knowledge from the data sets. The study of data science structured data. Structured Data exists when the information
includes studying data processing architectures, data is clearly broken down into fields that have an explicit
components & processes, data stores & data kind and the meaning & are highly categorical, ordinal or numeric.
challenges of big data. Unstructured Data exists in the form of natural language text,
images, audio & video. It requires pre-processing to identify
Keywords:-big data, volume, variety, velocity, data science, data & extract relevant features. Semi-Structured Data is used to
processing pipelines, data processing architectures, data describe the structured data that does not conform to the
components, data processes, data stores, data kind formal structure of data models associated with a relational
database or other forms of data tables.
www.rsisinternational.org Page 22
Volume I, Issue II, July 2014 IJRSI ISSN 2321 - 2705
easier to navigate the physical world. To achieve this goal, B. Components
Nokia needed to find a technology solution that would
support the collection, storage and analysis of virtually The technology ecosystem consists of:
unlimited data types and volumes. Effective collection and 1).Teradata Enterprise Data Warehouse: It stores & manages
use of data has become central to Nokia’s ability to data.
understand and improve users’ experience with their phones. 2).Oracle & My SQL Data Marts: These are simpler forms
The company leverages data processing and complex of data warehouses.
analyses in order to build maps with predictive traffic and 3).HBase: It is an extensible record store with a basic
layered elevation models, to source information about points scalability model of splitting rows & columns into multiple
of interest around the world & understand the quality of nodes.
phones. Cloudera helped Nokia in its endeavor to achieve 4).Scribe: It is used to log data directly into the HBase.
this goal by deciding to employ APACHE HADOOP to 5).Sqoop: It is a command-line interface application for
manage & process huge volumes of data. transferring data between relational databases and Hadoop
(HBase).
A. Data Processing Architecture
C. Process
Data required for analysis is acquired from various resources
like phones in use, services, log files, market research, 1).Nokia has over 100 terabytes (TB) of structured data on
discussion in forums, feedback etc. All this data is sent into a Teradata and petabytes (PB) of multi-structured data on the
DATA COLLECTOR which collects & stores these various Hadoop Distributed File System (HDFS).
kinds of data required for analysis. After initial data 2).The centralized Hadoop cluster which lies at the heart of
collection, a cleaning process is conducted with sampling & Nokia’s infrastructure contains 0.5 PB of data.
conversion of data. 3).Nokia’s data warehouses and marts continuously stream
Then the data is aggregated & sent into a DATA multi-structured data into a multi-tenant Hadoop
PROCESSOR. This complete process is supervised by a environment, allowing the company’s 60,000+ employees to
DATA SUPERVISOR that appropriately pre & post access the data.
processes the live data. The aggregated data is sent into a 4).Nokia runs hundreds of thousands of Scribe processes
COMPUTE CLOUD component that consists of 3 parts each day to efficiently move data from, for example, servers
namely the Data Broker, the Data Analyzer & the Data in Singapore to a Hadoop cluster in the UK data center.
Manager. 5).The company uses Sqoop to move data from HDFS to
1).Data Broker collects & repackages information available Oracle and/or Teradata.
in the public domain in a format readable & useful to the 6).And Nokia serves data out of Hadoop through HBase.
company.
2).Data Analyzers are tools that specialize in predictive D. Data Stores
modeling & text mining thus analyzing the information
available. 1).Teradata Enterprise Data Warehouse: This data
3).Data Manager is a tool that manages the processing of warehouse uses a "shared nothing" architecture which means
huge volumes of data by realizing the entities of applications that each server node has its own memory and processing
& efficiently creating graphs & information snapshots that power. Adding more servers and nodes increases the amount
deliver the analysis into a presentable format. of data that can be stored. The database software sits on top
There is a DATA REST unit that consists of: of the servers and spreads the workload among them.
a).QUERY unit used to query from the database 2).Oracle & My SQL Data Marts: These are focused on a
b).REPORTER unit that reports the results of the related single subject (or functional area), such as Sales, Finance or
queries Marketing. Data marts are often built and controlled by a
c).CACHE unit that stores all the temporary information single department within an organization. Given their single-
retrieved from the database subject focus, data marts usually draw data from only a few
d).VISUALIZER unit that helps process the digital data & sources. The sources could be internal operational systems,
interpret results a central data warehouse or external data.
e).AUDIT unit that keeps an account of the amount of data 3).HBase: HBase is an Apache project written in Java. It is
that is being processed & the effective time required to patterned directly after Big Table:
process this data • HBase uses the Hadoop distributed file system which
f).MONITOR unit which monitors the entire functioning of updates memory and periodically writes them out to files on
the DATA REST unit the disk.
The COMPUTE CLOUD component is supported by the • The updates go to the end of a data file, to avoid seeks. The
DATA REST unit. files are periodically compacted. Updates also go to the end
Finally the processed data is fed into the DATA SAAS of a write ahead log, to perform recovery if a server crashes.
wherein the various dimensions & prospects of the data are • Row operations are atomic, with row-level locking and
discussed & interpreted. transactions. There is optional support for transactions with
wider scope. These use optimistic concurrency control,
aborting the process; if there is a conflict with other updates.
www.rsisinternational.org Page 23
Volume I, Issue II, July 2014 IJRSI ISSN 2321 - 2705
• Partitioning and distribution are transparent; there is no effectively as data marts concentrate on concrete, single
client-side hashing or fixed key space. There is multiple subjects specifically on one functional area.
master support, to avoid a single point of failure. d). Query Processing, Data Modelling & Analysis is a phase
MapReduce support allows operations to be distributed where general statistic patterns are drawn from hidden
efficiently. patterns. HBase effectively derives the statistic patterns from
• HBase’s B-trees allow fast range queries and sorting. hidden patterns.
• There is a Java API, a Thrift API and REST API; e).Interpretation is a phase wherein all the assumptions made
JDBC/ODBC support has recently been added. need to be examined & the possible errors have to be
removed. Data SaaS available in the Hadoop Framework are
E.Data Kind utilized to interpret the processed data results efficiently.
www.rsisinternational.org Page 24
Volume I, Issue II, July 2014 IJRSI ISSN 2321 - 2705
from frequent patterns & correlation analysis usually 2) EBS: It provides persistent block-level storage volumes
overpower individual fluctuations & often disclose more for use with Amazon EC2.
reliable hidden patterns & knowledge. Interconnected Big 3) SAP: It serves as a storage location for consolidated &
Data forms large heterogeneous information networks, with cleansed transaction data on an individual level.
which information redundancy can be explored to 4) JDE: It is used to provide periodic updates of the
compensate for missing data, to crosscheck conflicting cases, operational data changes required.
to validate trustworthy relationships, to disclose inherent 5) PSFT: It is used as a data store to manage entire business
clusters and to uncover hidden relationships and models. process relationships.
Mining requires integrated, cleaned, trustworthy and
efficiently accessible data, declarative query and mining E. Data Kind
interfaces, scalable mining algorithms and big-data
computing environments. Data mining itself is being used to 1) Transaction data are business transactions that are
help improve the quality and trustworthiness of the data, captured during business operations and processes, such as a
understand its semantics and provide intelligent querying purchase records, inquiries, and payments.
functions. 2) Metadata, defined as “data about the data”, is the
description of the data.
e) Interpretation. 3) Master data refers to the enterprise-level data entities that
Ultimately the results of analysis need to be interpreted are of strategic value to an organization. They are
by a decision maker. The process basically involves typically non-volatile and non-transactional in nature.
examining all the assumptions made & retracing the 4) Reference data are internally managed or externally
analysis. The errors have to be debugged & the assumptions sourced facts to support an organization’s ability to
at various levels need to be critically examined. effectively process transactions, manage master data, and
Supplementary information that explains the derivation of provide decision support capabilities. Geo data and market
each result & the inputs that are involved in this process need data are among the most commonly used reference data.
to be mentioned & explained wherever necessary. 5) Unstructured data make up over 70% of an organization’s
data and information assets. They include documents, digital
B. Components images, geo-spatial data, and multi-media files.
Oracle deployed its MDM suite that consists of the following 6) Analytical data are derivations of the business operation
components: and transaction data used to satisfy reporting and analytical
a) Oracle Metadata Manager: It acquires & records the needs. They reside in data warehouses, data marts, and other
continuous inflow of data into the database. decision support applications.
b) Data Relationship Manager: It consolidates, rationalized, 7) Big data refer to large datasets that are challenging to
governs & shares the master reference data. store, search, share, visualize, and analyze.
c) Data Warehouse Manager: It divides the acquired data The growth of such data is mainly a result of the increasing
into specific functional areas & stores them in data marts. channels of data in today’s world.
d) BI Publisher: It queries, monitors & reports on the master Examples include, but are not limited to, user-generated
data. content through social media, web and software logs,
e) Data Steward Component: It facilitates the UI component cameras, information-sensing mobile devices, aerial sensory
& also helps to set up the workbench. technologies, genomics and medical records.
1) Profile the master data. Understand all possible sources 1) SUPPLY CHAIN MANAGEMENT was crucial to this
and the current state of data quality in each source. retail supplier as he was facing loses because of unstructured
2) Consolidate the master data into a central repository and data management.
link it to all participating applications. 2) The time he took to market his products was really huge
3) Govern the master data. Clean it up, de- duplicate it, and which led to other companies marketing similar stuff.
enrich it with information from 3rd party systems. Manage it 3) His revenue decreased as his sales fell significantly due to
according to business rules. his inability to manage the data.
4) Share it. Synchronize the central master data with 4) SHIPPING & INVOICING ERRORS were huge that led
enterprise business processes and the connected applications. to economic & customer losses.
Insure that data stays in sync across the IT landscape. 5) Distribution slowed down owing to inadequate
5) Leverage the fact that a single version of the truth exists management of required data.
for all master data objects by supporting business 6) Errors in acquiring orders resulted in dissatisfaction of
intelligence systems and reporting. the customers.
D. Data Stores G. Comparative Study of the Use Case & the BIG DATA
PIPELINE.
1) Siebel: It is used exclusively to store CRM data.
www.rsisinternational.org Page 25
Volume I, Issue II, July 2014 IJRSI ISSN 2321 - 2705
a) Data Acquisition & Recording is a phase where data is
acquired from the various sources. Oracle Metadata Manager
Tool acquires the required data.
b) Information Extraction & Cleaning is a phase where the
data is extracted according to the requirement & made
analysis ready. Data Relationship Manager extracts the data
as per requirement & makes it analysis ready.
c) Data Aggregation, Integration & Representation is a phase
where relevant data for analysis is grouped considering the
heterogeneity of the data acquired.
Data Warehouse Manager separates the relevant data
effectively as data marts concentrate on concrete, single
subjects specifically on one functional area.
d) Query Processing, Data Modelling & Analysis is a phase
where general statistic patterns are drawn from hidden
patterns. BI Publisher effectively derives the statistic patterns
from hidden patterns.
e) Interpretation is a phase wherein all the assumptions made
need to be examined & the possible errors have to be
removed. Data Steward Component is utilized to interpret
the processed data results to the workbench efficiently.
REFERENCES
[1] Big Data Analytics by Philip Russom
[2] Challenges and Opportunities with Big Data
[3] Scalable SQL & NoSQL Data Stores by Rick Catell
[4] Field Guide to Data Science by Booz, Allen & Hamilton
[5] An Architects’ Guide to Big Data - Oracle white paper
[6] Cloudera - Nokia Case Study
[7] Oracle & Big Data White Paper
[8] Master Data Management by Oracle
www.rsisinternational.org Page 26