0% found this document useful (0 votes)
198 views

CHAPTER 3: Big Data Adoption and Planning Considerations

This document discusses considerations for adopting big data, including organization prerequisites, data procurement, privacy, security, and provenance. It notes that big data adoption requires distinct governance, methodology, and analytics lifecycles. Adopting big data solutions introduces challenges like limited real-time support, distinct performance issues, and the need to address privacy, security, and data management throughout the data lifecycle. Effective governance is needed to regulate big data assets and solutions.

Uploaded by

Ladines Clarisse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

CHAPTER 3: Big Data Adoption and Planning Considerations

This document discusses considerations for adopting big data, including organization prerequisites, data procurement, privacy, security, and provenance. It notes that big data adoption requires distinct governance, methodology, and analytics lifecycles. Adopting big data solutions introduces challenges like limited real-time support, distinct performance issues, and the need to address privacy, security, and data management throughout the data lifecycle. Effective governance is needed to regulate big data assets and solutions.

Uploaded by

Ladines Clarisse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CHAPTER 3: Big Data Adoption and Planning Considerations

 Organization Prerequisites

 Data Procurement
 Privacy
 Security
 Provenance
 Limited Realtime Support
 Distinct Performance Challenges
 Distinct Governance Requirements
 Distinct Methodology
 Clouds
 Big Data Analytics Lifecycle
Chapter 3. Big Data Adoption and Planning Considerations

Big Data initiatives are strategic in nature and should be business-driven. The adoption of Big Data can be
transformative but is more often innovative. Transformation activities aretypically low-risk endeavors designed
to deliver increased efficiency and effectiveness.

Innovation requires a shift in mindset because it will fundamentally alter the structure of abusiness either in its
products, services or organization. This is the power of Big Data adoption; it can enable this sort of change.
Innovation management requires care—too many controlling forces can stifle the initiative and dampen the
results, and too little oversight can turn a best intentioned project into a science experiment that never delivers
promised results. It is against this backdrop that Chapter 3 addresses Big Data adoption and planning
considerations.
Given the nature of Big Data and its analytic power, there are many issues that need to be considered and
planned for in the beginning. For example, with the adoption of any new technology, the means to secure it in
a way that conforms to existing corporate standards needs to be addressed. Issues related to tracking the
provenance of a dataset from its procurement to its utilization is often a new requirement for organizations.
Managing the privacy of constituents whose data is being handled or whose identity is revealed by analytic
processes must be planned for. Big Data even opens up additional opportunities toconsider moving beyond on-
premise environments and into remotely-provisioned, scalable environments that are hosted in a cloud. In
fact, all of the above considerations require an organization to recognize and establish a set of distinct
governance processes and decision frameworks to ensure that responsible parties understand Big Data’s
nature, implications and management requirements.
Organizationally, the adoption of Big Data changes the approach to performing business analytics. For this
reason, a Big Data analytics lifecycle is introduced in this chapter. The lifecycle begins with the establishment
of a business case for the Big Data project and ends with ensuring that the analytic results are deployed to the
organization to generate maximal value. There are a number of stages in between that organize the steps of
identifying, procuring, filtering, extracting, cleansing and aggregating of data. This is all required before the
analysis even occurs. The execution of this lifecycle requires new competencies to be developed or hired into
the organization.
As demonstrated, there are many things to consider and account for when adopting Big Data. This chapter
explains the primary potential issues and considerations.

Organization Prerequisites
Big Data frameworks are not turn-key solutions. In order for data analysis and analytics to offer value,
enterprises need to have data management and Big Data governance frameworks. Sound processes and
sufficient skillsets for those who will be responsible for implementing, customizing, populating and using Big
Data solutions are also necessary. Additionally, the quality of the data targeted for processing by Big Data
solutions needs tobe assessed.
Outdated, invalid, or poorly identified data will result in low-quality input which, regardless of how good the
Big Data solution is, will continue to produce low-quality results. The longevity of the Big Data environment
also needs to be planned for. A roadmap needs to be defined to ensure that any necessary expansion or
augmentation ofthe environment is planned out to stay in sync with the requirements of the enterprise.

Data Procurement
The acquisition of Big Data solutions themselves can be economical, due to the availability of open-source
platforms and tools and opportunities to leverage commodity hardware. However, a substantial budget may
still be required to obtain external data. The nature of the business may make external data very valuable. The
greater the volume and variety of data that can be supplied, the higher the chances are of finding hidden
insights from patterns.
External data sources include government data sources and commercial data markets. Government-provided
data, such as geo-spatial data, may be free. However, most commercially relevant data will need to be
purchased and may involve the continuation of subscription costs to ensure the delivery of updates to procured
datasets.

Privacy
Performing analytics on datasets can reveal confidential information about organizations or individuals. Even
analyzing separate datasets that contain seemingly benign data can reveal private information when the
datasets are analyzed jointly. This can lead to intentional or inadvertent breaches of privacy.
Addressing these privacy concerns requires an understanding of the nature of data being accumulated and
relevant data privacy regulations, as well as special techniques for data tagging and anonymization. For
example, telemetry data, such as a car’s GPS log or smart meter data readings, collected over an extended
period of time can reveal an individual’s location and behavior, as shown in Figure 3.1.
Figure 3.1 Information gathered from running analytics on image files, relational data and textual data is
used to create John’s profile.

Security
Some of the components of Big Data solutions lack the robustness of traditional enterprise solution
environments when it comes to access control and data security. Securing Big Data involves ensuring that the
data networks and repositories are sufficiently secured via authentication and authorization mechanisms.
Big Data security further involves establishing data access levels for different categories of users. For
example, unlike traditional relational database management systems, NoSQL databases generally do not
provide robust built-in security mechanisms. They instead rely on simple HTTP-based APIs where data is
exchanged in plaintext, making the data proneto network-based attacks, as shown in Figure 3.2.

Figure 3.2 NoSQL databases can be susceptible to network-based attacks.


Provenance
Provenance refers to information about the source of the data and how it has been processed. Provenance
information helps determine the authenticity and quality of data, and it can be used for auditing purposes.
Maintaining provenance as large volumes of data are acquired, combined and put through multiple processing
stages can be a complex task.At different stages in the analytics lifecycle, data will be in different states due to
the fact it may be being transmitted, processed or in storage. These states correspond to the notion of data-in-
motion, data-in-use and data-at-rest. Importantly, whenever Big Data changes state, it should trigger the
capture of provenance information that is recorded as metadata.
As data enters the analytic environment, its provenance record can be initialized with the recording of
information that captures the pedigree of the data. Ultimately, the goal of capturing provenance is to be able
to reason over the generated analytic results with the knowledge of the origin of the data and what steps or
algorithms were used to process the data that led to the result. Provenance information is essential to being
able to realize the value of the analytic result. Much like scientific research, if results cannot be justified and
repeated, they lack credibility. When provenance information is captured on the way to generating analytic
results as in Figure 3.3, the results can be more easily trusted and thereby used with confidence.

Figure 3.3 Data may also need to be annotated with source dataset attributes and processing step details
as it passes through the data transformation steps.
Limited Realtime Support
Dashboards and other applications that require streaming data and alerts often demand realtime or near-
realtime data transmissions. Many open source Big Data solutions and tools are batch-oriented; however, there
is a new generation of realtime capable open source tools that have support for streaming data analysis. Many
of the realtime data analysis solutions that do exist are proprietary. Approaches that achieve near-realtime
results often process transactional data as it arrives and combine it with previously summarized batch-
processed data.

Distinct Performance Challenges


Due to the volumes of data that some Big Data solutions are required to process, performance is often a
concern. For example, large datasets coupled with complex search algorithms can lead to long query times.
Another performance challenge is related to network bandwidth. With increasing data volumes, the time to
transfer a unit of data canexceed its actual data processing time, as shown in Figure 3.4.

Figure 3.4 Transferring 1 PB of data via a 1-Gigabit LAN connection at 80% throughput will take
approximately 2,750 hours.

Distinct Governance Requirements


Big Data solutions access data and generate data, all of which become assets of the business. A governance
framework is required to ensure that the data and the solution environment itself are regulated, standardized
and evolved in a controlled manner.
Examples of what a Big Data governance framework can encompass include:
• standardization of how data is tagged and the metadata used for tagging
• policies that regulate the kind of external data that may be acquired
• policies regarding the management of data privacy and data anonymization
• policies for the archiving of data sources and analysis results
• policies that establish guidelines for data cleansing and filtering
Distinct Methodology
A methodology will be required to control how data flows into and out of Big Data solutions. It will need to
consider how feedback loops can be established to enable the processed data to undergo repeated refinement,
as shown in Figure 3.5. For example, an iterative approach may be used to enable business personnel to
provide IT personnel withfeedback on a periodic basis. Each feedback cycle provides opportunities for system
refinement by modifying data preparation or data analysis steps.
Figure 3.5 Each repetition can help fine-tune processing steps, algorithms and data models to improve the
accuracy of results and deliver greater value to the business.

Clouds
As mentioned in Chapter 2, clouds provide remote environments that can host IT infrastructure for large-scale
storage and processing, among other things. Regardless of whether an organization is already cloud-enabled,
the adoption of a Big Data environmentmay necessitate that some or all of that environment be hosted within a
cloud. For example, an enterprise that runs its CRM system in a cloud decides to add a Big Data solution in
the same cloud environment in order to run analytics on its CRM data. This data can then be shared with its
primary Big Data environment that resides within the enterprise boundaries.
Common justifications for incorporating a cloud environment in support of a Big Datasolution include:
• inadequate in-house hardware resources
• upfront capital investment for system procurement is not available
• the project is to be isolated from the rest of the business so that existing business processes are not
impacted
• the Big Data initiative is a proof of concept
• datasets that need to be processed are already cloud resident
• the limits of available computing and storage resources used by an in-house BigData solution are being
reached
Big Data Analytics Lifecycle
Big Data analysis differs from traditional data analysis primarily due to the volume, velocity and variety
characteristics of the data being processes. To address the distinct requirements for performing analysis on
Big Data, a step-by-step methodology is needed to organize the activities and tasks involved with acquiring,
processing, analyzing and repurposing data. The upcoming sections explore a specific data analytics lifecycle
that organizes and manages the tasks and activities associated with the analysis of Big Data. From a Big Data
adoption and planning perspective, it is important that in addition to the lifecycle, consideration be made for
issues of training, education, tooling and staffing of adata analytics team.
The Big Data analytics lifecycle can be divided into the following nine stages, as shown inFigure 3.6:
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation & Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results

Figure 3.6 The nine stages of the Big Data Analytics Lifecycle.
Business Case Evaluation
Each Big Data analytics lifecycle must begin with a well-defined business case that presents a clear
understanding of the justification, motivation and goals of carrying out the analysis. The Business Case
Evaluation stage shown in Figure 3.7 requires that a business case be created, assessed and approved prior to
proceeding with the actual hands-on analysis tasks.

Figure 3.7 Stage 1 of the Big Data Analytics Lifecycle.


An evaluation of a Big Data analytics business case helps decision-makers understand the business resources
that will need to be utilized and which business challenges the analysis will tackle. The further identification of
KPIs during this stage can help determine assessment criteria and guidance for the evaluation of the analytic
results. If KPIs are not readily available, efforts should be made to make the goals of the analysis project
SMART, which stands for specific, measurable, attainable, relevant and timely.
Based on business requirements that are documented in the business case, it can be determined whether the
business problems being addressed are really Big Data problems. In order to qualify as a Big Data problem, a
business problem needs to be directly relatedto one or more of the Big Data characteristics of volume, velocity,
or variety.
Note also that another outcome of this stage is the determination of the underlying budget required to carry out
the analysis project. Any required purchase, such as tools, hardware and training, must be understood in
advance so that the anticipated investment can be weighed against the expected benefits of achieving the
goals. Initial iterations of the Big Data analytics lifecycle will require more up-front investment of Big Data
technologies, products and training compared to later iterations where these earlier investments can be
repeatedly leveraged.

Data Identification
The Data Identification stage shown in Figure 3.8 is dedicated to identifying the datasets required for the
analysis project and their sources.

Figure 3.8 Data Identification is stage 2 of the Big Data Analytics Lifecycle.
Identifying a wider variety of data sources may increase the probability of finding hidden patterns and
correlations. For example, to provide insight, it can be beneficial to identify as many types of related data
sources as possible, especially when it is unclear exactly what to look for.
Depending on the business scope of the analysis project and nature of the business problems being addressed,
the required datasets and their sources can be internal and/orexternal to the enterprise.

In the case of internal datasets, a list of available datasets from internal sources, such as data marts and
operational systems, are typically compiled and matched against a pre- defined dataset specification.
In the case of external datasets, a list of possible third-party data providers, such as data markets and publicly
available datasets, are compiled. Some forms of external data may be embedded within blogs or other types of
content-based web sites, in which case they may need to be harvested via automated tools.

Data Acquisition and Filtering


During the Data Acquisition and Filtering stage, shown in Figure 3.9, the data is gathered from all of the data
sources that were identified during the previous stage. The acquired data is then subjected to automated
filtering for the removal of corrupt data or data that has been deemed to have no value to the analysis
objectives.

Figure 3.9 Stage 3 of the Big Data Analytics Lifecycle.


Depending on the type of data source, data may come as a collection of files, such as data purchased from a
third-party data provider, or may require API integration, such as with Twitter. In many cases, especially
where external, unstructured data is concerned, some ormost of the acquired data may be irrelevant (noise) and
can be discarded as part of the filtering process.
Data classified as ―corrupt‖ can include records with missing or nonsensical values or invalid data types. Data
that is filtered out for one analysis may possibly be valuable for a different type of analysis. Therefore, it is
advisable to store a verbatim copy of the original dataset before proceeding with the filtering. To minimize the
required storage space, the verbatim copy can be compressed.
Both internal and external data needs to be persisted once it gets generated or enters the enterprise boundary.
For batch analytics, this data is persisted to disk prior to analysis. In the case of realtime analytics, the data is
analyzed first and then persisted to disk.
As evidenced in Figure 3.10, metadata can be added via automation to data from both internal and external
data sources to improve the classification and querying. Examples of appended metadata include dataset size
and structure, source information, date and time of creation or collection and language-specific information. It
is vital that metadata be machine-readable and passed forward along subsequent analysis stages. This helps
maintain data provenance throughout the Big Data analytics lifecycle, which helps to establish and preserve
data accuracy and quality.

Figure 3.10 Metadata is added to data from internal and external sources.

Data Extraction
Some of the data identified as input for the analysis may arrive in a format incompatible with the Big Data
solution. The need to address disparate types of data is more likely with data from external sources. The Data
Extraction lifecycle stage, shown in Figure 3.11, is dedicated to extracting disparate data and transforming it
into a format that the underlyingBig Data solution can use for the purpose of the data analysis.
Figure 3.11 Stage 4 of the Big Data Analytics Lifecycle.
The extent of extraction and transformation required depends on the types of analytics and capabilities of the
Big Data solution. For example, extracting the required fields from delimited textual data, such as with
webserver log files, may not be necessary if the underlying Big Data solution can already directly process
those files.
Similarly, extracting text for text analytics, which requires scans of whole documents, is simplified if the
underlying Big Data solution can directly read the document in its nativeformat.
Figure 3.12 illustrates the extraction of comments and a user ID embedded within an XML document without
the need for further transformation.

Figure 3.12 Comments and user IDs are extracted from an XML document.
Figure 3.13 demonstrates the extraction of the latitude and longitude coordinates of a user from a single JSON
field.

Figure 3.13 The user ID and coordinates of a user are extracted from a single JSONfield.
Further transformation is needed in order to separate the data into two separate fields as required by the Big
Data solution.

Data Validation and Cleansing


Invalid data can skew and falsify analysis results. Unlike traditional enterprise data, where the data structure is
pre-defined and data is pre-validated, data input into Big Data analyses can be unstructured without any
indication of validity. Its complexity can further make it difficult to arrive at a set of suitable validation
constraints.
The Data Validation and Cleansing stage shown in Figure 3.14 is dedicated to establishing often complex
validation rules and removing any known invalid data.
Figure 3.14 Stage 5 of the Big Data Analytics Lifecycle.
Big Data solutions often receive redundant data across different datasets. This redundancy can be exploited to
explore interconnected datasets in order to assemble validation parameters and fill in missing valid data.
For example, as illustrated in Figure 3.15:

• The first value in Dataset B is validated against its corresponding value in DatasetA.
• The second value in Dataset B is not validated against its corresponding value inDataset A.
• If a value is missing, it is inserted from Dataset A.
Figure 3.15 Data validation can be used to examine interconnected datasets in order to fill in missing valid
data.
For batch analytics, data validation and cleansing can be achieved via an offline ETL operation. For realtime
analytics, a more complex in-memory system is required to validate and cleanse the data as it arrives from the
source. Provenance can play an important role in determining the accuracy and quality of questionable data.
Data that appears to be invalid may still be valuable in that it may possess hidden patterns and trends, as
shown in Figure 3.16.

Figure 3.16 The presence of invalid data is resulting in spikes. Although the data appears abnormal, it
may be indicative of a new pattern.

Data Aggregation and Representation


Data may be spread across multiple datasets, requiring that datasets be joined together via common fields, for
example date or ID. In other cases, the same data fields may appear in multiple datasets, such as date of birth.
Either way, a method of data reconciliation is required or the dataset representing the correct value needs to be
determined.
The Data Aggregation and Representation stage, shown in Figure 3.17, is dedicated to integrating multiple
datasets together to arrive at a unified view.
Figure 3.17 Stage 6 of the Big Data Analytics Lifecycle.

Performing this stage can become complicated because of differences in:


• Data Structure – Although the data format may be the same, the data model may bedifferent.
• Semantics – A value that is labeled differently in two different datasets may mean the same thing, for
example ―surname‖ and ―last name.‖
The large volumes processed by Big Data solutions can make data aggregation a time and effort-intensive
operation. Reconciling these differences can require complex logic that is executed automatically without the
need for human intervention.
Future data analysis requirements need to be considered during this stage to help foster data reusability.
Whether data aggregation is required or not, it is important to understand that the same data can be stored in
many different forms. One form may be better suited for a particular type of analysis than another. For
example, data stored as a BLOB wouldbe of little use if the analysis requires access to individual data fields.
A data structure standardized by the Big Data solution can act as a common denominatorthat can be used for a
range of analysis techniques and projects. This can require establishing a central, standard analysis repository,
such as a NoSQL database, as shown in Figure 3.18.

Figure 3.18 A simple example of data aggregation where two datasets are aggregated together using the Id
field.
Figure 3.19 shows the same piece of data stored in two different formats. Dataset A contains the desired piece
of data, but it is part of a BLOB that is not readily accessible forquerying. Dataset B contains the same piece of
data organized in column-based storage, enabling each field to be queried individually.

Figure 3.19 Dataset A and B can be combined to create a standardized data structure with a Big Data
solution.
Data Analysis
The Data Analysis stage shown in Figure 3.20 is dedicated to carrying out the actual analysis task, which
typically involves one or more types of analytics. This stage can be iterative in nature, especially if the data
analysis is exploratory, in which case analysis is repeated until the appropriate pattern or correlation is
uncovered. The exploratory analysisapproach will be explained shortly, along with confirmatory analysis.

Figure 3.20 Stage 7 of the Big Data Analytics Lifecycle.


Depending on the type of analytic result required, this stage can be as simple as querying a dataset to compute
an aggregation for comparison. On the other hand, it can be as challenging as combining data mining and
complex statistical analysis techniques to discover patterns and anomalies or to generate a statistical or
mathematical model to depict relationships between variables.
Data analysis can be classified as confirmatory analysis or exploratory analysis, the latter of which is linked to
data mining, as shown in Figure 3.21.

Figure 3.21 Data analysis can be carried out as confirmatory or exploratory analysis.
Confirmatory data analysis is a deductive approach where the cause of the phenomenon being investigated is
proposed beforehand. The proposed cause or assumption is called a hypothesis. The data is then analyzed to
prove or disprove the hypothesis and provide definitive answers to specific questions. Data sampling
techniques are typically used. Unexpected findings or anomalies are usually ignored since a predetermined
cause wasassumed.
Exploratory data analysis is an inductive approach that is closely associated with data mining. No hypothesis
or predetermined assumptions are generated. Instead, the data is explored through analysis to develop an
understanding of the cause of the phenomenon. Although it may not provide definitive answers, this method
provides a general directionthat can facilitate the discovery of patterns or anomalies.

Data Visualization
The ability to analyze massive amounts of data and find useful insights carries little value if the only ones that
can interpret the results are the analysts.
The Data Visualization stage, shown in Figure 3.22, is dedicated to using data visualization techniques and
tools to graphically communicate the analysis results foreffective interpretation by business users.
Figure 3.22 Stage 8 of the Big Data Analytics Lifecycle.
Business users need to be able to understand the results in order to obtain value from the analysis and
subsequently have the ability to provide feedback, as indicated by the dashed line leading from stage 8 back to
stage 7.
The results of completing the Data Visualization stage provide users with the ability to perform visual
analysis, allowing for the discovery of answers to questions that users havenot yet even formulated.
The same results may be presented in a number of different ways, which can influence theinterpretation of the
results. Consequently, it is important to use the most suitable visualization technique by keeping the business
domain in context.
Another aspect to keep in mind is that providing a method of drilling down to comparatively simple statistics
is crucial, in order for users to understand how the rolledup or aggregated results were generated.

Utilization of Analysis Results


Subsequent to analysis results being made available to business users to support business decision-making,
such as via dashboards, there may be further opportunities to utilize the analysis results. The Utilization of
Analysis Results stage, shown in Figure 3.23, is dedicated to determining how and where processed analysis
data can be further leveraged.

Figure 3.23 Stage 9 of the Big Data Analytics lifecycle.


Depending on the nature of the analysis problems being addressed, it is possible for the analysis results to
produce ―models‖ that encapsulate new insights and understandings about the nature of the patterns and
relationships that exist within the data that was analyzed. A model may look like a mathematical equation or a
set of rules. Models can be used to improve business process logic and application system logic, and they can
form the basis of a new system or software program.
Common areas that are explored during this stage include the following:

• Input for Enterprise Systems – The data analysis results may be automatically or manually fed directly
into enterprise systems to enhance and optimize their behaviors and performance. For example, an
online store can be fed processed customer-related analysis results that may impact how it generates
product recommendations. New models may be used to improve the programming logic within existing
enterprise systems or may form the basis of new systems.
• Business Process Optimization – The identified patterns, correlations and anomalies discovered during
the data analysis are used to refine business processes. An example is consolidating transportation
routes as part of a supply chain process. Models may also lead to opportunities to improve business
process logic.
• Alerts – Data analysis results can be used as input for existing alerts or may form thebasis of new alerts.
For example, alerts may be created to inform users via email or SMS text about an event that requires
them to take corrective action.

You might also like