BDA-Lec4
BDA-Lec4
01. Detailed Big Data Analytics Lifecycle with Case Study (cont.)
02. Pipeline
● The CEO and the directors are eager to see Big Data in action.
○ In response to this, the IT team, in partnership with the business
personnel, take on ETI’s first Big Data project.
● The team then follows a step-by-step approach as set forth by the Big
Data Analytics Lifecycle in pursuit of achieving this objective.
Stage 5: Data Validation and Cleansing
● It is dedicated to establishing often complex
validation rules and removing any known
invalid data.
● Invalid data can skew and falsify analysis
results.
● Unlike traditional enterprise data(Database),
where the data structure is pre-defined
and data is pre-validated, data input into
Big Data analyses can be unstructured
without any indication of validity.
○ Its complexity can further make it
difficult to arrive at a set of suitable
validation constraints.
Stage 5: Data Validation and Cleansing
● Big Data solutions often receive redundant data
across different datasets.
○ This redundancy can be exploited ) (تستغلto
explore interconnected datasets in order to
assemble validation parameters and fill in
missing valid data.
○ For example, as illustrated in this Figure:
■ • The first value in Dataset B is
validated against its corresponding
value in Dataset A.
■ • The second value in Dataset B is not
validated against its corresponding
value in Dataset A.
■ • If a value is missing, it is inserted
from Dataset A.
Stage 5: Data Validation and Cleansing
● For batch analytics,
○ data validation and cleansing can be
achieved via an offline ETL operation.
● For real-time analytics,
○ a more complex in-memory system is
required to validate and cleanse the
data as it arrives from the source.
● The large volumes processed by Big Data solutions can make data
aggregation a time and effort-intensive operation.
A simple example of data aggregation where two datasets are aggregated together using
the Id field.
Stage 6: Data Aggregation and Representation
● This Figure shows the same piece of data stored in two different formats.
○ Dataset A contains the desired piece of data, but it is part of a BLOB(Binary
Large Object) that is not readily accessible for querying.
○ Dataset B contains the same piece of data organized in column-based
storage, enabling each field to be queried individually.
Dataset A and B can be combined to create a standardized data structure with a Big Data
solution.
Stage 6: Data Aggregation and Representation (Case
Study)
● For meaningful analysis of data,
○ it is decided to join together policy data, claim data and call center
agent notes in a single dataset that is tabular in nature where each
field can be referenced via a data query.
○ It is thought that this will not only help with the current data analysis
task of detecting fraudulent claims but will also help with other data
analysis tasks, such as risk evaluation and speedy settlement of
claims.
○ The resulting dataset is stored in a NoSQL database.
Stage 7: Data Analysis
● The Data Analysis stage is dedicated to carrying out the
actual analysis task, which typically involves one or more
types of analytics.
● This stage can be iterative in nature,
○ especially if the data analysis is predictive analytics,
in which case analysis is repeated until the
appropriate pattern or correlation is uncovered.
● Depending on the type of analytic result required,
○ this stage can be as simple as querying a dataset to
compute an aggregation for comparison.
○ On the other hand, it can be as challenging as
combining data mining and complex statistical
analysis techniques to discover patterns and
anomalies or to generate a statistical or
mathematical model to depict relationships between
variables.
Stage 7: Data Analysis
● Before making inferences from data it is essential to
examine all your variables. Why?
○ By visualizing the data we will have clearer knowledge about the data.
Stage 7: Data Analysis (Case Study)
● The IT team applies Exploratory Data Analysis (EDA) as follows:
○ By visualizing the data we will have clearer knowledge about the data.
■ o Categorical Data plotting
●
Stage 7: Data Analysis (Case Study)
● The IT team applies Exploratory Data Analysis (EDA) as follows:
○ By visualizing the data we will have clearer knowledge about the data.
Stage 7: Data Analysis (Case Study)
● The IT team involves the data analysts at this stage as it does not have the right
skillset for analyzing data in support of detecting fraudulent claims.
● In order to be able to detect fraudulent transactions,
○ first the nature of fraudulent claims needs to be analyzed in order to find
which characteristics differentiate a fraudulent claim from a legitimate claim.
○ For this, the predictive data analysis approach is taken. As part of this
analysis, a range of analysis techniques are applied.
● This stage is repeated a number of times as the results generated after the first
pass are not conclusive enough ) (ليس قاطعا بما فيه الكفايةto comprehend what makes a
fraudulent claim different from a legitimate claim.
● As part of this exercise, attributes that are less indicative of a fraudulent claim are
dropped while attributes that carry a direct relationship are kept or added.
Stage 8: Data Visualization
● The ability to analyze massive amounts of data and
find useful insights carries little value if the only
ones that can interpret the results are the analysts.
● The Data Visualization stage is dedicated to using
data visualization techniques and tools to
graphically communicate the analysis results for
effective interpretation by business users.
● The results of completing the Data Visualization
stage provide users with the ability to perform
visual analysis, allowing for the discovery of
answers to questions that users have not yet even
formulated.
Stage 8: Data Visualization
Stage 8: Data Visualization (Case Study)
● The team has discovered some interesting findings and now needs to
convey the results to the actuaries, underwriters and claim adjusters.
● Different visualization methods are used including bar and line graphs
and scatter plots.
○ Scatter plots are used to analyze groups of fraudulent and
legitimate claims in the light of different factors, such as customer
age, age of policy, number of claims made and value of claim.
Stage 8: Data Visualization (Case Study)
Stage 9: Utilization of Analysis Results
● Subsequent to analysis results being made available to
business users to support business decision-making, such
as via dashboards, there may be further opportunities to
utilize the analysis results.
● This stage is dedicated to determining how and where
processed analysis data can be further leveraged.
● Depending on the nature of the analysis problems being
addressed, it is possible for the analysis results to produce
“models” that encapsulate new insights and understandings
about the nature of the patterns and relationships that exist
within the data that was analyzed.
○ A model may look like a mathematical equation or a set of rules.
○ Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program.
Stage 9: Utilization of Analysis Results (Case Study)
● Based on the data analysis results, the underwriting and the claims
settlement users have now developed an understanding of the nature of
fraudulent claims.
https://medium.com/@HassanFaheem/what-is-data-pipeline-in-big-data-6c0989cc4877
Why we need a Pipeline?
● The main goal of using data pipelines is to simplify and automate the
process of extracting, transforming, and loading (ETL) data from various
sources into a central location for analysis.
● Data pipelines are often used in conjunction with big data technologies
like Apache Hadoop and Apache Spark, distributed systems for
processing large amounts of data.
Pipeline vs ETL
● you should think about an ETL pipeline as a subcategory of data
pipelines.
● ETL pipelines follow a specific sequence. As the abbreviation implies,
they extract data, transform data, and then load and store data in a data
repository. Not all data pipelines need to follow this sequence.
● Finally, while unlikely, data pipelines as a whole do not necessarily need
to undergo data transformations, as with ETL pipelines. It’s rare to see a
data pipeline that doesn’t utilize transformations to facilitate data analysis
Types of data pipelines
● There are several main types of data pipelines, each appropriate for
specific tasks on specific platforms.
○ Batch processing
○ Streaming data
○ Data integration pipelines
Types of data pipelines
● Batch pipeline
Types of data pipelines
● Streaming pipeline
Types of data pipelines
● Data integration pipelines (ETL pipelines)
○ Data integration pipelines concentrate on merging data from multiple sources
into a single unified view.
○ These pipelines often involve extract, transform and load (ETL) processes that
clean, enrich, or otherwise modify raw data before storing it in a centralized
repository such as a data warehouse or data lake.
○ Data integration pipelines are essential for handling disparate systems that
generate incompatible formats or structures.
03. Exam-like Questions
Sample questions
● In Big Data, the term "Velocity" refers to:
• A) The number of different types of data generated