0% found this document useful (0 votes)
17 views

BDA-Lec4

The lecture covers the Big Data Analytics Lifecycle, focusing on a case study of ETI Insurance Company, which aims to detect fraudulent claims using Big Data solutions. It details stages such as Data Validation and Cleansing, Data Aggregation and Representation, Data Analysis, and Data Visualization, emphasizing the importance of each step in achieving accurate and actionable insights. Additionally, it introduces data pipelines and their role in automating data movement and transformation for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

BDA-Lec4

The lecture covers the Big Data Analytics Lifecycle, focusing on a case study of ETI Insurance Company, which aims to detect fraudulent claims using Big Data solutions. It details stages such as Data Validation and Cleansing, Data Aggregation and Representation, Data Analysis, and Data Visualization, emphasizing the importance of each step in achieving accurate and actionable insights. Additionally, it introduces data pipelines and their role in automating data movement and transformation for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

3rd grade

Big Data Analytics


Dr. Nesma Mahmoud
Lecture 4: More on
Big Data Analytics
What will we learn in this lecture?

01. Detailed Big Data Analytics Lifecycle with Case Study (cont.)

02. Pipeline

03. Exam-like Questions


Detailed Big Data
01. Analytics Lifecycle with
Case Study (cont.)
ETI Insurance Company (Case Study)
● ETI’s Big Data journey has reached the stage where its IT team
possesses the necessary skills and the management is convinced of the
potential benefits that a Big Data solution can bring in support of the
business goals.

● The CEO and the directors are eager to see Big Data in action.
○ In response to this, the IT team, in partnership with the business
personnel, take on ETI’s first Big Data project.

○ After a thorough evaluation process, the “detection of fraudulent


claims” objective is chosen as the first Big Data solution.

● The team then follows a step-by-step approach as set forth by the Big
Data Analytics Lifecycle in pursuit of achieving this objective.
Stage 5: Data Validation and Cleansing
● It is dedicated to establishing often complex
validation rules and removing any known
invalid data.
● Invalid data can skew and falsify analysis
results.
● Unlike traditional enterprise data(Database),
where the data structure is pre-defined
and data is pre-validated, data input into
Big Data analyses can be unstructured
without any indication of validity.
○ Its complexity can further make it
difficult to arrive at a set of suitable
validation constraints.
Stage 5: Data Validation and Cleansing
● Big Data solutions often receive redundant data
across different datasets.
○ This redundancy can be exploited )‫ (تستغل‬to
explore interconnected datasets in order to
assemble validation parameters and fill in
missing valid data.
○ For example, as illustrated in this Figure:
■ • The first value in Dataset B is
validated against its corresponding
value in Dataset A.
■ • The second value in Dataset B is not
validated against its corresponding
value in Dataset A.
■ • If a value is missing, it is inserted
from Dataset A.
Stage 5: Data Validation and Cleansing
● For batch analytics,
○ data validation and cleansing can be
achieved via an offline ETL operation.
● For real-time analytics,
○ a more complex in-memory system is
required to validate and cleanse the
data as it arrives from the source.

● Provenance can play an important role in


determining the accuracy and quality of
questionable data.
Stage 5: Data Validation and Cleansing (Case Study)
● To keep costs down,
○ ETI is currently using free versions of the weather and the census
datasets that are not guaranteed to be 100% accurate.
■ As a result, these datasets need to be validated and cleansed.
○ Based on the published field information,
■ the team is able to check the extracted fields for typographical
errors and any incorrect data as well as data type and range
validation.
■ A rule is established that a record will not be removed if it
contains some meaningful level of information even though
some of its fields may contain invalid data.
Stage 6: Data Aggregation and Representation
● This stage is dedicated to integrating
multiple datasets together to arrive at a
unified view.
● Data may be spread across multiple
datasets, requiring that datasets be joined
together via common fields, for example
date or ID.
○ In other cases, the same data fields
may appear in multiple datasets, such
as date of birth.
● Either way, a method of data reconciliation
)‫ (التوفيق بين البيانات‬is required or the dataset
representing the correct value needs to be
determined.
Stage 6: Data Aggregation and Representation
● Performing this stage can become complicated because of differences
in:
○ • Data Structure – Although the data format may be the same, the
data model may be different.
○ • Semantics – A value that is labeled differently in two different
datasets may mean the same thing, for example “surname” and
“last name.”

● The large volumes processed by Big Data solutions can make data
aggregation a time and effort-intensive operation.

● Reconciling (‫ )التوفيق‬these differences can require complex logic that is


executed automatically without the need for human intervention.
Stage 6: Data Aggregation and Representation
● A data structure standardized by the Big Data solution can act as a common
denominator that can be used for a range of analysis techniques and projects.
○ This can require establishing a central, standard analysis repository, such as
a NoSQL database, as shown in the following Figure.

A simple example of data aggregation where two datasets are aggregated together using
the Id field.
Stage 6: Data Aggregation and Representation
● This Figure shows the same piece of data stored in two different formats.
○ Dataset A contains the desired piece of data, but it is part of a BLOB(Binary
Large Object) that is not readily accessible for querying.
○ Dataset B contains the same piece of data organized in column-based
storage, enabling each field to be queried individually.

Dataset A and B can be combined to create a standardized data structure with a Big Data
solution.
Stage 6: Data Aggregation and Representation (Case
Study)
● For meaningful analysis of data,
○ it is decided to join together policy data, claim data and call center
agent notes in a single dataset that is tabular in nature where each
field can be referenced via a data query.
○ It is thought that this will not only help with the current data analysis
task of detecting fraudulent claims but will also help with other data
analysis tasks, such as risk evaluation and speedy settlement of
claims.
○ The resulting dataset is stored in a NoSQL database.
Stage 7: Data Analysis
● The Data Analysis stage is dedicated to carrying out the
actual analysis task, which typically involves one or more
types of analytics.
● This stage can be iterative in nature,
○ especially if the data analysis is predictive analytics,
in which case analysis is repeated until the
appropriate pattern or correlation is uncovered.
● Depending on the type of analytic result required,
○ this stage can be as simple as querying a dataset to
compute an aggregation for comparison.
○ On the other hand, it can be as challenging as
combining data mining and complex statistical
analysis techniques to discover patterns and
anomalies or to generate a statistical or
mathematical model to depict relationships between
variables.
Stage 7: Data Analysis
● Before making inferences from data it is essential to
examine all your variables. Why?

● So, you need to go deep into the data by performing


Exploratory Data Analysis (EDA) in order to have a good
descriptive analytics or other types of analysis
○ What is the maximum and minimum value?
○ How is the data distributed?
○ Are there different types of individuals represented
in the data?
Stage 7: Data Analysis (Case Study)
● The IT team applies Exploratory Data Analysis (EDA) as follows:
○ Determine the number of rows and columns.
○ Identify data types of each column.
○ Checking whether there are any missing values on dataset or not.
○ For checking duplicate rows in the dataset

○ By visualizing the data we will have clearer knowledge about the data.
Stage 7: Data Analysis (Case Study)
● The IT team applies Exploratory Data Analysis (EDA) as follows:
○ By visualizing the data we will have clearer knowledge about the data.
■ o Categorical Data plotting


Stage 7: Data Analysis (Case Study)
● The IT team applies Exploratory Data Analysis (EDA) as follows:
○ By visualizing the data we will have clearer knowledge about the data.
Stage 7: Data Analysis (Case Study)
● The IT team involves the data analysts at this stage as it does not have the right
skillset for analyzing data in support of detecting fraudulent claims.
● In order to be able to detect fraudulent transactions,
○ first the nature of fraudulent claims needs to be analyzed in order to find
which characteristics differentiate a fraudulent claim from a legitimate claim.
○ For this, the predictive data analysis approach is taken. As part of this
analysis, a range of analysis techniques are applied.
● This stage is repeated a number of times as the results generated after the first
pass are not conclusive enough ) ‫ (ليس قاطعا بما فيه الكفاية‬to comprehend what makes a
fraudulent claim different from a legitimate claim.
● As part of this exercise, attributes that are less indicative of a fraudulent claim are
dropped while attributes that carry a direct relationship are kept or added.
Stage 8: Data Visualization
● The ability to analyze massive amounts of data and
find useful insights carries little value if the only
ones that can interpret the results are the analysts.
● The Data Visualization stage is dedicated to using
data visualization techniques and tools to
graphically communicate the analysis results for
effective interpretation by business users.
● The results of completing the Data Visualization
stage provide users with the ability to perform
visual analysis, allowing for the discovery of
answers to questions that users have not yet even
formulated.
Stage 8: Data Visualization
Stage 8: Data Visualization (Case Study)
● The team has discovered some interesting findings and now needs to
convey the results to the actuaries, underwriters and claim adjusters.

● Different visualization methods are used including bar and line graphs
and scatter plots.
○ Scatter plots are used to analyze groups of fraudulent and
legitimate claims in the light of different factors, such as customer
age, age of policy, number of claims made and value of claim.
Stage 8: Data Visualization (Case Study)
Stage 9: Utilization of Analysis Results
● Subsequent to analysis results being made available to
business users to support business decision-making, such
as via dashboards, there may be further opportunities to
utilize the analysis results.
● This stage is dedicated to determining how and where
processed analysis data can be further leveraged.
● Depending on the nature of the analysis problems being
addressed, it is possible for the analysis results to produce
“models” that encapsulate new insights and understandings
about the nature of the patterns and relationships that exist
within the data that was analyzed.
○ A model may look like a mathematical equation or a set of rules.
○ Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program.
Stage 9: Utilization of Analysis Results (Case Study)
● Based on the data analysis results, the underwriting and the claims
settlement users have now developed an understanding of the nature of
fraudulent claims.

● However, In order to realize tangible)‫ (ملموس – حقيقي – مادي‬benefits from


this data analysis exercise, a model based on a machine-learning
technique is generated, which is then incorporated into the existing
claim processing system to flag fraudulent claims.
02. Pipeline
What is Data Pipeline?
● A data pipeline is a set of tools and processes used to automate data movement
and transformation between a source and a target.
● Data pipelines are more commonly known as ETL pipelines or ETL workflows.
● Data pipelines usually consist of multiple components depending on the
complexity of the pipeline. It can be
○ as simple as consisting of two or three key components: a source, a
processing step or steps, and a destination,
○ as complex, and more commonly, as source(s) destination(s), processing,
storage, monitoring, dataflow, workflow, etc.

https://medium.com/@HassanFaheem/what-is-data-pipeline-in-big-data-6c0989cc4877
Why we need a Pipeline?
● The main goal of using data pipelines is to simplify and automate the
process of extracting, transforming, and loading (ETL) data from various
sources into a central location for analysis.
● Data pipelines are often used in conjunction with big data technologies
like Apache Hadoop and Apache Spark, distributed systems for
processing large amounts of data.
Pipeline vs ETL
● you should think about an ETL pipeline as a subcategory of data
pipelines.
● ETL pipelines follow a specific sequence. As the abbreviation implies,
they extract data, transform data, and then load and store data in a data
repository. Not all data pipelines need to follow this sequence.
● Finally, while unlikely, data pipelines as a whole do not necessarily need
to undergo data transformations, as with ETL pipelines. It’s rare to see a
data pipeline that doesn’t utilize transformations to facilitate data analysis
Types of data pipelines
● There are several main types of data pipelines, each appropriate for
specific tasks on specific platforms.

○ Batch processing
○ Streaming data
○ Data integration pipelines
Types of data pipelines
● Batch pipeline
Types of data pipelines
● Streaming pipeline
Types of data pipelines
● Data integration pipelines (ETL pipelines)
○ Data integration pipelines concentrate on merging data from multiple sources
into a single unified view.
○ These pipelines often involve extract, transform and load (ETL) processes that
clean, enrich, or otherwise modify raw data before storing it in a centralized
repository such as a data warehouse or data lake.
○ Data integration pipelines are essential for handling disparate systems that
generate incompatible formats or structures.
03. Exam-like Questions
Sample questions
● In Big Data, the term "Velocity" refers to:
• A) The number of different types of data generated

• B) The speed at which data is generated and processed

• C) The trustworthiness of data sources

• D) The volume of data produced by a system

● Which of the following is an example of semi-structured data?


• a) A relational database table

• b) An XML document c) A video file d) A log file without any schema


Sample questions
● Consider the following data: "Excellent," "Good," "Average," "Poor." This
is an example of:
• A) Nominal data B) Ordinal data
• C) Interval data D) Ratio data
● A researcher wants to measure the weight of participants. Which type of
data would be most appropriate?
• A) Nominal B) Ordinal
• C) Interval D) Ratio/continou
Sample questions
● Which type of analytics answers the question, “What should be done?”
○ - A) Diagnostic Analytics - B) Prescriptive Analytics
○ - C) Predictive Analytics - D) Descriptive Analytics
● A clothing company discovers that its payment page is malfunctioning,
causing sales to decrease. What type of analytics is utilized to find this
issue?
● - A) Predictive Analytics B) Prescriptive Analytics
● - C) Descriptive Analytics - D) Diagnostic Analytics
Sample questions
● What is the significance of KPIs in the Business Case Evaluation stage?
• A) They help measure the success of the project.
• B) They determine the required data sources.
• C) They define the data extraction process.
• D) They help visualize the data.
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes


icons by Flaticon, and infographics & images by Freepik

You might also like