Introductory Big Data
Introductory Big Data
Introductory
Big Data
College of Engineering
Chapter -1-
Introductory Background
Introductory Background
Working with Data
2
Learning Objectives
3
What is data?
• Comes from the Latin word “Datum” which means “Thing given”
• Data, in the information age, are a large set of digital bits encoding
numbers, texts, images, sounds, videos, and so on.
• Data consists of digital records
• Financial transactions
• Online trading and purchasing
• Social Network posts and interactions
• Medical images
• Sensory data
4
Data Sources
Different forms of data is continuously generated from every device, software application, or any
electrical tool we use.
5
Different forms of data is generated from different
sources – e-Forms produce structured, tabular digital
records
6
Different forms of data is generated from different
sources – social media platforms produce unstructured
textual, image, audio, and semi-structured platform
data.
ex t ual
T
Image
Inte
racti
on
7
Types of Data Representations
Types of Data
Representations
8
Types of Data Representations – Structured
Data
• Structured data: Well organized and has well defined attributes Features/Attributes/Columns
• Databases, tables, lists, CSV files, … etc.
Instances/Objects/examples/data points/rows
• Example: patients records database, students grades
database, sales database, …. etc.
Name Age Educational level Company
• Can you think of more examples? Andrew 55 1 Good
• Bernhard 43 2 Good
Pros: Specific data can be retrieved easily through the
database operation. Usually clean and has less errors. Carolina 37 5 Bad
• Cons: Offers limited content and limited insights.
Dennis 82 3 Good
• In structured or tabular data:
• Rows: represent instances also named objects; an instance Eve 23 3.2 Bad
per row
Fred 46 5 Good
• Columns: represent attributes also named features; an
attribute per column Gwyneth 38 4.2 Bad
• Instances
Hayden 50 4 Bad
Are examples of the concept we want to characterize
• Attributes
Are characteristics present in the instances
9
Types of Data Representations – Unstructured
Data
• Unstructured data: does not have well-defined
attributes/features
• Chat messages, images, audio data, … etc.
• Example: tweets’ text, Instagram images, music
recordings, medical images, surveillance
camera images… etc.
• Pros:
• Rich content
• Valuable insights
• High Availability
• Cons:
• Difficult to process.
• Need to define attributes.
• Contains noise and errors
10
Types of Data Representations – Semi-
structured Data
• Semi-Structured data: a combination of
structured and unstructured data
• Current state of internet
• Social media JSON objects, Wiki, … etc.
• Pros:
• Rich content
• Valuable insights
• High Availability
• Easier to process compared to
unstructured data
JSON Object in the next slide
• Cons:
• No standard attribute structure
• Might have highly hierarchical structure
11
Sample Twitter JSON Object
13
Data Growth
• What is the
expected size of data
in 2025?
• What can we do
with all this data?!
Source:
https://www.statista.com/statistics/871
513/worldwide-data-created/
, accessed in February, 2023 14
15
Activity 1: What can we do with data?
• Access Activity 1 Padlet and share your thoughts:
• Padlet URL: https://padlet.com/hebaismail20/x45w5khe92psiau2
16
What Can We Do With such Data?
• Source of information that can be transformed into new, useful, valid
and human-understandable knowledge
17
Data Analytics
• Definition
• The science that analyze crude data to extract useful knowledge
(patterns and insights) from them.
• Analytics are produced using techniques from:
• Statistical methods
• Artificial intelligence models and algorithms
• Data Mining techniques
18
Types of Analytics
• Descriptive analytics
• Summarize or condensate data to extract patterns
• The result of a given method or technique is obtained directly by applying an algorithm to the data
• Examples: relationship between Hight and weight, average grade in the class, students with similar study
interests, … etc.
• Can you think of more examples?
• Predictive analytics
• Produce prediction based on predictive models.
• A predictive model is a generalization of the relationship between data and the desired output. It associates
the hidden relationships in data with a sought or target perdition
• Examples: predicting possibility of getting cancer for a new patient based on the history of genetic data of
previous cancer patients.
• Can you think of more examples?
19
Important terminologies in data analytics
• Algorithm • Method or technique
• A step-by-step set of • Is a systematic procedure
instructions to solve a that allows to achieve an
problem. Algorithms can be intended goal
small or large depending on
the complexity of the
problem.
20
Data Science
• Data Science
• Data science extracts meaningful and useful knowledge from data, with the
support of suitable technologies
21
What about big data then?
22
Big Data in 5 Minutes
23
Characteristics of Big Data
24
What is Big Data?
• Big data are data sets that are too large to be managed by conventional data-
processing technologies
• Which lead to the development of new techniques and tools for data
storage, processing and transmission
• Examples of such tools are MapReduce, Hadoop, and Spark
• Data science is the creation of models and methods able to extract patterns
from complex data and the use of these models in real-life problems
• For example ChatGPT
25
Big data architectures
• Distributed systems
• the most popular big data processing technique using clusters of computers is
MapReduce
• Hadoop: is its most famous implementation of MapReduce
• Is a programming model or a programming paradigm
• Has two steps: map & reduce
• Divide the data into small chunks and split them by the computers in the cluster then
reassemble the outputs to produce the final sought outcome
26
Hadoop in 5 Minutes
27
Data-driven Methodology
A project on data analytics does not imply only the use of one or more specific methods or apply
one or more techniques, rather it implies:
• understanding the problem to be solved
• defining the objectives of the project
• looking for the necessary data
• preparing these data so that they can be used
• identifying suitable methods and choosing between them
• optimizing the outputs of each method
• analyzing and evaluating the results
• redoing the pre-processing tasks and repeating the experiments
• and so on.
• We need a Data-driven methodology for the project ….
28
Data-driven Methodology - CRISP
• The CRISP-DM methodology:
CRoss-Industry Standard Process
for Data Mining (CRISP-DM) is a
six-step methodology.
29
Data-driven Methodology - CRISP
1) Business understanding: This involves understanding the business domain, being able to define the problem from
the business domain perspective, and finally being able to translate such business problems into a data analytics
problem.
2) Data understanding: This involves collection of the necessary data and their initial visualization/summarization in
order to obtain the first insights, particularly but not exclusively, about data quality problems such as missing data
or outliers.
3) Data preparation: This involves preparing the data set for the modeling tool, and includes data transformation,
feature construction, outlier removal, missing data fulfillment and incomplete instances removal.
4) Modeling: Typically there are several methods that can be used to solve the same problem in analytics, often
with specific data requirements. This implies that there may be a need for additional data preparation tasks that are
method-specific. In such case it is necessary to go back to the previous step. The modeling phase also includes
optimizing the chosen method(s).
5) Evaluation: Solving the problem from the data analytics point of view is not the end of the process. It is now
necessary to understand how its use is meaningful from the business perspective; in other words, that the obtained
solution answers to the business requirements.
6) Deployment: The integration of the data analytics solution in the business process is the main purpose of this
phase. Typically, it implies the integration of the obtained solution into a decision-support tool, website
maintenance process, reporting process or elsewhere.
30
Data-driven Methodology - CRISP
31
Class Activity -2- Examine CRISP in the a
real-life context
1. Predicting number of airlines sales in 2025
2. Understanding customers trends in aviation industry
3. Predicting deterioration rate for patients in the ICU
4. Predicting pipeline leakage
32
Recap
• Data, in the information age, are a large set of digital bits encoding
numbers, texts, images, sounds, videos, and so on.
• Data grows exponentially
• New technologies arise to help storing, transferring, processing, and
analyzing data
• Analytics support decision making
• Algorithm, technique, methodology
• CRISP
33
Reading
• Textbook: Chapter -1- from the textbook
• Moreira, João, André Carlos Ponce de Leon Ferreira, and Tomáš Horváth. A
general introduction to data analytics. Wiley, 2019. ISBN: 9781119296263.
34