0% found this document useful (0 votes)
29 views

Introductory Big Data

This document provides an introduction to the concepts of data, big data, and data analytics. It defines key terms like data, data science, and big data. It describes different types of data like structured, unstructured, and semi-structured data. Examples of each type are given. It also discusses different data sources and formats. The document explains how to measure data in bits, bytes, kilobytes and larger units. It outlines how much data is expected to grow by 2025 and what can be done with such large volumes of data through data analytics.

Uploaded by

FrNuDN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Introductory Big Data

This document provides an introduction to the concepts of data, big data, and data analytics. It defines key terms like data, data science, and big data. It describes different types of data like structured, unstructured, and semi-structured data. Examples of each type are given. It also discusses different data sources and formats. The document explains how to measure data in bits, bytes, kilobytes and larger units. It outlines how much data is expected to grow by 2025 and what can be done with such large volumes of data through data analytics.

Uploaded by

FrNuDN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

COE 102

Introductory
Big Data
College of Engineering

Chapter -1-
Introductory Background
Introductory Background
Working with Data

2
Learning Objectives

• Define Data, Data Science, Big Data, and Analytics 


• Identify data sources and data types 
• Describe technologies for Data Analytics 
• Explain data science terminologies
• Explain the Data-driven Approach – CRISP

3
What is data?
• Comes from the Latin word “Datum” which means “Thing given”
• Data, in the information age, are a large set of digital bits encoding
numbers, texts, images, sounds, videos, and so on.
• Data consists of digital records
• Financial transactions
• Online trading and purchasing
• Social Network posts and interactions
• Medical images
• Sensory data

• It can exist in various forms and types


• Boolean facts: True or False, Female or Male, Positive or Negative, … etc.
• Text, Numbers, or a mix of both
• Images, audio recordings, or Videos

4
Data Sources
Different forms of data is continuously generated from every device, software application, or any
electrical tool we use.

5
Different forms of data is generated from different
sources – e-Forms produce structured, tabular digital
records

6
Different forms of data is generated from different
sources – social media platforms produce unstructured
textual, image, audio, and semi-structured platform
data.

ex t ual
T
Image
Inte
racti
on

7
Types of Data Representations

Types of Data
Representations

Structured Unstructured Semi-structured

8
Types of Data Representations – Structured
Data
• Structured data: Well organized and has well defined attributes Features/Attributes/Columns
• Databases, tables, lists, CSV files, … etc.

Instances/Objects/examples/data points/rows
• Example: patients records database, students grades
database, sales database, …. etc.
Name Age Educational level Company
• Can you think of more examples? Andrew 55 1 Good

• Bernhard 43 2 Good
Pros: Specific data can be retrieved easily through the
database operation. Usually clean and has less errors. Carolina 37 5 Bad
• Cons: Offers limited content and limited insights.
Dennis 82 3 Good
• In structured or tabular data:
• Rows: represent instances also named objects; an instance Eve 23 3.2 Bad
per row
Fred 46 5 Good
• Columns: represent attributes also named features; an
attribute per column Gwyneth 38 4.2 Bad
• Instances
Hayden 50 4 Bad
Are examples of the concept we want to characterize
• Attributes
Are characteristics present in the instances
9
Types of Data Representations – Unstructured
Data
• Unstructured data: does not have well-defined
attributes/features
• Chat messages, images, audio data, … etc.
• Example: tweets’ text, Instagram images, music
recordings, medical images, surveillance
camera images… etc.

• Can you think of more examples?

• Pros:
• Rich content
• Valuable insights
• High Availability

• Cons:
• Difficult to process.
• Need to define attributes.
• Contains noise and errors
10
Types of Data Representations – Semi-
structured Data
• Semi-Structured data: a combination of
structured and unstructured data
• Current state of internet
• Social media JSON objects, Wiki, … etc.

• Can you think of more examples?

• Pros:
• Rich content
• Valuable insights
• High Availability
• Easier to process compared to
unstructured data
JSON Object in the next slide
• Cons:
• No standard attribute structure
• Might have highly hierarchical structure
11
Sample Twitter JSON Object

{ "created_at": "Thu Apr 06 15:24:15 +0000 2017", "id_str":


"850006245121695744", "text": "1\/ Today we\u2019re sharing
our vision for the future of the Twitter API
platform!\nhttps:\/\/ptop.only.wip.la:443\/https\/t.co\/XweGngmxlP", "user": { "id": • Can you name some attributes?
2244994945, "name": "Twitter Dev", "screen_name":
"TwitterDev", "location": "Internet", "url": • Can you identify the values of
"https:\/\/ptop.only.wip.la:443\/https\/dev.twitter.com\/", "description": "Your official source these attributes?
for Twitter Platform news, updates & events. Need technical help?
Visit https:\/\/ptop.only.wip.la:443\/https\/twittercommunity.com\/ \u2328\ufe0f • Can you identify unstructured
#TapIntoTwitter" }, "place": { }, "entities": { "hashtags": [ ], content?
"urls": [ { "url": "https:\/\/ptop.only.wip.la:443\/https\/t.co\/XweGngmxlP", "unwound":
{ "url": "https:\/\/ptop.only.wip.la:443\/https\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
"title": "Building the Future of the Twitter API Platform" } } ],
"user_mentions": [ ] } } Link for the example:
https://developer.twitter.com/en/do
cs/twitter-api/premium/data-diction
ary/overview#:~:text=All%20Twitter
%20APIs%20that%20return,JSON%2 12
C%20including%20Tweets%20and%2
0Users
How to measure data?
Value Symbol Name In real world
0 or 1 1 bit bit 1/8th of character
8 bits 1 Byte byte 1B = 1 character
Bytes Kb Kilobyte 1Kb = 2 Tweets
Bytes Mb Megabyte 6Mb = This course’s book
Bytes Gb Gigabyte 4Gb = 1 High Quality movie (1 DVD)
Bytes Tb Terabyte 1Tb = 16 entry-level smartphones
Bytes Pb Petabyte 50 Pb = entire written text of mankind
Bytes Eb Exabyte 1 Eb = All Netflix * 3000 times
Bytes Zb Zettabyte 1 Zb = 350 Billion DVDs
Bytes Yb Yottabyte 1 Yb = The entire Internet

13
Data Growth
• What is the
expected size of data
in 2025?

• What can we do
with all this data?!

Source:
https://www.statista.com/statistics/871
513/worldwide-data-created/
, accessed in February, 2023 14
15
Activity 1: What can we do with data?
• Access Activity 1 Padlet and share your thoughts:
• Padlet URL: https://padlet.com/hebaismail20/x45w5khe92psiau2

16
What Can We Do With such Data?
• Source of information that can be transformed into new, useful, valid
and human-understandable knowledge

• Support decision making in a wide variety of fields


• Agriculture, Commerce, Education, Environment, Finance, Government,
Industry, Medicine, Transport and Social care

• The analysis of data to extract knowledge is known as Data Analytics

17
Data Analytics
• Definition
• The science that analyze crude data to extract useful knowledge
(patterns and insights) from them.
• Analytics are produced using techniques from:
• Statistical methods
• Artificial intelligence models and algorithms
• Data Mining techniques

18
Types of Analytics
• Descriptive analytics
• Summarize or condensate data to extract patterns
• The result of a given method or technique is obtained directly by applying an algorithm to the data
• Examples: relationship between Hight and weight, average grade in the class, students with similar study
interests, … etc.
• Can you think of more examples?

• Predictive analytics
• Produce prediction based on predictive models.
• A predictive model is a generalization of the relationship between data and the desired output. It associates
the hidden relationships in data with a sought or target perdition
• Examples: predicting possibility of getting cancer for a new patient based on the history of genetic data of
previous cancer patients.
• Can you think of more examples?

19
Important terminologies in data analytics
• Algorithm • Method or technique
• A step-by-step set of • Is a systematic procedure
instructions to solve a that allows to achieve an
problem. Algorithms can be intended goal
small or large depending on
the complexity of the
problem.

20
Data Science
• Data Science
• Data science extracts meaningful and useful knowledge from data, with the
support of suitable technologies

• Data science goes beyond barely producing analytics by providing a


knowledge on the algorithms, advanced statistical methods, and visualization
techniques required to acquire interesting analytics.

21
What about big data then?

• Does “big” in this case only refer to the


large amount of data?

• Big data primarily refers to data sets that


are too large in size, complex in form, or
fast in production to be dealt with by
traditional data-processing application
software.

22
Big Data in 5 Minutes

23
Characteristics of Big Data

• Big data is often characterized by a term


known as the Five Vs
• Volume: size of data
• Veracity: uncertainty in data
• Value: importance of the data
• Velocity: speed of data gathering/generation
• Variety: the amount of sources this data is
coming from and the differences in data forms.

24
What is Big Data?
• Big data are data sets that are too large to be managed by conventional data-
processing technologies

• Which lead to the development of new techniques and tools for data
storage, processing and transmission
• Examples of such tools are MapReduce, Hadoop, and Spark

• Data science is the creation of models and methods able to extract patterns
from complex data and the use of these models in real-life problems
• For example ChatGPT

25
Big data architectures
• Distributed systems
• the most popular big data processing technique using clusters of computers is
MapReduce
• Hadoop: is its most famous implementation of MapReduce
• Is a programming model or a programming paradigm
• Has two steps: map & reduce
• Divide the data into small chunks and split them by the computers in the cluster then
reassemble the outputs to produce the final sought outcome

26
Hadoop in 5 Minutes

27
Data-driven Methodology
A project on data analytics does not imply only the use of one or more specific methods or apply
one or more techniques, rather it implies:
• understanding the problem to be solved
• defining the objectives of the project
• looking for the necessary data
• preparing these data so that they can be used
• identifying suitable methods and choosing between them
• optimizing the outputs of each method
• analyzing and evaluating the results
• redoing the pre-processing tasks and repeating the experiments
• and so on.
• We need a Data-driven methodology for the project ….

28
Data-driven Methodology - CRISP
• The CRISP-DM methodology:
CRoss-Industry Standard Process
for Data Mining (CRISP-DM) is a
six-step methodology.

• Despite the sequential six


phases, CRISP-DM is seen as a
perpetual process, used
throughout the life of a
company in successive iterations

29
Data-driven Methodology - CRISP
1) Business understanding: This involves understanding the business domain, being able to define the problem from
the business domain perspective, and finally being able to translate such business problems into a data analytics
problem.
2) Data understanding: This involves collection of the necessary data and their initial visualization/summarization in
order to obtain the first insights, particularly but not exclusively, about data quality problems such as missing data
or outliers.
3) Data preparation: This involves preparing the data set for the modeling tool, and includes data transformation,
feature construction, outlier removal, missing data fulfillment and incomplete instances removal.
4) Modeling: Typically there are several methods that can be used to solve the same problem in analytics, often
with specific data requirements. This implies that there may be a need for additional data preparation tasks that are
method-specific. In such case it is necessary to go back to the previous step. The modeling phase also includes
optimizing the chosen method(s).
5) Evaluation: Solving the problem from the data analytics point of view is not the end of the process. It is now
necessary to understand how its use is meaningful from the business perspective; in other words, that the obtained
solution answers to the business requirements.
6) Deployment: The integration of the data analytics solution in the business process is the main purpose of this
phase. Typically, it implies the integration of the obtained solution into a decision-support tool, website
maintenance process, reporting process or elsewhere.
30
Data-driven Methodology - CRISP

31
Class Activity -2- Examine CRISP in the a
real-life context 
1. Predicting number of airlines sales in 2025
2. Understanding customers trends in aviation industry
3. Predicting deterioration rate for patients in the ICU
4. Predicting pipeline leakage

32
Recap
• Data, in the information age, are a large set of digital bits encoding
numbers, texts, images, sounds, videos, and so on.
• Data grows exponentially
• New technologies arise to help storing, transferring, processing, and
analyzing data
• Analytics support decision making
• Algorithm, technique, methodology
• CRISP

33
Reading
• Textbook: Chapter -1- from the textbook
• Moreira, João, André Carlos Ponce de Leon Ferreira, and Tomáš Horváth. A
general introduction to data analytics. Wiley, 2019. ISBN: 9781119296263.

34

You might also like