0% found this document useful (0 votes)

9 views

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 2

Data Science
Contents:

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data

Overview of Data Science

• Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.

Data Processing Cycle

Data types and their representation

1. Data types from Computer programming perspective

• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective

Structured Data Unstructured Data Semi-structured Data

• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain

• The Data Value Chain is introduced to describe the information flow

within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within

data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).

• Variety (range of data types/sources): dealing with data using differing

syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).

• Veracity: can we trust the data? How accurate is it? etc.

Clustered Computing and Hadoop Ecosystem

Clustered Computing

• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.

• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.

• To better address the high storage and computational needs of big data, computer clusters are
a better fit.

• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).

• Hadoop is an open-source framework intended to make interaction with big

data easier.

• It is a framework that allows for the distributed processing of large datasets

across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:

• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.

• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.

• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.

Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources

such as relational databases, systems, or local files.

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage

• The second stage is Processing. In this stage, the data is stored and
processed.

• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data

• The third stage is to Analyze. Here, the data is analyzed by processing

frameworks such as Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then analyzes it.

• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!

Shoe Store Database Design
No ratings yet
Shoe Store Database Design
11 pages
Project Proposal: Project Name: Problem Statement
No ratings yet
Project Proposal: Project Name: Problem Statement
2 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Big Data
No ratings yet
Big Data
10 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
BDA Module-1
No ratings yet
BDA Module-1
9 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter One - DS Introduction
No ratings yet
Chapter One - DS Introduction
40 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Data Science - Fundamentals and Components
No ratings yet
Data Science - Fundamentals and Components
21 pages
Unit 1: Introduction: Dhanashree Huddedar
No ratings yet
Unit 1: Introduction: Dhanashree Huddedar
26 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
data_analytics
No ratings yet
data_analytics
8 pages
Data Analytics All 5 Units
No ratings yet
Data Analytics All 5 Units
63 pages
Unit 1 (1) (1)
No ratings yet
Unit 1 (1) (1)
12 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
BDS Module-1
No ratings yet
BDS Module-1
59 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
EmTech Chapter 2 - Data Science
No ratings yet
EmTech Chapter 2 - Data Science
22 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Unit_1_Introduction toDatabases
No ratings yet
Unit_1_Introduction toDatabases
15 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
DM Mod 1
No ratings yet
DM Mod 1
17 pages
bda ans
No ratings yet
bda ans
18 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
INF1505 - Module 3 - Study notes
No ratings yet
INF1505 - Module 3 - Study notes
15 pages
Unit 1 Rept
No ratings yet
Unit 1 Rept
61 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
SW Detailed Design Principles
No ratings yet
SW Detailed Design Principles
42 pages
SAP HR Logical Database PNPCE
100% (1)
SAP HR Logical Database PNPCE
10 pages
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
No ratings yet
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
6 pages
Management Information Systems, 10/e
No ratings yet
Management Information Systems, 10/e
34 pages
Swe2021 Software-Configuration-Management TH 1.0 47 Swe2021
No ratings yet
Swe2021 Software-Configuration-Management TH 1.0 47 Swe2021
2 pages
Ifix Value Package From Ge Digital Datasheet
No ratings yet
Ifix Value Package From Ge Digital Datasheet
4 pages
Info - Assurance - N - Secu - 1 - Quiz1 - Dimaano, Zareena A.
No ratings yet
Info - Assurance - N - Secu - 1 - Quiz1 - Dimaano, Zareena A.
3 pages
Test It
No ratings yet
Test It
18 pages
Cisco ISE Admin 3 0
No ratings yet
Cisco ISE Admin 3 0
1,254 pages
OS Practical File
No ratings yet
OS Practical File
32 pages
Lecture 3.3.1 Software Testability
No ratings yet
Lecture 3.3.1 Software Testability
17 pages
Inside Microsoft SharePoint 2013
No ratings yet
Inside Microsoft SharePoint 2013
1 page
3 PDF Cloud Computing Scalability
No ratings yet
3 PDF Cloud Computing Scalability
1 page
Building Beautiful Restful Apis Using Flask 1
No ratings yet
Building Beautiful Restful Apis Using Flask 1
34 pages
ESE (Security Operation)
No ratings yet
ESE (Security Operation)
29 pages
Resume 8
No ratings yet
Resume 8
2 pages
Unbeatable_Google_Ads_Mastery
No ratings yet
Unbeatable_Google_Ads_Mastery
3 pages
Mihir Final
No ratings yet
Mihir Final
12 pages
Devnet Module 1 4
No ratings yet
Devnet Module 1 4
18 pages
Thesis Final Chapter 1, 2, 3 and Title
No ratings yet
Thesis Final Chapter 1, 2, 3 and Title
20 pages
Current Log
No ratings yet
Current Log
25 pages
The Ultimate B2B E-Commerce Guide
50% (2)
The Ultimate B2B E-Commerce Guide
21 pages
PBS Software Solutions 2006
No ratings yet
PBS Software Solutions 2006
68 pages
Cybercrime Thesis PDF
100% (4)
Cybercrime Thesis PDF
5 pages
Database Management System As A Cloud Service
100% (1)
Database Management System As A Cloud Service
13 pages
Philtech Lessons
No ratings yet
Philtech Lessons
27 pages
Forrester - 2015 State of Agile Development Report
No ratings yet
Forrester - 2015 State of Agile Development Report
42 pages
209pm - 44.epra Journals 11085
No ratings yet
209pm - 44.epra Journals 11085
4 pages