0% found this document useful (0 votes)

17 views25 pages

Chapter - 2 Data Sciences

The document provides an overview of key concepts in data science including data, information, data types, the data value chain, and big data. It discusses structured, semi-structured, and unstructured data and how data is acquired, analyzed, stored, and used. Cluster computing and Hadoop ecosystem are also introduced.

Uploaded by

Tagel Wogayehu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views25 pages

Chapter - 2 Data Sciences

Uploaded by

Tagel Wogayehu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter II

Data Science

1
Chapter Outline
✓ An Overview of Data Science

✓ What are data and information

✓ Data types and their representation

✓ Data value Chain

✓ Basic concepts of Big data

2 2
Overview of Data Science
❖ What is data science, data , information and big data?
Data science is:-

❖ defined as a multi-disciplinary field that uses scientific methods, processes,

algorithms, and systems in order to extract knowledge and insights from
structured, semi-structured and unstructured data.

❖ is the science which uses computer science, statistics and machine

learning, visualization and human-computer interactions to collect,
clean, integrate, analyze, visualize, interact with data to create data
products.

3
Cntd…
✓ In case of academic discipline and profession,
Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.

✓ Data professionals understand that they must advance past the

traditional skills of analyzing large amounts of data, data mining, and
programming skills.

✓ Data scientists need to be curious and result-oriented, with exceptional

industry-specific knowledge and communication skills that allow them
to explain highly technical results to their non-technical counterparts.
4
What are data and information?
⁎ Data can be defined as:- a representation of facts, figures,
concepts, or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing, by human
or electronic machines. It can be described as unprocessed facts and
figures. It represented with the help of characters such as alphabets
(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
➢ Information is defined as the processed data on which decisions
and actions are based.
⁎ It is data that has been processed into a form that is meaningful to
the recipient. Or it is interpreted data; created from organized,
structured, and processed data in a particular context. 5
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.

• Basic steps that are consisted in the data processing :-

I. input

II. processing and

III. output.

Question

1. Define the above terms(input, processing and output) and discuss the
main differences between data and information with examples ? 6
Cntd…
 Input - in this step, the input data is prepared in some convenient form
for processing. The form will depend on the processing machine.

 Processing - in this step, the input data is changed to produce data in a

more useful form. For example, interest can be calculated on deposit to a
bank, or a summary of sales for the month can be calculated from the
sales orders.

 Output - at this stage, the result of the proceeding processing step is

collected. The particular form of the output data depends on the use of the
data. For example, output data may be payroll for employees.

7
Data types and their representation
Data types can be described from diverse perspectives. here some of the
perspectives are:-

1. Data types from Computer programming perspective:- different languages

may use different terminology for the notion of data types. Common data types
include:

➢ Integers(int)- is used to store whole numbers, mathematically known as integers.

➢ Booleans(bool)- is used to represent restricted to one of two values: true or false.
➢ Characters(char)- is used to store a single character.
➢ Floating-point numbers(float)- is used to store real numbers.
➢ Alphanumeric strings(string)- used to store a combination of characters and
numbers.
7
… Cntd
2. Data types from Data Analytics perspective: in terms of Analytic
perspective.

There are three common types of data types or structures. those are :

I. Structured Data:- is data that follow to a pre-defined data model

and is therefore straightforward to analyze.

➢ It conforms to a tabular format with a relationship between the

different rows and columns.

➢ e.g. Excel files or SQL databases (which has structured rows and
columns that can be sorted).
9
Cntd…
II. Semi-structured Data:- is a form of structured data that does not
conform with the formal structure of data models associated with relational
databases or other forms of data tables.

➢ It is also known as a self-describing structure. (why ?)

➢ e.g. JSON , XML, sensor data.
III. Unstructured data types :- is information that does not have either a
predefined data model or is not organized in a pre-defined manner.

✓Unstructured information is typically text-heavy but may contain data

such as dates, numbers, and facts as well.

10
Cntd…
✓ Do you think that understanding Unstructured data using traditional
programs as compared to data stored in structured databases easy?
e.g. audio, video files or No_x0002_SQL databases.

10
Cntd…
3. Metadata :- is simply defined as data about data.

From a technical point of view, this is not a separate data structure, but
it is one of the most important elements for Big Data analysis and big
data solutions.

⁎ Provides additional information about a specific set of data.

E.g., In a set of photographs, it describe when and where the photos were
taken. and also it provides fields for dates and locations which, by
themselves, can be considered structured data.

⁎ Because of this reason, metadata is frequently used by Big Data

solutions for initial analysis. 12
Data value Chain
➢ It is introduced to describe the information flow within a big data system
•
as a series of steps needed to generate value and useful insights from data.

➢ The Big Data Value Chain identifies the following key high-level
activities:

12
Cntd …
 Data Acquisition :- is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution on
which data analysis can be carried out.

✓ is one of the major big data challenges in terms of infrastructure

requirements. why explain ? Infrastructures must:

 Data Analysis :- making the raw data acquired amenable to use in

decision-making as well as domain-specific usage.

✓ it involves exploring, transforming, and modeling data with the goal of

highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view. 14
Cntd…
 Data Curation :- It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its effective
usage.

✓ It contains different activities such as content creation, selection,

classification, transformation, validation, and preservation.

 Data Storage :- is the persistence and management of data in a scalable way

that satisfies the needs of applications that require fast access to the data.

 Data Usage:- It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis within
the business activity.
15
Basic Concepts of Big data
• Big data is the term for a collection of data sets so large and complex
it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
 It is characterized by 3V and more:
▪ Volume: large amounts of raw data (Zeta bytes)
▪ Velocity: Data is live streaming or in motion, change over times
▪ Variety: data comes in many different forms from diverse sources
▪ Veracity: can we trust the data? How accurate is it? Data quality
▪ Value: Information for Decision Making
Cntd…
Clustered Computing and Hadoop Ecosystem
Cluster Computing :

❑ Is a form of computing in which a group of computers(often called

nodes) that are connected through a LAN( Local Area Network) so
that they behave like a single machine.

• Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of benefits:

⁕ Resource Pooling:
⁕ High Availability:
⁕ Easy Scalability: 18
Cluster computing benefits
✓ Resource pooling: combing the available storage space, CPU, and
memory is extremely important.
✓ Processing large datasets requires large amount of all three(storage
space, CPU, and memory) of the resources .
✓ High availability: clusters provide varying levels of fault tolerance
and availability guarantees to prevent hardware and software failure
from affecting access to data and processing.
✓ Increasingly important for real time analytics of big data.
✓ Easy scalability: cluster makes easy horizontally by adding more
machines to the group. The system can react to change in resource
requirements with out expanding the physical resource on a machine.
18
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with
big data easier.

✓ is a framework that allows for the distributed processing of large

datasets across clusters of computers using simple programming models.

Characteristics of Hadoop are:-

i. Economical:- Its systems are highly economical.

ii. Reliable:- stores copies of the data on different machines and is
resistant to hardware failure.

iii. Scalable:- is easily scalable both, horizontally and vertically.

iv. Flexible:- you can store as much structured and unstructured data. 19
Cntd…
Hadoop has an ecosystem that has evolved from its four core
components:
1. Data management 2. Access 3. Processing 4. Storage
It comprises the following components and many others:
▪ HDFS: Hadoop Distributed File ▪ HBase: NoSQL Database
System ▪ Mahout, Spark MLLib: Machine
▪ YARN: Yet Another Resource Learning algorithm libraries
Negotiator ▪ Solar, Lucene: Searching and Indexing
▪ MapReduce: Programming based Data ▪ Zookeeper: Managing cluster
Processing
▪ Oozie: Job Scheduling
▪ Spark: In-Memory data processing
▪ PIG, HIVE: Query-based processing 20
Cntd…

22
Big Data Life Cycle with Hadoop
There are different stages of Big Data processing.

some of them are:-

I. Ingesting data into the system :- data is ingested or transferred to

Hadoop from various sources such as relational databases, systems,
or local files. Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.

II. Processing the data in storage:- the data is stored and processed.

✓ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.

22
Cntd…
III. Computing and analyzing data:- data is analyzed by processing
frameworks such as Pig, Hive, and Impala.

✓ Pig converts the data using a map and reduce and then analyzes it.
✓ Hive is also based on the map and reduce programming and is most
suitable for structured data.

iv. Visualizing the results:- is performed by tools such as Hue and

Cloudera Search.

✓ In this stage, the analyzed data can be accessed by users.

23
25

Online PDF Editor: How To Edit PDF Files Online For Free
100% (1)
Online PDF Editor: How To Edit PDF Files Online For Free
15 pages
Sample of Literature Review
50% (4)
Sample of Literature Review
10 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
EmTec Chapter 2 (1)
No ratings yet
EmTec Chapter 2 (1)
32 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Data Science
No ratings yet
Data Science
35 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 [Data Science]
No ratings yet
Chapter 2 [Data Science]
35 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2 - EMTE_240216_133452
No ratings yet
Chapter 2 - EMTE_240216_133452
47 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Islamic answer
No ratings yet
Islamic answer
27 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Session 9 - Search Strategies
No ratings yet
Session 9 - Search Strategies
18 pages
Samruddhi Malware
No ratings yet
Samruddhi Malware
5 pages
3110c RM-237 Schematics
No ratings yet
3110c RM-237 Schematics
9 pages
Net Backup 1
No ratings yet
Net Backup 1
2 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
12 pages
List D
No ratings yet
List D
3 pages
Project Plan
No ratings yet
Project Plan
7 pages
D342F820
No ratings yet
D342F820
602 pages
A Case Study On Data Classification Approach Using K-Nearest Neighbor
No ratings yet
A Case Study On Data Classification Approach Using K-Nearest Neighbor
7 pages
Visualization or Visual Data Mining
No ratings yet
Visualization or Visual Data Mining
15 pages
Dbms Question Paper
No ratings yet
Dbms Question Paper
24 pages
Seniority Dates
No ratings yet
Seniority Dates
2 pages
Week 1 686 F2022
No ratings yet
Week 1 686 F2022
37 pages
Orange Lecture01 - Machine Learning (1)
No ratings yet
Orange Lecture01 - Machine Learning (1)
7 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
S.S.V.P.S.'S B.S. Deore College of Engineering, Dhule 2017-2018
No ratings yet
S.S.V.P.S.'S B.S. Deore College of Engineering, Dhule 2017-2018
18 pages
Defining Data Ethics in Library and Information Science
No ratings yet
Defining Data Ethics in Library and Information Science
8 pages
Sap Abap Interview Questions
No ratings yet
Sap Abap Interview Questions
8 pages
Wa0045.
No ratings yet
Wa0045.
6 pages
RDBMS-Unit 1-Dr - Sandeep
No ratings yet
RDBMS-Unit 1-Dr - Sandeep
48 pages
deVCBAdMRRKlQgQHTHUSAw - Course 1 Week 4 Glossary - DA Terms and Definitions 1
No ratings yet
deVCBAdMRRKlQgQHTHUSAw - Course 1 Week 4 Glossary - DA Terms and Definitions 1
4 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
DP s16 l04 FinalExamReviewSections10 To 13answers
No ratings yet
DP s16 l04 FinalExamReviewSections10 To 13answers
7 pages
HJORLAND The Foundation of The Concept of Relevance
No ratings yet
HJORLAND The Foundation of The Concept of Relevance
21 pages
Dunzo Catalog: Team Infinite
No ratings yet
Dunzo Catalog: Team Infinite
2 pages
Human Resource Management System
No ratings yet
Human Resource Management System
5 pages
HTML Forms: Pat Morin COMP 2405
No ratings yet
HTML Forms: Pat Morin COMP 2405
20 pages
presentation-of-library-management-system
No ratings yet
presentation-of-library-management-system
22 pages

Chapter - 2 Data Sciences

Uploaded by

Chapter - 2 Data Sciences

Uploaded by

Chapter II

✓ What are data and information

✓ Data types and their representation

✓ Data value Chain

✓ Basic concepts of Big data

❖ defined as a multi-disciplinary field that uses scientific methods, processes,

❖ is the science which uses computer science, statistics and machine

✓ Data professionals understand that they must advance past the

✓ Data scientists need to be curious and result-oriented, with exceptional

• Basic steps that are consisted in the data processing :-

II. processing and

 Processing - in this step, the input data is changed to produce data in a

 Output - at this stage, the result of the proceeding processing step is

1. Data types from Computer programming perspective:- different languages

➢ Integers(int)- is used to store whole numbers, mathematically known as integers.

I. Structured Data:- is data that follow to a pre-defined data model

➢ It conforms to a tabular format with a relationship between the

➢ It is also known as a self-describing structure. (why ?)

✓Unstructured information is typically text-heavy but may contain data

⁎ Provides additional information about a specific set of data.

⁎ Because of this reason, metadata is frequently used by Big Data

✓ is one of the major big data challenges in terms of infrastructure

 Data Analysis :- making the raw data acquired amenable to use in

✓ it involves exploring, transforming, and modeling data with the goal of

✓ It contains different activities such as content creation, selection,

 Data Storage :- is the persistence and management of data in a scalable way

❑ Is a form of computing in which a group of computers(often called

• Big data clustering software combines the resources of many smaller

✓ is a framework that allows for the distributed processing of large

Characteristics of Hadoop are:-

i. Economical:- Its systems are highly economical.

iii. Scalable:- is easily scalable both, horizontally and vertically.

some of them are:-

I. Ingesting data into the system :- data is ingested or transferred to

iv. Visualizing the results:- is performed by tools such as Hue and

✓ In this stage, the analyzed data can be accessed by users.

You might also like