Chaoter Data Science

Uploaded by

gaarummaayohaannis

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Chaoter Data Science

Uploaded by

gaarummaayohaannis

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 2: Data Science

learning contents(points to be covered) are :

An Overview of Data Science

What are data and information

Data types and their representation

Data value Chain

Basic concepts of big data

1
Overview of Data Science

• What is data science, data , information and big data?

 data science is:-
 defined as a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems in order to extract knowledge and insights from
structured, semi-structured and unstructured data.
 is much more than simply analyzing data.
 It offers a range of roles and requires a range of skills
 In case of academic discipline and profession, data science continues to evolve as
one of the most promising and in-demand career paths for skilled professionals. 2
cont. ..
 Data professionals understand that they must advance past the traditional skills of
analyzing large amounts of data, data mining, and programming skills.
 Data scientists need to be curious and result-oriented, with exceptional industry-
specific knowledge and communication skills that allow them to explain highly
technical results to their non-technical counterparts.

3
What are data and information

• what is Data and information ?

 Data can be defined as: - a representation of facts, figures, concepts, or
instructions in a formalized manner,

• which should be suitable for communication, interpretation, or

processing, by human or electronic machines.
 It can be described as unprocessed facts and figures.

 Is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-
9) or special characters (+, -, /, *, <,>, =, etc.).

4
cont. ..
 Information is defenied as :-
 the processed data on which decisions and actions are based.
 It is data that has been processed into a form that is meaningful to the recipient.
 or it is interpreted data; created from organized, structured, and processed data in
a particular context.

5
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose.
• Basic steps that are consisted in the data processing :-

I. input
II. processing and
III. output.
Data Processing Output

Question
1. Define the above terms(input, processing and output) and discuss the main
differences between data and information with examples ?
6
Data types and their representation
• Data types can be described from diverse perspectives. here some of the perspectives
are:-
i. Data types from Computer programming perspective:- different languages may use
different terminology for the notion of data types. Common data types include:
 Integers(int)- is used to store whole numbers, mathematically known as integers
• Booleans(bool)- is used to represent restricted to one of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of characters and numbers 7
cont. ..
ii. Data types from Data Analytics perspective :- in terms of Analytic perspective there
are three common types of data types or structures. those are :
I. Structured Data:- is data that adheres to a pre-defined data model and
is therefore straightforward to analyze.
 It conforms to a tabular format with a relationship between the different rows
and columns.
 e.g Excel files or SQL databases (which has structured rows and columns that
can be sorted)
II. Semi-structured Data:- is a form of structured data that does not
conform with the formal structure of data models associated with
relational databases or other forms of data tables.
 It is also known as a self-describing structure. (why ?)
 e.g JSON , XML, sensor data
III. Unstructured data types :- is information that does not have either a 8
predefined data model or is not organized in a pre-defined manner.
cont. ..
 Unstructured information is typically text-heavy but may contain data such as dates,
numbers, and facts as well.
 Do you think that understanding Unstructured data using traditional programs as
compared to data stored in structured databases easy?
 e.g Audio, video files or No_x0002_SQL databases

9
cont. ..
iii. Metadata :- is simply defined as data about data.
 From a technical point of view, this is not a separate data structure, but it is one
of the most important elements for Big Data analysis and big data solutions.
 provides additional information about a specific set of data.
for example, In a set of photographs, it describe when and where the photos
were taken. and also it provides fields for dates and locations which, by
themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big Data solutions for
initial analysis.
Compare metadata with structured, unstructured and semi-structured data ? 10
Data value Chain
• It is introduced to describe the information flow within a big data system as a
series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:

11
cont. ..
 Data Acquisition :- is the process of gathering, filtering, and cleaning data before it
is put in a data warehouse or any other storage solution on which data analysis can be
carried out.
 is one of the major big data challenges in terms of infrastructure requirements. why
explain ?
 Data Analysis :- making the raw data acquired amenable to use in decision-making as
well as domain-specific usage.
 it involves exploring, transforming, and modeling data with the goal of highlighting
relevant data, synthesizing and extracting useful hidden information with high
12
potential from a business point of view.
cont. ..
 Data Curation :- It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
 it contains different activities such as content creation, selection, classification,

transformation, validation, and preservation.

 Data Storage :- is the persistence and management of data in a scalable way

that satisfies the needs of applications that require fast access to the data.
 Data Usage:- It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis within
the business activity. 13
Basic concepts of big data
• Big data is the term for a collection of data sets so large and complex .
• it becomes difficult to process using on-hand database management tools or traditional data
processing applications.
• it is characterized by 3V and more:
 Volume: large amounts of data (Zeta bytes)
 Velocity: Data is live streaming or in motion
 Variety: data comes in many different forms from
diverse sources
 Veracity: can we trust the data? How accurate is it?

Figure 2 Characteristics of big data

14
Clustered Computing and Hadoop Ecosystem

Clustered computing :-
• Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
 Resource Pooling:
 High Availability:
 Easy Scalability:

15
Hadoop and its Ecosystem

• Hadoop is an open-source framework intended to make interaction with big data easier.
 is a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
 characteristics of Hadoop are:-

i. Economical:- Its systems are highly economical.

ii. Reliable:- stores copies of the data on different machines and is resistant to
hardware failure.
iii. Scalable:- is easily scalable both, horizontally and vertically.
iv. Flexible:- you can store as much structured and unstructured data as you need. 16
cont. ..
• It comprises the following components and many others:
• Hadoop has an ecosystem that
 HDFS: Hadoop Distributed File System
has evolved from its four core
 YARN: Yet Another Resource Negotiator
components:
 MapReduce: Programming based Data Processing
 data management,  Spark: In-Memory data processing

 access,  PIG, HIVE: Query-based processing of data services

 HBase: NoSQL Database

 processing, and
 Mahout, Spark MLLib: Machine Learning algorithm
 storage.
libraries
 Solar, Lucene: Searching and Indexing

 Zookeeper: Managing cluster

17
 Oozie: Job Scheduling
cont. ..

18
Big Data Life Cycle with Hadoop
• There are different stages of Big Data processing, some of them are:-

I. Ingesting data into the system :- data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
II. Processing the data in storage:- the data is stored and processed.
 The data is stored in the distributed file system, HDFS, and the NoSQL distributed data,
HBase. Spark and MapReduce perform data processing.

19
cont. ..
III. Computing and analyzing data:- data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then analyzes it.

 Hive is also based on the map and reduce programming and is most suitable for
structured data.
iv. Visualizing the results:- is performed by tools such as Hue and Cloudera Search.
 In this stage, the analyzed data can be accessed by users.

Sample BruntWork data - Excel Test
No ratings yet
Sample BruntWork data - Excel Test
24 pages
Email Output To Multiple Recipients Functionality in SAP - SAP Blogs
No ratings yet
Email Output To Multiple Recipients Functionality in SAP - SAP Blogs
32 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
EmTech Chapter 2 - Data Science
No ratings yet
EmTech Chapter 2 - Data Science
22 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Big Data Analytics Compiled Notes
No ratings yet
Big Data Analytics Compiled Notes
130 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Big Data and Data Science: Case Studies: Priyanka Srivatsa
No ratings yet
Big Data and Data Science: Case Studies: Priyanka Srivatsa
5 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Unit I Introduction To Data Science Syllabus
No ratings yet
Unit I Introduction To Data Science Syllabus
10 pages
Bda Module 1
No ratings yet
Bda Module 1
19 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Unit 1 Introduction to Data Analytics
No ratings yet
Unit 1 Introduction to Data Analytics
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
bda ans
No ratings yet
bda ans
18 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
Bigdata FinalAll (2)
No ratings yet
Bigdata FinalAll (2)
62 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
Chapter 2, 3, 4&5
No ratings yet
Chapter 2, 3, 4&5
121 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
BDA Module-1
No ratings yet
BDA Module-1
9 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
Data Analytics All 5 Units
No ratings yet
Data Analytics All 5 Units
63 pages
Unit 1 (1) (1)
No ratings yet
Unit 1 (1) (1)
12 pages
data_analytics
No ratings yet
data_analytics
8 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Anomaly Detection Using The Numenta Anomaly Benchmark
No ratings yet
Anomaly Detection Using The Numenta Anomaly Benchmark
8 pages
3 Variable Systems Kuta
No ratings yet
3 Variable Systems Kuta
6 pages
The Need For Information Security
No ratings yet
The Need For Information Security
34 pages
Computer Project - Student Choose Data
No ratings yet
Computer Project - Student Choose Data
4 pages
[FREE PDF sample] Networks and Communications NetCom2013 Proceedings of the Fifth International Conference on Networks Communications 1st Edition Natarajan Meghanathan ebooks
100% (17)
[FREE PDF sample] Networks and Communications NetCom2013 Proceedings of the Fifth International Conference on Networks Communications 1st Edition Natarajan Meghanathan ebooks
37 pages
IBDP Computer Science Computational Thinking Notes
No ratings yet
IBDP Computer Science Computational Thinking Notes
32 pages
SQL Interview Questions - Fail Over Cluster
No ratings yet
SQL Interview Questions - Fail Over Cluster
40 pages
Computers, Ngos Colony, Opp. Kandrika Bus Stop, Kandrika, Krishna District, Vijayawada, Andhra Pradesh, India 520015
No ratings yet
Computers, Ngos Colony, Opp. Kandrika Bus Stop, Kandrika, Krishna District, Vijayawada, Andhra Pradesh, India 520015
2 pages
Sony HX90V Vs Sony HX99 Specifications
No ratings yet
Sony HX90V Vs Sony HX99 Specifications
13 pages
Full Download Mechatronics and Automatic Control Systems Proceedings of the 2013 International Conference on Mechatronics and Automatic Control Systems ICMS2013 1st Edition Wego Wang (Eds.) PDF DOCX
100% (1)
Full Download Mechatronics and Automatic Control Systems Proceedings of the 2013 International Conference on Mechatronics and Automatic Control Systems ICMS2013 1st Edition Wego Wang (Eds.) PDF DOCX
13 pages
01 Nov 2024
No ratings yet
01 Nov 2024
1 page
El 210, Texts
No ratings yet
El 210, Texts
5 pages
Parallel Asynchronous Programming Java
No ratings yet
Parallel Asynchronous Programming Java
144 pages
UN
No ratings yet
UN
3 pages
Google Ux Design Certificate - Portfolio Project 1 - Navika Wecompress
No ratings yet
Google Ux Design Certificate - Portfolio Project 1 - Navika Wecompress
27 pages
Gamemaker Studio 2: Software Tools
No ratings yet
Gamemaker Studio 2: Software Tools
5 pages
Owasp Api Security Top 10 Cheat Sheet A4
No ratings yet
Owasp Api Security Top 10 Cheat Sheet A4
4 pages
P&ID Walkdown Testing Procedure:: How To Use This Document
No ratings yet
P&ID Walkdown Testing Procedure:: How To Use This Document
3 pages
Final Investigation 1
No ratings yet
Final Investigation 1
1 page
Small, Personal Videogames About Mental Health An Informal Survey of Bitsy Games
No ratings yet
Small, Personal Videogames About Mental Health An Informal Survey of Bitsy Games
5 pages
USP General Chapter : Compendium
No ratings yet
USP General Chapter : Compendium
48 pages
25.kstar_line-interactive_ups_datasheet
No ratings yet
25.kstar_line-interactive_ups_datasheet
8 pages
RTL838x_RTL833x_Developer_Guide_V1.3_np
No ratings yet
RTL838x_RTL833x_Developer_Guide_V1.3_np
169 pages
Ibm Xseries - Server
No ratings yet
Ibm Xseries - Server
87 pages
Tech Specs SoundCam Ultra Sensor 2021
No ratings yet
Tech Specs SoundCam Ultra Sensor 2021
2 pages
Data Quality Strategy - A Step by Step Approach
No ratings yet
Data Quality Strategy - A Step by Step Approach
28 pages
Touchless Touchscreen Seminar Report
No ratings yet
Touchless Touchscreen Seminar Report
31 pages
Instruction Manual Ecodrive P40: Quick Ed
No ratings yet
Instruction Manual Ecodrive P40: Quick Ed
18 pages