0% found this document useful (0 votes)

13 views

CH 2 - Emerging

The document discusses key concepts in data science and big data. It defines data science as a multi-disciplinary field that uses scientific methods and systems to extract knowledge from various data sources. It also discusses the differences between data and information, types of data, data processing cycles, and concepts in big data like data acquisition, analysis, curation, storage, usage and basic concepts.

Uploaded by

Tofik mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

CH 2 - Emerging

Uploaded by

Tofik mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

1

Compiled By: Teshager K.

Chapter-Two
Data Science
Overview of Data Science 2
 Data science is a multi-disciplinary field that uses

Compiled By: Teshager K.

• scientific methods
• processes
• algorithms and
• systems to extract knowledge from structured, semi-structured and
unstructured sources.
 In academic areas data science continues to evolve as
• data mining
• data warehousing
• data modeling
• big data and etc.
 Data scientist possesses a strong quantitative background in statistics, linear algebra
and programming skills.
Data Vs Information 3
 Data is a representation of unprocessed:

Compiled By: Teshager K.

• raw facts
• figures
• concepts
• instructions in a formalized manner, which is more suitable for processing, interpretation
and communication by humans or machines.
 Data may transfer a piece or partial meaning but not a complete sense.
 Data can be represented using:

• figures • alphanumeric
• shapes • non-alphanumeric characters
• tables
• numeric
Cont. . . 4

 Information:

Compiled By: Teshager K.

• is a processed data on which decisions are based and
• transfers a complete meaning.
 Information is interpreted data, created from:
• organized
• structured and
• processed data in a particular context.
Data Processing Cycle 5

 Data processing is a restructuring or reordering of data in order:

Compiled By: Teshager K.

• to increase usefulness
• to add values
• to avoid ambiguity
• to deal complexity and
• for better representations.
 Data processing cycle has three main steps:

Input Process Output

Cont. . . 6

 Input

Compiled By: Teshager K.

• is a data in a convenient form for further processing.
• The format will depend on the purpose of processing and processing machine.
• When a computer used, the input can be:
• directly obtained from users via input-devices.
• fetch from hard disk, CD, flash disk and etc.
 Process
• In this step the data obtained as an input further processed into more useful form.
• In electronic computer, a software or an application performs the processing.
 Output
• A result of the processing will be fetched as an output.
• The output from a particular process will be the final information required or may be used
further as input for another process.
Data: types and representations 7

 Data types can be describe from different perspectives.

Compiled By: Teshager K.

• From computer programming: data types are attributes of data that controls the compiler or
interpreter how it used data.
• From data analytics: data types simply articulates us how the data exists.
 Data types from computer programming perspectives are:
• Integers-to store whole numbers.
• Booleans-to store true/false states.
• Characters-to store a single character.
• Alphanumeric strings-to store combination of characters and numbers.
 Data types from data analytics perspectives are:
• Structured:-obeys a pre-defined data model and forthright for interpretations. E.g. tabular data.
• Semi-structured (Self-describing structure):-a form of structured data but not conform the formal
structure of data model instead contains tags or other markers for expressing semantic relations.
E.g. XML.
Cont. . . 8

 From technical point of view-Meta Data

Compiled By: Teshager K.

• Meta data is not a separate data structure.
• It provides additional information about a specific set of data.
• Meta data is a data about data.
• E.g. in a photograph: size, locations, time and etc are meta data.
• Meta data is highly applicable in semantic webs, big data and etc.
Data value Chain 9

 Big data is a set of strategies and technologies required to:

Compiled By: Teshager K.

• gather
• organize
• process and
• gather insights from large datasets.
 Data value chain- describes the flow of information within a big data systems.
Data acquisition 10

 Data acquisition is a process of:

Compiled By: Teshager K.

• gathering,
• filtering and
• cleaning data before its putted in data warehouse or further processed.
 Data acquisition is a major challenge in big data.
 The challenge is because the infrastructure:
• should support low, predictable latency in capturing data and executing query.
• should support dynamic and flexible data structure.
• should handle very high transaction volumes.
Data analysis 11

 Data analysis involves:

Compiled By: Teshager K.

• exploring
• transforming and
• modeling data in order to make the raw data amenable(agreeable) in decision making.
 The goals of data analysis are:
• highlighting relevant data
• synthesizing and
• extracting useful hidden information.
 Related areas to data analysis includes:
• Data mining
• Business intelligence and
• Machine learning
Data curation 12

 Data curation refers to an active management of data to ensure its quality.

Compiled By: Teshager K.

 Data curation includes activities such as:
• content creation • transformation
• selection • validation and
• classification • preservation of data.
 Data curation is done by data curators.
 Data curators are responsible for improving accessibility and quality of the data.
 The goals of data curation are:
• ensuring trustworthiness
• making data discoverable
• easing accessibility
• improving data reusability and
• making data fit their purpose
Data storage 13

 Data storage

Compiled By: Teshager K.

• Is the persistence and management of data in a scalable way.
• It guarantees the applications fast access to the data.
 Relational Database Management Systems(RDBMS):
• RDBMS have been the main solution for data storage for almost 40 years.
• RDBMS have a property called ACID(atomicity, consistency, isolation and duration).
• ACID properties lacks flexibility with regard to schema change, fault tolerance and data
volume(complexity) increase.
• Lack of flexibility makes RDBMS unsuitable for big data science.
 NoSQL data storage technologies designed as an alternative data model to support
flexibility and scalability in data storage.
Data usage 14

 Data usage covers data-driven business activities that needs

Compiled By: Teshager K.

• access to data
• its analysis and
• the tools needed to integrate the data analysis within the business activity.
 Data usage in business decision making can enhance competitiveness through
• reduction of costs
• increased added value, or any
• other parameter that can be measured against existing performance criteria.
Big data: Basic concepts 15

 Big data is a set of strategies and technologies required to:

Compiled By: Teshager K.

• gather
• organize
• process and
• gather insights from large datasets.
 Why big data? Big data is because:
• the volume of data drastically increased over time.
• the data set in organizations becomes so large it becomes difficult ( almost impossible) to
process using
 on-hand database management tools or
 traditional data processing applications.
Cont. . . 16

 Big data is characterized by 3V and more:

Compiled By: Teshager K.

• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc..
Clustered computing and Hadoop 17

 Clustered computing:

Compiled By: Teshager K.

• Due to big data individual computers are inadequate for computing.
• Therefor for addressing computational and high storage need of big data clustering
appeared.
• Big data clustering software combines the resources of many smaller machines.
 Advantages of clustered computing:
• Resource pooling-combining available storage space, CPU and memory for processing large
datasets.
• High availability- clustering embraces fault tolerant and robust computing environment for
increasing availability.
• Easy scalability-clustered computing is easily scalable horizontally by adding additional
resources to the cluster.
Cont. . . 18

 Clustered computing requires:

Compiled By: Teshager K.

• managing cluster membership
• coordinating resource sharing and
• scheduling actual work on individual nodes.
 Cluster membership and resource allocation can be handled by software like Hadoop’s
YARN.
 YARN is an acronym stands for “Yet Another Resource Negotiator”.
Hadoop and Its ecosystem 19

 Hadoop is an open-source framework.

Compiled By: Teshager K.

 It is designed to make interaction with big data easier.
 Hadoop allows distributed processing of large datasets across clusters like parallel
computing.
 Hadoop has four key characteristics:
• Economical-its economical because it used ordinary computers for extensive computation.
• Reliable-as it stores copies of the data on different machines.
• Scalable-it can be scaled simply by adding machines t the cluster.
• Flexible-it can store as much structured and unstructured data.
 Hadoop has four key components:
• Data management
• Data access
• Data process
•
Hadoop ecosystem 20

 Hadoop ecosystem evolved from its four components mentioned on previous slide.

Compiled By: Teshager K.

 Generally the Hadoop ecosystem consists of:
• HDFS: Hadoop Distributed File System • HBase: NoSQL Database
• YARN: Yet Another Resource Negotiator • Mahout, Spark MLLib: Machine Learning algorithm
• MapReduce: Programming based Data libraries
Processing • Solar, Lucene: Searching and Indexing
• Spark: In-Memory data processing • Zookeeper: Managing cluster
• PIG, HIVE: Query-based processing of data • Oozie: Job Scheduling
services
21

Compiled By: Teshager K.

Cont. . .
Big data lifecycle: with Hadoop 22

Ingesting data into the system

Compiled By: Teshager K.

1.
• Data ingestion is the first phase in big data processing.
• The data transferred to Hadoop from different sources like local files, databases or systems.
• Sqoop transfer data from RDBMS to Hadoop, whereas Flume transfers event data.
2. Processing the data in the storage
• Processing the stored data is the second phase.
• Data big data is stored in the distributed file system:
• HDFS
• the NoSQL distributed data and
• HBase.
• Then data processing is done by MapReduce and Spark systems.
Cont. . . 23

Computing and analyzing data

Compiled By: Teshager K.

3.
• Computing is the third phase in big data processing lifecycle.
• Data is analyzed by processing frameworks such as Pig, Hive, and Impala.
• Pig converts the data using a MapReduce and then analyzes it.
• Hive is also based on the MapReduce programming and is most suitable for structured data.
4. Visualizing the results
• Access or visualizing the results is the forth phase.
• In this stage, the analyzed data can be accessed by users.
• Visualizing the results or access is performed by tools such as Hue and Cloudera Search.
24

Compiled By: Teshager K.

End of Chapter-Two
Thank you !
Reading Assign: List AI-applications that you encountered in your life.

Pagacz Matma Podstawa
No ratings yet
Pagacz Matma Podstawa
191 pages
Aircraft Maintenance Manual BO 105-Dikonversi
50% (2)
Aircraft Maintenance Manual BO 105-Dikonversi
4 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
ETCh2
No ratings yet
ETCh2
36 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Data Science
No ratings yet
Data Science
35 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
data science
No ratings yet
data science
23 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Data Science
No ratings yet
Data Science
31 pages
Chapter One Data Science
No ratings yet
Chapter One Data Science
4 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
32 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Unit 1
No ratings yet
Unit 1
26 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
ID4 Algorithm - Incremental Decision Tree Learning
No ratings yet
ID4 Algorithm - Incremental Decision Tree Learning
9 pages
Dumpshq Itil 4 Foundation Itil 4 Questions by Dunn 06 06 2022 7qa
No ratings yet
Dumpshq Itil 4 Foundation Itil 4 Questions by Dunn 06 06 2022 7qa
9 pages
MCS 207 PREVIOUS YEARS QUESTION WITH ANSWER
No ratings yet
MCS 207 PREVIOUS YEARS QUESTION WITH ANSWER
19 pages
HCI Individual Project-1
No ratings yet
HCI Individual Project-1
10 pages
Hotel Management System JAVA + MySQL Summer Training Project For BCA - PDF Download
No ratings yet
Hotel Management System JAVA + MySQL Summer Training Project For BCA - PDF Download
65 pages
Data Dictionary Slides
No ratings yet
Data Dictionary Slides
11 pages
Hardware and Software Selection For Library Automation: G. Rama Devi, K. Raghuveer
No ratings yet
Hardware and Software Selection For Library Automation: G. Rama Devi, K. Raghuveer
5 pages
Mohit Gupta Resume
No ratings yet
Mohit Gupta Resume
2 pages
Discoverer10g Administration
No ratings yet
Discoverer10g Administration
51 pages
Virtual University of Pakistan
No ratings yet
Virtual University of Pakistan
14 pages
DBMS Ptu
No ratings yet
DBMS Ptu
42 pages
Keys in DBMS
No ratings yet
Keys in DBMS
8 pages
Memory: 2 Types of Memory Modules
No ratings yet
Memory: 2 Types of Memory Modules
3 pages
DBMS Complete Notes by RS
No ratings yet
DBMS Complete Notes by RS
9 pages
1Z0-1127-24 Practice Exam Questions
No ratings yet
1Z0-1127-24 Practice Exam Questions
12 pages
Saksham Singhal Resume
No ratings yet
Saksham Singhal Resume
1 page
730719189-Hospital-Management-System-Project
No ratings yet
730719189-Hospital-Management-System-Project
16 pages
FAQs On EOffice
No ratings yet
FAQs On EOffice
54 pages
Cataloging Dissertations Using RDA
No ratings yet
Cataloging Dissertations Using RDA
7 pages
Advisory - Inventory of Tricycle Operatprs and Drivers Association Nationwide 1-11-19
No ratings yet
Advisory - Inventory of Tricycle Operatprs and Drivers Association Nationwide 1-11-19
5 pages
Ethical and Social Issues in Is
No ratings yet
Ethical and Social Issues in Is
7 pages
Business Intelligence (BI) Maturity Model: Unit VI BI Maturity, Strategy and Modern Trends in BI
No ratings yet
Business Intelligence (BI) Maturity Model: Unit VI BI Maturity, Strategy and Modern Trends in BI
59 pages
Google Data Analyst
No ratings yet
Google Data Analyst
3 pages
DV Lab Manual
No ratings yet
DV Lab Manual
73 pages
Enhancing Management Decision Making For The Digital Firm
No ratings yet
Enhancing Management Decision Making For The Digital Firm
50 pages
Student Database Management
No ratings yet
Student Database Management
23 pages
Multi-Class Sentiment Analysis From Afaan Oromo Text Based 3
No ratings yet
Multi-Class Sentiment Analysis From Afaan Oromo Text Based 3
9 pages
Snowflake
No ratings yet
Snowflake
16 pages

CH 2 - Emerging

Uploaded by

CH 2 - Emerging

Uploaded by

1

Compiled By: Teshager K.

Compiled By: Teshager K.

Compiled By: Teshager K.

Compiled By: Teshager K.

 Data processing is a restructuring or reordering of data in order:

Compiled By: Teshager K.

Input Process Output

Compiled By: Teshager K.

 Data types can be describe from different perspectives.

Compiled By: Teshager K.

 From technical point of view-Meta Data

Compiled By: Teshager K.

 Big data is a set of strategies and technologies required to:

Compiled By: Teshager K.

 Data acquisition is a process of:

Compiled By: Teshager K.

 Data analysis involves:

Compiled By: Teshager K.

 Data curation refers to an active management of data to ensure its quality.

Compiled By: Teshager K.

Compiled By: Teshager K.

 Data usage covers data-driven business activities that needs

Compiled By: Teshager K.

 Big data is a set of strategies and technologies required to:

Compiled By: Teshager K.

 Big data is characterized by 3V and more:

Compiled By: Teshager K.

Compiled By: Teshager K.

 Clustered computing requires:

Compiled By: Teshager K.

 Hadoop is an open-source framework.

Compiled By: Teshager K.

Compiled By: Teshager K.

Compiled By: Teshager K.

Ingesting data into the system

Compiled By: Teshager K.

Computing and analyzing data

Compiled By: Teshager K.

Compiled By: Teshager K.

You might also like