0% found this document useful (0 votes)
10 views

Chapter - 2 Data Sciences

The document provides an overview of key concepts in data science including data, information, data types, the data value chain, and big data. It discusses structured, semi-structured, and unstructured data and how data is acquired, analyzed, stored, and used. Cluster computing and Hadoop ecosystem are also introduced.

Uploaded by

Tagel Wogayehu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter - 2 Data Sciences

The document provides an overview of key concepts in data science including data, information, data types, the data value chain, and big data. It discusses structured, semi-structured, and unstructured data and how data is acquired, analyzed, stored, and used. Cluster computing and Hadoop ecosystem are also introduced.

Uploaded by

Tagel Wogayehu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter II

Data Science

1
Chapter Outline
✓ An Overview of Data Science

✓ What are data and information

✓ Data types and their representation

✓ Data value Chain

✓ Basic concepts of Big data

2 2
Overview of Data Science
❖ What is data science, data , information and big data?
Data science is:-

❖ defined as a multi-disciplinary field that uses scientific methods, processes,


algorithms, and systems in order to extract knowledge and insights from
structured, semi-structured and unstructured data.

❖ is the science which uses computer science, statistics and machine


learning, visualization and human-computer interactions to collect,
clean, integrate, analyze, visualize, interact with data to create data
products.

3
Cntd…
✓ In case of academic discipline and profession,
Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.

✓ Data professionals understand that they must advance past the


traditional skills of analyzing large amounts of data, data mining, and
programming skills.

✓ Data scientists need to be curious and result-oriented, with exceptional


industry-specific knowledge and communication skills that allow them
to explain highly technical results to their non-technical counterparts.
4
What are data and information?
⁎ Data can be defined as:- a representation of facts, figures,
concepts, or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing, by human
or electronic machines. It can be described as unprocessed facts and
figures. It represented with the help of characters such as alphabets
(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
➢ Information is defined as the processed data on which decisions
and actions are based.
⁎ It is data that has been processed into a form that is meaningful to
the recipient. Or it is interpreted data; created from organized,
structured, and processed data in a particular context. 5
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.

• Basic steps that are consisted in the data processing :-


I. input

II. processing and

III. output.

Question

1. Define the above terms(input, processing and output) and discuss the
main differences between data and information with examples ? 6
Cntd…
 Input - in this step, the input data is prepared in some convenient form
for processing. The form will depend on the processing machine.

 Processing - in this step, the input data is changed to produce data in a


more useful form. For example, interest can be calculated on deposit to a
bank, or a summary of sales for the month can be calculated from the
sales orders.

 Output - at this stage, the result of the proceeding processing step is


collected. The particular form of the output data depends on the use of the
data. For example, output data may be payroll for employees.

7
Data types and their representation
Data types can be described from diverse perspectives. here some of the
perspectives are:-

1. Data types from Computer programming perspective:- different languages


may use different terminology for the notion of data types. Common data types
include:

➢ Integers(int)- is used to store whole numbers, mathematically known as integers.


➢ Booleans(bool)- is used to represent restricted to one of two values: true or false.
➢ Characters(char)- is used to store a single character.
➢ Floating-point numbers(float)- is used to store real numbers.
➢ Alphanumeric strings(string)- used to store a combination of characters and
numbers.
7
… Cntd
2. Data types from Data Analytics perspective: in terms of Analytic
perspective.

There are three common types of data types or structures. those are :

I. Structured Data:- is data that follow to a pre-defined data model


and is therefore straightforward to analyze.

➢ It conforms to a tabular format with a relationship between the


different rows and columns.

➢ e.g. Excel files or SQL databases (which has structured rows and
columns that can be sorted).
9
Cntd…
II. Semi-structured Data:- is a form of structured data that does not
conform with the formal structure of data models associated with relational
databases or other forms of data tables.

➢ It is also known as a self-describing structure. (why ?)


➢ e.g. JSON , XML, sensor data.
III. Unstructured data types :- is information that does not have either a
predefined data model or is not organized in a pre-defined manner.

✓Unstructured information is typically text-heavy but may contain data


such as dates, numbers, and facts as well.

10
Cntd…
✓ Do you think that understanding Unstructured data using traditional
programs as compared to data stored in structured databases easy?
e.g. audio, video files or No_x0002_SQL databases.

10
Cntd…
3. Metadata :- is simply defined as data about data.

From a technical point of view, this is not a separate data structure, but
it is one of the most important elements for Big Data analysis and big
data solutions.

⁎ Provides additional information about a specific set of data.


E.g., In a set of photographs, it describe when and where the photos were
taken. and also it provides fields for dates and locations which, by
themselves, can be considered structured data.

⁎ Because of this reason, metadata is frequently used by Big Data


solutions for initial analysis. 12
Data value Chain
➢ It is introduced to describe the information flow within a big data system

as a series of steps needed to generate value and useful insights from data.

➢ The Big Data Value Chain identifies the following key high-level
activities:

12
Cntd …
 Data Acquisition :- is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution on
which data analysis can be carried out.

✓ is one of the major big data challenges in terms of infrastructure


requirements. why explain ? Infrastructures must:

 Data Analysis :- making the raw data acquired amenable to use in


decision-making as well as domain-specific usage.

✓ it involves exploring, transforming, and modeling data with the goal of


highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view. 14
Cntd…
 Data Curation :- It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its effective
usage.

✓ It contains different activities such as content creation, selection,


classification, transformation, validation, and preservation.

 Data Storage :- is the persistence and management of data in a scalable way


that satisfies the needs of applications that require fast access to the data.

 Data Usage:- It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis within
the business activity.
15
Basic Concepts of Big data
• Big data is the term for a collection of data sets so large and complex
it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
 It is characterized by 3V and more:
▪ Volume: large amounts of raw data (Zeta bytes)
▪ Velocity: Data is live streaming or in motion, change over times
▪ Variety: data comes in many different forms from diverse sources
▪ Veracity: can we trust the data? How accurate is it? Data quality
▪ Value: Information for Decision Making
Cntd…
Clustered Computing and Hadoop Ecosystem
Cluster Computing :

❑ Is a form of computing in which a group of computers(often called


nodes) that are connected through a LAN( Local Area Network) so
that they behave like a single machine.

• Big data clustering software combines the resources of many smaller


machines, seeking to provide a number of benefits:

⁕ Resource Pooling:
⁕ High Availability:
⁕ Easy Scalability: 18
Cluster computing benefits
✓ Resource pooling: combing the available storage space, CPU, and
memory is extremely important.
✓ Processing large datasets requires large amount of all three(storage
space, CPU, and memory) of the resources .
✓ High availability: clusters provide varying levels of fault tolerance
and availability guarantees to prevent hardware and software failure
from affecting access to data and processing.
✓ Increasingly important for real time analytics of big data.
✓ Easy scalability: cluster makes easy horizontally by adding more
machines to the group. The system can react to change in resource
requirements with out expanding the physical resource on a machine.
18
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with
big data easier.

✓ is a framework that allows for the distributed processing of large


datasets across clusters of computers using simple programming models.

Characteristics of Hadoop are:-

i. Economical:- Its systems are highly economical.


ii. Reliable:- stores copies of the data on different machines and is
resistant to hardware failure.

iii. Scalable:- is easily scalable both, horizontally and vertically.


iv. Flexible:- you can store as much structured and unstructured data. 19
Cntd…
Hadoop has an ecosystem that has evolved from its four core
components:
1. Data management 2. Access 3. Processing 4. Storage
It comprises the following components and many others:
▪ HDFS: Hadoop Distributed File ▪ HBase: NoSQL Database
System ▪ Mahout, Spark MLLib: Machine
▪ YARN: Yet Another Resource Learning algorithm libraries
Negotiator ▪ Solar, Lucene: Searching and Indexing
▪ MapReduce: Programming based Data ▪ Zookeeper: Managing cluster
Processing
▪ Oozie: Job Scheduling
▪ Spark: In-Memory data processing
▪ PIG, HIVE: Query-based processing 20
Cntd…

22
Big Data Life Cycle with Hadoop
There are different stages of Big Data processing.

some of them are:-

I. Ingesting data into the system :- data is ingested or transferred to


Hadoop from various sources such as relational databases, systems,
or local files. Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.

II. Processing the data in storage:- the data is stored and processed.

✓ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.

22
Cntd…
III. Computing and analyzing data:- data is analyzed by processing
frameworks such as Pig, Hive, and Impala.

✓ Pig converts the data using a map and reduce and then analyzes it.
✓ Hive is also based on the map and reduce programming and is most
suitable for structured data.

iv. Visualizing the results:- is performed by tools such as Hue and


Cloudera Search.

✓ In this stage, the analyzed data can be accessed by users.

23
25

You might also like