Chapter - 2 Data Sciences
Chapter - 2 Data Sciences
Data Science
1
Chapter Outline
✓ An Overview of Data Science
2 2
Overview of Data Science
❖ What is data science, data , information and big data?
Data science is:-
3
Cntd…
✓ In case of academic discipline and profession,
Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.
III. output.
Question
1. Define the above terms(input, processing and output) and discuss the
main differences between data and information with examples ? 6
Cntd…
Input - in this step, the input data is prepared in some convenient form
for processing. The form will depend on the processing machine.
7
Data types and their representation
Data types can be described from diverse perspectives. here some of the
perspectives are:-
There are three common types of data types or structures. those are :
➢ e.g. Excel files or SQL databases (which has structured rows and
columns that can be sorted).
9
Cntd…
II. Semi-structured Data:- is a form of structured data that does not
conform with the formal structure of data models associated with relational
databases or other forms of data tables.
10
Cntd…
✓ Do you think that understanding Unstructured data using traditional
programs as compared to data stored in structured databases easy?
e.g. audio, video files or No_x0002_SQL databases.
10
Cntd…
3. Metadata :- is simply defined as data about data.
From a technical point of view, this is not a separate data structure, but
it is one of the most important elements for Big Data analysis and big
data solutions.
➢ The Big Data Value Chain identifies the following key high-level
activities:
12
Cntd …
Data Acquisition :- is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution on
which data analysis can be carried out.
Data Usage:- It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data analysis within
the business activity.
15
Basic Concepts of Big data
• Big data is the term for a collection of data sets so large and complex
it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
It is characterized by 3V and more:
▪ Volume: large amounts of raw data (Zeta bytes)
▪ Velocity: Data is live streaming or in motion, change over times
▪ Variety: data comes in many different forms from diverse sources
▪ Veracity: can we trust the data? How accurate is it? Data quality
▪ Value: Information for Decision Making
Cntd…
Clustered Computing and Hadoop Ecosystem
Cluster Computing :
⁕ Resource Pooling:
⁕ High Availability:
⁕ Easy Scalability: 18
Cluster computing benefits
✓ Resource pooling: combing the available storage space, CPU, and
memory is extremely important.
✓ Processing large datasets requires large amount of all three(storage
space, CPU, and memory) of the resources .
✓ High availability: clusters provide varying levels of fault tolerance
and availability guarantees to prevent hardware and software failure
from affecting access to data and processing.
✓ Increasingly important for real time analytics of big data.
✓ Easy scalability: cluster makes easy horizontally by adding more
machines to the group. The system can react to change in resource
requirements with out expanding the physical resource on a machine.
18
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with
big data easier.
22
Big Data Life Cycle with Hadoop
There are different stages of Big Data processing.
II. Processing the data in storage:- the data is stored and processed.
✓ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.
22
Cntd…
III. Computing and analyzing data:- data is analyzed by processing
frameworks such as Pig, Hive, and Impala.
✓ Pig converts the data using a map and reduce and then analyzes it.
✓ Hive is also based on the map and reduce programming and is most
suitable for structured data.
23
25