Chaoter Data Science
Chaoter Data Science
3
What are data and information
Is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-
9) or special characters (+, -, /, *, <,>, =, etc.).
4
cont. ..
Information is defenied as :-
the processed data on which decisions and actions are based.
It is data that has been processed into a form that is meaningful to the recipient.
or it is interpreted data; created from organized, structured, and processed data in
a particular context.
5
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose.
• Basic steps that are consisted in the data processing :-
I. input
II. processing and
III. output.
Data Processing Output
Question
1. Define the above terms(input, processing and output) and discuss the main
differences between data and information with examples ?
6
Data types and their representation
• Data types can be described from diverse perspectives. here some of the perspectives
are:-
i. Data types from Computer programming perspective:- different languages may use
different terminology for the notion of data types. Common data types include:
Integers(int)- is used to store whole numbers, mathematically known as integers
• Booleans(bool)- is used to represent restricted to one of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of characters and numbers 7
cont. ..
ii. Data types from Data Analytics perspective :- in terms of Analytic perspective there
are three common types of data types or structures. those are :
I. Structured Data:- is data that adheres to a pre-defined data model and
is therefore straightforward to analyze.
It conforms to a tabular format with a relationship between the different rows
and columns.
e.g Excel files or SQL databases (which has structured rows and columns that
can be sorted)
II. Semi-structured Data:- is a form of structured data that does not
conform with the formal structure of data models associated with
relational databases or other forms of data tables.
It is also known as a self-describing structure. (why ?)
e.g JSON , XML, sensor data
III. Unstructured data types :- is information that does not have either a 8
predefined data model or is not organized in a pre-defined manner.
cont. ..
Unstructured information is typically text-heavy but may contain data such as dates,
numbers, and facts as well.
Do you think that understanding Unstructured data using traditional programs as
compared to data stored in structured databases easy?
e.g Audio, video files or No_x0002_SQL databases
9
cont. ..
iii. Metadata :- is simply defined as data about data.
From a technical point of view, this is not a separate data structure, but it is one
of the most important elements for Big Data analysis and big data solutions.
provides additional information about a specific set of data.
for example, In a set of photographs, it describe when and where the photos
were taken. and also it provides fields for dates and locations which, by
themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big Data solutions for
initial analysis.
Compare metadata with structured, unstructured and semi-structured data ? 10
Data value Chain
• It is introduced to describe the information flow within a big data system as a
series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:
11
cont. ..
Data Acquisition :- is the process of gathering, filtering, and cleaning data before it
is put in a data warehouse or any other storage solution on which data analysis can be
carried out.
is one of the major big data challenges in terms of infrastructure requirements. why
explain ?
Data Analysis :- making the raw data acquired amenable to use in decision-making as
well as domain-specific usage.
it involves exploring, transforming, and modeling data with the goal of highlighting
relevant data, synthesizing and extracting useful hidden information with high
12
potential from a business point of view.
cont. ..
Data Curation :- It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
it contains different activities such as content creation, selection, classification,
14
Clustered Computing and Hadoop Ecosystem
Clustered computing :-
• Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
Resource Pooling:
High Availability:
Easy Scalability:
15
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data easier.
is a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
characteristics of Hadoop are:-
18
Big Data Life Cycle with Hadoop
• There are different stages of Big Data processing, some of them are:-
I. Ingesting data into the system :- data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
II. Processing the data in storage:- the data is stored and processed.
The data is stored in the distributed file system, HDFS, and the NoSQL distributed data,
HBase. Spark and MapReduce perform data processing.
19
cont. ..
III. Computing and analyzing data:- data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
Pig converts the data using a map and reduce and then analyzes it.
Hive is also based on the map and reduce programming and is most suitable for
structured data.
iv. Visualizing the results:- is performed by tools such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed by users.
20