Chapter-2 Data Science
Chapter-2 Data Science
defined as a multi-disciplinary field that uses scientific methods, processes, algorithms, and
systems in order to extract knowledge and insights from structured, semi-structured and
unstructured data.
In case of academic discipline and profession, data science continues to evolve as one of the most
promising and in-demand career paths
FCSE
for skilled
AMIT
professionals.
ARBAMINCH
2
UNIVERSITY
cont. ..
Data professionals understand that they must advance past the traditional skills of
analyzing large amounts of data, data mining, and programming skills.
Is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9)
or special characters (+, -, /, *, <,>, =, etc.).
FCSE AMIT ARBAMINCH
4
UNIVERSITY
cont. ..
Information is defined as :-
It is data that has been processed into a form that is meaningful to the recipient.
or it is interpreted data; created from organized, structured, and processed data in
a particular context.
I. input
Question
1. Define the above terms(input, processing and output) and discuss the main differences
between data and information with examples ?
FCSE AMIT ARBAMINCH
6
UNIVERSITY
Cont..
Input - in this step, the input data is prepared in some convenient form for processing. The
form will depend on the processing machine.
Processing - in this step, the input data is changed to produce data in a more useful form. For
example, interest can be calculated on deposit to a bank, or a summary of sales for the month
can be calculated from the sales orders.
Output - at this stage, the result of the proceeding processing step is collected. The particular
form of the output data depends on the use of the data. For example, output data may be
payroll for employees.
for example, In a set of photographs, it describe when and where the photos were
taken. and also it provides fields for dates and locations which, by themselves, can
be considered structured data.
Because of this reason, metadata is frequently used by Big Data solutions for
initial analysis.
Compare metadata with structured, unstructured and semi-structured data ?
FCSE AMIT ARBAMINCH
11
UNIVERSITY
Data value Chain
• It is introduced to describe the information flow within a big data system as a
series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:
is one of the major big data challenges in terms of infrastructure requirements. why
explain ?
Data Analysis :- making the raw data acquired amenable to use in decision-making as
well as domain-specific usage.
it involves exploring, transforming, and modeling data with the goal of highlighting
relevant data, synthesizing and extracting useful hidden information with high potential
from a business point of view. FCSE AMIT
UNIVERSITY
ARBAMINCH
13
cont. ..
Data Curation :- It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective usage.
it contains different activities such as content creation, selection, classification,
transformation, validation, and preservation.
Data Storage :- is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Data Usage:- It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the business activity.
Velocity: Data is live streaming or in motion Figure 2.4 Characteristics of big data
Cluster computing :
• Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
Resource Pooling:
High Availability:
Easy Scalability:
Hive is also based on the map and reduce programming and is most suitable for
structured data.
iv. Visualizing the results:- is performed by tools such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed by users.