0% found this document useful (0 votes)
9 views

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 2

Data Science
Contents:

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data


Overview of Data Science

• Data science is a multi-disciplinary field that uses scientific methods,


processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.


What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.

Data Processing Cycle


Data types and their representation

1. Data types from Computer programming perspective


• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective

Structured Data Unstructured Data Semi-structured Data


• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain

• The Data Value Chain is introduced to describe the information flow


within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within


data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).

• Variety (range of data types/sources): dealing with data using differing


syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).

• Veracity: can we trust the data? How accurate is it? etc.


Clustered Computing and Hadoop Ecosystem

Clustered Computing

• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.

• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.

• To better address the high storage and computational needs of big data, computer clusters are
a better fit.

• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).

• Hadoop is an open-source framework intended to make interaction with big


data easier.

• It is a framework that allows for the distributed processing of large datasets


across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:

• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.

• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.

• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.

Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources


such as relational databases, systems, or local files.

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers


event data.
2. Processing the data in storage

• The second stage is Processing. In this stage, the data is stored and
processed.

• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data

• The third stage is to Analyze. Here, the data is analyzed by processing


frameworks such as Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then analyzes it.

• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!

You might also like