Chapter 2 Data Science
Chapter 2 Data Science
data science
03/31/2025 1
content
Overview of Data Science
Data and Information
Data Types and Their Representation
Data Value Chains
Basic Concepts of Big data
03/31/2025 2
2.1. Overview of Data Science
03/31/2025 5
What are data and information?
Data can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or electronic
machines.
It can be described as unprocessed facts and figures.
03/31/2025 7
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
Data processing consists of the following basic steps: Input,
Processing and Output. These three steps constitute the data
processing cycle.
03/31/2025 9
Output-at this stage, the result of the proceeding processing step is
collected.
The particular form of the output data depends on the use of the
data.
For example, output data may be payroll for employees.
03/31/2025 10
Data types and their representation
Data types can be described from diverse perspectives.
In computer science and computer programming, for
instance, a data type is simply an attribute of data that tells
the compiler or interpreter how the programmer intends to
use the data.
03/31/2025 11
1. Data types from Computer programming perspective
Almost all programming languages explicitly include the notion of
data type, though different languages may use different terminology.
Common data types include:
Integers(int)- is used to represent whole numbers, mathematically
known as integers
Booleans(bool)- is used to represent restricted to one of two
values:true or false
Characters(char)- is used to represent a single character
Floating-point numbers(float)- is used to represent real numbers
Alphanumeric strings(string)- used to represent a combination of
characters and numbers
03/31/2025 12
2. Data types from Data Analytics perspective
From a data analytics point of view, it is important to
understand that there are three common types of data
types or structures:
Structured
Semi-structured and
Unstructured data types.
03/31/2025 13
Structured Data
Structured data is data that adheres to a pre-defined data
model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with a
relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL
databases.
Each of these has structured rows and columns that can be
sorted.
03/31/2025 14
Semi-structured Data
Semi-structured data is a form of structured data that does not conform with
the formal structure of data models associated with relational databases or
other forms of data tables, but nonetheless, contains tags or other markers to
separate semantic elements and enforce hierarchies of records and fields
within the data.
Therefore, it is also known as a self-describing structure.
03/31/2025 16
Metadata – Data about Data
• The last category of data type is metadata.
• From a technical point of view, this is not a separate
data structure, but it is one of the most important
elements for Big Data analysis and big data solutions.
• Metadata is data about data.
• It provides additional information about a specific set
of data.
• In a set of photographs, for example, metadata could
describe when and where the photos were taken.
03/31/2025 17
Data value Chain
The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data. The Big Data Value Chain identifies the
following key high-level activities:
03/31/2025 23
Basic concepts of big data
What Is Big Data?
• Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
• In this context, a “large dataset” means a dataset too large to
reasonably process or store with traditional tooling or on a single
computer.
• This means that the common scale of big datasets is constantly
shifting and may vary significantly from organization to
organization.
03/31/2025 24
Big data is characterized by 4V and more:
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc.
03/31/2025 26
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault tolerance
and availability guarantees to prevent hardware or software failures
from affecting access to data and processing.
• This becomes increasingly important as we continue to emphasize the
importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by adding
additional machines to the group.
• This means the system can react to changes in resource requirements
without expanding the physical resources on a machine.
03/31/2025 27
THANK YOU
?
03/31/2025 28