data science
data science
Group Members ID
1 kssahun Alemneh…………...........................................................................................................1196\21
2 habtamu Teketel…………………………………………………………………………………………….0141\19
3 Demisew Bereded…………………….............................................................................................1152\21
Data science is a multi disciplinary field that uses scientific methods, processes, algorithms, and systems to
examain large amount of data
it insights from structured, semi structured and unstructured data.
Data science is simply analyzing data.
Dealing with huge amounts of data to find marketing patterns is known as data science
• Data science combines all about: Statistics, Data expertise, Data engineering, Visualization and Advanced
computing
used to analyze large amounts of data and spot trends through formats like data visualizations and predictive
models
Applicaions of data science
1, Healthcare
2, Transportation
3, Sport
4, Government
5, e_ comerce
6, social media
7, logistics
8, gaming
What are data and information?
Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner
It can be described as unprocessed facts and figures.
It is represented with the help of characters such as
Alphabets (A-Z, a-z)
Digits (0-9)or
Special characters(+,-,/,*,<,>,=,etc.)
Measure of data in file
1, bit =1/8 byte
2, Nibble =4 bit =1/2 byte
3, Byte =8 bit
4, Kilobyte = 1024 byte
5, megabyte =1024 KB
6, Gigabyte =1024 MB
7, Terrabyte =1024 GB
8, Petabyte = 1024 TB
9, Exabyte = 1024 PB
10, zettabyte = 1024 EB
11, Yottabyte = 1024 ZB
What are data and information?
Information is the processed data on which decisions and actions are
based.
It is data that has been processed in to a form that is meaningful to
there recipient and is of real or perceived value in the current or the
prospective action or decision of recipient.
information is interpreted data; created from organized, structured,
and processed data in a particular context.
Data Processing Cycle
data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose Data processing consists of the following basic
steps-input, processing, and out put.
The three steps constitute the data processing cycle
Input data is prepared in some convenient form for processing.
The form will depend on the processing machine.
Processing the input data is changed to produce data in more use
full forms
Unstructured Data
Unstructured data is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
• Can not displayed in rows, columens and relational database
• It requirs more storage and defficult to manage and protect
• Unstructured information is typically text-heavy but may contain data such as
dates, numbers, and facts as well.
• This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases.
• Common examples of unstructured data include audio, video files or No-SQL
Metadata –Data about Data
• The last category of data type is metadata. From a technical point of
view, this is not a separate data structure, but it is one of the most
important elements for Big Data analysis and big data solutions.
• Metadata is data about data. It provides additional information about a
specific set of data.
• In a set of photographs, for example, metadata could describe when
and where the photos were taken.
• The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data. Because of this reason,
metadata is frequently used by Big Data solutions for initial analysis.
Data value Chain
Data Value Chain is introduced to describe the information flow with in a
big data system
• a series of steps needed to generate value and useful in sights from data.
• The Big Data Value Chain identifies the following key high-level
activities:
Data Acquisition It is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse
Data acquisition is one of the major big data challenges interms of
infrastructure requirements
The infrastructure required to support the acquisition of big data must
deliver low, predictable latency in both capturing data and in executing
queries.
Used in distributed environment; and support flexible and dynamic data
Data value Chain
Data Analysis is making the raw data acquired amenable to use in decision-
making.
Data analysis involves exploring ,transforming ,and modeling data
the goal of high lighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
Related areas include data mining, business intelligence, and machine learning
Data Curation It is the active management of data over its life cycle
Data curation processes can be categorized into different activities such as content
creation, selection, classification, transformation, validation, and preservation.
Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
Data curators (also known as scientific curators or data annotators) hold the
responsibility of ensuring that data are trust worthy, discoverable ,accessible, reusable
and fit their purpose.
A key trend for the duration of big data utilizes community and crowdsourcing
approaches.
Data value Chain
Data Storage
It is the persistence and management of data in a scalable way that satisfies
the needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the main,
and almost unique, a solution to the storage paradigm .
the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions.
NoSQL technologies have been designed with the scalability goal in mind and
present a wide range of solutions based on alternative data models
Data Usage
• It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis with in the
business activity.
• Data usage in business decision-making can enhance competitiveness
through the reduction of costs, increased added value.
Basic concepts of big data
Big data non-traditional strategies and technologies needed to
gather, organize, process, and gather insights from large data sets.
Large, complex and divers data sets
Big data is a collection of data sets so large and complex that it
becomes difficult to process using on traditional data processing
applications.
“large dataset” means a data set tool large to reasonably process or
store with traditional tooling or on a single computer.
Big data is characterized by 3V and more:
• Volume: large amounts of data /Massive datasets has been generated
• Velocity: Data is live streaming or in motion
• Variety : data comes in many different forms from diverse sources
• Veracity: can quality, accuracy or trustworthiness of the data
Clustered Computing
Clustered Computing
Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
To better address the high storage and computational needs of big data,
computer clusters are a better fit.
Is acollection of losely connected computers that works togethet and they
act as a single entity
Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits:
Cluster computing benefits
Resource Pooling
High Availability
Easy Scalability
Clustered Computing
THANK YOU