data science

The document is a group assignment from the Department of Computer Science at Admas University, focusing on selected topics in data science. It covers key concepts such as data types, the data value chain, big data characteristics, clustered computing, and the Hadoop ecosystem. The assignment includes an introduction to data science, its applications, and a summary of the data processing cycle.

Uploaded by

kassahunyemaryam21

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

data science

Uploaded by

kassahunyemaryam21

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

ADMAS UNIVERSITY

DEPARTMENT OF COMPUTER SCIENCE

SELLECTED TOPICS IN COMPUTTER SCIENCE
GROUP ASSIGNMENT

Group Members ID

1 kssahun Alemneh…………...........................................................................................................1196\21
2 habtamu Teketel…………………………………………………………………………………………….0141\19
3 Demisew Bereded…………………….............................................................................................1152\21

presentation date………….. ………..NOV 20 \2024

submission date…………………..
Introduction to Data Science
 Overview for Data Science
 Definition of data and information
 Data types and representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic concepts of Big data
 Characteristics of big data
 Clustered computing
 Benefits of clustered computing
 Handoop and its ecosystem
Introduction to Data Science

Introduction to Data Science

What is data science?

 Data science is a multi disciplinary field that uses scientific methods, processes, algorithms, and systems to
examain large amount of data
 it insights from structured, semi structured and unstructured data.
 Data science is simply analyzing data.
 Dealing with huge amounts of data to find marketing patterns is known as data science
• Data science combines all about: Statistics, Data expertise, Data engineering, Visualization and Advanced
computing
 used to analyze large amounts of data and spot trends through formats like data visualizations and predictive
models
 Applicaions of data science
1, Healthcare
2, Transportation
3, Sport
4, Government
5, e_ comerce
6, social media
7, logistics
8, gaming
What are data and information?
 Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner
 It can be described as unprocessed facts and figures.
 It is represented with the help of characters such as
 Alphabets (A-Z, a-z)
 Digits (0-9)or
 Special characters(+,-,/,*,<,>,=,etc.)
 Measure of data in file
1, bit =1/8 byte
2, Nibble =4 bit =1/2 byte
3, Byte =8 bit
4, Kilobyte = 1024 byte
5, megabyte =1024 KB
6, Gigabyte =1024 MB
7, Terrabyte =1024 GB
8, Petabyte = 1024 TB
9, Exabyte = 1024 PB
10, zettabyte = 1024 EB
11, Yottabyte = 1024 ZB
What are data and information?
Information is the processed data on which decisions and actions are
based.
 It is data that has been processed in to a form that is meaningful to
there recipient and is of real or perceived value in the current or the
prospective action or decision of recipient.
 information is interpreted data; created from organized, structured,
and processed data in a particular context.
Data Processing Cycle
data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose Data processing consists of the following basic
steps-input, processing, and out put.
The three steps constitute the data processing cycle
 Input data is prepared in some convenient form for processing.
 The form will depend on the processing machine.
Processing the input data is changed to produce data in more use
full forms

input processig output

Data types and their representation
Data types can be described from diverse perspectives.
• In computer science and computer programming, for instance, a data type is
simply an attribute of data that tells the compiler or interpreter how the
programme rintends to use the data
Data types from Computer programming perspective
 Common data types include
 Integers (int)-is used to store whole numbers, mathematically known as
integers
 Booleans(bool)- is used to represent restricted to one of two values: true or
false
 Characters(char)-is used to store a single character
 Floating-point numbers(float)-is used to store real numbers
 Alphanumeric strings(string)-used to store a combination of characters and
numbers
Data types from Data Analytics perspective
• A data type makes the values that expression, such as a variable or a function, might take.
• This data type defines the operations that can be done on the data, the meaning of the data,
and the way values of that type can be stored
 Structured Data
 Structured data is data that adheres to a pre-defined data
model and is therefore straight forward to analyze.
 Structured data conforms to a tabular format with a
relationship between the different rows and columns.
 Common examples of structured data are Excel files or SQL
data bases.
 Each of these has structured rows and columns that can be
sorted.
Data types from Data Analytics perspective
Semi-structured data

 Semi-structured data is a form of structured data that does not

conform with the formal structure of data models Associated with
relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and enforce
hierarchies of records and fields with in the data.
• Therefore, it is also known as a self-describing structure.
• Examples of semi-structured data include JSON and XML are forms of
semi-structured data.
• JSON : java script object Notation: is data interchange format that uses
human readable text to store and transmit data.
• XML : extensible markup language provides rules to define any data
Data types from Data Analytics perspective

Unstructured Data
Unstructured data is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
• Can not displayed in rows, columens and relational database
• It requirs more storage and defficult to manage and protect
• Unstructured information is typically text-heavy but may contain data such as
dates, numbers, and facts as well.
• This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases.
• Common examples of unstructured data include audio, video files or No-SQL
Metadata –Data about Data
• The last category of data type is metadata. From a technical point of
view, this is not a separate data structure, but it is one of the most
important elements for Big Data analysis and big data solutions.
• Metadata is data about data. It provides additional information about a
specific set of data.
• In a set of photographs, for example, metadata could describe when
and where the photos were taken.
• The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data. Because of this reason,
metadata is frequently used by Big Data solutions for initial analysis.
Data value Chain
Data Value Chain is introduced to describe the information flow with in a
big data system
• a series of steps needed to generate value and useful in sights from data.
• The Big Data Value Chain identifies the following key high-level
activities:
Data Acquisition It is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse
Data acquisition is one of the major big data challenges interms of
infrastructure requirements
 The infrastructure required to support the acquisition of big data must
deliver low, predictable latency in both capturing data and in executing
queries.
 Used in distributed environment; and support flexible and dynamic data
Data value Chain
 Data Analysis is making the raw data acquired amenable to use in decision-
making.
 Data analysis involves exploring ,transforming ,and modeling data
 the goal of high lighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
 Related areas include data mining, business intelligence, and machine learning
 Data Curation It is the active management of data over its life cycle
 Data curation processes can be categorized into different activities such as content
creation, selection, classification, transformation, validation, and preservation.
 Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
 Data curators (also known as scientific curators or data annotators) hold the
responsibility of ensuring that data are trust worthy, discoverable ,accessible, reusable
and fit their purpose.
 A key trend for the duration of big data utilizes community and crowdsourcing
approaches.
Data value Chain
Data Storage
 It is the persistence and management of data in a scalable way that satisfies
the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the main,
and almost unique, a solution to the storage paradigm .
 the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions.
 NoSQL technologies have been designed with the scalability goal in mind and
present a wide range of solutions based on alternative data models
Data Usage
• It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis with in the
business activity.
• Data usage in business decision-making can enhance competitiveness
through the reduction of costs, increased added value.
Basic concepts of big data
Big data non-traditional strategies and technologies needed to
gather, organize, process, and gather insights from large data sets.
Large, complex and divers data sets
 Big data is a collection of data sets so large and complex that it
becomes difficult to process using on traditional data processing
applications.
 “large dataset” means a data set tool large to reasonably process or
store with traditional tooling or on a single computer.
Big data is characterized by 3V and more:
• Volume: large amounts of data /Massive datasets has been generated
• Velocity: Data is live streaming or in motion
• Variety : data comes in many different forms from diverse sources
• Veracity: can quality, accuracy or trustworthiness of the data
Clustered Computing
Clustered Computing
 Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
 To better address the high storage and computational needs of big data,
computer clusters are a better fit.
 Is acollection of losely connected computers that works togethet and they
act as a single entity
 Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits:
Cluster computing benefits
Resource Pooling
High Availability
Easy Scalability
Clustered Computing

 Cluster computing benefits

 Resource Pooling
 Combining the available storage space to hold data is a clear benefit but CPU and memory
pooling are also extremely important
 High Availability
 Clusters can provide varying levels of fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data and processing.
 Easy Scalability
 Clusters make it easy to scale horizontally by adding additional machines to the group.
 Clusters requires a solution for managing cluster membership, coordinating resource
sharing, and scheduling actual work on individual nodes.
 Cluster membership and resource allocation can be handled by software like Hadoop’s YARN
Hadoop and its Ecosystem
 Hadoop is an open-source apache software framework intended to make
interaction with big data easier.
 A framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
 It is inspired by a technical document published by Google.
 Key characteristics of Hadoop
Economical: Its systems are highly economical as ordinary computers can be used
for data processing.
Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure
Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes
help in scaling up the framework.
Flexible:It is flexible and you can store as much structured and unstructured data as
you need to and decide to use them later
Hadoop and its Ecosystem
 Hadoop ecosystem consists of a variety of components that work together to provide
a comprehensive framework for processing and managing large datasets.
 HDFS : Hadoop Distributed File System : It is designed to store large files across
multiple machines in a distributed manner.
YARN: Yet Another Resource Negotiator: is the resource management layer of
Hadoop.
 MapReduce: is the programming model used for processing large datasets in
parallel across a Hadoop cluster.
MapReduce consists of two main tasks:
 the Map task, which processes input data and generates intermediate key-value pairs
 the Reduce task, which aggregates these intermediate results.
Spark: is a fast, in-memory data processing engine that can run on top of Hadoop.
 It provides APIs for Java, Scala, Python, and R,
PIG, is a high-level platform for creating programs that run on Hadoop
HIVE: is a data warehousing and SQL-like query language interface built on top of
Big Data Life Cycle with Hadoop
Ingesting data into the system
 The first stage of Big Data processing is Ingest.
 The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
Processing the data in storage
 The second stage is Processing. In this stage , the data is stored and processed.
 The data is stored in the distributed file system.
Computing and analyzing data
 The third stage is to Analyze. Here, the data is analyzed by processing
frameworks such as Pig , Hive , and Impala.
Visualizing the results
 The fourth stage is Access, which is performed by tools such as Hue and
Cloudera Search.
 In this stage the analyzed data can be accessed by users.
Summery questions
1.What do you mean by data science?
2. What is structured data?
3. What is unstructured data?
4. What is the difference between BI (Business intelligence) and Data science?
5. What are benefits of Cluster computing?
6. ____ is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse
7. _____provides additional information about a specific set of data
8. List the characterstics of big data
a. __________
b. __________
c. __________
d __________
End of chapter

THANK YOU

DBS Bank - A Tech Company Going All in On AI
No ratings yet
DBS Bank - A Tech Company Going All in On AI
20 pages
NETFLIX Supply Chain Management
100% (3)
NETFLIX Supply Chain Management
17 pages
data science
No ratings yet
data science
23 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Gokaraju Rangaraju Institute of Engineering and Technology
No ratings yet
Gokaraju Rangaraju Institute of Engineering and Technology
49 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Data Minng
No ratings yet
Data Minng
20 pages
Unit-2
No ratings yet
Unit-2
144 pages
Dmdw-Unit-1 R16
No ratings yet
Dmdw-Unit-1 R16
17 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Unit 1
No ratings yet
Unit 1
18 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Describe The Data Processing Chain: Business Understanding
No ratings yet
Describe The Data Processing Chain: Business Understanding
4 pages
Unit 1
No ratings yet
Unit 1
11 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Bda Module 1
No ratings yet
Bda Module 1
19 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Unit 2- Data Representation
No ratings yet
Unit 2- Data Representation
44 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Introduction To Analytics and Big Data
No ratings yet
Introduction To Analytics and Big Data
12 pages
DA Practice Questions - Unit - 1
No ratings yet
DA Practice Questions - Unit - 1
4 pages
718-Article Text-2170-1-10-20240113
No ratings yet
718-Article Text-2170-1-10-20240113
18 pages
Literature Review Prof V Bansal IITK
100% (1)
Literature Review Prof V Bansal IITK
23 pages
Sama Gnaneshwar Reddy: Microsoft Jun 2014 - Aug 2014
No ratings yet
Sama Gnaneshwar Reddy: Microsoft Jun 2014 - Aug 2014
2 pages
MIS Chapter 5
No ratings yet
MIS Chapter 5
54 pages
Conquering Big Data With High Performance Computing
100% (1)
Conquering Big Data With High Performance Computing
328 pages
Big Data Assignment 1
No ratings yet
Big Data Assignment 1
2 pages
Direction of Agile BI Analytics Big Data
No ratings yet
Direction of Agile BI Analytics Big Data
26 pages
Ketulkumar Polara: Data Scientist Email: Phone
No ratings yet
Ketulkumar Polara: Data Scientist Email: Phone
6 pages
A.I notes class 9th
No ratings yet
A.I notes class 9th
4 pages
Lecture 3 (Data Ingestion)
No ratings yet
Lecture 3 (Data Ingestion)
3 pages
Artificial Intelligence Financial Services
No ratings yet
Artificial Intelligence Financial Services
27 pages
The Impact of Artificial Intelligence and Blockcha
No ratings yet
The Impact of Artificial Intelligence and Blockcha
21 pages
BA Assignment-2 - Yogender
No ratings yet
BA Assignment-2 - Yogender
3 pages
Brochure MIT XPRO - Professional Certificate in Data Engineering - V44
No ratings yet
Brochure MIT XPRO - Professional Certificate in Data Engineering - V44
15 pages
Data Analytics Brochure
No ratings yet
Data Analytics Brochure
3 pages
Nokia AWS Kafka Spark
No ratings yet
Nokia AWS Kafka Spark
12 pages
Case Uber
No ratings yet
Case Uber
9 pages
Nosql Approaches To Big Data - Introduction, Cap Theorem, Different Flavors of Nosql
No ratings yet
Nosql Approaches To Big Data - Introduction, Cap Theorem, Different Flavors of Nosql
24 pages
NTU-FTA Deck
No ratings yet
NTU-FTA Deck
10 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
Full download Handbook of Research for Big Data 1st Edition Brojo Kishore Mishra pdf docx
100% (2)
Full download Handbook of Research for Big Data 1st Edition Brojo Kishore Mishra pdf docx
40 pages
Performance Magazine Oct 2023
No ratings yet
Performance Magazine Oct 2023
36 pages
Big Data Analysis Thesis
100% (3)
Big Data Analysis Thesis
8 pages
Catalogue of ICT Companies From Moldova
No ratings yet
Catalogue of ICT Companies From Moldova
44 pages
TERM PAPER - Marketing Management
No ratings yet
TERM PAPER - Marketing Management
18 pages