0% found this document useful (0 votes)

9 views

Unit 1

Uploaded by

shyamvinay222

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Unit 1

Uploaded by

shyamvinay222

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

BTY587:DATA ANALYSIS AND SIMULATIONS

Data pre-processing and visualization

Dr. Awadhesh K Verma

Assistant Professor(LPU)
Former Assistant Professor(DSEU, Govt. Of NCT Delhi)
Former Research Scientist,3Nano (AIC-JNUFI)
Ph.D. NanoBioPhysics & Nano-bioinformatics(CIRBSc, JMI & IIT D),
M.Tech Nanoscience (SCNS, JNU),
M.Sc. Bioinformatics(JMI),
M.Sc. Biochemistry(JMI),
Unit 1
Data pre-processing and visualization
Data preprocessing and visualization : types of data, dealing
with missing data, scatter plot,
histogram, group plots, box plots, dimensionality reduction
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things, that is formatted in
particular manner.
genes c1 c2 c3 s11 s12 s13
gene-PPP1CC_PPP1CC 3 0 22 4642 4213 4083
gene-ZNF556_ZNF556 154.17 19.7 101 97.76 202.69 182.75
gene-DHX57_DHX57 0 1 17 942 973 1048
gene-RPL7A_RPL7A 12.5 1.5 24.5 17975.5 10619 12520.5
gene-CD1B_CD1B 0 0 0 0 0 0
gene-TPCN2_TPCN2 1 0 27 688 498 568
gene-PPP1R10_PPP1R10 0 0 5.71 765.43 515.57 631
gene-RIMBP3_RIMBP3 0 0 0 2.88 2.56 0
gene-CARF_CARF 7.05 0 23.54 499.01 551.2 695.59
gene-HLA-DQA2_HLA-DQA2 0 0 0 0 0 0
gene-IFT140_IFT140 0 0 33 608 656 898
gene-CDKN2C_CDKN2C 0 0 0 385 213 282
gene-TAOK2_TAOK2 3 0 11 2481 2010 2561
gene-TRIM38_TRIM38 11.09 1 102.18 1105.03 1166 1386.33
Terms
• Data
• Programs- collections of instructions used to manipulate data
• Data science- it the field that combines knowledge of mathematics,
programming skills, domain expertise, scientific methods, algorithms,
processes and systems to extract and apply actionable knowledge to
wide ranges of uses and domains.
• information- it is defined as classified and organized data, that has
some meaningful values for the users.
• It is the processed data used to make decisions and take actions.
•

•
Types of data
• Qualitative- words or sentences

• Quantitative-1. Discrete – can only take certain values (whole

number)
2. Continuous-can take any value (within range)

-Discrete is counted
-Continuous is measured
• Qualitative data- They represent some characteristics or attributes. They depict
descriptions that may be observed but cannot be computed or calculated. They
are more exploratory than conclusive in nature.
Example-1. For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class a sample would
be classified as qualitative.
2. Reviews by customers

• Quantitative data- These can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them. This
information is numerical and can be classified as quantitative.
Example-1. data on the number of students playing different sports from your class
gives an estimate of how many of the total students play which sport.
2. A firm reports financial numbers for particular quarter.
• Discrete Data: These are data that can take only certain specific
values rather than a range of values. For example, data on the blood
group of a certain population or on their genders is termed as
discrete data.

•
For example, the
height and weights of the students of your school can be in decimals.
Dealing with missing data
• What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset.
• Missing values are represented by NaN.
Why Is Data Missing From The Dataset

• There can be multiple reasons why certain values are missing from the
data.
• Reasons for the missing data from the dataset affect the approach of
handling missing data.
• Some of the reasons are listed below:

1. Past data might get corrupted due to improper maintenance.

2. Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
3. The user has not provided the values intentionally.
Why Do We Need To Care About Handling Missing Value?

• It is important to handle the missing values appropriately.

1. Many machine learning algorithms fail if the dataset contains missing

values.

2. You may end up building a biased machine learning model which will
lead to incorrect results if the missing values are not handled properly.

3. Missing data can lead to a lack of precision in the statistical analysis.

Ways of handling missing data
• There are 2 primary ways of handling missing values:

1. Deleting the Missing values

2. Imputing the Missing Values
Deleting the Missing value
• It is one of the quick techniques to deal with missing values.

• If the missing value is of the type Missing Not At Random (MNAR), then it should not be
deleted.

• If the missing value is of type Missing At Random (MAR) or Missing Completely At

Random (MCAR) then it can be deleted.

• The disadvantage of this method is one might end up deleting some useful data from the
dataset.

• There are 2 ways one can delete the missing values:

• Deleting the entire row

• Deleting the entire column
Imputing the Missing Value
• There are different ways of replacing the missing values.

• Replacing With Arbitrary Value

• Replacing With Mean
• Replacing With Mode- Mode is the most frequently occurring value.
• Replacing With Median- Median is the middlemost value.
• Replacing with previous value – Forward fill
• Replacing with next value – Backward fill
Plots
Box plot
• The method to summarize a set of data that is
measured using an interval scale is called a box and
whisker plot.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Median: The median is the middle value of the dataset, which divides the given dataset

into two equal parts. The median is considered as the second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile and first quartile is

known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data is tested to

be the outliers. Generally, the outliers fall more than the specified distance from the

first and third quartile.

Example:
Find the maximum, minimum, median, first quartile, and third quartile for the
given data set: 23, 26, 12, 10, 15, 14, 9, 96
Solution:
Given: 23, 26, 12, 10, 15, 14, 9, 96
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 26, 96.
Hence,
Minimum = 9
Maximum = 26
Median = 14
Outlier = 96
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 26 is 23).
Scatter plot
• Scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph.

• The scatter diagram graphs numerical data pairs, with one variable on each axis,
show their relationship.

• Mathematical diagram using Cartesian coordinates to display values for typically

two variables for a set of data

• The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Histogram
• The histogram is represented by a set of rectangles, adjacent to each
other, where each bar represent a kind of data.

• It is an area diagram and can be defined as a set of rectangles with

bases along with the intervals between class boundaries and with areas
proportional to frequencies in the corresponding classes.
Types of
Histogram
Group plot
• A grouped bar plot is a type of chart that
displays quantities for different
variables, grouped by another variable.

Manual Haskel
100% (1)
Manual Haskel
30 pages
List of Professors For Chinese Universities With Emails
No ratings yet
List of Professors For Chinese Universities With Emails
46 pages
Cummins 6LT9.3 162 KW (220 HP) at 2,200 RPM 146 KW (199 HP) at 2,200 RPM 16,700 KG 3.6 M 155 KN 2,950 MM
100% (8)
Cummins 6LT9.3 162 KW (220 HP) at 2,200 RPM 146 KW (199 HP) at 2,200 RPM 16,700 KG 3.6 M 155 KN 2,950 MM
2 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
CHP 2
No ratings yet
CHP 2
52 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
DM 02 01 Data Undrestanding
No ratings yet
DM 02 01 Data Undrestanding
35 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
UNIT02
No ratings yet
UNIT02
41 pages
Unit2
No ratings yet
Unit2
76 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Unit 1
No ratings yet
Unit 1
21 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
253777
No ratings yet
253777
66 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
CH - 4
No ratings yet
CH - 4
71 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
ML U2
No ratings yet
ML U2
62 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Lec 2
No ratings yet
Lec 2
26 pages
Lect 2
No ratings yet
Lect 2
54 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
program-1_
No ratings yet
program-1_
15 pages
Probabilistik Dan Proses Stokastik
No ratings yet
Probabilistik Dan Proses Stokastik
31 pages
Unit 4
No ratings yet
Unit 4
66 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
A Level Stats 3 Measures of Location Spread
No ratings yet
A Level Stats 3 Measures of Location Spread
1 page
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
02 Data
No ratings yet
02 Data
64 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
machine learning unit 2
No ratings yet
machine learning unit 2
9 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
1st Part of Material
No ratings yet
1st Part of Material
15 pages
MBAS901 - L2
No ratings yet
MBAS901 - L2
110 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
FS 1 PORTFOLIO Episode 2 ENCINA JULIE JEAN BEED 4B
No ratings yet
FS 1 PORTFOLIO Episode 2 ENCINA JULIE JEAN BEED 4B
9 pages
2014 Guide Komai Fellowship
No ratings yet
2014 Guide Komai Fellowship
4 pages
How To Create The Sample Function Called Spellnumber
No ratings yet
How To Create The Sample Function Called Spellnumber
5 pages
Time Table For Summer 2024 Theory Examination 6i
No ratings yet
Time Table For Summer 2024 Theory Examination 6i
1 page
NBL Datasheet
No ratings yet
NBL Datasheet
2 pages
Chem Taster Sheet 2
No ratings yet
Chem Taster Sheet 2
1 page
Meaning Sentence Connectors Subordinates Phrase Linkers Cause and Effect
No ratings yet
Meaning Sentence Connectors Subordinates Phrase Linkers Cause and Effect
9 pages
Ejemplo de Reporte de Inspección
No ratings yet
Ejemplo de Reporte de Inspección
42 pages
Individual Analysis Report - Assignment #3 - W21
No ratings yet
Individual Analysis Report - Assignment #3 - W21
1 page
MSDS - Weber. Bloc Fix 02
No ratings yet
MSDS - Weber. Bloc Fix 02
4 pages
PRG Language Concepts PDF
No ratings yet
PRG Language Concepts PDF
32 pages
Book Details CSE
No ratings yet
Book Details CSE
3 pages
MS For Crane & Hoist Erection Rev.A
No ratings yet
MS For Crane & Hoist Erection Rev.A
21 pages
1_XDA45U ARTICULATED DUMP TRUCK PARTS CATALOGUE
No ratings yet
1_XDA45U ARTICULATED DUMP TRUCK PARTS CATALOGUE
578 pages
Creative Accounting
No ratings yet
Creative Accounting
14 pages
EntropyandLife (Final) PDF
No ratings yet
EntropyandLife (Final) PDF
19 pages
Fake News in The Philippines
No ratings yet
Fake News in The Philippines
1 page
Fenesta
50% (2)
Fenesta
42 pages
Fa6b20n N6 L3
No ratings yet
Fa6b20n N6 L3
1 page
CSE322 Unit1 2
No ratings yet
CSE322 Unit1 2
37 pages
Machine Safeguarding Genrqts
No ratings yet
Machine Safeguarding Genrqts
44 pages
Lab Report 4
100% (1)
Lab Report 4
16 pages
Assembly Drawing & Parts List: Ser - No. Part No. Name & Specification Quantity
No ratings yet
Assembly Drawing & Parts List: Ser - No. Part No. Name & Specification Quantity
3 pages
Proposal Project MT Batch 6
No ratings yet
Proposal Project MT Batch 6
62 pages
Thermovit Brochure EN - 0 PDF
No ratings yet
Thermovit Brochure EN - 0 PDF
9 pages
Bugreport Karna - in QKQ1.200512.002 2021 02 28 16 31 37 Dumpstate - Log 17011
No ratings yet
Bugreport Karna - in QKQ1.200512.002 2021 02 28 16 31 37 Dumpstate - Log 17011
30 pages
Assignment
No ratings yet
Assignment
10 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

BTY587:DATA ANALYSIS AND SIMULATIONS

Data pre-processing and visualization

Dr. Awadhesh K Verma

• Quantitative-1. Discrete – can only take certain values (whole

1. Past data might get corrupted due to improper maintenance.

• It is important to handle the missing values appropriately.

1. Many machine learning algorithms fail if the dataset contains missing

3. Missing data can lead to a lack of precision in the statistical analysis.

1. Deleting the Missing values

• If the missing value is of type Missing At Random (MAR) or Missing Completely At

• There are 2 ways one can delete the missing values:

• Deleting the entire row

• Replacing With Arbitrary Value

Maximum: The maximum value in the given dataset.

known as the interquartile range. (i.e.) IQR = Q3-Q1

first and third quartile.

• Mathematical diagram using Cartesian coordinates to display values for typically

• It is an area diagram and can be defined as a set of rectangles with

You might also like