0% found this document useful (0 votes)
9 views

Unit 1

Unit 1

Uploaded by

shyamvinay222
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit 1

Unit 1

Uploaded by

shyamvinay222
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

BTY587:DATA ANALYSIS AND SIMULATIONS

Data pre-processing and visualization

Dr. Awadhesh K Verma

Assistant Professor(LPU)
Former Assistant Professor(DSEU, Govt. Of NCT Delhi)
Former Research Scientist,3Nano (AIC-JNUFI)
Ph.D. NanoBioPhysics & Nano-bioinformatics(CIRBSc, JMI & IIT D),
M.Tech Nanoscience (SCNS, JNU),
M.Sc. Bioinformatics(JMI),
M.Sc. Biochemistry(JMI),
Unit 1
Data pre-processing and visualization
Data preprocessing and visualization : types of data, dealing
with missing data, scatter plot,
histogram, group plots, box plots, dimensionality reduction
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things, that is formatted in
particular manner.
genes c1 c2 c3 s11 s12 s13
gene-PPP1CC_PPP1CC 3 0 22 4642 4213 4083
gene-ZNF556_ZNF556 154.17 19.7 101 97.76 202.69 182.75
gene-DHX57_DHX57 0 1 17 942 973 1048
gene-RPL7A_RPL7A 12.5 1.5 24.5 17975.5 10619 12520.5
gene-CD1B_CD1B 0 0 0 0 0 0
gene-TPCN2_TPCN2 1 0 27 688 498 568
gene-PPP1R10_PPP1R10 0 0 5.71 765.43 515.57 631
gene-RIMBP3_RIMBP3 0 0 0 2.88 2.56 0
gene-CARF_CARF 7.05 0 23.54 499.01 551.2 695.59
gene-HLA-DQA2_HLA-DQA2 0 0 0 0 0 0
gene-IFT140_IFT140 0 0 33 608 656 898
gene-CDKN2C_CDKN2C 0 0 0 385 213 282
gene-TAOK2_TAOK2 3 0 11 2481 2010 2561
gene-TRIM38_TRIM38 11.09 1 102.18 1105.03 1166 1386.33
Terms
• Data
• Programs- collections of instructions used to manipulate data
• Data science- it the field that combines knowledge of mathematics,
programming skills, domain expertise, scientific methods, algorithms,
processes and systems to extract and apply actionable knowledge to
wide ranges of uses and domains.
• information- it is defined as classified and organized data, that has
some meaningful values for the users.
• It is the processed data used to make decisions and take actions.


Types of data
• Qualitative- words or sentences

• Quantitative-1. Discrete – can only take certain values (whole


number)
2. Continuous-can take any value (within range)

-Discrete is counted
-Continuous is measured
• Qualitative data- They represent some characteristics or attributes. They depict
descriptions that may be observed but cannot be computed or calculated. They
are more exploratory than conclusive in nature.
Example-1. For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class a sample would
be classified as qualitative.
2. Reviews by customers

• Quantitative data- These can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them. This
information is numerical and can be classified as quantitative.
Example-1. data on the number of students playing different sports from your class
gives an estimate of how many of the total students play which sport.
2. A firm reports financial numbers for particular quarter.
• Discrete Data: These are data that can take only certain specific
values rather than a range of values. For example, data on the blood
group of a certain population or on their genders is termed as
discrete data.


For example, the
height and weights of the students of your school can be in decimals.
Dealing with missing data
• What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset.
• Missing values are represented by NaN.
Why Is Data Missing From The Dataset

• There can be multiple reasons why certain values are missing from the
data.
• Reasons for the missing data from the dataset affect the approach of
handling missing data.
• Some of the reasons are listed below:

1. Past data might get corrupted due to improper maintenance.


2. Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
3. The user has not provided the values intentionally.
Why Do We Need To Care About Handling Missing Value?

• It is important to handle the missing values appropriately.

1. Many machine learning algorithms fail if the dataset contains missing


values.

2. You may end up building a biased machine learning model which will
lead to incorrect results if the missing values are not handled properly.

3. Missing data can lead to a lack of precision in the statistical analysis.


Ways of handling missing data
• There are 2 primary ways of handling missing values:

1. Deleting the Missing values


2. Imputing the Missing Values
Deleting the Missing value
• It is one of the quick techniques to deal with missing values.

• If the missing value is of the type Missing Not At Random (MNAR), then it should not be
deleted.

• If the missing value is of type Missing At Random (MAR) or Missing Completely At


Random (MCAR) then it can be deleted.

• The disadvantage of this method is one might end up deleting some useful data from the
dataset.

• There are 2 ways one can delete the missing values:

• Deleting the entire row


• Deleting the entire column
Imputing the Missing Value
• There are different ways of replacing the missing values.

• Replacing With Arbitrary Value


• Replacing With Mean
• Replacing With Mode- Mode is the most frequently occurring value.
• Replacing With Median- Median is the middlemost value.
• Replacing with previous value – Forward fill
• Replacing with next value – Backward fill
Plots
Box plot
• The method to summarize a set of data that is
measured using an interval scale is called a box and
whisker plot.
Minimum: The minimum value in the given dataset

First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Median: The median is the middle value of the dataset, which divides the given dataset

into two equal parts. The median is considered as the second quartile.

Third Quartile (Q3): The third quartile is the median of the upper half of the data.

Maximum: The maximum value in the given dataset.

Apart from these five terms, the other terms used in the box plot are:

Interquartile Range (IQR): The difference between the third quartile and first quartile is

known as the interquartile range. (i.e.) IQR = Q3-Q1

Outlier: The data that falls on the far left or right side of the ordered data is tested to

be the outliers. Generally, the outliers fall more than the specified distance from the

first and third quartile.


Example:
Find the maximum, minimum, median, first quartile, and third quartile for the
given data set: 23, 26, 12, 10, 15, 14, 9, 96
Solution:
Given: 23, 26, 12, 10, 15, 14, 9, 96
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 26, 96.
Hence,
Minimum = 9
Maximum = 26
Median = 14
Outlier = 96
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 26 is 23).
Scatter plot
• Scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph.

• The scatter diagram graphs numerical data pairs, with one variable on each axis,
show their relationship.

• Mathematical diagram using Cartesian coordinates to display values for typically


two variables for a set of data

• The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Histogram
• The histogram is represented by a set of rectangles, adjacent to each
other, where each bar represent a kind of data.

• It is an area diagram and can be defined as a set of rectangles with


bases along with the intervals between class boundaries and with areas
proportional to frequencies in the corresponding classes.
Types of
Histogram
Group plot
• A grouped bar plot is a type of chart that
displays quantities for different
variables, grouped by another variable.

You might also like