Unit 1
Unit 1
Assistant Professor(LPU)
Former Assistant Professor(DSEU, Govt. Of NCT Delhi)
Former Research Scientist,3Nano (AIC-JNUFI)
Ph.D. NanoBioPhysics & Nano-bioinformatics(CIRBSc, JMI & IIT D),
M.Tech Nanoscience (SCNS, JNU),
M.Sc. Bioinformatics(JMI),
M.Sc. Biochemistry(JMI),
Unit 1
Data pre-processing and visualization
Data preprocessing and visualization : types of data, dealing
with missing data, scatter plot,
histogram, group plots, box plots, dimensionality reduction
Data
• Data is a collection of facts, such as numbers, words, measurements,
observations or just descriptions of things, that is formatted in
particular manner.
genes c1 c2 c3 s11 s12 s13
gene-PPP1CC_PPP1CC 3 0 22 4642 4213 4083
gene-ZNF556_ZNF556 154.17 19.7 101 97.76 202.69 182.75
gene-DHX57_DHX57 0 1 17 942 973 1048
gene-RPL7A_RPL7A 12.5 1.5 24.5 17975.5 10619 12520.5
gene-CD1B_CD1B 0 0 0 0 0 0
gene-TPCN2_TPCN2 1 0 27 688 498 568
gene-PPP1R10_PPP1R10 0 0 5.71 765.43 515.57 631
gene-RIMBP3_RIMBP3 0 0 0 2.88 2.56 0
gene-CARF_CARF 7.05 0 23.54 499.01 551.2 695.59
gene-HLA-DQA2_HLA-DQA2 0 0 0 0 0 0
gene-IFT140_IFT140 0 0 33 608 656 898
gene-CDKN2C_CDKN2C 0 0 0 385 213 282
gene-TAOK2_TAOK2 3 0 11 2481 2010 2561
gene-TRIM38_TRIM38 11.09 1 102.18 1105.03 1166 1386.33
Terms
• Data
• Programs- collections of instructions used to manipulate data
• Data science- it the field that combines knowledge of mathematics,
programming skills, domain expertise, scientific methods, algorithms,
processes and systems to extract and apply actionable knowledge to
wide ranges of uses and domains.
• information- it is defined as classified and organized data, that has
some meaningful values for the users.
• It is the processed data used to make decisions and take actions.
•
•
Types of data
• Qualitative- words or sentences
-Discrete is counted
-Continuous is measured
• Qualitative data- They represent some characteristics or attributes. They depict
descriptions that may be observed but cannot be computed or calculated. They
are more exploratory than conclusive in nature.
Example-1. For example, data on attributes such as intelligence, honesty, wisdom,
cleanliness, and creativity collected using the students of your class a sample would
be classified as qualitative.
2. Reviews by customers
• Quantitative data- These can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them. This
information is numerical and can be classified as quantitative.
Example-1. data on the number of students playing different sports from your class
gives an estimate of how many of the total students play which sport.
2. A firm reports financial numbers for particular quarter.
• Discrete Data: These are data that can take only certain specific
values rather than a range of values. For example, data on the blood
group of a certain population or on their genders is termed as
discrete data.
•
For example, the
height and weights of the students of your school can be in decimals.
Dealing with missing data
• What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset.
• Missing values are represented by NaN.
Why Is Data Missing From The Dataset
• There can be multiple reasons why certain values are missing from the
data.
• Reasons for the missing data from the dataset affect the approach of
handling missing data.
• Some of the reasons are listed below:
2. You may end up building a biased machine learning model which will
lead to incorrect results if the missing values are not handled properly.
• If the missing value is of the type Missing Not At Random (MNAR), then it should not be
deleted.
• The disadvantage of this method is one might end up deleting some useful data from the
dataset.
First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given dataset
into two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is
Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
• The scatter diagram graphs numerical data pairs, with one variable on each axis,
show their relationship.
• The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Histogram
• The histogram is represented by a set of rectangles, adjacent to each
other, where each bar represent a kind of data.