0% found this document useful (0 votes)

2 views

6 metrics data mining

Uploaded by

saharsh0812

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

6 metrics data mining

Uploaded by

saharsh0812

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Metrics

Metric Interval =, <, > P=xP’ + y Temperatures,

 No ratios date and time
A d d i t i o n ,
subtraction
 Arbitrary zero
point
Ratio Absolute zero P=xP’ We i g h t ,
point height, length
 All arithmetic
operations
possible
Absolute Simple count P=P’ Number of
values f a u l t s
encountered
in testing
Metrics
Analyzing the Metric Data
• The role of statistics is to function as a tool in
analyzing research data and drawing conclusions
from it.
• The research data must be suitably reduced to be
read easily and used for further analysis.
• Descriptive statistics concern development of
certain indices or measures to summarize data.
• Data can be summarized by using measures of
central tendency (mean, median and mode) and
measures of dispersion (standard deviation,
variance, quartile).
Metrics
Measures of central tendency
• Measures of central tendency include mean,
median and mode.
• These measures are known as measures of central
tendency as they give us the idea about the central
values of the data around which all the other data
points have a tendency to gather.
• Mean can be computed by taking average values of
the data set and is given as:
N
Mean(  )  
xi
i1
N
Metrics
Analyzing the Metric Data
• Median gives the middle value in the data set which
means half of the data points are below the median
value and half of the data points are above the
median value. It is calculated as the value of the
data set, where n is the number of data points in
the data set.
• The most frequently occurring value in the data set
is denoted by mode.
Metrics
Choice of Measures of Central Tendency
The choice of selecting a measure of central tendency
depends upon:
1. the scale type of data at which it is measured
2.the distribution of data (left skewed, symmetrical, right
skewed)

Measures Relevant Scale Type

Mean Interval and ratio data which are not

skewed.
Median Ordinal, interval and ratio but not useful
for ordinal scales having few values.
Mode All scale types but not useful for scales
having multiple values.
Metrics
Graphs representing skewed and symmetrical distributions
Metrics
Measures of dispersion
• Measures of dispersion include standard deviation,
variance and quartiles. Measures of dispersion tell us
how the data is scattered or spread. Standard deviation
calculates the distance of the data point from the mean.
If most of the data points are far away from the mean
then the standard deviation of the variable is large. The
standard deviation is calculated as given below:

 (x   ) 2
x 
N
Metrics
Measures of dispersion
• Variance is a measure of variability and is the square of
standard deviation.
• The quartile divides the metric data into four equal
parts. For calculating quartile, the data is first arranged
in ascending order.
• The 25 percent of the metric data is below the lower
quartile (25 percentile), fifty percent of the metric data is
below the median value and seventy five percent of
metric data is below the upper quartile (75 percentile).
Metrics

Lower quartile Median Upper quartile

1st part 2nd part 3rd part 4th part

Metrics
Measures of dispersion
• The lower quartile (Q1) is computed by:
– finding the median of the data set
– finding median of the lower half of the data set
• The upper quartile (Q3) is computed by:
– finding the median of the data set
– finding median of the upper half of the data set
Metrics
Measures of dispersion
• Interquartile range = Q3 - Q1

Example: Consider data set consisting of lines of source

code (SLOC) given in example 8.2. Calculate standard
deviation, variance and quartile for it.
107, 128, 186, 222, 322, 466, 657, 706, 790, 844, 1129,
1280, 1411, 1532, 1824, 1882, 3442
Metrics
Metric Data Distribution
• In order to understand the metric data, the starting point
is to analyze the shape of the distribution of the data.
• There are a number of methods available to analyze the
shape of the data one of them is histogram through
which a researcher can gain insight about the normality
of the data.
• Histogram is a graphical representation of frequency of
occurrences of the values of a given variable.
Metrics

Metric Data Distribution

5
Frequency

0
0 1000 2000 3000 4000

SLOC
Metrics
Metric Data Distribution
• The bars show the frequency of values of LOC metric.
• The normal curve is superimposed on the distribution of
values to determine the normality of the data values of
LOC.
• Most of the data is left skewed or right skewed. These
curves will not be applicable for discrete data (nominal
or ordinal).
• For example, the classes may be faulty or non faulty,
thus the distribution will not be normal.
Metrics
Metric Data Distribution
• The measures of central tendency such as mean,
median and mode are all equal for normal curves.
• The normal curves are like a bell shaped curve and
within three standard deviations of the means, 96% of
data occurs
Metrics
Metric Data Distribution
• Consider the data sets given for three metric variables
in table below. Determine the normality of these
variables
Fault Cyclomatic Branch
count complexity count
470.00 26.00 826.00
128.00 20.00 211.00
268.00 14.00 485.00
19.00 10.00 29.00
404.00 15.00 405.00
127.00 11.00 240.00
263.00 14.00 464.00
94.00 10.00 187.00
Metrics
Metric Data Distribution

Fault Cyclomatic Branch

count complexity count
207.00 13.00 344.00
42.00 7.00 83.00
24.00 10.00 47.00
94.00 6.00 163.00
34.00 9.00 67.00
286.00 10.00 503.00
104.00 12.00 175.00
82.00 8.00 147.00
20.00 7.00 39.00
Metrics
Metric Data Distribution 8

Frequency
4

0
0.00 100.00 200.00 300.00 400.00 500.00

Fault_count
Metrics
Metric Data Distribution 6

Frequency
3

0
5.00 10.00 15.00 20.00 25.00 30.00

Cyclomatic_complexity
Metrics
Metric Data Distributio n 5

Frequency
3

0
0.00 200.00 400.00 600.00 800.00 1000.00

Branch_count
Metrics
Metric Data Distribution

Metric Mean Median Std. Deviation

Fault count 156.8235 108 137.7599
Cyclomatic complexity 11.88235 10 5.035901
Branch count 259.7059 187 216.9976
Metrics
Metric Data Distribution
• The normality of the data can also be checked by
calculating mean and median.
• The mean, median, and the mode should be same for a
normal distribution.
• We compare the values of mean and median.
• The values of median are less than the mean for
variables fault count and branch count.
• Thus, these variables are not normally distributed.
• However, the mean and median of cyclomatic
complexity do not differ by a larger value.
Metrics
Outlier Analysis
• Data points, which are located in an empty part of the
sample space, are called outliers.
• These are the data values that are numerically distant
from the rest of the data.
• For example, suppose one calculates the average
weight of 10 students in a class, where most are
between 51 pounds and 61 pounds, but the weight of
one student is 210 pounds. In this case, the mean will
be 72 pounds and the median will be 58.
Metrics
Outlier Analysis
• Hence, the median better reflects the weight of the
students than the mean.
• Thus, the data point with the value 210 is an outlier, that
is, it is located away from other values in the data set.
• Outlier analysis is done to find data points that are over
influential and removing them is essential.
Metrics
Outlier Analysis
• Once the outliers are identified the decision about the
inclusion or exclusion of the outlier must be made.
• The decision depends upon the reason why the case is
identified as outlier.
• There are three types of outliers: univariate, bivariate
and multivariate.
• Univariate outliers are those exceptional values that
occur within a single variable.
• Bivariate outliers occur within the combination of two
variables and multivariate outliers are present within the
combination of more than two variables.
Metrics
Outlier Analysis
• Box plots and scatter plots are two popular methods
that are used for univariate and bivariate outlier
detection.
• Box plots are based on median and quartiles. The upper
and lower quartiles statistics are used to construct a box
plot.
• The median value is the middle value of the data set
half of the values are less than this value and half of the
values are greater than this value.
Metrics
Outlier Analysis

Median

End of
Start of Lower Upper the tail
the tail quartile quartile
Metrics
Outlier Analysis
• The box starts at the lower quartile and ends at the
upper quartile.
• The distance between lower and the upper quartile is
called box length.
• The tails of the box plot specifies the bounds between
which all the data points must lie.
• The start of the tail is Q1 -1.5 × IQR and end of the tail
is Q3 +1.5 × IQR.
Metrics
Outlier Analysis
• These values are truncated to the nearest values of the
actual data points in order to avoid negative values.
• Thus, actual start of the tail is the lowest value in the
variable above (Q1 -1.5 × IQR) and actual end of the tail
is the highest value below (Q3 +1.5 × IQR).
• Any value outside the start of the tail and the end of the
tail is outlier and these data points must be identified as
they are unusual occurrences of data values which must
be considered for inclusion or exclusion.
Metrics
Outlier Analysis
• The box plots also tell us whether the data is skewed or
not.
• If the data is not skewed the median will lie in the center
of the box.
• If the data is left or right skewed, then the median will be
away from the center.
Metrics
Outlier Analysis
For example consider the LOC values for a sample project
given below:
17, 25, 36, 48, 56, 62, 78, 82, 103, 140, 162, 181, 202, 251,
310, 335, 508
Metrics
Outlier Analysis
• The median of the data set is 103, lower quartile is 56 and
upper quartile is 202. The interquartile range is 146.
• The start of the tail is 56-1.5×146=-163 and end of the tail
is 202+1.5×146=421.
• The actual start of the tail is lowest value above -163 i.e.
17 and actual end of the tail is highest value below 421 i.e
335.
• Thus the case number 17 with value 508 is above the end
of the tail and hence is an outlier.
Outlier Analysis

LOC
17

0.00 100.00 200.00 300.00 400.00 500.00 600.00

Metrics
Example:
Consider the data set given in example above. Construct
box plots and identify univariate outliers for all the variables
in the data set.
Outlier Analysis

Fault_count

0 100 200 300 400 500

Outlier Analysis

Cyclomatic_compl
exity
1

5 10 15 20 25 30
Outlier Analysis

Branch_count

0 200 400 600 800

Metrics
• The outliers should be analyzed by the researchers to
make a decision about their inclusion or exclusion in
the data analysis.
• There may be many reasons for an outlier
– error in the entry of the data
– some extra information that represents extraordinary or
unusual event
– an extraordinary event that is unexplained by the
researcher.
Metrics
• Outlier values may be present due to combination of
data values present across more than one variable.
• These outliers are called multivariate outliers.
• Scatter plot is another visualization method to detect
outliers.
• In scatter plots, we simply represent graphically all the
data points.
• The scatter plot allows us to examine more than one
metric variable at a given time.
Metrics
Exploring Analysis
• The metric variables can be of two type’s independent
variables and dependent (target) variable.
• The effect of independent variables on the dependent
variable can be explored by using various statistical
and machine learning methods.
• The choice of the method for analyzing the
relationship between independent and dependent
variable depends upon the type of the dependent
variable (continuous or discrete).

Sample Opposition To Petition To Compel Arbitration in California
100% (1)
Sample Opposition To Petition To Compel Arbitration in California
3 pages
What Is Remote Closing With She Sells by Shelby Sapp? and How Does Remote Closing For Women Actually Work? - WikiSauce Profiles
No ratings yet
What Is Remote Closing With She Sells by Shelby Sapp? and How Does Remote Closing For Women Actually Work? - WikiSauce Profiles
6 pages
Lecture 5 (Descriptive Statistics)
No ratings yet
Lecture 5 (Descriptive Statistics)
39 pages
2.data Description
No ratings yet
2.data Description
57 pages
Week 5 - Result and Analysis 1 (UP)
No ratings yet
Week 5 - Result and Analysis 1 (UP)
7 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Data Management
100% (1)
Data Management
51 pages
Math in The Modern World Stat Lecture
No ratings yet
Math in The Modern World Stat Lecture
3 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Lecture 2 - Mean, Median and Mode
No ratings yet
Lecture 2 - Mean, Median and Mode
9 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
No ratings yet
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
44 pages
Ch 2 Lecture Notes
No ratings yet
Ch 2 Lecture Notes
12 pages
Unit - 2: Measures of Central Tendency
No ratings yet
Unit - 2: Measures of Central Tendency
8 pages
Data Analytics TB
No ratings yet
Data Analytics TB
1,944 pages
Prelim Notes
No ratings yet
Prelim Notes
4 pages
الفصل الثالث مقدمة في الاحصاء.pdf
No ratings yet
الفصل الثالث مقدمة في الاحصاء.pdf
69 pages
3.3.1 Data Summarization
No ratings yet
3.3.1 Data Summarization
56 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
MMW Managing Data
No ratings yet
MMW Managing Data
3 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Biostats Lesson 3
No ratings yet
Biostats Lesson 3
6 pages
FDT and MCT
No ratings yet
FDT and MCT
19 pages
2nd Unit - Statistics
No ratings yet
2nd Unit - Statistics
15 pages
TDA1
No ratings yet
TDA1
57 pages
Coverage: - Measures of Central Tendency
No ratings yet
Coverage: - Measures of Central Tendency
52 pages
Unit Iii
No ratings yet
Unit Iii
152 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
3rd Week
No ratings yet
3rd Week
87 pages
EDA_W3_Obtaining-Data
No ratings yet
EDA_W3_Obtaining-Data
57 pages
Unit 3 Summarising Data - Averages and Dispersion
No ratings yet
Unit 3 Summarising Data - Averages and Dispersion
22 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
L3 Numerical Summary Measures
No ratings yet
L3 Numerical Summary Measures
44 pages
2.3 Descriptive Numerical Summary Measures
No ratings yet
2.3 Descriptive Numerical Summary Measures
67 pages
What Are Measures of Central Tendency
No ratings yet
What Are Measures of Central Tendency
5 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
Stat Chapter 5-9
No ratings yet
Stat Chapter 5-9
32 pages
Module-4 PPT
No ratings yet
Module-4 PPT
54 pages
Statistics 091147
No ratings yet
Statistics 091147
60 pages
Business Statistics - Session Descriptive Statistics
No ratings yet
Business Statistics - Session Descriptive Statistics
28 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Lesson-3.2-Measures-of-Central-Tendency-Position-and-Variation
No ratings yet
Lesson-3.2-Measures-of-Central-Tendency-Position-and-Variation
62 pages
Mining Data Dispersion Characteristics
No ratings yet
Mining Data Dispersion Characteristics
7 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
01 Data
No ratings yet
01 Data
100 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
51 pages
Chapter 2 - Stat
No ratings yet
Chapter 2 - Stat
100 pages
Measures
No ratings yet
Measures
8 pages
Chap1 Lesson 2
No ratings yet
Chap1 Lesson 2
10 pages
statistics
No ratings yet
statistics
10 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
02 Measures of Central Tendency
No ratings yet
02 Measures of Central Tendency
41 pages
Mathematical Analysis
100% (1)
Mathematical Analysis
46 pages
Biostatistics (Descriptive Statistics)
No ratings yet
Biostatistics (Descriptive Statistics)
30 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Latest Ballast Water Management Implementation Dates
No ratings yet
Latest Ballast Water Management Implementation Dates
2 pages
Fire Warden Manual - Original
No ratings yet
Fire Warden Manual - Original
5 pages
Annual Programme Implementation Plan (APIP) Bihar 2015-16
No ratings yet
Annual Programme Implementation Plan (APIP) Bihar 2015-16
40 pages
User Manual Vmmma
No ratings yet
User Manual Vmmma
16 pages
04 - Engine Mechanical
100% (1)
04 - Engine Mechanical
48 pages
Mail Server
0% (1)
Mail Server
78 pages
Engineering Eco PDF Free
No ratings yet
Engineering Eco PDF Free
26 pages
Ceb72 6086 PDF
No ratings yet
Ceb72 6086 PDF
4 pages
SailPoint InstallationSteps 6.4version
No ratings yet
SailPoint InstallationSteps 6.4version
2 pages
DEPT WORK PLAN - Community Service-1
No ratings yet
DEPT WORK PLAN - Community Service-1
6 pages
Fkaas BFF Proforma 20162017
No ratings yet
Fkaas BFF Proforma 20162017
66 pages
Lecture 1422914957
No ratings yet
Lecture 1422914957
178 pages
Maker Line Datasheet
No ratings yet
Maker Line Datasheet
7 pages
2023-Annual-Report FII AR 2401 XS 240507-2
No ratings yet
2023-Annual-Report FII AR 2401 XS 240507-2
34 pages
The Enterprise Cloud For HR & Finance - Sherry Amos
No ratings yet
The Enterprise Cloud For HR & Finance - Sherry Amos
10 pages
Qualified Aspirants - JMC Study Hub
No ratings yet
Qualified Aspirants - JMC Study Hub
1 page
Reddit - PDF For Handbook On PCT
No ratings yet
Reddit - PDF For Handbook On PCT
43 pages
Personality To Sue
No ratings yet
Personality To Sue
20 pages
Bioequivalence of Two Ibuprofen
No ratings yet
Bioequivalence of Two Ibuprofen
7 pages
SM Full Book SM Book
No ratings yet
SM Full Book SM Book
250 pages
P L D 2010 Supreme Court 661
0% (2)
P L D 2010 Supreme Court 661
3 pages
4042-MI20-00S1-0110-002 R08 PIPING MATERIALS CLASSIFICATION DATA SHEETS
No ratings yet
4042-MI20-00S1-0110-002 R08 PIPING MATERIALS CLASSIFICATION DATA SHEETS
118 pages
Hd1800Rm - Dual Boom Hedger Topper: December 2016/NP
No ratings yet
Hd1800Rm - Dual Boom Hedger Topper: December 2016/NP
2 pages
Jaringan Komputer
No ratings yet
Jaringan Komputer
493 pages
VSMAbrasives TechManual
No ratings yet
VSMAbrasives TechManual
12 pages
Rahul Ranjan 2150227 - Cia Iii
No ratings yet
Rahul Ranjan 2150227 - Cia Iii
12 pages
"La Caixa"-Severo Ochoa International PHD Programme at The Cnio 2014 Call
No ratings yet
"La Caixa"-Severo Ochoa International PHD Programme at The Cnio 2014 Call
5 pages
Dynamic Seal in The Fertilizer Industry 11-09
No ratings yet
Dynamic Seal in The Fertilizer Industry 11-09
9 pages

6 metrics data mining

Uploaded by

6 metrics data mining

Uploaded by

Metrics

Metric Interval =, <, > P=xP’ + y Temperatures,

Measures Relevant Scale Type

Mean Interval and ratio data which are not

Lower quartile Median Upper quartile

1st part 2nd part 3rd part 4th part

Example: Consider data set consisting of lines of source

Metric Data Distribution

Fault Cyclomatic Branch

Metric Mean Median Std. Deviation

0.00 100.00 200.00 300.00 400.00 500.00 600.00

0 100 200 300 400 500

0 200 400 600 800

You might also like