6 metrics data mining
6 metrics data mining
(x ) 2
x
N
Metrics
Measures of dispersion
• Variance is a measure of variability and is the square of
standard deviation.
• The quartile divides the metric data into four equal
parts. For calculating quartile, the data is first arranged
in ascending order.
• The 25 percent of the metric data is below the lower
quartile (25 percentile), fifty percent of the metric data is
below the median value and seventy five percent of
metric data is below the upper quartile (75 percentile).
Metrics
5
Frequency
0
0 1000 2000 3000 4000
SLOC
Metrics
Metric Data Distribution
• The bars show the frequency of values of LOC metric.
• The normal curve is superimposed on the distribution of
values to determine the normality of the data values of
LOC.
• Most of the data is left skewed or right skewed. These
curves will not be applicable for discrete data (nominal
or ordinal).
• For example, the classes may be faulty or non faulty,
thus the distribution will not be normal.
Metrics
Metric Data Distribution
• The measures of central tendency such as mean,
median and mode are all equal for normal curves.
• The normal curves are like a bell shaped curve and
within three standard deviations of the means, 96% of
data occurs
Metrics
Metric Data Distribution
• Consider the data sets given for three metric variables
in table below. Determine the normality of these
variables
Fault Cyclomatic Branch
count complexity count
470.00 26.00 826.00
128.00 20.00 211.00
268.00 14.00 485.00
19.00 10.00 29.00
404.00 15.00 405.00
127.00 11.00 240.00
263.00 14.00 464.00
94.00 10.00 187.00
Metrics
Metric Data Distribution
Frequency
4
0
0.00 100.00 200.00 300.00 400.00 500.00
Fault_count
Metrics
Metric Data Distribution 6
Frequency
3
0
5.00 10.00 15.00 20.00 25.00 30.00
Cyclomatic_complexity
Metrics
Metric Data Distributio n 5
Frequency
3
0
0.00 200.00 400.00 600.00 800.00 1000.00
Branch_count
Metrics
Metric Data Distribution
Median
End of
Start of Lower Upper the tail
the tail quartile quartile
Metrics
Outlier Analysis
• The box starts at the lower quartile and ends at the
upper quartile.
• The distance between lower and the upper quartile is
called box length.
• The tails of the box plot specifies the bounds between
which all the data points must lie.
• The start of the tail is Q1 -1.5 × IQR and end of the tail
is Q3 +1.5 × IQR.
Metrics
Outlier Analysis
• These values are truncated to the nearest values of the
actual data points in order to avoid negative values.
• Thus, actual start of the tail is the lowest value in the
variable above (Q1 -1.5 × IQR) and actual end of the tail
is the highest value below (Q3 +1.5 × IQR).
• Any value outside the start of the tail and the end of the
tail is outlier and these data points must be identified as
they are unusual occurrences of data values which must
be considered for inclusion or exclusion.
Metrics
Outlier Analysis
• The box plots also tell us whether the data is skewed or
not.
• If the data is not skewed the median will lie in the center
of the box.
• If the data is left or right skewed, then the median will be
away from the center.
Metrics
Outlier Analysis
For example consider the LOC values for a sample project
given below:
17, 25, 36, 48, 56, 62, 78, 82, 103, 140, 162, 181, 202, 251,
310, 335, 508
Metrics
Outlier Analysis
• The median of the data set is 103, lower quartile is 56 and
upper quartile is 202. The interquartile range is 146.
• The start of the tail is 56-1.5×146=-163 and end of the tail
is 202+1.5×146=421.
• The actual start of the tail is lowest value above -163 i.e.
17 and actual end of the tail is highest value below 421 i.e
335.
• Thus the case number 17 with value 508 is above the end
of the tail and hence is an outlier.
Outlier Analysis
LOC
17
Fault_count
Cyclomatic_compl
exity
1
5 10 15 20 25 30
Outlier Analysis
Branch_count