0% found this document useful (0 votes)
2 views

6 metrics data mining

Uploaded by

saharsh0812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

6 metrics data mining

Uploaded by

saharsh0812
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Metrics

Metric Interval =, <, > P=xP’ + y Temperatures,


 No ratios date and time
A d d i t i o n ,
subtraction
 Arbitrary zero
point
Ratio Absolute zero P=xP’ We i g h t ,
point height, length
 All arithmetic
operations
possible
Absolute Simple count P=P’ Number of
values f a u l t s
encountered
in testing
Metrics
Analyzing the Metric Data
• The role of statistics is to function as a tool in
analyzing research data and drawing conclusions
from it.
• The research data must be suitably reduced to be
read easily and used for further analysis.
• Descriptive statistics concern development of
certain indices or measures to summarize data.
• Data can be summarized by using measures of
central tendency (mean, median and mode) and
measures of dispersion (standard deviation,
variance, quartile).
Metrics
Measures of central tendency
• Measures of central tendency include mean,
median and mode.
• These measures are known as measures of central
tendency as they give us the idea about the central
values of the data around which all the other data
points have a tendency to gather.
• Mean can be computed by taking average values of
the data set and is given as:
N
Mean(  )  
xi
i1
N
Metrics
Analyzing the Metric Data
• Median gives the middle value in the data set which
means half of the data points are below the median
value and half of the data points are above the
median value. It is calculated as the value of the
data set, where n is the number of data points in
the data set.
• The most frequently occurring value in the data set
is denoted by mode.
Metrics
Choice of Measures of Central Tendency
The choice of selecting a measure of central tendency
depends upon:
1. the scale type of data at which it is measured
2.the distribution of data (left skewed, symmetrical, right
skewed)

Measures Relevant Scale Type

Mean Interval and ratio data which are not


skewed.
Median Ordinal, interval and ratio but not useful
for ordinal scales having few values.
Mode All scale types but not useful for scales
having multiple values.
Metrics
Graphs representing skewed and symmetrical distributions
Metrics
Measures of dispersion
• Measures of dispersion include standard deviation,
variance and quartiles. Measures of dispersion tell us
how the data is scattered or spread. Standard deviation
calculates the distance of the data point from the mean.
If most of the data points are far away from the mean
then the standard deviation of the variable is large. The
standard deviation is calculated as given below:

 (x   ) 2
x 
N
Metrics
Measures of dispersion
• Variance is a measure of variability and is the square of
standard deviation.
• The quartile divides the metric data into four equal
parts. For calculating quartile, the data is first arranged
in ascending order.
• The 25 percent of the metric data is below the lower
quartile (25 percentile), fifty percent of the metric data is
below the median value and seventy five percent of
metric data is below the upper quartile (75 percentile).
Metrics

Lower quartile Median Upper quartile

1st part 2nd part 3rd part 4th part


Metrics
Measures of dispersion
• The lower quartile (Q1) is computed by:
– finding the median of the data set
– finding median of the lower half of the data set
• The upper quartile (Q3) is computed by:
– finding the median of the data set
– finding median of the upper half of the data set
Metrics
Measures of dispersion
• Interquartile range = Q3 - Q1

Example: Consider data set consisting of lines of source


code (SLOC) given in example 8.2. Calculate standard
deviation, variance and quartile for it.
107, 128, 186, 222, 322, 466, 657, 706, 790, 844, 1129,
1280, 1411, 1532, 1824, 1882, 3442
Metrics
Metric Data Distribution
• In order to understand the metric data, the starting point
is to analyze the shape of the distribution of the data.
• There are a number of methods available to analyze the
shape of the data one of them is histogram through
which a researcher can gain insight about the normality
of the data.
• Histogram is a graphical representation of frequency of
occurrences of the values of a given variable.
Metrics

Metric Data Distribution


6

5
Frequency

0
0 1000 2000 3000 4000

SLOC
Metrics
Metric Data Distribution
• The bars show the frequency of values of LOC metric.
• The normal curve is superimposed on the distribution of
values to determine the normality of the data values of
LOC.
• Most of the data is left skewed or right skewed. These
curves will not be applicable for discrete data (nominal
or ordinal).
• For example, the classes may be faulty or non faulty,
thus the distribution will not be normal.
Metrics
Metric Data Distribution
• The measures of central tendency such as mean,
median and mode are all equal for normal curves.
• The normal curves are like a bell shaped curve and
within three standard deviations of the means, 96% of
data occurs
Metrics
Metric Data Distribution
• Consider the data sets given for three metric variables
in table below. Determine the normality of these
variables
Fault Cyclomatic Branch
count complexity count
470.00 26.00 826.00
128.00 20.00 211.00
268.00 14.00 485.00
19.00 10.00 29.00
404.00 15.00 405.00
127.00 11.00 240.00
263.00 14.00 464.00
94.00 10.00 187.00
Metrics
Metric Data Distribution

Fault Cyclomatic Branch


count complexity count
207.00 13.00 344.00
42.00 7.00 83.00
24.00 10.00 47.00
94.00 6.00 163.00
34.00 9.00 67.00
286.00 10.00 503.00
104.00 12.00 175.00
82.00 8.00 147.00
20.00 7.00 39.00
Metrics
Metric Data Distribution 8

Frequency
4

0
0.00 100.00 200.00 300.00 400.00 500.00

Fault_count
Metrics
Metric Data Distribution 6

Frequency
3

0
5.00 10.00 15.00 20.00 25.00 30.00

Cyclomatic_complexity
Metrics
Metric Data Distributio n 5

Frequency
3

0
0.00 200.00 400.00 600.00 800.00 1000.00

Branch_count
Metrics
Metric Data Distribution

Metric Mean Median Std. Deviation


Fault count 156.8235 108 137.7599
Cyclomatic complexity 11.88235 10 5.035901
Branch count 259.7059 187 216.9976
Metrics
Metric Data Distribution
• The normality of the data can also be checked by
calculating mean and median.
• The mean, median, and the mode should be same for a
normal distribution.
• We compare the values of mean and median.
• The values of median are less than the mean for
variables fault count and branch count.
• Thus, these variables are not normally distributed.
• However, the mean and median of cyclomatic
complexity do not differ by a larger value.
Metrics
Outlier Analysis
• Data points, which are located in an empty part of the
sample space, are called outliers.
• These are the data values that are numerically distant
from the rest of the data.
• For example, suppose one calculates the average
weight of 10 students in a class, where most are
between 51 pounds and 61 pounds, but the weight of
one student is 210 pounds. In this case, the mean will
be 72 pounds and the median will be 58.
Metrics
Outlier Analysis
• Hence, the median better reflects the weight of the
students than the mean.
• Thus, the data point with the value 210 is an outlier, that
is, it is located away from other values in the data set.
• Outlier analysis is done to find data points that are over
influential and removing them is essential.
Metrics
Outlier Analysis
• Once the outliers are identified the decision about the
inclusion or exclusion of the outlier must be made.
• The decision depends upon the reason why the case is
identified as outlier.
• There are three types of outliers: univariate, bivariate
and multivariate.
• Univariate outliers are those exceptional values that
occur within a single variable.
• Bivariate outliers occur within the combination of two
variables and multivariate outliers are present within the
combination of more than two variables.
Metrics
Outlier Analysis
• Box plots and scatter plots are two popular methods
that are used for univariate and bivariate outlier
detection.
• Box plots are based on median and quartiles. The upper
and lower quartiles statistics are used to construct a box
plot.
• The median value is the middle value of the data set
half of the values are less than this value and half of the
values are greater than this value.
Metrics
Outlier Analysis

Median

End of
Start of Lower Upper the tail
the tail quartile quartile
Metrics
Outlier Analysis
• The box starts at the lower quartile and ends at the
upper quartile.
• The distance between lower and the upper quartile is
called box length.
• The tails of the box plot specifies the bounds between
which all the data points must lie.
• The start of the tail is Q1 -1.5 × IQR and end of the tail
is Q3 +1.5 × IQR.
Metrics
Outlier Analysis
• These values are truncated to the nearest values of the
actual data points in order to avoid negative values.
• Thus, actual start of the tail is the lowest value in the
variable above (Q1 -1.5 × IQR) and actual end of the tail
is the highest value below (Q3 +1.5 × IQR).
• Any value outside the start of the tail and the end of the
tail is outlier and these data points must be identified as
they are unusual occurrences of data values which must
be considered for inclusion or exclusion.
Metrics
Outlier Analysis
• The box plots also tell us whether the data is skewed or
not.
• If the data is not skewed the median will lie in the center
of the box.
• If the data is left or right skewed, then the median will be
away from the center.
Metrics
Outlier Analysis
For example consider the LOC values for a sample project
given below:
17, 25, 36, 48, 56, 62, 78, 82, 103, 140, 162, 181, 202, 251,
310, 335, 508
Metrics
Outlier Analysis
• The median of the data set is 103, lower quartile is 56 and
upper quartile is 202. The interquartile range is 146.
• The start of the tail is 56-1.5×146=-163 and end of the tail
is 202+1.5×146=421.
• The actual start of the tail is lowest value above -163 i.e.
17 and actual end of the tail is highest value below 421 i.e
335.
• Thus the case number 17 with value 508 is above the end
of the tail and hence is an outlier.
Outlier Analysis

LOC
17

0.00 100.00 200.00 300.00 400.00 500.00 600.00


Metrics
Example:
Consider the data set given in example above. Construct
box plots and identify univariate outliers for all the variables
in the data set.
Outlier Analysis

Fault_count

0 100 200 300 400 500


Outlier Analysis

Cyclomatic_compl
exity
1

5 10 15 20 25 30
Outlier Analysis

Branch_count

0 200 400 600 800


Metrics
• The outliers should be analyzed by the researchers to
make a decision about their inclusion or exclusion in
the data analysis.
• There may be many reasons for an outlier
– error in the entry of the data
– some extra information that represents extraordinary or
unusual event
– an extraordinary event that is unexplained by the
researcher.
Metrics
• Outlier values may be present due to combination of
data values present across more than one variable.
• These outliers are called multivariate outliers.
• Scatter plot is another visualization method to detect
outliers.
• In scatter plots, we simply represent graphically all the
data points.
• The scatter plot allows us to examine more than one
metric variable at a given time.
Metrics
Exploring Analysis
• The metric variables can be of two type’s independent
variables and dependent (target) variable.
• The effect of independent variables on the dependent
variable can be explored by using various statistical
and machine learning methods.
• The choice of the method for analyzing the
relationship between independent and dependent
variable depends upon the type of the dependent
variable (continuous or discrete).

You might also like