0% found this document useful (0 votes)
5 views

Lecture 5 (Descriptive Statistics)

Descriptive Statistics

Uploaded by

Bilal Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 5 (Descriptive Statistics)

Descriptive Statistics

Uploaded by

Bilal Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Analytics in Software Engineering (MSE 669)

Dr. Assad Abbas

Department of Computer Science


COMSATS Institute of Information Technology, Islamabad
[email protected]
Outline
n Descriptive Statistics

June 23, 2024 2


Data Analysis and Statistical Testing
n The research data can be analyzed using various statistical
measures and inferring conclusions from these measures.
n The research data should be reduced in a suitable form before
it can be used for further analysis.
n The statistical techniques can be used to preprocess the
attributes so that they can be analyzed and meaningful
conclusions can be drawn out of them.
n After preprocessing of the data, the attributes need to be
reduced so that dimensionality can be reduced and better
results can be obtained. Then, the model is predicted and
validated using statistical and/or machine learning techniques.
n The results obtained are analyzed and interpreted from each
and every aspect.
n Finally, hypotheses are tested and decision about the accuracy
of model is made.
June 23, 2024 3
Data Analysis and Statistical Testing

June 23, 2024 4


Analyzing the Metric Data
n After data collection, descriptive statistics can be
used to summarize and analyze the nature of the
data.
n The descriptive statistics are used to describe the
data, for example, extracting attributes with very few
data points or determining the spread of the data

June 23, 2024 5


What are Descriptive Statistics?
n Descriptive statistics are the methods and
techniques used to summarize and display data in a
meaningful way. Descriptive statistics help us to:
5 Describe the main features and characteristics of the
data.
5 Compare and contrast different groups or variables in
the data.
5 Visualize the distribution and variation of the data.
n Some of the common measures of descriptive
statistics are:
5 Measures of central tendency
5 Measures of dispersion

June 23, 2024 6


Why Descriptive Statistics in Software Engineering
n Descriptive statistics can be used to analyze various
aspects of software engineering, such as:
5 Software requirements
g Descriptive statistics, including mean, median, and mode, analyze
software requirements, like determining the average user
satisfaction scores for feature prioritization in software
development.
5 Software design
g can help evaluate the quality and complexity of the software
architecture and components, such as the coupling and cohesion
of the modules, the size and depth of the classes, the number and
type of the interfaces, etc.
5 Software development
5 Software testing
5 Software maintenance

June 23, 2024 7


Analyzing the Metric Data
n Measures of Central Tendency
5 Measures of central tendency are used to summarize
the central values of the attributes. These measures
include mean, median, and mode. They are known as
measures of central tendency as they provide idea
about the central values of the data around which all
the other values tend to gather.
n Mean
5 Can be computed by taking the average values of the
data set
5 The mean is typically better when the data follows a
symmetric distribution

June 23, 2024 8


Analyzing the Metric Data
n Median
5 The median is the value which divides the data into two halves
5 For odd number of data points, median is the central value, and
for even number of data points, median is the mean of the two
central values
5 First, we need to arrange data in ascending order
5 Median is not useful, if number of categories in the ordinal type
of scale are very low. In such cases, mode is the preferred
measure of central tendency.
5 Median is useful when the data are skewed because the mean
will be distorted by outliers
Odd: 15,17,18,19,45,63,64,65,71,75,79
Even: 15,17,18,45,63,64,65,71,75,79
Even(median): 63.5

June 23, 2024 9


Analyzing the Metric Data
n Mode
5 Mode gives the value that has the highest frequency
in the distribution
5 Unlike the mean and median, the same distribution
may have multiple values of mode.
5 The major disadvantage of mode is that it does not
produce useful results when applied to interval/ratio
scales having many values
15,17,18,18,45,63,64,65,71,75,79
Mode: 18
Is this a good measure to use here?

June 23, 2024 10


Choice of Measure of Central Tendency
n The choice of selecting a measure of central
tendency depends on:
5 The scale type of data at which it is measured.
5 The distribution of data (left skewed, symmetrical,
right skewed).
g In fact, if the data is symmetrical, all the three
measures (mean, median, and mode) have the same
values. But, if the data is skewed, there will always be
difference between these measures.
5 The symmetrical curve is a bell-shaped curve, where
all the data points are equally distributed.
5 Usually, when the data is skewed, the mean is a
misleading measure for determining central values

June 23, 2024 11


Choice of Measure of Central Tendency

Mean: 531, Median=265


Which measure is suitable to use here, mean or median??

June 23, 2024 12


Measures of Dispersion
n The measures of dispersion indicate the spread or the range of
the distributions in the dataset.
n Measures of dispersion include range, standard deviation,

variance, and quartiles.


n The range is defined as the difference between the highest

value and the lowest value in the distribution. It is the easiest


measure that can be quickly computed
Range: 3,000-200=2,800
n The range of the two distributions may be

different even if they have the same mean.


n The advantage of using range measure is that it is simple to

compute, and the disadvantage is that it only takes into account


the extreme values in the distribution and, hence, does not
represent actual spread in the distribution

June 23, 2024 13


Measures of Dispersion
n The interquartile range (IQR) can be used to overcome the
disadvantage of the simple range measure.
n The quartiles are used to compute the IQR of the distribution.
The quartile divides the metric data into four equal parts.
n For the purpose of calculation of quartiles, the data is first
required to be arranged in ascending order.
n The 25% of the metric data is below the lower quartile (25
percentile), 50% of the metric data is below the median value,
and 75% of the metric data is below the upper quartile (75
percentile).

June 23, 2024 14


Measures of Dispersion
n The lower quartile (Q1) is computed by the following methods:
5 Computing the median of the data set
5 Computing the median of the lower half of the data set
n The upper quartile (Q3) is computed by the following
methods:
5 Computing the median of the data set
5 Computing the median of the upper half of the data set

5 IQR =Q3 −Q1


5 IQR=300-240=60

June 23, 2024 15


Measures of Dispersion
n Standard Deviation
5 The standard deviation is used to measure the average
distance a data point has from the mean. The standard
deviation assesses the spread by calculating the distance
of the data point from the mean.
5 Low standard deviation means data are clustered around
the mean, and high standard deviation indicates data are
more spread out.
5 A standard deviation close to zero indicates that data
points are close to the mean, whereas a high or low
standard deviation indicates data points are respectively
above or below the mean.

June 23, 2024 16


Measures of Dispersion
n Standard Deviation

n Variance
5 Variance is a measure of variability and is the square of
standard deviation

June 23, 2024 17


Data Distributions
n The shape of the distribution of the data is used to describe and
understand the metrics data.
n Shape exhibits the patterns of distribution of data points in a given data
set.
n A distribution can either be symmetrical (half of the data points lie to the
left of the median and other half of the data points lie to the right of the
median) or skewed (low and/or high data values are imbalanced).
n A bell-shaped curve is known as normal curve and is defined as, “The
normal curve is a smooth curve that is perfectly symmetrical. It has 68.3
percent of the area under the curve within one standard deviation of the
mean”

June 23, 2024 18


Histogram Analysis
n The normal curves can be used to understand data
descriptions. There are a number of methods that
can be applied to analyze the normality of the data
set. One of the methods is histogram analysis.
Histogram is a graphical representation that depicts
frequency of occurrence of range of values.

June 23, 2024 19


Outlier Analysis
n Data points that lie away from the rest of the data values are
known as outliers. These values are located in an empty
space and are extreme or unusual values.
n The presence of these outliers may adversely affect the

results in data analysis.


35, 45, 45, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75, 75, 85, 85,
85, 85, 95, 95, 95,105, 105, 115, 300,400
n What’s abnormal here?

n Box plots, z-scores, and scatter plots can be used for

detecting outliers.
Scatter plot
450
400
350
300
250
200
150
100
50
0
0 5 10 15 20 25
June 23, 2024 20
Z-Score
n Z-score is a method to identify outliers and is used to depict
the relationship of a value to its mean, and is given as follows:

n The z-score gives the information about the value as to


whether it is above or below the mean, and by how many
standard deviations. It may be positive or negative.
n The z-score values of data samples exceeding the threshold
of ±2.5 are considered to be outliers.

June 23, 2024 21


Z-Score

June 23, 2024 22


Correlation Analysis
n Correlation analysis in research is a statistical method used to measure
the strength of the linear relationship between two variables and compute
their association.
n Simply put - correlation analysis calculates the level of change in one
variable due to the change in the other.
n A high correlation points to a strong relationship between the two variables,
while a low correlation means that the variables are weakly related.
n Correlation coefficients have a value of between -1 and 1. A “0” means
there is no relationship between the variables at all, while -1 or 1 means
that there is a perfect negative or positive correlation

June 23, 2024 23


Correlation Analysis

June 23, 2024 24


Correlation Analysis
n Types of techniques / test for correlation
5 Pearson correlation
5 Kendall rank correlation
5 Spearman correlation
5 Point-Biserial correlation.

June 23, 2024 25


Attribute Reduction Methods
n Sometimes the presence of a large number of attributes in an
empirical study reduces the efficiency of the prediction results
produced by the statistical and machine learning techniques.
n Reducing the dimensionality of the data reduces the size of
the hypothesis space and allows the methods to operate
faster and more effectively.
n The attribute reduction methods involve either selection of
subset of attributes (independent variables) by eliminating the
attributes that have little or no predictive information (known
as attribute selection) or combining the relevant attributes into
a new set of attributes (known as attribute extraction).

June 23, 2024 26


Attribute Reduction Methods
n It is also possible that more than one attribute captures the
same concept and hence is redundant.
n The irrelevant and redundant attributes only add noise to the
data, increase computational time and may reduce the
accuracy of the predicted models.
n To remove the noise and correlation in the attributes, it is
desirable to reduce data dimensionality as a preprocessing
step of data analysis.
n Benefits of attributes reduction:
5 Improved model interpretability
5 Faster training time
5 Reduction in overfitting of the models
5 Reduced noise

June 23, 2024 27


Attribute Reduction Methods

June 23, 2024 28


Attribute Reduction Methods
n Attribute Selection
5 Attribute selection involves selecting a subset of attributes
from a given set of attributes.
5 For example, univariate analysis and correlation-based
feature selection (CFS) techniques can be used for
attribute sub-selection.
5 Wrapper methods and filter methods can be used for
metric selection.

June 23, 2024 29


Attribute Reduction Methods
n Attribute Selection
5 Wrapper Methods
g In wrapper methods, the feature selection process is based on a
specific machine learning algorithm that we are trying to fit on a
given dataset.
g It follows a greedy search approach by evaluating all the possible
combinations of features against the evaluation criterion.
g Examples of learning techniques used in Wrapper methods
include Hill climbing, genetic algorithms, simulated annealing ,
*Tabu* search.

June 23, 2024 30


Attribute Reduction Methods
n Attribute Selection
5 Filter Methods
g Filter methods are independent of the learning technique.
g Filter methods that compute attribute ranking on the basis of
correlation-based and information-centric
g Features are selected on the basis of their scores in
various statistical tests for their correlation with the outcome
variable
g Examples of techniques used in filter methods include
correlation coefficient, mutual information, information gain

June 23, 2024 31


Attribute Reduction Methods
n Univariate Analysis
5 The univariate analysis is done to find the individual effect
of each independent variable on the dependent variable.
5 One of the purposes of univariate analysis is to screen out
the independent variables that are not significantly related
to the dependent variables.
5 For example, in regression analysis, only the independent
variables that are significant at 0.05 significance level may
be considered in subsequent model prediction using
multivariate analysis.
5 The primary goal is to preselect the independent variables
for multivariate analysis that seems to be useful
predictors. The choice of methods in the univariate
analysis depends on the type of dependent variables
being used.
June 23, 2024 32
Attribute Reduction Methods
n Correlation-Based Feature Selection (CFS)
5 This is a commonly used method for preselecting attributes in
machine learning methods.
5 To incorporate the correlation of independent variables, a CFS
method is applied to select the best predictors out of the
independent variables in the data sets
5 The best combinations of independent variables are searched
through all possible combinations of variables.
5 CFS evaluates the best of a subset of independent variables,
such as software metrics, by considering the individual
predictive ability of each attribute along with the degree of
redundancy between them.
5 CFS can be used in drastically reducing the dimensionality of
data sets, while maintaining the performance of the machine
learning methods.

June 23, 2024 33


Attribute Reduction Methods
n Attribute Extraction
5 Unlike attribute selection, which selects the existing
attributes with respect to their significance values or
importance, attribute extraction transforms the
existing attributes and produces new attributes by
combining or aggregating the original attributes so
that useful information for model building can be
extracted from the attributes

June 23, 2024 34


Attribute Reduction Methods
n Autoencoder Methods
n Principal Comp0nent Analysis
n Bag of Words

June 23, 2024 35


Attribute Reduction Methods
n Autoencoders
5 Autoencoders can identify key data features.
5 The autoencoder concept focuses on learning from
the coding of the original data sets to derive new,
more potent features.
5 It achieves this by training a neural network to
recreate its input, which forces it to discover and
exploit structures in the data.
5 Through this process, autoencoders reduce
dimensionality and extract significant features from
the data, contributing to more effective machine-
learning models.

June 23, 2024 36


Attribute Reduction Methods
n Attribute Extraction
5 Principal Component Analysis (PCA)
g Principal component analysis, or PCA, is a
dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.

June 23, 2024 37


Attribute Reduction Methods

June 23, 2024 38


Source
n Malhotra, Ruchika. Empirical research in software
engineering: concepts, analysis, and applications.
CRC press, 2016

June 23, 2024 39

You might also like