Ese Lab - Sanoj-159
Ese Lab - Sanoj-159
Practical File
(2023- 2024)
Submitted By
Sanoj (2K21/SE/159)
Under the Supervision of
Ms. Shweta Meena
OBJECTIVE
Perform a comparison of the following data analysis tool: WEKA, KEEL, SPSS, MATLAB, R
THEORY
1. Weka:
Ease of Use: It provides a graphical user interface (GUI) making it accessible to users without
extensive programming experience. However, some advanced features may require scripting.
Flexibility: Being open-source, Weka allows for customization and integration with other tools or
libraries.
Community Support: It has an active community with resources like forums, documentation, and
tutorials.
2. KEEL:
Features: KEEL is a Java-based software tool for a wide range of data mining tasks. It offers
algorithms for classification, regression, clustering, pattern mining, etc.
Ease of Use: It provides a user-friendly interface but may require some learning curve, especially for
users unfamiliar with Java.
Flexibility: KEEL allows for customization and supports the integration of new algorithms.
Community Support: While it has a user community, it might not be as extensive as other more
widely used tools.
3. SPSS:
Features: SPSS (Statistical Package for the Social Sciences) is a statistical software suite offering a
broad range of data analysis capabilities including descriptive statistics, hypothesis testing,
regression analysis, and more.
Flexibility: SPSS offers some customization options, but it might be limited compared to open-
source alternatives.
Community Support: It has a large user base, with extensive documentation and support available.
4. MATLAB:
Ease of Use: MATLAB provides an interactive development environment (IDE) with easy-to-use
functions and visualization tools. However, proficiency in MATLAB programming is required for
complex tasks.
Flexibility: MATLAB offers high flexibility and customization options, allowing users to create
custom algorithms and functions.
Community Support: MATLAB has a large user base and comprehensive documentation, with
active forums and support channels.
5. R:
Ease of Use: While R has a steep learning curve for beginners, it provides powerful functionalities
once mastered. Various IDEs and graphical interfaces like RStudio make it more user-friendly.
Flexibility: R is highly flexible, allowing users to write custom functions and packages. Its open-
source nature encourages community contributions and extensions.
Community Support: R has a large and active user community with extensive documentation,
numerous packages, and online resources.
OBJECTIVE
Collection of Empirical Studies
THEORY
OBJECTIVE
Collection of Empirical Studies
EXPERIMENT 4 – Collection of Empirical Studies
OBJECTIVE
Collection of Empirical Studies
EXPERIMENT 5 – Feature Reduction Techniques
OBJECTIVE
Write a program to perform following feature reduction technique for the collected dataset
a) Correlation-based feature evaluation
b) Relief attribute feature evaluation
c) Information gain feature evaluation
d) Principle component analysis
THEORY
a) Correlation-based feature evaluation: This approach evaluates the relationship between each
feature and the target variable by calculating their correlation coefficient. Features with high
correlation values with the target variable are considered important and are retained, while those with
low correlation values may be discarded. However, it's essential to note that correlation doesn't imply
causation, so this method might overlook certain important features that are not highly correlated but
still influential.
b) Relief attribute feature evaluation: The Relief algorithm estimates the importance of features by
considering their ability to distinguish between instances of the same and different classes. It works
by iteratively sampling instances and adjusting feature weights based on the differences in feature
values between the nearest instances of the same and different classes. Features with higher weights
are considered more relevant. This method is particularly useful for classification tasks and is robust
to noisy data.
c) Information gain feature evaluation: Information gain measures the reduction in entropy or
uncertainty about the target variable achieved by knowing the value of a particular feature. Features
that lead to significant reductions in entropy are considered more informative and are thus selected.
This method is commonly used in decision tree algorithms, where features with higher information
gain are preferred for splitting nodes. However, it may prioritize features with many distinct values
or categories.
d) Principal component analysis (PCA): PCA is a dimensionality reduction technique that identifies
the directions (principal components) that capture the most variance in the data. These principal
components are linear combinations of the original features. By retaining only the most significant
principal components, PCA reduces the dimensionality of the data while preserving most of its
variance. This technique is particularly useful for visualizing high-dimensional data and for feature
extraction in scenarios where the original features are highly correlated or redundant.
CODE AND OUTPUT
Importing Libraries and Dataset
LEARNING