Question1 Answers Complete
Question1 Answers Complete
Filter methods evaluate the relevance of features by examining intrinsic properties of the
data, independent of any learning algorithm. They utilize statistical techniques to assess the
relationship between each feature and the target variable. Common techniques include
correlation coefficients, chi-square tests, and mutual information.
Wrapper methods assess feature subsets by training and evaluating a specific machine
learning model on different combinations of features. They use performance metrics, such
as accuracy or F1 score, to determine the optimal feature subset. While this approach can
capture interactions between features and is often more accurate, it is computationally
intensive, especially with large feature sets, as it requires multiple model trainings.
Selecting k=10 is common due to its balance between bias and variance. With k=10, each
training set uses 90% of the data, and each test set uses 10%, providing a reliable estimate
of model performance. This setup reduces the variance of the performance estimate
compared to methods like leave-one-out cross-validation (k=n), which can result in high
variance due to training on nearly the entire dataset with minimal test data.
2. Feature Engineering: Introduce new features that better represent the underlying data
patterns. This can involve creating interaction terms, polynomial features, or using domain
knowledge to derive meaningful attributes. Additionally, reducing regularization strength
allows the model to fit the training data more closely, potentially capturing more complex
patterns.
1. Dataset Size and Dimensionality: Algorithms vary in their ability to handle large datasets
and high-dimensional spaces. For example, Support Vector Machines may struggle with
very large datasets, whereas algorithms like Stochastic Gradient Descent are designed to
handle such scenarios efficiently.
Batch Gradient Descent computes the gradient of the loss function with respect to the entire
dataset before updating the model parameters. This approach ensures a stable and accurate
gradient estimation but can be computationally expensive and slow, especially with large
datasets, as it requires processing the entire dataset to perform a single update.
Stochastic Gradient Descent (SGD) updates the model parameters using the gradient
computed from a single data point at each iteration. This results in faster updates and the
ability to handle large datasets more efficiently. However, the parameter updates exhibit
higher variance, leading to a noisier optimization process that may cause the loss function
to fluctuate rather than converge smoothly.
In practice, a compromise between these two methods is often used, known as Mini-Batch
Gradient Descent, which updates parameters based on small batches of data, balancing
convergence stability and computational efficiency.
1. Resampling Techniques:
- Oversampling the Minority Class: Increase the number of instances in the minority class
by duplicating existing samples or generating new synthetic samples using methods like
Synthetic Minority Over-sampling Technique (SMOTE).
- Undersampling the Majority Class: Reduce the number of instances in the majority class
to balance the class distribution. While this can lead to loss of information, it helps prevent
the model from being biased towards the majority class.
2. Algorithmic Approaches:
- Adjust Class Weights: Modify the algorithm to assign higher misclassification costs to the
minority class. This can be done by setting class weights inversely proportional to class
frequencies, encouraging the model to pay more attention to the minority class during
training.
- Ensemble Methods: Utilize ensemble techniques like Balanced Random Forests or
EasyEnsemble, which are designed to handle imbalanced datasets by combining multiple
models trained on balanced subsets of the data.
A tree with fewer splits is less complex and captures the essential patterns in the data
without fitting noise or outliers. In contrast, a tree with more splits may overfit the training
data, capturing noise and leading to poor generalization on new data. Overfitting occurs
when a model learns the training data too closely, including its random fluctuations, which
do not represent the underlying data distribution.
Therefore, selecting the simpler tree with 3 splits reduces the risk of overfitting and
enhances the model's ability to generalize to new, unseen data.
Achieving diversity can be done by training models on different subsets of the data, using
different algorithms, or introducing randomness into the training process. However, it's
important to balance diversity with the accuracy of individual models, as overly diverse
models may lack sufficient accuracy, negating the benefits of combining them.
These models require substantial computational resources and large amounts of data to
train effectively, making them well-suited for scenarios with extensive datasets. While they
often operate as 'black boxes' with limited interpretability, their ability to model complex
relationships makes them ideal when predictive accuracy is the primary concern.
Additionally, ensemble methods like Gradient Boosting Machines (e.g., XGBoost) can also be
effective, as they combine multiple weak learners to achieve strong predictive performance.
However, they may offer slightly better interpretability compared to deep neural networks.