Modeling
Modeling
Model Building
The choice of modelling approach and the effort to improve model performance in this project
were based on several key considerations and objectives:
- Interpretability: There's a need for a balance between model accuracy and interpretability.
Some models, like Linear Regression, offer easier interpretability, allowing insights into the
factors influencing salary predictions.
- Scalability: The chosen approaches can efficiently handle large datasets, ensuring scalability
as the dataset size grows.
- Feature Importance: The models can provide insights into the key predictors affecting
salary, helping Delta Limited make informed decisions.
- Previous Success: The selected models have demonstrated effectiveness in similar projects,
providing confidence in their suitability for this salary prediction task.
4.3.1. Cross-Validation:
- Cross-validation was employed to ensure the models' robustness and generalization. This
technique helps in estimating how well the models will perform on unseen data.
- 14 -
4.3.3. Ensemble Methods:
- Ensemble methods, such as Random Forests and Gradient Boosting, were considered to
combine multiple models for improved accuracy. This approach leverages the strengths of
different algorithms.
4.4. Challenges
4.4.2. Interpretability:
- Ensuring that the models provide meaningful insights into salary determinants was a priority.
Linear Regression, in particular, supports interpretability.
The KNIME workflow incorporates several nodes for data pre-processing and modelling:
- Excel Reader: This node imports and prepares data from Excel files.
- Missing Value Node: Handles missing data by replacing them with appropriate values.
- Partitioning Node: Divides data into training and testing sets for model evaluation.
- Model Building Node: Utilizes a combination of nodes ('Learner,' 'Predictor,' and 'Scorer')
for building regression models, including Linear Regression, Decision Trees, Gradient Boost
Trees, and Random Forests.
The holistic approach outlined in this project is designed to address various challenges,
optimize model performance, and provide actionable insights into salary prediction for Delta
Limited.
- 15 -
Figure 5. KNIME models used for salary prediction
Total experience, Current CTC, Number of Companies Worked, Number of Publications, and
Certifications are statistically significant predictors of Expected CTC. This significance is
indicated by their respective P-values, all of which are less than 0.05.
- 16 -
Table 4. Regression Statistics for Model Validity and Reliability
- 17 -