0% found this document useful (0 votes)
6 views

Capstone Project Assignment

The capstone project focuses on advanced data analysis and visualization using Python, specifically with Pandas and NumPy for data manipulation and Matplotlib, Seaborn, Plotly, and Bokeh for visualizations. Participants will clean and analyze a dataset containing employee information, addressing tasks such as handling missing values, detecting outliers, and performing various analyses to extract insights. Deliverables include a well-documented Jupyter notebook, a summary report, and a presentation slide deck.

Uploaded by

nk2gv9dv5f
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Capstone Project Assignment

The capstone project focuses on advanced data analysis and visualization using Python, specifically with Pandas and NumPy for data manipulation and Matplotlib, Seaborn, Plotly, and Bokeh for visualizations. Participants will clean and analyze a dataset containing employee information, addressing tasks such as handling missing values, detecting outliers, and performing various analyses to extract insights. Deliverables include a well-documented Jupyter notebook, a summary report, and a presentation slide deck.

Uploaded by

nk2gv9dv5f
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Capstone Project Assignment: Advanced Data Analysis and Visualization

Objective

This capstone project is designed to challenge and enhance your skills in Python
programming, focusing on data preprocessing, cleaning, manipulation, and analysis using
Pandas and NumPy. It also evaluates your ability to create compelling, meaningful
visualizations with Matplotlib, Seaborn, Plotly, and Bokeh.

Project Context

The dataset contains detailed employee information, including demographics, job roles,
salaries, bonuses, performance scores, and other attributes. It has intentionally been
augmented with anomalies (e.g., typos, missing values, and outliers) to simulate real-world
data. Your task is to clean, analyze, and extract meaningful insights to guide business
decisions.

Project Tasks

1. Data Preprocessing and Exploration

1. Load and Inspect the Dataset:


o Load the dataset and display the first few rows.
o Understand the structure of the dataset: dimensions, column names, and data
types.
2. Handle Missing Values:
o Identify columns with missing values and analyze their proportion.
o Impute missing values appropriately using methods such as mean/median for
numerical data and mode/fill-forward for categorical data.
3. Clean Incorrect Data Entries:
o Identify and correct data inconsistencies, such as typos (e.g., Femelle instead
of Female, Malle instead of Male).
o Standardize categorical columns (e.g., ensuring consistent case: RemoteWork
has Yes and yes unified).
4. Explore Categorical and Numerical Columns:
o Count unique values in categorical columns and analyze their distributions.
o Compute statistical summaries for numerical columns (mean, median, standard
deviation, range).
5. Detect and Handle Outliers:
o Use boxplots and statistical methods (e.g., z-scores or IQR) to identify outliers
in numerical columns.
o Decide on strategies to handle these outliers (e.g., capping, removal, or
retaining for analysis).
6. Check for Duplicates:
o Detect duplicate rows and decide whether to retain or remove them.
7. Create New Derived Columns:
o Add a column for the ratio of AnnualBonus to Salary.
o Add a column for ExperienceLevel (e.g., Junior: 0–5 years, Mid: 6–15 years,
Senior: 16+ years).
o Add a column for AgeDecade (e.g., 20s, 30s, etc.).

2. In-depth Analysis Questions

Using Pandas and NumPy, answer the following questions:

1. GroupBy Analysis:
o Calculate the average salary by Department and Gender.
o Identify the top 3 job roles in terms of average PerformanceScore.
2. Correlation and Relationships:
o Compute correlations between Salary, YearsAtCompany, and
PerformanceScore.
o Identify whether salary has a stronger correlation with PerformanceScore or
YearsAtCompany.
3. Crosstab Analysis:
o Analyze the relationship between RemoteWork and MaritalStatus using a
crosstab.
4. Filtering and Ranking:
o List the top 5 employees with the highest bonus-to-salary ratio.
o Identify the top 3 cities with the highest average salaries and their
corresponding average performance scores.
5. Departmental Analysis:
o Find the department with the most balanced gender ratio.
o Compare the average salaries of employees in Sales and Engineering across
countries.
6. NumPy Calculations:
o Calculate the median salary for each Job type using NumPy.
o Standardize the PerformanceScore column using z-scores.
7. Performance and Age:
o Group employees by AgeGroup and calculate average PerformanceScore and
AnnualBonus.
o Explore how bonuses vary across different ExperienceLevel categories.

3. Visualization Tasks

Create the following visualizations:

Using Matplotlib and Seaborn


1. A bar plot showing the average salary by Education level.
2. A heatmap visualizing correlations between numerical columns.
3. A boxplot showing the distribution of salaries segmented by Gender and Department.
4. A line plot showing the trend of average performance scores over YearsAtCompany.

Using Plotly and Bokeh

1. An interactive geographical map showing average salaries by Country.


2. An interactive scatter plot exploring the relationship between Salary and
PerformanceScore.
3. A parallel coordinates plot analyzing relationships between Salary, AnnualBonus,
and PerformanceScore.
4. An interactive histogram to explore the distribution of YearsAtCompany.

Deliverables

1. Code Notebook:
o A well-documented Jupyter notebook with clean, modular code and comments.
o Include analysis, visualizations, and insights.
2. Summary Report:
o A 2–3 page report summarizing:
 Key findings and insights.
 Embedded visuals with brief explanations.
 Recommendations based on the analysis.
3. Presentation Slides:
o A 7–10 slide deck summarizing the project approach, visuals, and actionable
insights.

Evaluation Criteria

1. Completeness: Have all tasks been addressed thoroughly?


2. Code Quality: Is the code clean, efficient, and well-documented?
3. Insightfulness: Are the insights logical and supported by analysis?
4. Visualization: Are visualizations clear, labeled, and insightful?
5. Advanced Work: Bonus points for completing advanced tasks creatively.

Good luck, and enjoy the challenge! 🚀🚀

You might also like