Python Module 5
Python Module 5
Question-1
You've collected the annual salaries of five employees at Tech Solutions
Inc. and want to store this data in a NumPy array. Write Python code to
create a NumPy array called "salary_array" to store the following
salaries: [60000, 75000, 80000, 90000, 85000]. Print the data.
import numpy as np
salary_array = np.array([60000,75000,80000,90000,85000])
print(salary_array)
Question-2
OBG Tech Solutions Inc. has different departments, and you want to create
a 2D NumPy array to store employee salaries by department. Write Python
code to create a 2D NumPy array called "department_salaries" that stores
this data.
import numpy as np
print(department_salaries)
Question-3
Employee Number 3 at OBG Tech Solutions Inc. received a bonus of $5000.
Update the salary of the 3rd employee in the "salary_array" and Print the
results with the msg shown in the expected output.
import numpy as np
Question-4
You are tasked with analyzing the total salary cost for a company based on
the salaries of its employees and their respective years of experience.
The company has a policy of increasing an employee's salary by a certain
percentage based on their years of experience.
The policy states that for every 1 year of experience, an employee's
salary should increase by 10%. You need to calculate the total salary cost
for the company after applying this policy to each employee's salary based
on their experience.
import numpy as np
total_salary_cost = salary_array*((employee_experience/10) + 1)
print("Total_Salary_Cost:", total_salary_cost)
Question-5
OBG Tech Solutions Inc. is expanding, and they hired three new employees
with salaries of [72000, 78000, 76000]. Create a new NumPy array called
"new_hires" to store their salaries and then concatenate it with the
"salary_array" to update the company's salary data as
"updated_salary_array". Print the msg "Updated Salary Array:" to match the
expected output.
import numpy as np
updated_salary_array = np.concatenate((salary_array,new_hires),axis=0)
Question-6
To better organize the salary data, you want to reshape the "salary_array"
(use this from the previous questions) into a 2D array with two rows and
three columns. How can you use NumPy to reshape an existing array? Save
the results in "reshaped_salary_array" and print the results to match the
expected output.
import numpy as np
reshaped_salary_array = salary_array.reshape(2,4)
import numpy as np
salary_array = np.array([80000, 85000, 90000, 95000, 92000, 70000, 72000, 75000, 78000, 76000])
print("Salary array:",salary_array)
mean_salary = np.mean(salary_array)
median_salary= np.median(salary_array)
total_salary_cost = np.sum(salary_array)
std_deviation_salary = np.std(salary_array)
seventy_fifth_percentile_salary = np.percentile(salary_array,75)
import numpy as np
average_engineering_salary = np.average(engineering_salaries)
average_marketing_salary = np.average(marketing_salaries)
1. Convert the given data in the above table into a dataframe and print
the output.
2. Print the first two rows from the dataframe.
3. Print the name of the columns present in the dataframe.
4. Print the dimensions of the given dataframe.
5. Print the statistical summary of all the numerical columns present in
the dataframe
6. Extract the mean and median for the salary column
import pandas as pd
data = {
employee_df = pd.DataFrame(data)
print(employee_df.head(2))
print(employee_df.describe())
mean_salary = employee_df['Salary'].mean()
median_salary = employee_df['Salary'].median()
import pandas as pd
df = pd.read_csv("employee_data.csv")
mean_salary = df['Salary'].mean()
median_salary = df['Salary'].median()
Question-2
Write a Python program to import the data from the given XLSX file into
Python using pandas and also perform the following operations:
Note: The data is already uploaded for the purpose of practice, Please
import the file "employee_data.xlsx" using Pandas.
import pandas as pd
df = pd.read_excel("employee_data.xlsx")
emp_shape = df.shape
std_salary = df['Salary'].std()
sum_salary = df['Salary'].sum()
print(f"Dimension of Shape: {emp_shape}")
1. Check if the dataset contains any missing values in the salary column.
2. Check if the dataset contains any missing values in the age column.
3. Print if the dataset contains any duplicated value.
4. Drop the rows with missing values and print the cured data frame.
import pandas as pd
employee_df = pd.read_csv("employee_data_nc.csv")
missing_Salary_data_mask = pd.isnull(employee_df['Salary'])
missing_Age_data_mask = employee_df['Age'].isnull()
duplicate_rows = employee_df[employee_df.duplicated()]
employee_df_dropped = employee_df.dropna(subset=["Salary"])
1.Fill the missing data in the salary with the mean value of the salary.
2.Change the datatype of the salary variable from int to float.
3.Change the value of department column:
Engineering : Engg
Marketing: Mktg
Finance : Fin
import pandas as pd
employee_df = pd.read_csv("employee_data_nc.csv")
mean_salary = employee_df["Salary"].mean()
employee_df["Salary"].fillna(mean_salary, inplace=True)
mean_age = employee_df["Age"].mean()
employee_df["Age"].fillna(mean_age, inplace=True)
employee_df["Salary"] = employee_df["Salary"].astype(int)
employee_df["Age"] = employee_df["Age"].astype(int)
print("Changed data type float to int for Age and Salary: \n",employee_df)
employee_df["Department"].replace(department_mapping, inplace=True)
Note: The data is already uploaded for the purpose of practice, Please
import the file "employee_data.csv" using Pandas.
import pandas as pd
employee_df = pd.read_csv("employee_data.csv")
high_earners = employee_df[employee_df["Salary"]>80000]
engineering_employees=employee_df[employee_df["Department"]=="Engineering"]
Question-2
Write a Python program to import the data from the CSV file and also
perform the following operations on it:
1. Show the sum of the employee salary
2. Show the employee data of the youngest employee and oldest employee.
3. Show the most common department amount of the employee data.
import pandas as pd
employee_df = pd.read_csv("employee_data.csv")
total_salary_cost=employee_df["Salary"].sum()
print("Total Salary:",total_salary_cost)
oldest_employee=employee_df.loc[employee_df["Age"].idxmax()]
youngest_employee=employee_df.loc[employee_df["Age"].idxmin()]
most_common_department = employee_df["Department"].mode()[0]
import pandas as pd
employee_df = pd.read_csv("employee_data.csv")
employee_df.set_index(["Department","EmployeeName"],inplace=True)
employee_df.sort_index(inplace=True)
print(employee_df)
Question-2
Write a Python program to import the data from the CSV file into Python
and sort the data by values using the Salary column.
import pandas as pd
employee_df = pd.read_csv('employee_data.csv')
print("Sorted by Salary:\n",employee_df)
Question-3
Write a Python program to import the data from the CSV file into Python
and concatenate the existing data with the new data given above.
1. Make sure to also remove the duplicate records from the combined
dataframe.
2. Show the common data between both the new and existing data table using
EmployeeName as the common key.
import pandas as pd
employee_df = pd.read_csv('employee_data.csv')
data = { 'EmployeeName': ['Karishma', 'Rena', 'Ragesh','Eve'], 'Age': [30, 29, 30,31], 'Department':
['Content', 'Manager', 'Editor','HR'], 'Salary': [95000, 90000, 90000,72000] }
new_employee_data_df = pd.DataFrame(data)
combined_df=pd.concat([employee_df,new_employee_data_df])
combined_df=combined_df.drop_duplicates()
print("Combined Datatable:\n",combined_df)
1. The manager wants to know the average salary in each department. Create
a pivot table that displays the average salary for each department.
4. Management needs to find the total salary cost for each department. Use
aggregation functions to calculate the total salary cost per department.
import pandas as pd
employee_df=pd.read_csv('employee_data.csv')
print(pivot_table_avg_salary)
print(pivot_table_employee_count)
print(pivot_table_max_min_salary_by_age)
total_salary_cost_by_department = employee_df.groupby(['Department'])['Salary'].sum()
print(total_salary_cost_by_department)
Question-2
Write a Python program to import the data from the CSV file into Python
and concatenate the existing data with the new data given above.
1.Make sure to also remove the duplicate records from the combined
dataframe.
2.Show employee count in each department using the groupby function.
import pandas as pd
employee_df = pd.read_csv('employee_data.csv')
import pandas as pd
employee_df = pd.read_csv('employee_data.csv')
data = {
new_employee_data_df = pd.DataFrame(data)
combined_df = combined_df.drop_duplicates()
print("Combined Datatable:\n",combined_df)
count_by_department =combined_df.groupby(['Department'])['EmployeeName'].count()
print(count_by_department)
Capstone
Question-1
In this task, you are presented with an unclean car price dataset stored
in a CSV file named 'car-prices_unclean.csv'. Your primary objective is to
perform preliminary data exploration and cleaning to ensure that the
dataset is ready for further analysis or modeling.
import pandas as pd
import numpy as np
df = pd.read_csv('car-prices_unclean.csv')
print(df.head())
print(df.shape)
print(df.info())
print(df.isnull().sum())
print(df.duplicated().sum())
print(df.nunique())
print(df.describe())
Question-2
Write a Python code to include removing unnecessary columns ('car_ID' and
'Untitled'), handling missing values in the 'symboling' and 'price'
columns, imputing missing values in the 'fueltype' column, and checking
for any remaining missing values.
Additionally, you need to detect and report outliers in the 'price' column
using quartiles and the Interquartile Range (IQR). Identified outliers
will be treated by applying a logarithmic transformation to 'price.'
import pandas as pd
import numpy as np
df = pd.read_csv('car-prices_unclean.csv')
df['symboling'].fillna(df['symboling'].mean(), inplace=True)
df['price'].fillna(df['price'].median(), inplace=True)
most_frequent_fueltype = df['fueltype'].mode()[0]
df['fueltype'].fillna(most_frequent_fueltype, inplace=True)
print(df.isnull().sum())
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
print("\nOutliers in price:")
print(outliers)
df['price'] = np.log1p(df['price'])
Question-3
Data preprocessing task for a dataset containing information about cars
and their prices. The code performs the following tasks:
import pandas as pd
import numpy as np
df = pd.read_csv('cars-prices_new.csv')
print(df['CarName'].head(25))
print(car_companies.head(25))
df['car_company'] = car_companies
print(df['car_company'].value_counts())
print(df['car_company'].value_counts())
Question-4
Your task is to analyze the dataset and identify factors that have a
significant impact on car prices. To achieve this, you need to perform the
following steps using Python and Pandas:
import pandas as pd
import numpy as np
df = pd.read_csv('cars-prices_final.csv')
numerical_columns = df.select_dtypes(include=[np.number])
correlation_matrix = numerical_columns.corr()
price_correlations = correlation_matrix['price']
print(price_correlations)
significant_factors = price_correlations[np.abs(price_correlations) > 0.5]
print(significant_factors)
Question-5
Your task is to analyze the impact of categorical columns (those
containing object or category data types) on car prices. To achieve this,
you need to perform the following tasks using Python and Pandas:
Your final output should provide insights into how different categories
within each categorical column impact car prices. Ensure that your code is
well-documented and easy to understand, making it clear which categories
have a significant influence on car prices within each categorical column.
import pandas as pd
import numpy as np
df = pd.read_csv('cars-prices_final.csv')
categorical_columns = df.select_dtypes(include=['object','category'])
unique_categories = df[column].unique()
price_summary = category_data['price'].describe()
print(f"Category: {category}")
print(price_summary)
print('\n')