0% found this document useful (0 votes)
94 views

Python Module 5

This document provides examples of using NumPy and Pandas to store and analyze employee salary data. NumPy is used to create arrays to hold single and multi-dimensional salary data. Operations like updating salaries, calculating totals, and reshaping arrays are demonstrated. Pandas is used to import salary data from CSV and Excel files into DataFrames. Summary statistics like mean, median, and standard deviation are calculated on the DataFrames. Basic data cleaning checks for missing values and duplicates are also performed.

Uploaded by

surajmishraa24
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Python Module 5

This document provides examples of using NumPy and Pandas to store and analyze employee salary data. NumPy is used to create arrays to hold single and multi-dimensional salary data. Operations like updating salaries, calculating totals, and reshaping arrays are demonstrated. Pandas is used to import salary data from CSV and Excel files into DataFrames. Summary statistics like mean, median, and standard deviation are calculated on the DataFrames. Basic data cleaning checks for missing values and duplicates are also performed.

Uploaded by

surajmishraa24
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

NumPy - Storing Data in Arrays

Question-1
You've collected the annual salaries of five employees at Tech Solutions
Inc. and want to store this data in a NumPy array. Write Python code to
create a NumPy array called "salary_array" to store the following
salaries: [60000, 75000, 80000, 90000, 85000]. Print the data.

import numpy as np

salary_array = np.array([60000,75000,80000,90000,85000])

print(salary_array)

Question-2
OBG Tech Solutions Inc. has different departments, and you want to create
a 2D NumPy array to store employee salaries by department. Write Python
code to create a 2D NumPy array called "department_salaries" that stores
this data.

import numpy as np

department_salaries = np.array([[80000, 85000, 90000, 95000, 92000],[70000, 72000, 75000, 78000,


76000]])

print(department_salaries)

Question-3
Employee Number 3 at OBG Tech Solutions Inc. received a bonus of $5000.
Update the salary of the 3rd employee in the "salary_array" and Print the
results with the msg shown in the expected output.

import numpy as np

salary_array = np.array([60000, 75000, 80000, 90000, 85000])

salary_array[2] = salary_array[2] + 5000

print("Updated Salary Array:", salary_array)

Question-4
You are tasked with analyzing the total salary cost for a company based on
the salaries of its employees and their respective years of experience.
The company has a policy of increasing an employee's salary by a certain
percentage based on their years of experience.
The policy states that for every 1 year of experience, an employee's
salary should increase by 10%. You need to calculate the total salary cost
for the company after applying this policy to each employee's salary based
on their experience.

import numpy as np

salary_array = np.array([60000, 75000, 80000, 90000, 85000])

employee_experience = np.array([3, 5, 2, 7, 4])

total_salary_cost = salary_array*((employee_experience/10) + 1)

print("Total_Salary_Cost:", total_salary_cost)

Question-5
OBG Tech Solutions Inc. is expanding, and they hired three new employees
with salaries of [72000, 78000, 76000]. Create a new NumPy array called
"new_hires" to store their salaries and then concatenate it with the
"salary_array" to update the company's salary data as
"updated_salary_array". Print the msg "Updated Salary Array:" to match the
expected output.

import numpy as np

salary_array = np.array([60000, 75000, 80000, 90000, 85000])

new_hires = [72000, 78000, 76000]

updated_salary_array = np.concatenate((salary_array,new_hires),axis=0)

print("Updated Salary Array:",updated_salary_array)

Question-6
To better organize the salary data, you want to reshape the "salary_array"
(use this from the previous questions) into a 2D array with two rows and
three columns. How can you use NumPy to reshape an existing array? Save
the results in "reshaped_salary_array" and print the results to match the
expected output.

import numpy as np

salary_array = np.array([60000, 75000, 80000, 90000, 85000, 72000, 78000, 76000])

reshaped_salary_array = salary_array.reshape(2,4)

print("Reshaped Salary Array:",reshaped_salary_array)


Numpy Functions
Question-1
The first task for you to do right now is to perform some basic operations
on the salary list provided by your manager. The manager has asked you to
do the following things with the data which is :
1. Convert the salary list into an array in 'salary_array'.
2. Showcase the average salary spent on the employees in 'mean_salary'.
3. Print the center value of the given salary list.
4. Calculate the Total of all the employee salaries in 'total_salary_cost'
variable.
5. Find the variation in the employee salaries using standard deviation in
'std_deviation_salary' variable.
6. Find the 75% Percentile of the employee salaries in
'seventy_fifth_percentile_salary' variable.
7. Print the results to match the expected output below.

import numpy as np

salary_array = np.array([80000, 85000, 90000, 95000, 92000, 70000, 72000, 75000, 78000, 76000])

print("Salary array:",salary_array)

mean_salary = np.mean(salary_array)

print(f"The mean salary of employees is: ${mean_salary:0.2f}")

median_salary= np.median(salary_array)

print(f"The median salary of employees is: ${median_salary:0.2f}")

total_salary_cost = np.sum(salary_array)

print(f"The total salary cost for the company is: ${total_salary_cost:0.2f}")

std_deviation_salary = np.std(salary_array)

print(f"The standard deviation of salaries is: ${std_deviation_salary:0.2f}")

seventy_fifth_percentile_salary = np.percentile(salary_array,75)

print(f"The 75th percentile salary is: ${seventy_fifth_percentile_salary:0.2f}")


Question-2
You have done a great job in giving insight into the salary data of all
the employees. But your manager is still curious about the salaries of the
two major departments which are Engineering and Marketing. The manager
wants to know which department consumes the most chunk of the company's
finances.
Your task is to calculate the mean salary for the Engineering and
Marketing departments separately using NumPy functions. These statistics
unveil departmental compensation trends, guiding decisions about salary
adjustments and resource allocation.
Additionally, we find the department with the highest median salary. 1.
Convert the given salary list into an array.
2. Calculate the average for each department’s salary
3. Find out the mid-value for each department’s salary and also compare
which department’s mid value is greater.

import numpy as np

engineering_salaries = np.array([80000, 85000, 90000, 95000, 92000])

marketing_salaries = np.array([70000, 72000, 75000, 78000, 76000])

average_engineering_salary = np.average(engineering_salaries)

print(f"Average Engineering Department Salary: ${average_engineering_salary:0.2f}")

average_marketing_salary = np.average(marketing_salaries)

print(f"Average Marketing Department Salary: ${average_marketing_salary:0.2f}")

if np.median(marketing_salaries) < np.median(engineering_salaries):

print(f"The department with the highest median salary is Engineering")

else:print(f"The department with the highest median salary is Marketing")

Pandas - Storing Data in DataFrames


Question-1
The manager has provided you with the above table and asked you to perform
the following tasks on the data in the table using pandas in Python.

1. Convert the given data in the above table into a dataframe and print
the output.
2. Print the first two rows from the dataframe.
3. Print the name of the columns present in the dataframe.
4. Print the dimensions of the given dataframe.
5. Print the statistical summary of all the numerical columns present in
the dataframe
6. Extract the mean and median for the salary column

import pandas as pd

data = {

'EmployeeName': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Age': [28, 35, 24, 42, 31],

'Department': ['Engineering', 'Marketing', 'Engineering', 'Finance', 'HR'],

'Salary': [75000, 80000, 90000, 85000, 72000]

employee_df = pd.DataFrame(data)

print("The Employee Dataframe: \n", employee_df)

print(employee_df.head(2))

print("Columns in the given dataframe: \n",employee_df.columns)

num_rows, num_columns = employee_df.shape

print("Dimensions of the employee dataframe")

print(f"Number of Rows: {num_rows}")

print(f"Number of Columns: {num_columns}")

print("Statistics of numerical columns")

print(employee_df.describe())

mean_salary = employee_df['Salary'].mean()

median_salary = employee_df['Salary'].median()

print(f"Mean Salary: ${mean_salary:.2f}")

print(f"Median Salary: ${median_salary:.2f}")

Importing Data from Files


Question-1
Write a python program to import the data from the given CSV file into
python using pandas and also find out the mean and median for the “Salary”
column
Note: The data is already uploaded for the purpose of practice, Please
import the file "employee_data.csv" using Pandas.

import pandas as pd

df = pd.read_csv("employee_data.csv")

mean_salary = df['Salary'].mean()

median_salary = df['Salary'].median()

print("Mean and Median from CSV dataframe")

print(f"Mean Salary: ${mean_salary:0.2f}")

print(f"Median Salary: ${median_salary:0.2f}")

Question-2
Write a Python program to import the data from the given XLSX file into
Python using pandas and also perform the following operations:

1. Find the dimensions of the given dataframe


2. Find the standard deviation for the salary column
3. Find the sum of the salary column

Note: The data is already uploaded for the purpose of practice, Please
import the file "employee_data.xlsx" using Pandas.

import pandas as pd

df = pd.read_excel("employee_data.xlsx")

emp_shape = df.shape

std_salary = df['Salary'].std()

sum_salary = df['Salary'].sum()
print(f"Dimension of Shape: {emp_shape}")

print("Std and Sum from XLSX dataframe")

print(f"Std Salary: ${std_salary:0.2f}")

print(f"Sum Salary: ${sum_salary:0.2f}")

Basic Data Cleaning


Question-1
You need to write a Python program to import the data from the given CSV
file into Python and also perform some basic checks on the dataset like:

1. Check if the dataset contains any missing values in the salary column.
2. Check if the dataset contains any missing values in the age column.
3. Print if the dataset contains any duplicated value.
4. Drop the rows with missing values and print the cured data frame.

import pandas as pd

employee_df = pd.read_csv("employee_data_nc.csv")

missing_Salary_data_mask = pd.isnull(employee_df['Salary'])

print(f"Null Value for Salary:\n",missing_Salary_data_mask)

missing_Age_data_mask = employee_df['Age'].isnull()

print(f"Null Value for Age:\n",missing_Age_data_mask)

duplicate_rows = employee_df[employee_df.duplicated()]

print(f"List of Duplicated Rows: \n",duplicate_rows)

employee_df_dropped = employee_df.dropna(subset=["Salary"])

print(f"Cured Dataset: \n",employee_df_dropped)


Question-2
You have to create a python program where you need to import the data from
the given CSV file and perform the following operations as described
below:

1.Fill the missing data in the salary with the mean value of the salary.
2.Change the datatype of the salary variable from int to float.
3.Change the value of department column:
Engineering : Engg
Marketing: Mktg
Finance : Fin

import pandas as pd

employee_df = pd.read_csv("employee_data_nc.csv")

mean_salary = employee_df["Salary"].mean()

employee_df["Salary"].fillna(mean_salary, inplace=True)

mean_age = employee_df["Age"].mean()

employee_df["Age"].fillna(mean_age, inplace=True)

print("Filled data with new values: \n",employee_df)

employee_df["Salary"] = employee_df["Salary"].astype(int)

employee_df["Age"] = employee_df["Age"].astype(int)

print("Changed data type float to int for Age and Salary: \n",employee_df)

department_mapping = {"Engineering": "Engg","Marketing": "Mktg","Finance": "Fin"}

employee_df["Department"].replace(department_mapping, inplace=True)

print("Changed Department name: \n",employee_df)


Data Manipulation with Pandas
Question-1
Write a Python program to import the data from the CSV file and also
perform the following operations on it:
1.Fetch the data of the first 3 rows for the column Employee Name and
Salary.
2.Show the data of the employee who is earning more than $80,000
3.Show the employee data who is from the Engineering department.

Note: The data is already uploaded for the purpose of practice, Please
import the file "employee_data.csv" using Pandas.

import pandas as pd

employee_df = pd.read_csv("employee_data.csv")

selected_data = employee_df.loc[0:2, ["EmployeeName", "Salary"]]

print("Select Employee Data: \n", selected_data)

high_earners = employee_df[employee_df["Salary"]>80000]

print("Highest Earning Employee: \n",high_earners)

engineering_employees=employee_df[employee_df["Department"]=="Engineering"]

print("Employee of Engineering Department: \n", engineering_employees))

Question-2
Write a Python program to import the data from the CSV file and also
perform the following operations on it:
1. Show the sum of the employee salary
2. Show the employee data of the youngest employee and oldest employee.
3. Show the most common department amount of the employee data.

import pandas as pd

employee_df = pd.read_csv("employee_data.csv")

total_salary_cost=employee_df["Salary"].sum()

print("Total Salary:",total_salary_cost)
oldest_employee=employee_df.loc[employee_df["Age"].idxmax()]

print("Detail of Oldest Employee:\n",oldest_employee)

youngest_employee=employee_df.loc[employee_df["Age"].idxmin()]

print("Detail of Youngest Employee:\n",youngest_employee)

most_common_department = employee_df["Department"].mode()[0]

print("Most Common Department\n",most_common_department)

Data Manipulation : Indexing and Sorting


Question-1
Write a Python program to import the data from the CSV file into Python
and create multiple indexes of columns "EmployeeName" and "Department".
Sort the data using the created index.

import pandas as pd

employee_df = pd.read_csv("employee_data.csv")

employee_df.set_index(["Department","EmployeeName"],inplace=True)

employee_df.sort_index(inplace=True)

print(employee_df)

Question-2
Write a Python program to import the data from the CSV file into Python
and sort the data by values using the Salary column.

import pandas as pd

employee_df = pd.read_csv('employee_data.csv')

employee_df.sort_values(by='Salary', ascending=False, inplace=True)

print("Sorted by Salary:\n",employee_df)
Question-3
Write a Python program to import the data from the CSV file into Python
and concatenate the existing data with the new data given above.
1. Make sure to also remove the duplicate records from the combined
dataframe.
2. Show the common data between both the new and existing data table using
EmployeeName as the common key.

import pandas as pd

employee_df = pd.read_csv('employee_data.csv')

data = { 'EmployeeName': ['Karishma', 'Rena', 'Ragesh','Eve'], 'Age': [30, 29, 30,31], 'Department':
['Content', 'Manager', 'Editor','HR'], 'Salary': [95000, 90000, 90000,72000] }

new_employee_data_df = pd.DataFrame(data)

combined_df=pd.concat([employee_df,new_employee_data_df])

combined_df=combined_df.drop_duplicates()

print("Combined Datatable:\n",combined_df)

merged_df=pd.merge(employee_df, new_employee_data_df, on='EmployeeName', how='inner')

print("Common records between the new and existing datatable:\n",merged_df)

Data Manipulation : Grouping Data and


Aggregation
Question-4
Write a Python program to import the data from the CSV file and perform
the following operations using pivot tables and aggregate functions in
pandas:

1. The manager wants to know the average salary in each department. Create
a pivot table that displays the average salary for each department.

2. HR is interested in the total number of employees in each department.


Use a pivot table to show the count of employees in each department.
3. The finance department wants to identify the highest and lowest
salaries in each age group. Create a pivot table that displays the maximum
and minimum salaries for each age group.

4. Management needs to find the total salary cost for each department. Use
aggregation functions to calculate the total salary cost per department.

import pandas as pd

employee_df=pd.read_csv('employee_data.csv')

pivot_table_avg_salary = pd.pivot_table(employee_df, values='Salary', index='Department',


aggfunc='mean')

print(pivot_table_avg_salary)

pivot_table_employee_count = pd.pivot_table(employee_df, values='EmployeeName',


index='Department', aggfunc='count')

print(pivot_table_employee_count)

pivot_table_max_min_salary_by_age = pd.pivot_table(employee_df, values='Salary', index='Age',


aggfunc={'Salary': ['max', 'min']})

print(pivot_table_max_min_salary_by_age)

total_salary_cost_by_department = employee_df.groupby(['Department'])['Salary'].sum()

print(total_salary_cost_by_department)

Question-2
Write a Python program to import the data from the CSV file into Python
and concatenate the existing data with the new data given above.
1.Make sure to also remove the duplicate records from the combined
dataframe.
2.Show employee count in each department using the groupby function.

import pandas as pd
employee_df = pd.read_csv('employee_data.csv')

import pandas as pd

employee_df = pd.read_csv('employee_data.csv')

data = {

'EmployeeName': ['Karishma', 'Rena', 'Ragesh','Eve'],

'Age': [30, 29, 30,31],

'Department': ['Content', 'Manager', 'Editor','HR'],

'Salary': [95000, 90000, 90000,72000]

new_employee_data_df = pd.DataFrame(data)

combined_df = pd.concat([employee_df, new_employee_data_df])

combined_df = combined_df.drop_duplicates()

print("Combined Datatable:\n",combined_df)

count_by_department =combined_df.groupby(['Department'])['EmployeeName'].count()

print(count_by_department)

Capstone
Question-1
In this task, you are presented with an unclean car price dataset stored
in a CSV file named 'car-prices_unclean.csv'. Your primary objective is to
perform preliminary data exploration and cleaning to ensure that the
dataset is ready for further analysis or modeling.

import pandas as pd
import numpy as np

df = pd.read_csv('car-prices_unclean.csv')

print('First Few Rows of the Dataset:')

print(df.head())

print('\nDimensions of the Dataset:')

print(df.shape)

print('\nSummary of the Dataset:')

print(df.info())

print('\nMissing Values in the Dataset:')

print(df.isnull().sum())

print(df.duplicated().sum())

print(df.nunique())

print('\nSummary Statistics for Numerical Columns:')

print(df.describe())

Question-2
Write a Python code to include removing unnecessary columns ('car_ID' and
'Untitled'), handling missing values in the 'symboling' and 'price'
columns, imputing missing values in the 'fueltype' column, and checking
for any remaining missing values.

Additionally, you need to detect and report outliers in the 'price' column
using quartiles and the Interquartile Range (IQR). Identified outliers
will be treated by applying a logarithmic transformation to 'price.'
import pandas as pd

import numpy as np

df = pd.read_csv('car-prices_unclean.csv')

df.drop(['car_ID', 'Untitled'], axis = 1, inplace = True)

df['symboling'].fillna(df['symboling'].mean(), inplace=True)

df['price'].fillna(df['price'].median(), inplace=True)

most_frequent_fueltype = df['fueltype'].mode()[0]

df['fueltype'].fillna(most_frequent_fueltype, inplace=True)

print(df.isnull().sum())

Q1 = df['price'].quantile(0.25)

Q3 = df['price'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]

print("\nOutliers in price:")

print(outliers)

df['price'] = np.log1p(df['price'])

Question-3
Data preprocessing task for a dataset containing information about cars
and their prices. The code performs the following tasks:

1.Reads a CSV file named 'cars-prices_new.csv' into a Pandas DataFrame,


'df'.
2. Prints the first 25 rows of the 'CarName' column from the DataFrame.
3.Extracts the car company names from the 'CarName' column and creates a
new Pandas Series, 'car_companies', which contains the first word of each
'CarName'.
4.Drops the 'CarName' column from the DataFrame.
5.Adds a new column, 'car_company', to the DataFrame using the
'car_companies' Series.
6.Cleans and standardizes some car company names by replacing certain
misspelled or inconsistent names with correct ones.
7.Prints the counts of each unique value in the 'car_company' column after
the standardization.

import pandas as pd

import numpy as np

df = pd.read_csv('cars-prices_new.csv')

print(df['CarName'].head(25))

car_companies = pd.Series([car.split(" ")[0] for car in df['CarName']], index = df.index)

print(car_companies.head(25))

df.drop(columns= ['CarName'], axis = 1, inplace = True)

df['car_company'] = car_companies

print(df['car_company'].value_counts())

df.loc[(df['car_company'] == "vw") | (df['car_company'] == "vokswagen"), 'car_company'] = 'volkswagen'

df.loc[df['car_company'] == "porcshce", 'car_company'] = 'porsche'

df.loc[df['car_company'] == "toyouta", 'car_company'] = 'toyota'

df.loc[df['car_company'] == "Nissan", 'car_company'] = 'nissan'

df.loc[df['car_company'] == "maxda", 'car_company'] = 'mazda'

print(df['car_company'].value_counts())
Question-4
Your task is to analyze the dataset and identify factors that have a
significant impact on car prices. To achieve this, you need to perform the
following steps using Python and Pandas:

1.Read the dataset from a CSV file named 'cars-prices_final.csv' into a


Pandas DataFrame, 'df'.
2. Extract the numerical columns from the DataFrame, as these are the
potential factors that could affect car prices.
3. Create a new DataFrame, 'numerical_columns', containing only numerical
columns from 'df'.
4. Calculate the correlation matrix for the numerical columns. The
correlation matrix will help you understand the relationships between
these factors and the 'price' of the cars.
5. Extract the correlation coefficients between the 'price' column and all
other numerical columns. Store these correlations in the
'price_correlations' Series.
6.Print the correlation coefficients with the 'price' column to understand
how each factor is related to car prices.
7. Identify factors that have a significant impact on car prices. Select
the factors with an absolute correlation coefficient greater than 0.5,
indicating a strong positive or negative correlation with car prices.
8. Print the factors with significant impacts on car prices.
9.Your final output should include a list of factors that significantly
affect car prices based on their correlation coefficients with the 'price'
column. Ensure that your code is well-documented and easy to understand,
making it clear which factors have the most influence on car prices.

import pandas as pd

import numpy as np

df = pd.read_csv('cars-prices_final.csv')

numerical_columns = df.select_dtypes(include=[np.number])

correlation_matrix = numerical_columns.corr()

price_correlations = correlation_matrix['price']

print("Correlation coefficients with 'price' ")

print(price_correlations)
significant_factors = price_correlations[np.abs(price_correlations) > 0.5]

print("\nFactors with significant impact on car prices:")

print(significant_factors)

Question-5
Your task is to analyze the impact of categorical columns (those
containing object or category data types) on car prices. To achieve this,
you need to perform the following tasks using Python and Pandas:

1. Read the dataset from a CSV file named 'cars-prices_final.csv' into a


Pandas DataFrame, 'df'.
1. Identify the categorical columns in the dataset. These columns contain
non-numeric data, and you will analyze how each category within these
columns affects the 'price' of the cars.
2. For each categorical column, perform the following steps:
a. Print a message indicating the impact of the current categorical column
on 'price'. For example, if the column is 'CarType', the message might be:
"Impact of 'CarType' on 'price':"
b. Find the unique categories within the current categorical column.
c. For each unique category, filter the data to include only rows where
the categorical column matches the current category.
d. Calculate summary statistics for the 'price' column within this
filtered data, including statistics such as mean, standard deviation,
minimum, maximum, and quartiles.
e. Print the category name and the summary statistics for the 'price'
column for that category.

After analyzing all unique categories within the current categorical


column, print a newline character ('\n') to separate the analysis for
different categorical columns.

Your final output should provide insights into how different categories
within each categorical column impact car prices. Ensure that your code is
well-documented and easy to understand, making it clear which categories
have a significant influence on car prices within each categorical column.

import pandas as pd

import numpy as np

df = pd.read_csv('cars-prices_final.csv')
categorical_columns = df.select_dtypes(include=['object','category'])

for column in categorical_columns:

print(f"Impact of '{column}' on 'price':")

unique_categories = df[column].unique()

for category in unique_categories:

category_data = df[df[column] == category]

price_summary = category_data['price'].describe()

print(f"Category: {category}")

print(price_summary)

print('\n')

You might also like