0% found this document useful (0 votes)

4 views

Lab1 for module3- Python code (1)

The document outlines an exploratory data analysis of the US Cars Dataset, which includes information on 28 car brands sold in the US. It details the process of data cleaning, visualization, and analysis using libraries like Pandas, Matplotlib, and Seaborn. Key findings include popular car models, price distributions, and relationships between car age and price.

Uploaded by

Sean Jing

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lab1 for module3- Python code (1)

Uploaded by

Sean Jing

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lab3 by Dr.

Hoora Fakhrmoosavy

Exploratory Data Analysis of the US Cars Dataset

The US Cars Dataset contains scraped data from the online North
American Car auction. It contains information about 28 car brands for sale
in the US. In this post, we will perform exploratory data analysis on the US
Cars Dataset.

First, let’s import the Pandas library

import pandas as pd

Next, let’s remove the default display limits for Pandas data frames:
pd.set_option('display.max_columns', None)

Now, let’s read the data into a data frame:

df = pd.read_csv("USA_cars_datasets.csv")

Let’s print the list of columns in the data:

print(list(df.columns))

Let’s find the unique values in brand and year columns:

df.brand.unique()

years = df.year.unique()

Let's sort it:

np.sort(years)

We can also take a look at the number of rows in the data:

print("Number of rows: ", len(df))

Next, let’s print the first five rows of data:

print(df.head())
df.describe()

Now, let’s look at the brands of white cars:

df_d1 = df[df['color'] =='white']
print(set(df_d1['brand']))

We can also look at the most common brands for white cars:
from collections import Counter
print(dict(Counter(df_d1['brand']).most_common(5)))

Dealing with missing value:

df['mileage'].replace(np.nan, df[' mileage '].mean(), inplace=True)

df.year.replace(np.nan, df.year.mean(), inplace=True)

df.info()

Let's begin by importing matplotlib.pyplot and seaborn.

import seaborn as sns

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

sns.set_style('darkgrid')

matplotlib.rcParams['font.size'] = 14

matplotlib.rcParams['figure.figsize'] = (8,
6)matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let’s find popular models:
import plotly.express as px

models_df = df.dropna(subset = [ 'model'])

fig = px.treemap(models_df, path=['model'], title='Most Popular

Models')

fig.show()

Relationship between Car's Release Year and Price

The better way to study this relationship is to consider the age of car than the year when it was released.
Let's add another column in the dataframe for the age of car. The age is calculated with the help of datetime
library in Python.

import datetime

df['age'] = datetime.datetime.now().year - df['year']

sns.scatterplot(x=df.age, y=df.price, s=40);

Adding Log Price Column:

df['Log Price'] = df['price'].map(lambda p: np.log(p))

sns.scatterplot(x=df.Age, y=df['Log Price'], s=40);

On the logrithmic scale, the visualization becomes much clearer than before and the inverse relationship is
more obvious.

Popularity based on Model:

models= df.groupby('model')['model'].count()

models = pd.DataFrame(models)

models.columns = ['models Counts']

models.sort_values(by=['models Counts'], inplace=True, ascending=False)

models = models.head(5)

models.plot.bar();

plt.title('Preferred models')

plt.xlabel('models')

plt.ylabel('No. of Cars');
Finding Top brands in our database:

topbrands= df.groupby('brand')['brand'].count()

topbrands = pd.DataFrame(topbrands)

topbrands.columns = ['Top Brands']

topbrands.sort_values(by=['Top Brands'], inplace=True, ascending=False)

topbrands = topbrands.head(10)

topbrands.plot.bar();

plt.title('Famous Brands')

plt.xlabel('Brands')

plt.ylabel('No. of Cars');
Most Expensive Car Brands:

expensive= df.groupby('brand')['price'].mean()

expensive = pd.DataFrame(expensive)

expensive.columns = ['Average Prices']

expensive.sort_values(by=['Average Prices'], inplace=True, ascending=False)

expensive = expensive.head(10)

expensive.plot.bar();

plt.title('Expensive Brands')

plt.xlabel('Car Brands')

plt.ylabel('No. of Cars');
Let’s look at Distribution of Price:

cars_price_df = df[(df.price > 1000) & (df.price < 5000)]

plt.title('Distribution of Price')

plt.hist(cars_price_df.price, bins=np.arange(1000, 5000, 500));

plt.xlabel('Price')

plt.ylabel('No. of Samples')

plt.xlim(1000, 5000);

Let’s find the histogram of price:

plt.figure(figsize=(10,6))
sns.distplot(df['price']).set_title('Distribution of Car Prices')

Finally let’s create a boxplot of ‘price’ in the 5 most commonly occurring

‘brand’ categories:
import matplotlib.pyplot as plt

def get_boxplot_of_categories(data_frame, categorical_column, numerical_column, limit):

import seaborn as sns

from collections import Counter

keys = []

for i in dict(Counter(df[categorical_column].values).most_common(limit)):

keys.append(i)

print(keys)

df_new = df[df[categorical_column].isin(keys)]

sns.set()

sns.boxplot(x = df_new[categorical_column], y = df_new[numerical_column])

plt.show()

get_boxplot_of_categories(df, 'brand', 'price', 5)

Also, we can get for all brands:
plt.figure(figsize=(12,8))

sns.set(style='darkgrid')
sns.boxplot(x='brand', y='price', data=df).set_title("Price Distribution of Different Brands")

Please write a code to answer these questions.

Question1: Find Price Distribution of Top 3 Brands in database?

Question2: What is the average price of nissan, BMWand ford cars or

the 3 most famous car brands in database?

Question3: Cars from which release years are most cheapest (on
average) in database for the release years beyond 2000?

Question4: Which brand cars have covered most mileage on the roads?

Question5: Which state has the highest registered Mercedes cars?

Belarus Car Price Prediction
No ratings yet
Belarus Car Price Prediction
18 pages
Python Dataframe Assignment No 1 - Answerkey
No ratings yet
Python Dataframe Assignment No 1 - Answerkey
7 pages
Internship
No ratings yet
Internship
23 pages
9587_9638_9563_ADS_exp1.ipynb - Colab
No ratings yet
9587_9638_9563_ADS_exp1.ipynb - Colab
8 pages
Part A
No ratings yet
Part A
3 pages
Data Frames and Charts 2: 2.1 Dealing With Missing Values
No ratings yet
Data Frames and Charts 2: 2.1 Dealing With Missing Values
12 pages
Ist Part A
No ratings yet
Ist Part A
4 pages
Trilokesh Assignment
No ratings yet
Trilokesh Assignment
15 pages
Data Analytics Project PDF
No ratings yet
Data Analytics Project PDF
10 pages
Pyt On Visualization
No ratings yet
Pyt On Visualization
50 pages
4
No ratings yet
4
9 pages
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
No ratings yet
Automobil E Data Analysis: Name Pgp-Dsba Online January' 21 Date: Dd/mm/yyyy
11 pages
1.5 Data Analysis with Python- Exploratory Data Analysis 1
No ratings yet
1.5 Data Analysis with Python- Exploratory Data Analysis 1
17 pages
Python Pandas Matplot
No ratings yet
Python Pandas Matplot
15 pages
Project Report
No ratings yet
Project Report
7 pages
Eda Notes
No ratings yet
Eda Notes
4 pages
Practical 2 .Ipynb - Colab (1) - Copy (1)
No ratings yet
Practical 2 .Ipynb - Colab (1) - Copy (1)
9 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Module 5 - Data Visualization - File 1
No ratings yet
Module 5 - Data Visualization - File 1
3 pages
2
No ratings yet
2
9 pages
Python Codes
No ratings yet
Python Codes
17 pages
Exploratiory data analysis
No ratings yet
Exploratiory data analysis
26 pages
DAV_WEEK8_240953580
No ratings yet
DAV_WEEK8_240953580
15 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Cars Sales Dashboard
No ratings yet
Cars Sales Dashboard
19 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
Project - Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Project - Analyzing The Impact of Car Features On Price and Profitability
8 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Project - Analyzing The Impact of Car Features On Price and Profitability
No ratings yet
Project - Analyzing The Impact of Car Features On Price and Profitability
8 pages
car-price-prediction-1 (1)
No ratings yet
car-price-prediction-1 (1)
24 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
7 pages
Laptop Prices Analysis
No ratings yet
Laptop Prices Analysis
6 pages
Sample Project - IP - 12
No ratings yet
Sample Project - IP - 12
14 pages
KrutikaKolhe-862467252-HW3
No ratings yet
KrutikaKolhe-862467252-HW3
14 pages
Technologyname Phase2
No ratings yet
Technologyname Phase2
20 pages
SMDM Business+Report
No ratings yet
SMDM Business+Report
11 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Lec ExploratoryDataAnalysis1Unit5Part1
No ratings yet
Lec ExploratoryDataAnalysis1Unit5Part1
22 pages
Team AN
No ratings yet
Team AN
23 pages
Impact of Car Features
No ratings yet
Impact of Car Features
9 pages
Engo 645
No ratings yet
Engo 645
10 pages
Assgn
No ratings yet
Assgn
6 pages
22eg107a11 DWV
No ratings yet
22eg107a11 DWV
15 pages
Note
No ratings yet
Note
9 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
Eda 1
No ratings yet
Eda 1
29 pages
USA Second Hand Car: Project Report
No ratings yet
USA Second Hand Car: Project Report
24 pages
Ip Project
No ratings yet
Ip Project
52 pages
Data Visualization For Python - Sales Retail - r1
No ratings yet
Data Visualization For Python - Sales Retail - r1
19 pages
IP Project Final
No ratings yet
IP Project Final
13 pages
Beginner Guide Matplotlib Data Visualization Exploration Python
No ratings yet
Beginner Guide Matplotlib Data Visualization Exploration Python
13 pages
10 Must-know Seaborn Visualization Plots for Multivariate Data Analysis in Python _ by Susan Maina _ Towards Data Science
No ratings yet
10 Must-know Seaborn Visualization Plots for Multivariate Data Analysis in Python _ by Susan Maina _ Towards Data Science
39 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
lec19
No ratings yet
lec19
14 pages
Exercises 3
No ratings yet
Exercises 3
11 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
BN308 EnduraNext Brochure Compressed
No ratings yet
BN308 EnduraNext Brochure Compressed
12 pages
MDV Relay
No ratings yet
MDV Relay
4 pages
Sikaflex - 527 at
No ratings yet
Sikaflex - 527 at
2 pages
Support Vector Machines - Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow, 3rd Edition
No ratings yet
Support Vector Machines - Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow, 3rd Edition
17 pages
Aklas Bahi
No ratings yet
Aklas Bahi
6 pages
General Mathematics: Quarter 1 - Module 1 Functions
No ratings yet
General Mathematics: Quarter 1 - Module 1 Functions
17 pages
Sarah Jane Ferrer High School Guinacutan, Vinzons Camarines Norte Table of Specification in Grade 9 - Science 4th Grading Examination
No ratings yet
Sarah Jane Ferrer High School Guinacutan, Vinzons Camarines Norte Table of Specification in Grade 9 - Science 4th Grading Examination
1 page
How To Take A Good Portrait Photo
No ratings yet
How To Take A Good Portrait Photo
4 pages
Pymander For Tarot
No ratings yet
Pymander For Tarot
2 pages
Solutions and Colligative Properties
No ratings yet
Solutions and Colligative Properties
2 pages
thuvienhoclieu.com-De-on-thi-Danh-gia-nang-luc-2025-mon-Tieng-Anh-De-1
No ratings yet
thuvienhoclieu.com-De-on-thi-Danh-gia-nang-luc-2025-mon-Tieng-Anh-De-1
51 pages
The Horizon Volume XIV
No ratings yet
The Horizon Volume XIV
42 pages
Answers Assignment Set 1 - Updated 0411
No ratings yet
Answers Assignment Set 1 - Updated 0411
3 pages
More Multiple-Choice Item Writing Dos and Donts
No ratings yet
More Multiple-Choice Item Writing Dos and Donts
4 pages
Mediostar Next Pro
No ratings yet
Mediostar Next Pro
2 pages
Advances in Speech and Music Technology Proceedings of FRSM 2020 Advances in Intelligent Systems and Computing Anupam Biswas Editor Emile Wennekes Editor Tzung Pei Hong Editor Alicja Wieczorkowska Editor - Quickly download the ebook to read anytime, anywhere
100% (1)
Advances in Speech and Music Technology Proceedings of FRSM 2020 Advances in Intelligent Systems and Computing Anupam Biswas Editor Emile Wennekes Editor Tzung Pei Hong Editor Alicja Wieczorkowska Editor - Quickly download the ebook to read anytime, anywhere
73 pages
Dell Inspiron 16 7620 2-In-1 p119f p119f001 Dell Regulatory and Environmental Datasheet En-Us
No ratings yet
Dell Inspiron 16 7620 2-In-1 p119f p119f001 Dell Regulatory and Environmental Datasheet En-Us
12 pages
Space-radiation-research-with-heavy-ions-at_2024_Life-Sciences-in-Space-Rese
No ratings yet
Space-radiation-research-with-heavy-ions-at_2024_Life-Sciences-in-Space-Rese
9 pages
Engineering Mechanics: 5/10, Friday 3-4am
No ratings yet
Engineering Mechanics: 5/10, Friday 3-4am
11 pages
Construction Cost Estimate Practical Manual
100% (5)
Construction Cost Estimate Practical Manual
215 pages
Iso 4628-5-2016
0% (1)
Iso 4628-5-2016
16 pages
Remote Audit Process
100% (1)
Remote Audit Process
2 pages
Section 4 Complete
No ratings yet
Section 4 Complete
23 pages
iATTHEMO TLE LN1
No ratings yet
iATTHEMO TLE LN1
2 pages
Urdu: Second Language: Paper 3248/01 Composition and Translation
No ratings yet
Urdu: Second Language: Paper 3248/01 Composition and Translation
8 pages
ECE316H1 - 20201 - 641586556718ECE316 Problem Set Solutions Merged-20
No ratings yet
ECE316H1 - 20201 - 641586556718ECE316 Problem Set Solutions Merged-20
79 pages
[FREE PDF sample] Global Engineering Design Decision Making and Communication Industrial Innovation 1st Edition Carlos Acosta ebooks
100% (8)
[FREE PDF sample] Global Engineering Design Decision Making and Communication Industrial Innovation 1st Edition Carlos Acosta ebooks
85 pages
Corrosion and Its Control - Notes
No ratings yet
Corrosion and Its Control - Notes
4 pages
1final_manu (7)
No ratings yet
1final_manu (7)
89 pages
Psychology IB Possible Questions
No ratings yet
Psychology IB Possible Questions
1 page

Lab1 for module3- Python code (1)

Uploaded by

Lab1 for module3- Python code (1)

Uploaded by

Lab3 by Dr.

Exploratory Data Analysis of the US Cars Dataset

First, let’s import the Pandas library

Now, let’s read the data into a data frame:

Let’s print the list of columns in the data:

Let’s find the unique values in brand and year columns:

Let's sort it:

We can also take a look at the number of rows in the data:

Next, let’s print the first five rows of data:

Now, let’s look at the brands of white cars:

Dealing with missing value:

df.year.replace(np.nan, df.year.mean(), inplace=True)

Let's begin by importing matplotlib.pyplot and seaborn.

import matplotlib.pyplot as plt

models_df = df.dropna(subset = [ 'model'])

fig = px.treemap(models_df, path=['model'], title='Most Popular

Relationship between Car's Release Year and Price

df['age'] = datetime.datetime.now().year - df['year']

sns.scatterplot(x=df.age, y=df.price, s=40);

df['Log Price'] = df['price'].map(lambda p: np.log(p))

sns.scatterplot(x=df.Age, y=df['Log Price'], s=40);

Popularity based on Model:

models.columns = ['models Counts']

models.sort_values(by=['models Counts'], inplace=True, ascending=False)

topbrands.columns = ['Top Brands']

topbrands.sort_values(by=['Top Brands'], inplace=True, ascending=False)

expensive.columns = ['Average Prices']

expensive.sort_values(by=['Average Prices'], inplace=True, ascending=False)

cars_price_df = df[(df.price > 1000) & (df.price < 5000)]

plt.hist(cars_price_df.price, bins=np.arange(1000, 5000, 500));

Let’s find the histogram of price:

Finally let’s create a boxplot of ‘price’ in the 5 most commonly occurring

def get_boxplot_of_categories(data_frame, categorical_column, numerical_column, limit):

import seaborn as sns

sns.boxplot(x = df_new[categorical_column], y = df_new[numerical_column])

get_boxplot_of_categories(df, 'brand', 'price', 5)

Please write a code to answer these questions.

Question1: Find Price Distribution of Top 3 Brands in database?

Question2: What is the average price of nissan, BMWand ford cars or

Question5: Which state has the highest registered Mercedes cars?

You might also like