0% found this document useful (0 votes)
4 views

Lab1 for module3- Python code (1)

The document outlines an exploratory data analysis of the US Cars Dataset, which includes information on 28 car brands sold in the US. It details the process of data cleaning, visualization, and analysis using libraries like Pandas, Matplotlib, and Seaborn. Key findings include popular car models, price distributions, and relationships between car age and price.

Uploaded by

Sean Jing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lab1 for module3- Python code (1)

The document outlines an exploratory data analysis of the US Cars Dataset, which includes information on 28 car brands sold in the US. It details the process of data cleaning, visualization, and analysis using libraries like Pandas, Matplotlib, and Seaborn. Key findings include popular car models, price distributions, and relationships between car age and price.

Uploaded by

Sean Jing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lab3 by Dr.

Hoora Fakhrmoosavy

Exploratory Data Analysis of the US Cars Dataset

The US Cars Dataset contains scraped data from the online North
American Car auction. It contains information about 28 car brands for sale
in the US. In this post, we will perform exploratory data analysis on the US
Cars Dataset.

First, let’s import the Pandas library


import pandas as pd

Next, let’s remove the default display limits for Pandas data frames:
pd.set_option('display.max_columns', None)

Now, let’s read the data into a data frame:


df = pd.read_csv("USA_cars_datasets.csv")

Let’s print the list of columns in the data:


print(list(df.columns))

Let’s find the unique values in brand and year columns:


df.brand.unique()

years = df.year.unique()

Let's sort it:


np.sort(years)

We can also take a look at the number of rows in the data:


print("Number of rows: ", len(df))

Next, let’s print the first five rows of data:


print(df.head())
df.describe()

Now, let’s look at the brands of white cars:


df_d1 = df[df['color'] =='white']
print(set(df_d1['brand']))

We can also look at the most common brands for white cars:
from collections import Counter
print(dict(Counter(df_d1['brand']).most_common(5)))

Dealing with missing value:


df['mileage'].replace(np.nan, df[' mileage '].mean(), inplace=True)

df.year.replace(np.nan, df.year.mean(), inplace=True)

df.info()

Let's begin by importing matplotlib.pyplot and seaborn.


import seaborn as sns

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

sns.set_style('darkgrid')

matplotlib.rcParams['font.size'] = 14

matplotlib.rcParams['figure.figsize'] = (8,
6)matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let’s find popular models:
import plotly.express as px

models_df = df.dropna(subset = [ 'model'])

fig = px.treemap(models_df, path=['model'], title='Most Popular


Models')

fig.show()

Relationship between Car's Release Year and Price

The better way to study this relationship is to consider the age of car than the year when it was released.
Let's add another column in the dataframe for the age of car. The age is calculated with the help of datetime
library in Python.

import datetime

df['age'] = datetime.datetime.now().year - df['year']

sns.scatterplot(x=df.age, y=df.price, s=40);


Adding Log Price Column:

df['Log Price'] = df['price'].map(lambda p: np.log(p))

sns.scatterplot(x=df.Age, y=df['Log Price'], s=40);

On the logrithmic scale, the visualization becomes much clearer than before and the inverse relationship is
more obvious.

Popularity based on Model:

models= df.groupby('model')['model'].count()

models = pd.DataFrame(models)

models.columns = ['models Counts']

models.sort_values(by=['models Counts'], inplace=True, ascending=False)

models = models.head(5)

models.plot.bar();

plt.title('Preferred models')

plt.xlabel('models')

plt.ylabel('No. of Cars');
Finding Top brands in our database:

topbrands= df.groupby('brand')['brand'].count()

topbrands = pd.DataFrame(topbrands)

topbrands.columns = ['Top Brands']

topbrands.sort_values(by=['Top Brands'], inplace=True, ascending=False)

topbrands = topbrands.head(10)

topbrands.plot.bar();

plt.title('Famous Brands')

plt.xlabel('Brands')

plt.ylabel('No. of Cars');
Most Expensive Car Brands:

expensive= df.groupby('brand')['price'].mean()

expensive = pd.DataFrame(expensive)

expensive.columns = ['Average Prices']

expensive.sort_values(by=['Average Prices'], inplace=True, ascending=False)

expensive = expensive.head(10)

expensive.plot.bar();

plt.title('Expensive Brands')

plt.xlabel('Car Brands')

plt.ylabel('No. of Cars');
Let’s look at Distribution of Price:

cars_price_df = df[(df.price > 1000) & (df.price < 5000)]

plt.title('Distribution of Price')

plt.hist(cars_price_df.price, bins=np.arange(1000, 5000, 500));

plt.xlabel('Price')

plt.ylabel('No. of Samples')

plt.xlim(1000, 5000);

Let’s find the histogram of price:


plt.figure(figsize=(10,6))
sns.distplot(df['price']).set_title('Distribution of Car Prices')

Finally let’s create a boxplot of ‘price’ in the 5 most commonly occurring


‘brand’ categories:
import matplotlib.pyplot as plt

def get_boxplot_of_categories(data_frame, categorical_column, numerical_column, limit):

import seaborn as sns


from collections import Counter

keys = []

for i in dict(Counter(df[categorical_column].values).most_common(limit)):

keys.append(i)

print(keys)

df_new = df[df[categorical_column].isin(keys)]

sns.set()

sns.boxplot(x = df_new[categorical_column], y = df_new[numerical_column])

plt.show()

get_boxplot_of_categories(df, 'brand', 'price', 5)


Also, we can get for all brands:
plt.figure(figsize=(12,8))

sns.set(style='darkgrid')
sns.boxplot(x='brand', y='price', data=df).set_title("Price Distribution of Different Brands")

Please write a code to answer these questions.

Question1: Find Price Distribution of Top 3 Brands in database?

Question2: What is the average price of nissan, BMWand ford cars or


the 3 most famous car brands in database?

Question3: Cars from which release years are most cheapest (on
average) in database for the release years beyond 2000?

Question4: Which brand cars have covered most mileage on the roads?

Question5: Which state has the highest registered Mercedes cars?

You might also like