0% found this document useful (0 votes)
25 views

Kaggle Machine Learning

Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Kaggle Machine Learning

Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 6

Using Pandas to Get Familiar With Your Data

The first thing you'll want to do is familiarize yourself with the data. You'll use the Pandas library
for this. Pandas is the primary tool that modern data scientists use for exploring and manipulating
data. Most people abbreviate pandas in their code as pd. We do this with the command

In [1]:

import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data
you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The
Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.
Let's start by looking at a basic data overview with our example data from Melbourne and the data
you'll be working with from Iowa.
We load and explore the data with the following:
In [2]:

# save filepath to variable for easier access

melbourne_file_path = 'melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data

melbourne_data = pd.read_csv(melbourne_file_path)

# print a summary of the data in Melbourne data

print(melbourne_data.describe())

Unnamed: 0 Rooms Price Distance Postcode \


count 18396.000000 18396.000000 1.839600e+04 18395.000000 18395.000000
mean 11826.787073 2.935040 1.056697e+06 10.389986 3107.140147
std 6800.710448 0.958202 6.419217e+05 6.009050 95.000995
min 1.000000 1.000000 8.500000e+04 0.000000 3000.000000
25% 5936.750000 2.000000 6.330000e+05 6.300000 3046.000000
50% 11820.500000 3.000000 8.800000e+05 9.700000 3085.000000
75% 17734.250000 3.000000 1.302000e+06 13.300000 3149.000000
max 23546.000000 12.000000 9.000000e+06 48.100000 3978.000000

Bedroom2 Bathroom Car Landsize BuildingArea \


count 14927.000000 14925.000000 14820.000000 13603.000000 7762.000000
mean 2.913043 1.538492 1.615520 558.116371 151.220219
std 0.964641 0.689311 0.955916 3987.326586 519.188596
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 1.000000 1.000000 176.500000 93.000000
50% 3.000000 1.000000 2.000000 440.000000 126.000000
75% 3.000000 2.000000 2.000000 651.000000 174.000000
max 20.000000 8.000000 10.000000 433014.000000 44515.000000

YearBuilt Lattitude Longtitude Propertycount


count 8958.000000 15064.000000 15064.000000 18395.000000
mean 1965.879996 -37.809849 144.996338 7517.975265
std 37.013261 0.081152 0.106375 4488.416599
min 1196.000000 -38.182550 144.431810 249.000000
25% 1950.000000 -37.858100 144.931193 4294.000000
50% 1970.000000 -37.803625 145.000920 6567.000000
75% 2000.000000 -37.756270 145.060000 10331.000000
max 2018.000000 -37.408530 145.526350 21650.000000

Selecting and Filtering Data


Your dataset had too many variables to wrap your head around, or even to print out nicely. How can
you pare down this overwhelming amount of data to something you can understand?
To show you the techniques, we'll start by picking a few variables using our intuition. Later tutorials
will show you statistical techniques to automatically prioritize variables.
Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset. That is done with
the columns property of the DataFrame (the bottom line of code below).

In [1]:

import pandas as pd

melbourne_file_path = 'melb_data.csv'

melbourne_data = pd.read_csv(melbourne_file_path)

print(melbourne_data.columns)

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',


'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
Selecting a Single Column¶
You can pull out any variable (or column) with dot-notation. This single column is stored in a
Series, which is broadly like a DataFrame with only a single column of data. Here's an example:
In [2]:

# store the series of prices separately as melbourne_price_data .

melbourne_price_data = melbourne_data.Price

# the head command returns the top few lines of data .

print(melbourne_price_data.head())

0 1480000.0
1 1035000.0
2 1465000.0
3 850000.0
4 1600000.0
Name: Price, dtype: float64

Selecting Multiple Columns


You can select multiple columns from a DataFrame by providing a list of column names inside
brackets. Remember, each item in that list should be a string (with quotes).
In [3]:

columns_of_interest = ['Landsize', 'BuildingArea']

two_columns_of_data = melbourne_data[columns_of_interest]

We can verify that we got the columns we need with the describe command.

two_columns_of_data.describe()

Landsize BuildingArea
count 13603.000000 7762.000000
mean 558.116371 151.220219
std 3987.326586 519.188596
min 0.000000 0.000000
25% 176.500000 93.000000
50% 440.000000 126.000000
75% 651.000000 174.000000
Landsize BuildingArea
max 433014.000000 44515.000000

Choosing the Prediction Target


You have the code to load your data, and you know how to index it. You are ready to choose which
column you want to predict. This column is called the prediction target. There is a convention that
the prediction target is referred to as y. Here is an example doing that with the example data.

y = melbourne_data.Price

Choosing Predictors
Next we select the predictors. Sometimes, you will want to use all of the variables except the
target..
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric
variables. In the example data, the predictors will be chosen as:
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']

By convention, this data is called X.

X = melbourne_data[melbourne_predictors]

Building Your Model


Scikit-learn is easily the most popular library for modeling the types of data typically stored in
DataFrames.
The steps to building and using a model are:
• Define: What type of model will it be? A decision tree? Some other type of model? Some
other parameters of the model type are specified too.
• Fit: Capture patterns from provided data. This is the heart of modeling.
• Predict: Just what it sounds like
• Evaluate: Determine how accurate the model's predictions are.

from sklearn.tree import DecisionTreeRegressor

# Define model

melbourne_model = DecisionTreeRegressor()

# Fit model

melbourne_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,


max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

print("Making predictions for the following 5 houses:")

print(X.head())

print("The predictions are")

print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:


Rooms Bathroom Landsize BuildingArea YearBuilt Lattitude Longtitude
1 2 1.0 156.0 79.0 1900.0 -37.8079 144.9934
2 3 2.0 134.0 150.0 1900.0 -37.8093 144.9944
4 4 1.0 120.0 142.0 2014.0 -37.8072 144.9941
6 3 2.0 245.0 210.0 1910.0 -37.8024 144.9993
7 2 1.0 256.0 107.0 1890.0 -37.8060 144.9954
The predictions are
[ 1035000. 1465000. 1600000. 1876000. 1636000.]

Implementation -
• Select the target variable you want to predict. You can go back to the list of columns
from your earlier commands to recall what it's called (hint: you've already worked with
this variable). Save this to a new variable called y.
• Create a list of the names of the predictors we will use in the initial model. Use just the
following columns in the list (you can copy and paste the whole list to save some typing,
though you'll still need to add quotes):
◦ LotArea
◦ YearBuilt
◦ 1stFlrSF
◦ 2ndFlrSF
◦ FullBath
◦ BedroomAbvGr
◦ TotRmsAbvGrd
• Using the list of variable names you just created, select a new DataFrame of the
predictors data. Save this with the variable name X.
• Create a DecisionTreeRegressorModel and save it to a variable (with a name like
my_model or iowa_model). Ensure you've done the relevant import so you can run this
command.
• Fit the model you have created using the data in X and the target data you saved above.
• Make a few predictions with the model's predict command and print out the predictions.

You might also like