Kaggle Machine Learning
Kaggle Machine Learning
The first thing you'll want to do is familiarize yourself with the data. You'll use the Pandas library
for this. Pandas is the primary tool that modern data scientists use for exploring and manipulating
data. Most people abbreviate pandas in their code as pd. We do this with the command
In [1]:
import pandas as pd
The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data
you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The
Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.
Let's start by looking at a basic data overview with our example data from Melbourne and the data
you'll be working with from Iowa.
We load and explore the data with the following:
In [2]:
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
print(melbourne_data.describe())
In [1]:
import pandas as pd
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
print(melbourne_data.columns)
melbourne_price_data = melbourne_data.Price
print(melbourne_price_data.head())
0 1480000.0
1 1035000.0
2 1465000.0
3 850000.0
4 1600000.0
Name: Price, dtype: float64
two_columns_of_data = melbourne_data[columns_of_interest]
We can verify that we got the columns we need with the describe command.
two_columns_of_data.describe()
Landsize BuildingArea
count 13603.000000 7762.000000
mean 558.116371 151.220219
std 3987.326586 519.188596
min 0.000000 0.000000
25% 176.500000 93.000000
50% 440.000000 126.000000
75% 651.000000 174.000000
Landsize BuildingArea
max 433014.000000 44515.000000
y = melbourne_data.Price
Choosing Predictors
Next we select the predictors. Sometimes, you will want to use all of the variables except the
target..
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric
variables. In the example data, the predictors will be chosen as:
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_predictors]
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)
print(X.head())
print(melbourne_model.predict(X.head()))
Implementation -
• Select the target variable you want to predict. You can go back to the list of columns
from your earlier commands to recall what it's called (hint: you've already worked with
this variable). Save this to a new variable called y.
• Create a list of the names of the predictors we will use in the initial model. Use just the
following columns in the list (you can copy and paste the whole list to save some typing,
though you'll still need to add quotes):
◦ LotArea
◦ YearBuilt
◦ 1stFlrSF
◦ 2ndFlrSF
◦ FullBath
◦ BedroomAbvGr
◦ TotRmsAbvGrd
• Using the list of variable names you just created, select a new DataFrame of the
predictors data. Save this with the variable name X.
• Create a DecisionTreeRegressorModel and save it to a variable (with a name like
my_model or iowa_model). Ensure you've done the relevant import so you can run this
command.
• Fit the model you have created using the data in X and the target data you saved above.
• Make a few predictions with the model's predict command and print out the predictions.