0% found this document useful (0 votes)

36 views6 pages

Kaggle Machine Learning

Uploaded by

Prathamesh Sawant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

Kaggle Machine Learning

Uploaded by

Prathamesh Sawant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 6

Using Pandas to Get Familiar With Your Data

The first thing you'll want to do is familiarize yourself with the data. You'll use the Pandas library
for this. Pandas is the primary tool that modern data scientists use for exploring and manipulating
data. Most people abbreviate pandas in their code as pd. We do this with the command

In [1]:

import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data
you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The
Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.
Let's start by looking at a basic data overview with our example data from Melbourne and the data
you'll be working with from Iowa.
We load and explore the data with the following:
In [2]:

# save filepath to variable for easier access

melbourne_file_path = 'melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data

melbourne_data = pd.read_csv(melbourne_file_path)

# print a summary of the data in Melbourne data

print(melbourne_data.describe())

Unnamed: 0 Rooms Price Distance Postcode \

count 18396.000000 18396.000000 1.839600e+04 18395.000000 18395.000000
mean 11826.787073 2.935040 1.056697e+06 10.389986 3107.140147
std 6800.710448 0.958202 6.419217e+05 6.009050 95.000995
min 1.000000 1.000000 8.500000e+04 0.000000 3000.000000
25% 5936.750000 2.000000 6.330000e+05 6.300000 3046.000000
50% 11820.500000 3.000000 8.800000e+05 9.700000 3085.000000
75% 17734.250000 3.000000 1.302000e+06 13.300000 3149.000000
max 23546.000000 12.000000 9.000000e+06 48.100000 3978.000000

Bedroom2 Bathroom Car Landsize BuildingArea \

count 14927.000000 14925.000000 14820.000000 13603.000000 7762.000000
mean 2.913043 1.538492 1.615520 558.116371 151.220219
std 0.964641 0.689311 0.955916 3987.326586 519.188596
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 1.000000 1.000000 176.500000 93.000000
50% 3.000000 1.000000 2.000000 440.000000 126.000000
75% 3.000000 2.000000 2.000000 651.000000 174.000000
max 20.000000 8.000000 10.000000 433014.000000 44515.000000

YearBuilt Lattitude Longtitude Propertycount

count 8958.000000 15064.000000 15064.000000 18395.000000
mean 1965.879996 -37.809849 144.996338 7517.975265
std 37.013261 0.081152 0.106375 4488.416599
min 1196.000000 -38.182550 144.431810 249.000000
25% 1950.000000 -37.858100 144.931193 4294.000000
50% 1970.000000 -37.803625 145.000920 6567.000000
75% 2000.000000 -37.756270 145.060000 10331.000000
max 2018.000000 -37.408530 145.526350 21650.000000

Selecting and Filtering Data

Your dataset had too many variables to wrap your head around, or even to print out nicely. How can
you pare down this overwhelming amount of data to something you can understand?
To show you the techniques, we'll start by picking a few variables using our intuition. Later tutorials
will show you statistical techniques to automatically prioritize variables.
Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset. That is done with
the columns property of the DataFrame (the bottom line of code below).

In [1]:

import pandas as pd

melbourne_file_path = 'melb_data.csv'

melbourne_data = pd.read_csv(melbourne_file_path)

print(melbourne_data.columns)

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',

'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
Selecting a Single Column¶
You can pull out any variable (or column) with dot-notation. This single column is stored in a
Series, which is broadly like a DataFrame with only a single column of data. Here's an example:
In [2]:

# store the series of prices separately as melbourne_price_data .

melbourne_price_data = melbourne_data.Price

# the head command returns the top few lines of data .

print(melbourne_price_data.head())

0 1480000.0
1 1035000.0
2 1465000.0
3 850000.0
4 1600000.0
Name: Price, dtype: float64

Selecting Multiple Columns

You can select multiple columns from a DataFrame by providing a list of column names inside
brackets. Remember, each item in that list should be a string (with quotes).
In [3]:

columns_of_interest = ['Landsize', 'BuildingArea']

two_columns_of_data = melbourne_data[columns_of_interest]

We can verify that we got the columns we need with the describe command.

two_columns_of_data.describe()

Landsize BuildingArea
count 13603.000000 7762.000000
mean 558.116371 151.220219
std 3987.326586 519.188596
min 0.000000 0.000000
25% 176.500000 93.000000
50% 440.000000 126.000000
75% 651.000000 174.000000
Landsize BuildingArea
max 433014.000000 44515.000000

Choosing the Prediction Target

You have the code to load your data, and you know how to index it. You are ready to choose which
column you want to predict. This column is called the prediction target. There is a convention that
the prediction target is referred to as y. Here is an example doing that with the example data.

y = melbourne_data.Price

Choosing Predictors
Next we select the predictors. Sometimes, you will want to use all of the variables except the
target..
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric
variables. In the example data, the predictors will be chosen as:
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']

By convention, this data is called X.

X = melbourne_data[melbourne_predictors]

Building Your Model

Scikit-learn is easily the most popular library for modeling the types of data typically stored in
DataFrames.
The steps to building and using a model are:
• Define: What type of model will it be? A decision tree? Some other type of model? Some
other parameters of the model type are specified too.
• Fit: Capture patterns from provided data. This is the heart of modeling.
• Predict: Just what it sounds like
• Evaluate: Determine how accurate the model's predictions are.

from sklearn.tree import DecisionTreeRegressor

# Define model

melbourne_model = DecisionTreeRegressor()

# Fit model

melbourne_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

print("Making predictions for the following 5 houses:")

print(X.head())

print("The predictions are")

print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:

Rooms Bathroom Landsize BuildingArea YearBuilt Lattitude Longtitude
1 2 1.0 156.0 79.0 1900.0 -37.8079 144.9934
2 3 2.0 134.0 150.0 1900.0 -37.8093 144.9944
4 4 1.0 120.0 142.0 2014.0 -37.8072 144.9941
6 3 2.0 245.0 210.0 1910.0 -37.8024 144.9993
7 2 1.0 256.0 107.0 1890.0 -37.8060 144.9954
The predictions are
[ 1035000. 1465000. 1600000. 1876000. 1636000.]

Implementation -
• Select the target variable you want to predict. You can go back to the list of columns
from your earlier commands to recall what it's called (hint: you've already worked with
this variable). Save this to a new variable called y.
• Create a list of the names of the predictors we will use in the initial model. Use just the
following columns in the list (you can copy and paste the whole list to save some typing,
though you'll still need to add quotes):
◦ LotArea
◦ YearBuilt
◦ 1stFlrSF
◦ 2ndFlrSF
◦ FullBath
◦ BedroomAbvGr
◦ TotRmsAbvGrd
• Using the list of variable names you just created, select a new DataFrame of the
predictors data. Save this with the variable name X.
• Create a DecisionTreeRegressorModel and save it to a variable (with a name like
my_model or iowa_model). Ensure you've done the relevant import so you can run this
command.
• Fit the model you have created using the data in X and the target data you saved above.
• Make a few predictions with the model's predict command and print out the predictions.

Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Report
No ratings yet
Report
40 pages
Prac - 8 (1) - Jupyter Notebook
No ratings yet
Prac - 8 (1) - Jupyter Notebook
6 pages
1722414346054
No ratings yet
1722414346054
18 pages
pract1.printdsbdapdf 2
No ratings yet
pract1.printdsbdapdf 2
7 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Exp_2-EDA_CaliforniaData Set_HeatMap_PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp_2-EDA_CaliforniaData Set_HeatMap_PairPlot-checkpoint - Jupyter Notebook
12 pages
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
DL_1
No ratings yet
DL_1
11 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
002 Python Pandas
No ratings yet
002 Python Pandas
19 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
ds_ml__house_price_book
No ratings yet
ds_ml__house_price_book
46 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
Project PDF
No ratings yet
Project PDF
13 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Ex 1
No ratings yet
Ex 1
119 pages
ML LAB - BCSL606
No ratings yet
ML LAB - BCSL606
67 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Week 12
No ratings yet
Week 12
2 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
Data Analysis With Python - Jupyter Notebook
No ratings yet
Data Analysis With Python - Jupyter Notebook
10 pages
Eda Project
No ratings yet
Eda Project
28 pages
Regression Week 1: Simple Linear Regression Assignment: All Course Content
No ratings yet
Regression Week 1: Simple Linear Regression Assignment: All Course Content
1 page
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
Final DA LAB1 Merged (1)
No ratings yet
Final DA LAB1 Merged (1)
48 pages
Emllab
No ratings yet
Emllab
6 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Evan Marie Carr - Python and SKlearn
No ratings yet
Evan Marie Carr - Python and SKlearn
32 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
BCA 5th Sem Lab(ML)
No ratings yet
BCA 5th Sem Lab(ML)
20 pages
a
No ratings yet
a
2 pages
Module 2
No ratings yet
Module 2
20 pages
Python Assignment 1.ipynb - Colaboratory
No ratings yet
Python Assignment 1.ipynb - Colaboratory
3 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Data_cleaning_on_Melbourne_housing
No ratings yet
Data_cleaning_on_Melbourne_housing
16 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
1684918425867
No ratings yet
1684918425867
14 pages
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
No ratings yet
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
1 page
Tarea - Prediccion de Casas en California
No ratings yet
Tarea - Prediccion de Casas en California
5 pages
Bi El
No ratings yet
Bi El
26 pages
DL_LR_1.ipynb - Colab
No ratings yet
DL_LR_1.ipynb - Colab
5 pages
R 9. Regression
No ratings yet
R 9. Regression
7 pages
Import As Import As From Import: "Mean Squared Errors: "
No ratings yet
Import As Import As From Import: "Mean Squared Errors: "
1 page
Xgboost
No ratings yet
Xgboost
12 pages
ML_Lab_Manual (1)
No ratings yet
ML_Lab_Manual (1)
110 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Aleatorika
No ratings yet
Aleatorika
5 pages
SJBHS Floor Map - JLPT July 2023
No ratings yet
SJBHS Floor Map - JLPT July 2023
16 pages
Bhagwati International Public School, Patan Dr. Indu Dayal Meshri Primary English School
No ratings yet
Bhagwati International Public School, Patan Dr. Indu Dayal Meshri Primary English School
1 page
Lesson Plan
100% (1)
Lesson Plan
8 pages
Special Education Historical Timeline
No ratings yet
Special Education Historical Timeline
23 pages
Tales From Greek Mythology
No ratings yet
Tales From Greek Mythology
31 pages
GHO 4 (Reduced Adverb Clauses) SS
No ratings yet
GHO 4 (Reduced Adverb Clauses) SS
5 pages
What Is Needed To Create An OAuth Flow
No ratings yet
What Is Needed To Create An OAuth Flow
6 pages
Lessons:: Tew's Targeted Instruction Math Plan
No ratings yet
Lessons:: Tew's Targeted Instruction Math Plan
5 pages
Java Jdeveloper For Beginners
No ratings yet
Java Jdeveloper For Beginners
15 pages
Kenken - S Warm-Ups 2017 Version 2
100% (1)
Kenken - S Warm-Ups 2017 Version 2
29 pages
Domestic Accidents: Their Cause and Prevention: Reflexive Pronouns
No ratings yet
Domestic Accidents: Their Cause and Prevention: Reflexive Pronouns
7 pages
Request Letter For School Airconditioner
100% (2)
Request Letter For School Airconditioner
3 pages
Sample Speech Critique
No ratings yet
Sample Speech Critique
4 pages
Community: 10a Law and Order
No ratings yet
Community: 10a Law and Order
8 pages
Eternia by AdityaBirla Hindalco Aditya Damar 07May2025
No ratings yet
Eternia by AdityaBirla Hindalco Aditya Damar 07May2025
21 pages
Prog_7
No ratings yet
Prog_7
10 pages
Nooma: 015 Rob Bell
No ratings yet
Nooma: 015 Rob Bell
18 pages
Unit I Uml Diagrams Introduction To Ooad Object-Oriented Analysis and Design (OOAD) Is A Software Engineering Approach That
100% (1)
Unit I Uml Diagrams Introduction To Ooad Object-Oriented Analysis and Design (OOAD) Is A Software Engineering Approach That
45 pages
Presentation 1
No ratings yet
Presentation 1
8 pages
GRADE 12 SBA TASKS 2025
No ratings yet
GRADE 12 SBA TASKS 2025
84 pages
NSF Proposal
100% (2)
NSF Proposal
22 pages
Testing Vocabulary ELT Assessment
No ratings yet
Testing Vocabulary ELT Assessment
26 pages
Iwr 11 - 12 Q3 1002 PS
No ratings yet
Iwr 11 - 12 Q3 1002 PS
52 pages
BEOWULF
0% (1)
BEOWULF
7 pages
M25L08 NumIntA
No ratings yet
M25L08 NumIntA
17 pages
An Interview With Elizabeth Povinelli
No ratings yet
An Interview With Elizabeth Povinelli
17 pages
Important Alphanumeric Symbol Questions For SBI Clerk/ RBI Asst Prelims Exam
No ratings yet
Important Alphanumeric Symbol Questions For SBI Clerk/ RBI Asst Prelims Exam
7 pages
Null Constituents PDF
100% (1)
Null Constituents PDF
34 pages
Battle of Hunain
No ratings yet
Battle of Hunain
13 pages

Kaggle Machine Learning

Uploaded by

Kaggle Machine Learning

Uploaded by

Using Pandas to Get Familiar With Your Data

# save filepath to variable for easier access

# read the data and store data in DataFrame titled melbourne_data

# print a summary of the data in Melbourne data

Unnamed: 0 Rooms Price Distance Postcode \

Bedroom2 Bathroom Car Landsize BuildingArea \

YearBuilt Lattitude Longtitude Propertycount

Selecting and Filtering Data

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',

# store the series of prices separately as melbourne_price_data .

# the head command returns the top few lines of data .

Selecting Multiple Columns

columns_of_interest = ['Landsize', 'BuildingArea']

Choosing the Prediction Target

By convention, this data is called X.

Building Your Model

from sklearn.tree import DecisionTreeRegressor

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,

print("Making predictions for the following 5 houses:")

print("The predictions are")

Making predictions for the following 5 houses:

You might also like