python回归预测
Online property companies offer valuations of houses using machine learning techniques. This report aims to predict house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015. We will predict the sales of houses in King County with an accuracy of at least 75–80% and understand which factors are responsible for higher property value — $650K and above.
在线房地产公司使用机器学习技术对房屋进行估价。 本报告旨在使用多元线性回归(MLR)预测美国华盛顿州金县的房屋销售。 该数据集由2014年5月至2015年5月之间售出的房屋的历史数据组成。我们将预测金县的房屋销售,其准确性至少为75–80%,并了解哪些因素导致了更高的房屋价值(65万美元和以上。
The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. This data was published/released under CC0: Public Domain. Unfortunately, the user has not indicated the source of the data. Please find the citation and database description in the Glossary and Bibliography. The dataset consisted of 21 variables and 21613 observations.
该数据集包含美国华盛顿州金县的房价,该数据还涵盖西雅图。 该数据集是从Kaggle获得的。 此数据在CC0 :公共领域下发布/发行 。 不幸的是,用户没有指出数据的来源。 请在术语表和参考书目中找到引文和数据库描述。 该数据集由21个变量和21613个观测值组成。
Doing a small basic data exploration analysis on this project. First of all, we need to install the libraries used in this project. The first library to install is pandas for reading data files and NumPy for numerical analysis.
在这个项目上做一个小的基本数据探索分析。 首先,我们需要安装该项目中使用的库。 第一个要安装的库是用于读取数据文件的熊猫和用于数值分析的NumPy 。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Matplotlib is a plotting library for the Python programming and provides an object-oriented API for embedding plots into applications. Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program. For reading the data set file first, create a data frame variable and call read_csv of pandas for importing the file from the local location on the computer.
Matplotlib是用于Python编程的绘图库,并提供用于将绘图嵌入到应用程序中的面向对象的API。 警告信息在那里 ,提醒一些条件的用户在程序中是非常有用的情况下,通常发出的。 要首先读取数据集文件,请创建一个数据框变量,然后调用pandas的read_csv从计算机上的本地位置导入文件。
house_df = pd.read_csv("housedata.csv")
The housedata is a file name which is a type of CSV ( comma separated values) and house_df is a variable name in which we assigned file to this variable. To know the shape of the data frame by using the shape method.
housedata是一个文件名,它是CSV(逗号分隔值)的一种类型, house_df是一个变量名,在其中我们将文件分配给该变量。 通过使用shape方法了解数据框的形状。
print(house_df.shape)

To know only the column names use the column method. This method will give the column names used in the data.
要只知道列名,请使用column方法。 此方法将提供数据中使用的列名。
print(house_df.columns)

We differentiated between the features column and target column. A from the data it is clear that the target column is a “price” column. Depending upon the other factors how the price is changing. To check the null values in the data try isna().sum(). To treat the null values by mean, median or something.
我们在功能列和目标列之间进行了区分。 从数据中可以清楚地看到目标列是“价格”列。 取决于其他因素,价格如何变化。 要检查数据中的空值,请尝试isna()。sum()。 用均值,中位数或其他方式处理空值。
#check if the data has any missing values
house_df.isna().sum()

For checking the outliers we use the visualizations as a Boxplot and also with the describe function. In boxplot, the box is drawn and the bottom of the box is 25% percentile and the higher most of the box is 75% percentile and the line drawn in between in the box is the median value.
为了检查离群值,我们将可视化效果用作箱线图,并与describe函数一起使用。 在箱线图中,绘制框,框的底部为25%百分数,框的较高部分为75%百分数,框之间的画线为中间值。
In the describe method the function will give the statistical description of the data. We get mean, standard deviation, 25, 50, 75 percentile values and the max value. If we want to check the outliers in this description then we can see the difference between the 75th percentile and the max value.
在describe方法中,函数将提供数据的统计描述。 我们得到平均值,标准偏差,25、50、75个百分位数值和最大值。 如果要检查此描述中的离群值,则可以看到第75个百分位数与最大值之间的差异。
#to check the statistical description of the data
house_df.describe().T
With box plot we will try to visualize the outliers in columns which is a type of int and float.
通过箱形图,我们将尝试可视化列中的离群值,这是一种int和float形式。
#checking outliers with Box Plot
for column in house_df:
if house_df[column].dtype in ['int64', 'float64']:
plt.figure()
house_df.boxplot(column = [column])

Instead of for loop, if we want to plot a box plot of a single column then we can give only the column name to the column value. The black circles in the box plot are outliers.
如果要绘制单个列的箱形图,则可以不使用for循环,而只能将列名指定为列值。 方框图中的黑色圆圈是异常值。
house_df.boxplot(column = ['price'])

In our data, we have 21 columns and we need to choose feature columns for our model. So, I choose 10 columns which can be good for features.
在我们的数据中,我们有21列,我们需要为模型选择特征列。 因此,我选择了10个对功能有益的列。
#choosing features for model
house_feat_data = house_df[["price","date","bedrooms","bathrooms",
"sqft_living","floors","waterfront",
"view","condition","grade"]]
Now we have a date column in the data for the need to extract year and month from it. As we see the data is in the type of string-like so we can use the slicing method to extract the first four positions for years and the next two values for months. After making a new column of year and month, we can remove the year column as the day is not so much important for house pricing. In these columns, we have different categories inside them so we use get_dummies to categories these features.
现在我们在数据中有一个date列,需要从中提取年和月。 如我们所见,数据是类似字符串的类型,因此我们可以使用切片方法提取年份的前四个位置以及月份的后两个值。 在创建了年和月的新列之后,我们可以删除年列,因为对于房屋价格而言,日期并不是很重要。 在这些列中,我们内部有不同的类别,因此我们使用get_dummies对这些功能进行分类。
#extracting year and month from date
house_feat_data["year"] = house_df["date"].str[0:4]
house_feat_data["month"] = house_df["date"].str[4:6]#removing date after this selection
house_feat_data = house_feat_data.drop(columns = ["date"])#treating features as categorical
features = ["bedrooms","bathrooms","floors","waterfront","view","condition","grade", "year","month"]
house_en = pd.get_dummies(house_feat_data, columns = features)
print(house_en.columns)

Now its time to make a model of our data for regression. For that, we need to use the scikit learn library for the regression algorithm and splitting of data into training and testing set.
现在该为我们的数据建立模型以进行回归了。 为此,我们需要使用scikit学习库进行回归算法并将数据拆分为训练和测试集。
#importing train_test_split
from sklearn.model_selection import train_test_split#creating train and test set
train_house, test_house=train_test_split(house_en, test_size = 0.2)#shape of the train and test set
train_house.shape, test_house.shape

The test_size is divided into 80% into the training set and 20% into the test set. The building of data on the train set and checking how well the model fits.
将test_size分为训练集的80%和测试集的20%。 在火车上建立数据并检查模型的拟合度。
from sklearn.linear_model import LinearRegression
house_features = house_en.columns.drop("price")
target = ["price"]#building model on train set
model = LinearRegression()
model.fit(train_house[house_features],train_house[target])
model.score(train_house[house_features],train_house[target])#output:
0.6191940799329386
We get this R2- Score, this score is just telling how well the line has been fit on a model. The more value of the R2-score is better the regression model fits on data. After predicting the train set and test set the RMS value is very huge so, it means that we need to change our model for better results.
我们得到这个R2-分数,这个分数只是表明该线在模型中的拟合程度。 R2分数的值越大,回归模型就越适合数据。 在预测了列车设置和测试设置之后,RMS值非常大,这意味着我们需要更改模型以获得更好的结果。
from sklearn.metrics import mean_squared_errortrain_predict = model.predict(train_house[house_features])
mean_squared_error(train_house[target], train_predict)**0.5#output:
0.32567881525600667test_predict = model.predict(test_house[house_features])
mean_squared_error(test_house[target], test_predict)**0.5#output:
0.32780719274631726
The RMS value is more so the model is not performing well, so better results we need to reduce the value of RMS. Then we can use the adjusted R2-score in Linear regression.
RMS值更大,因此模型表现不佳,因此,我们需要降低RMS值来获得更好的结果。 然后,我们可以在线性回归中使用调整后的R2得分。
You can reach me at my LinkedIn link here and on my email: design4led@gmail.com.
你可以在我的LinkedIn链接到我这里 design4led@gmail.com:和我的电子邮件。
My Previous Articles:
我以前的文章:
python回归预测