预测月份温度机器学习模型
A Practical Machine Learning Workflow Example
实用的机器学习工作流程示例
问题介绍 (Problem Introduction)
The problem we will tackle is predicting the average global land and ocean temperature using over 100 years of past weather data. We are going to act as if we don’t have access to any weather forecasts. What we do have access to is a century’s worth of historical global temperatures averages including; global maximum temperatures, global minimum temperatures, and global land and ocean temperatures. Having all of this, we know that this is a supervised, regression machine learning problem
我们将要解决的问题是使用100多年的过去天气数据来预测全球平均陆地和海洋温度。 我们将采取行动,好像我们无法获得任何天气预报一样。 我们所能获得的是一个世纪以来全球历史平均温度值,包括: 全球最高温度,全球最低温度以及全球陆地和海洋温度。 有了所有这些,我们知道这是一个有监督的回归机器学习问题
It’s supervised because we have both the features and the target that we want to predict, also our target makes this a regression task because it is continuous. During training, we will give multiple regression models both the features and targets and it must learn how to map the data to a prediction. Moreover, this is a regression task because the target value is continuous (as opposed to discrete classes in classification).
之所以受到监督,是因为我们既具有要预测的特征和目标,又因为它是连续的,所以我们的目标使它成为回归任务。 在训练期间,我们将提供特征和目标的多个回归模型,并且它必须学习如何将数据映射到预测。 此外,这是一项回归任务,因为目标值是连续的(与分类中的离散类相对)。
That’s pretty much all the background we need, so let’s start!
这几乎是我们需要的所有背景,所以让我们开始吧!
ML工作流程 (ML Workflow)
Before we jump right into programming, we should outline exactly what we want to do. The following steps are the basis of my machine learning workflow now that we have our problem and model in mind:
在开始进行编程之前,我们应该准确概述我们想做的事情。 考虑到我们的问题和模型,以下步骤是我的机器学习工作流程的基础:
- State the question and determine the required data (completed) 陈述问题并确定所需数据(已完成)
- Acquire the data 采集数据
- Identify and correct missing data points/anomalies 识别并纠正丢失的数据点/异常
- Prepare the data for the machine learning model by cleaning/wrangling 通过清理/整理为机器学习模型准备数据
- Establish a baseline model 建立基准模型
- Train the model on the training data 根据训练数据训练模型
- Make predictions on the test data 对测试数据做出预测
- Compare predictions to the known test set targets and calculate performance metrics 将预测与已知测试集目标进行比较,并计算性能指标
- If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique 如果性能不令人满意,请调整模型,获取更多数据或尝试其他建模技术
- Interpret model and report results visually and numerically 可视化和数字化解释模型并报告结果
数据采集 (Data Acquisition)
First, we need some data. To use a realistic example, I retrieved temperature data from the Berkeley Earth Climate Change: Earth Surface Temperature Dataset found on Kaggle.com. Being that this dataset was created from one of the most prestigious research universities in the world, we will assume data in the dataset is truthful.
首先,我们需要一些数据。 举一个实际的例子,我从Kaggle.com上的“伯克利地球气候变化:地球表面温度数据集”中检索了温度数据。 由于该数据集是由世界上最负盛名的研究型大学之一创建的,因此我们将假定数据集中的数据是真实的。
Dataset link:https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
数据集链接: https : //www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
After importing some important libraries and modules, the code below loads in the CSV data which I store into a variable we can use later:
导入一些重要的库和模块后,下面的代码将CSV数据加载到我存储的变量中,以备后用:

Following are explanations of each column:
以下是各列的说明:
dt: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures
dt:平均陆地温度从1750年开始,最高和最低陆地温度以及全球海洋和陆地温度从1850年开始
LandAverageTemperature: global average land temperature in celsius
LandAverageTemperature:摄氏全球平均气温
LandAverageTemperatureUncertainty: the 95% confidence interval around the average
LandAverageTemperatureUncertainty:围绕平均值的95%置信区间
LandMaxTemperature: global average maximum land temperature in celsius
LandMaxTemperature:全球平均最高气温,以摄氏度为单位
LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature
LandMaxTemperatureUncertainty:最高陆地温度附近的95%置信区间
LandMinTemperature: global average minimum land temperature in celsius
LandMinTemperature:摄氏全球平均最低气温
LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature
LandMinTemperatureUncertainty:最低地面温度附近的95%置信区间
LandAndOceanAverageTemperature: global average land and ocean temperature in celsius
LandAndOceanAverageTemperature:全球平均陆地和海洋温度以摄氏
LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature
陆地和海洋平均温度不确定性:全球平均陆地和海洋温度的95%置信区间
识别异常/丢失数据 (Identify Anomalies/ Missing Data)
Looking through the data (shown above) from Berkeley Earth, I noticed several missing data points, which is a great reminder that data collected in the real-world will never be perfect. Missing data can impact analysis immensely, as can incorrect data or outliers.
通过查看来自伯克利地球的数据(如上所示),我注意到了一些缺失的数据点,这很提醒我们,在现实世界中收集的数据永远不会是完美的。 数据丢失或不正确的数据或异常值都会极大地影响分析。
To identify anomalies, we can quickly find missing using the info() method on our DataFrame.
为了识别异常,我们可以使用DataFrame上的info()方法快速找到缺失的内容。

Also, we can use the “.isnull()” and “.sum()” methods directly on our dataframe to find the total amount of missing values in each column.
另外,我们可以直接在数据帧上使用“ .isnull()”和“ .sum()”方法来查找每一列中缺失值的总数。
