简单英文自我介绍
When it comes to machine learning classification tasks, the more data available to train algorithms, the better. In supervised learning, this data must be labeled with respect to the target class — otherwise, these algorithms wouldn’t be able to learn the relationships between the independent and target variables. However, there are a couple of issues that arise when building large, labeled data sets for classification:
当涉及机器学习分类任务时,可用于训练算法的数据越多越好。 在监督学习中,必须根据目标类对这些数据进行标记-否则,这些算法将无法学习自变量和目标变量之间的关系。 但是,在构建大型的,带有标签的数据集进行分类时,会出现一些问题:
Labeling data can be time-consuming. Let’s say we have 1,000,000 dog images that we want to feed to a classification algorithm, with the goal of predicting whether each image contains a Boston Terrier. If we want to use all of those images for a supervised classification task, we need a human to look at each image and determine whether a Boston Terrier is present. While I do have friends (and a wife) who wouldn’t mind scrolling through dog pictures all day, it probably isn’t how most of us want to spend our weekend.
标记数据可能很耗时。 假设我们要向分类算法提供1,000,000张狗图像,以预测每个图像是否包含波士顿梗犬。 如果我们想将所有这些图像用于监督分类任务,则需要人员查看每个图像并确定是否存在波士顿梗犬。 虽然我确实有朋友(和妻子)不愿意整日浏览狗的照片,但我们大多数人可能都不希望在周末度过。
Labeling data can be expensive. See reason 1: to get someone to painstakingly scour 1,000,000 dog pictures, we’re probably going to have to shell out some cash.
标记数据可能会很昂贵。 看到原因1:要想让某人辛苦地搜寻1,000,000张狗的照片,我们可能必须掏出一些钱。
So, what if we only have enough time and money to label some of a large data set, and choose to leave the rest unlabeled? Can this unlabeled data somehow be used in a classification algorithm?
因此,如果我们只有足够的时间和金钱来标记一些大数据集,而选择不标记其余数据集怎么办? 可以将这种未标记的数据以某种方式用于分类算法吗?
This is where semi-supervised learning comes in. In taking a semi-supervised approach, we can train a classifier on the small amount of labeled data, and then use the classifier to make predictions on the unlabeled data. Since these predictions are likely better than random guessing, the unlabeled data predictions can be adopted as ‘pseudo-labels’ in subsequent iterations of the classifier. While there are many flavors of semi-supervised learning, this specific technique is called self-training.
这就是半监督学习的地方。采用半监督方法,我们可以在少量标记数据上训练分类器,然后使用分类器对未标记数据进行预测。 由于这些预测可能比随机猜测更好,因此在分类器的后续迭代中,可以将未标记的数据预测用作“伪标记”。 尽管有半监督学习的多种形式,但是这种特定的技术称为自我训练 。
自我训练 (Self-Training)

On a conceptual level, self-training works like this:
从概念上讲,自我训练的工作方式如下:
Step 1: Split the labeled data instances into train and test sets. Then, train a classification algorithm on the labeled training data.
步骤1:将标记的数据实例拆分为训练集和测试集。 然后,在标记的训练数据上训练分类算法。
Step 2: Use the trained classifier to predict class labels for all of the unlabeled data instances. Of these predicted class labels, the ones with the highest probability of being correct are adopted as ‘pseudo-labels’.
步骤2:使用训练有素的分类器来预测所有未标记数据实例的类标签。 在这些预测的类别标签中,最有可能被纠正的类别标签被用作“ 伪标签” 。
(A couple of variations on Step 2: a) All of the predicted labels can be adopted as ‘pseudo-labels’ at once, without considering probability, or b) The ‘pseudo-labeled’ data can be weighted by confidence in the prediction.)
(步骤2的几个变体 )a) 所有预测标签都可以立即用作“伪标签”,而无需考虑概率,或者 b) 可以通过对预测的置信度来加权“伪标签”数据)
Step 3: Concatenate the ‘pseudo-labeled’ data with the labeled training data. Re-train the classifier on the combined ‘pseudo-labeled’ and labeled training data.
步骤3:将“伪标签”数据与标签训练数据连接起来。 在组合的“伪标签”和标签训练数据上对分类器进行重新训练。
Step 4: Use the trained classifier to predict class labels for the labeled test data instances. Evaluate classifier performance using your metric(s) of choice.
步骤4:使用训练有素的分类器来预测带标签的测试数据实例的类标签。 使用您选择的指标评估分类器的效果。
(Steps 1 through 4 can be repeated until no more predicted class labels from Step 2 meet a specific probability threshold, or until no more unlabeled data remains.)
(可以重复执行步骤1到步骤4,直到不再有来自步骤2的预测类别标签达到特定的概率阈值为止,或者直到不再有未标记的数据为止。)
Ok, got it? Good! Let’s work through an example.
好的,我知道了? 好! 让我们来看一个例子。
示例:使用自我训练来改进分类器 (Example: Using Self-Training to Improve a Classifier)
To demonstrate self-training, I’m using Python and the surgical_deepnet data set, available here on Kaggle. This data set is intended to be used for binary classification, and contains data for 14.6k+ surgeries. The attributes are measurements like bmi, age, and a variety of others, while the target variable, complication, records whether the patient suffered complications as a result of surgery. Clearly, being able to accurately predict whether a patient will suffer complications from a surgery would be in the best interest of healthcare and insurance providers alike.
为了证明自我训练,我使用Python和surgical_deepnet数据集,可以在这里上Kaggle。 该数据集旨在用于二进制分类,并且包含14.6k +次手术的数据。 这些属性是bmi,年龄等各种测量值,而目标变量complication则记录患者是否因手术而出现并发症。 显然,能够准确预测患者是否会因手术而遭受并发症,这将对医疗保健和保险提供商最有利。
Imports
进口货
For this tutorial, I import numpy, pandas, and matplotlib. I’ll also use the LogisticRegression classifier from sklearn, as well as the f1_score and plot_confusion_matrix functions for model evaluation.
对于本教程,我将导入numpy , pandas和matplotlib 。 我还将使用sklearn的LogisticRegression分类器以及f1_score和plot_confusion_matrix函数进行模型评估。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import plot_confusion_matrix
Load Data
载入资料
# Load data
df = pd.read_csv('surgical_deepnet.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14635 entries, 0 to 14634
Data columns (total 25 columns):
bmi 14635 non-null float64
Age 14635 non-null float64
asa_status 14635 non-null int64
baseline_cancer 14635 non-null int64
baseline_charlson 14635 non-null int64
baseline_cvd 14635 non-null int64
baseline_dementia 14635 non-null int64
baseline_diabetes 14635 non-null int64
baseline_digestive 14635 non-null int64
baseline_osteoart 14635 non-null int64
baseline_psych 14635 non-null int64
baseline_pulmonary 14635 non-null int64
ahrq_ccs 14635 non-null int64
ccsComplicationRate 14635 non-null float64
ccsMort30Rate 14635 non-null float64
complication_rsi 14635 non-null float64
dow 14635 non-null int64
gender 14635 non-null int64
hour 14635 non-null float64
month 14635 non-null int64
moonphase 14635 non-null int64
mort30 14635 non-null int64
mortality_rsi 14635 non-null float64
race 14635 non-null int64
complication 14635 non-null int64
dtypes: float64(7), int64(18)
memory usage: 2.8 MB
The attributes in the data set are all numerical, and there are no missing values. Since my focus here isn’t on data cleaning, I’ll move on to partitioning the data.
数据集中的属性都是数字的,没有缺失值。 由于这里的重点不是数据清理,因此我将继续对数据进行分区。
Data Splits
数据分割
To experiment with self-training, I’ll need to split the data into three parts: a train set, a test set, and an unlabeled set. I’ll split the data according to the following proportions:
为了进行自我训练试验,我需要将数据分为三个部分: 训练集 , 测试集和未标记集 。 我将按照以下比例拆分数据:
1% Train (Labeled)
1%火车(贴标签)
25% Test (Labeled)
25%测试(已标记)
74% Unlabeled
74%未贴标签
For the unlabeled set, I will simply drop the target variable, complication, and pretend that it never existed. So, in this case, we are imagining that 74% of the surgery cases have no information regarding complications. I do this to simulate the fact that in real-world classification problems, much of the data available may not have class labels. However, if we do have class labels for a small fraction of the data (in this case 1%), semi-supervised learning techniques could then be used to draw conclusions from the unlabeled data.
对于未标记的集合,我将简单地删除目标变量complication ,并假装它不存在。 因此,在这种情况下,我们想象74%的手术病例没有有关并发症的信息。 我这样做是为了模拟这样一个事实:在实际的分类问题中,许多可用数据可能没有类标签。 但是,如果我们确实拥有一小部分数据的类标签(在这种情况下为1%),则可以使用半监督学习技术从未标记的数据中得出结论。
Below, I shuffle the data, generate indices to partition the data, and then create the test, train, and unlabeled splits. I then check the dimensions of the splits to make sure everything went according to plan.
在下面,我将对数据进行混洗,生成索引以对数据进行分区,然后创建测试,训练和未标记的拆分。 然后,我检查拆分的尺寸,以确保一切均按计划进行。
# Shuffle the data
df = df.sample(frac=1, random_state=15).reset_index(drop=True)
# Generate indices for splits
test_ind = round(len(df)*0.25)
train_ind = test_ind + round(len(df)*0.01)
unlabeled_ind = train_ind + round(len(df)*0.74)
# Partition the data
test = df.iloc[:test_ind]
train = df.iloc[test_ind:train_ind]
unlabeled = df.iloc[train_ind:unlabeled_ind]
# Assign data to train, test, and unlabeled sets
X_train = train.drop('complication', axis=1)
y_train = train.complication
X_unlabeled = unlabeled.drop('complication', axis=1)
X_test = test.drop('complication', axis=1)
y_test = test.complication
# Check dimensions of data after splitting
print(f"X_train dimensions: {X_train.shape}")
print(f"y_train dimensions: {y_train.shape}\n")
print(f"X_test dimensions: {X_test.shape}")
print(f"y_test dimensions: {y_test.shape}\n")
print(f"X_unlabeled dimensions: {X_unlabeled.shape}")
X_train dimensions: (146, 24)
y_train dimensions: (146,)
X_test dimensions: (3659, 24)
y_test dimensions: (3659,)
X_unlabeled dimensions: (10830, 24)
Class Distribution
班级分布
# Visualize class distribution
y_train.value_counts().plot(kind='bar')
plt.xticks([0,1], ['No Complication', 'Complication'])
plt.ylabel('Count');

There are more than twice as many instances of the majority class (no complication) than the minority class (complication). With an imbalanced class situation like this, I want to be very selective about the classification evaluation metric that I choose — accuracy may not be the best choice.
多数类(无并发症)的实例比少数类(复杂)的实例多两倍。 面对这样的不平衡情况,我想对我选择的分类评估指标非常挑剔- 准确性可能不是最佳选择。
I choose the F1 score as the classification metric to judge the effectiveness of the classifier. The F1 score is more robust to class imbalance than accuracy, which is more appropriate when classes are approximately balanced. The F1 score can be calculated as follows:
我选择F1分数作为分类指标,以判断分类器的有效性。 F1分数 对于类不平衡而言,它比精度更强健,而准确性在类近似平衡时更为合适。 F1分数的计算公式如下:

where precision is the fraction of predicted positives that were correctly predicted, and recall is the fraction of true positive instances that were correctly predicted.
其中, 精度是正确预测的预测阳性率,而召回率是正确预测的真实阳性率的百分比。
Initial Classifier (Supervised)
初始分类器(监督)
To ground-truth the results of semi-supervised learning, I first train a simple Logistic Regression classifier using only the labeled training data, and predict on the test data set.
为了弄清半监督学习的结果,我首先仅使用标记的训练数据训练一个简单的Logistic回归分类器,然后对测试数据集进行预测。
# Logistic Regression Classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_hat_test = clf.predict(X_test)
y_hat_train = clf.predict(X_train)
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Train f1 Score: {train_f1}")
print(f"Test f1 Score: {test_f1}")
plot_confusion_matrix(clf, X_test, y_test, cmap='Blues', normalize='true',
display_labels=['No Comp.', 'Complication']);
Train f1 Score: 0.5846153846153846
Test f1 Score: 0.5002908667830134

The classifier has a test F1 score of 0.5. The confusion matrix tells us that the classifier does very well correctly predicting surgeries that did not have complications — 86% accuracy. However, the classifier has more trouble correctly identifying surgeries with complications, at only 47% accuracy.
分类器的测试F1得分为0.5。 混淆矩阵告诉我们,分类器可以很好地正确预测没有并发症的手术-准确度达86%。 但是,分类器在正确识别具有并发症的手术方面存在更多麻烦,准确度仅为47%。
Probability of Predictions
预测概率
For the self-training algorithm, we’ll want to know the probabilities of predictions made by the Logistic Regression classifier. Luckily, sklearn provides the .predict_proba() method, which allows us to see the probability of a prediction belonging to either class. As you can see below, the total probability will sum to 1.0 for each prediction in a binary classification problem.
对于自训练算法,我们将要知道由Logistic回归分类器做出的预测的概率。 幸运的是, sklearn提供了.predict_proba ()方法,该方法使我们能够看到属于任一类的预测的概率。 如下所示,对于二元分类问题中的每个预测,总概率总计为1.0。
# Generate probabilities for each prediction
clf.predict_proba(X_test)
array([[0.93931367, 0.06068633],
[0.2327203 , 0.7672797 ],
[0.93931367, 0.06068633],
...,
[0.61940353, 0.38059647],
[0.41240068, 0.58759932],
[0.24306008, 0.75693992]])
Self-Training Classifier (Semi-Supervised)
自我训练分类器(半监督)
Now that we know how to get prediction probabilities using sklearn, we can move ahead with coding the self-training classifier. Here is a brief outline:
现在,我们知道了如何使用sklearn获得预测概率,我们可以继续进行自训练分类器的编码了。 这里是一个简短的概述:
Step 1: First, train a Logistic Regression classifier on the labeled training data.
步骤1 :首先,在标记的训练数据上训练Logistic回归分类器。
Step 2: Next, use the classifier to predict labels for all unlabeled data, as well as probabilities for those predictions. In this case, I will only adopt ‘pseudo-labels’ for predictions with greater than 99% probability.
步骤2 :接下来,使用分类器来预测所有未标记数据的标记以及这些预测的概率。 在这种情况下,我将仅采用“伪标签”进行概率大于99%的预测。
Step 3: Concatenate the ‘pseudo-labeled’ data with the labeled training data, and re-train the classifier on the concatenated data.
步骤3 :将“伪标记”数据与标记的训练数据连接起来,并在连接数据上重新训练分类器。
Step 4: Use trained classifier to make predictions for the labeled test data, and evaluate the classifier.
步骤4 :使用训练有素的分类器对标记的测试数据进行预测,然后评估分类器。
Repeat steps 1 through 4 until no more predictions have greater than 99% probability, or no unlabeled data remains.
重复步骤1到4,直到没有更多的预测具有大于99%的概率,或者没有剩余的未标记数据。
See the code below for my implementation of these steps in Python, using a while-loop.
请参阅以下代码,了解我如何使用while循环在Python中执行这些步骤。
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Fit classifier and make train/test predictions
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_hat_train = clf.predict(X_train)
y_hat_test = clf.predict(X_test)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = clf.predict_proba(X_unlabeled)
preds = clf.predict(X_unlabeled)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
Iteration 0
Train f1: 0.5846153846153846
Test f1: 0.5002908667830134
Now predicting labels for unlabeled data...
42 high-probability predictions added to training data.
10788 unlabeled instances remaining.
Iteration 1
Train f1: 0.7627118644067796
Test f1: 0.5037463976945246
Now predicting labels for unlabeled data...
30 high-probability predictions added to training data.
10758 unlabeled instances remaining.
Iteration 2
Train f1: 0.8181818181818182
Test f1: 0.505431675242996
Now predicting labels for unlabeled data...
20 high-probability predictions added to training data.
10738 unlabeled instances remaining.
Iteration 3
Train f1: 0.847457627118644
Test f1: 0.5076835515082526
Now predicting labels for unlabeled data...
21 high-probability predictions added to training data.
10717 unlabeled instances remaining.
...Iteration 44
Train f1: 0.9481216457960644
Test f1: 0.5259179265658748
Now predicting labels for unlabeled data...
0 high-probability predictions added to training data.
10079 unlabeled instances remaining.
The self training algorithm went through 44 iterations before no more unlabeled instances could be predicted at >99% probability. Even though there were initially 10,830 unlabeled instances, 10,079 of those remained unlabeled (and unused by the classifier) after self-training.
自训练算法经过44次迭代,然后再没有更多的未标记实例可以以> 99%的概率进行预测。 即使最初有10,830个未标记的实例,但其中有10,079个自训练后仍未标记(并且未由分类器使用)。
# Plot f1 scores and number of pseudo-labels added for all iterations
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(6,8))
ax1.plot(range(iterations), test_f1s)
ax1.set_ylabel('f1 Score')
ax2.bar(x=range(iterations), height=pseudo_labels)
ax2.set_ylabel('Pseudo-Labels Created')
ax2.set_xlabel('# Iterations');
# View confusion matrix after self-training
plot_confusion_matrix(clf, X_test, y_test, cmap='Blues', normalize='true',
display_labels=['No Comp.', 'Complication']);

Over the 44 iterations, the F1 score improved from 0.50 to 0.525! Though it’s only a small increase, it looks like self-training has improved the classifier’s performance on the test data set. The top panel of the figure above shows that most of this improvement is happening in earlier iterations of the algorithm. Similarly, the bottom panel shows that most of the ‘pseudo-labels’ added to the training data come during the first 20–30 iterations.
在44次迭代中,F1分数从0.50提高到0.525! 尽管只是小幅增长,但自训练似乎改善了分类器在测试数据集上的性能。 上图的顶部面板显示,大多数改进是在算法的早期迭代中进行的。 同样,底部面板显示,添加到训练数据中的大多数“伪标签”来自前20–30次迭代。

The final confusion matrix shows an improvement in the classification of surgeries with complications, but a slight decline in the classification of surgeries with no complications. Supported by the improved F1 score, I think that this is an acceptable improvement — it is likely more important to identify surgery cases that will result in complications (True Positives), and it is probably worth increasing the False Positive rate to achieve that result.
最后的混淆矩阵显示了与手术并发症的分类有所改善,但在无并发症的手术的分类略有下降。 在F1分数提高的支持下,我认为这是可以接受的改进-确定可能导致并发症的手术病例(真阳性)可能更重要,并且增加假阳性率以实现该结果可能值得。
Words of Caution
注意的话
So you may be thinking: is there a risk to performing self-training with so much unlabeled data? The answer, of course, is yes. Remember that although we are including our ‘pseudo-labeled’ data with labeled training data, some of the ‘pseudo-labeled’ data is certainly going to be incorrect. When enough of the ‘pseudo-labels’ are incorrect, the self-training algorithm can reinforce poor classification decisions, and classifier performance can actually get worse.
因此,您可能在想:使用大量未标记的数据进行自我培训是否有风险? 答案是肯定的。 请记住,尽管我们将带有标签的训练数据包括在“伪标签”数据中,但是某些“伪标签”数据肯定会不正确。 当足够多的“伪标签”不正确时,自训练算法会强化不良的分类决策,分类器性能实际上会变差。
However, this risk can be mitigated by following established practices like using a test data set that the classifier has not seen during training, or using a probability threshold for ‘pseudo-label’ prediction.
但是,可以通过遵循既定的惯例(例如使用分类器在训练期间未看到的测试数据集或使用“伪标签”预测的概率阈值)来减轻这种风险。
简单英文自我介绍