基于GPT2的对诗模型

最新推荐文章于 2025-06-24 08:59:57 发布

原创最新推荐文章于 2025-06-24 08:59:57 发布 · 1k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #pytorch #人工智能 #神经网络 #深度学习 #gpt

人工智能应用同时被 2 个专栏收录

19 篇文章

订阅专栏

自然语言处理

8 篇文章

订阅专栏

在这里插入图片描述

前言

由于本人水平有限，难免出现错漏，敬请批评改正。
更多精彩内容，可点击进入Python日常小操作专栏、OpenCV-Python小应用专栏、YOLO系列专栏、自然语言处理专栏或我的个人主页查看
基于DETR的人脸伪装检测
YOLOv7训练自己的数据集（口罩检测）
YOLOv8训练自己的数据集（足球检测）
YOLOv10训练自己的数据集（交通标志检测）
YOLO11训练自己的数据集（吸烟、跌倒行为检测）
YOLOv5：TensorRT加速YOLOv5模型推理
YOLOv5：IoU、GIoU、DIoU、CIoU、EIoU
玩转Jetson Nano（五）：TensorRT加速YOLOv5目标检测
YOLOv5：添加SE、CBAM、CoordAtt、ECA注意力机制
YOLOv5：yolov5s.yaml配置文件解读、增加小目标检测层
Python将COCO格式实例分割数据集转换为YOLO格式实例分割数据集
YOLOv5：使用7.0版本训练自己的实例分割模型（车辆、行人、路标、车道线等实例分割）
使用Kaggle GPU资源免费体验Stable Diffusion开源项目

前提条件

熟悉Python

实验环境

Package                       Version
----------------------------- ------------
matplotlib                    3.3.4
numpy                         1.19.5
Pillow                        8.4.0
pip                           21.2.2
protobuf                      3.19.6
requests                      2.27.1
scikit-learn                  0.24.2
scipy                         1.5.4
sentencepiece                 0.1.91
setuptools                    58.0.4
threadpoolctl                 3.1.0
thulac                        0.2.2
tokenizers                    0.9.3
torch                         1.9.1+cu111
torchaudio                    0.9.1
torchvision                   0.10.1+cu111
tornado                       6.1
tqdm                          4.64.1
traitlets                     4.3.3
transformers                  3.5.1
urllib3                       1.26.20

基于GPT2的对诗模型

准备数据集

在这里插入图片描述

data_splited++.jl：该文件是预处理的诗句文本数据集文件。
id2w++.json：该文件是ID到词的映射文件。
w2id++.json：该文件是词到ID的映射文件。

读取数据集

import json
from tqdm import tqdm
import torch
import time
with open('w2id++.json', 'r') as f:
    w2id = json.load(f)
with open('id2w++.json', 'r') as f:
    id2w = json.load(f)
w2id['，'] = len(id2w)
id2w.append('，')
w2id['。'] = len(id2w)
id2w.append('。')
data_list = []
with open('data_splited++.jl', 'r') as f:
    for l in f:
        d = ','.join(json.loads(l)) + '。'
        data_list.append(d)

分割数据集

# 根据词数分割数据
dlx = [[] for _ in range(5)]
for d in data_list:
    dlx[len(d) // 2- 6].append(d)

设置相关参数

batch_size = 32
# data_workers = 4
data_workers = 0
learning_rate = 1e-6
gradient_accumulation_steps = 1
max_train_epochs = 10
warmup_proportion = 0.05
weight_decay=0.01
max_grad_norm=1.0
	
cur_time = time.strftime("%Y-%m-%d_%H:%M:%S")
device = torch.device('cuda')

这些参数通常用于配置深度学习模型的训练过程，特别是在使用PyTorch这样的深度学习框架时。下面是对每个参数的解释：

batch_size = 128:
- 批大小（Batch Size）是指在模型训练过程中，一次迭代（iteration）所使用的数据样本数量。这里设置为128，意味着每次更新模型参数前，会使用128个样本来计算损失和梯度。较大的批大小可以加速训练，但也可能增加内存消耗并影响模型的泛化能力。
data_workers = 0:
- 数据加载工作线程数（Data Workers）是指用于并行加载数据的线程数量。设置为0意味着数据加载将在主线程上同步进行，这可能会降低数据加载的速度。通常，增加工作线程数可以加速数据加载过程，但过多的线程可能会增加系统开销。
learning_rate = 0.0001:
- 学习率（Learning Rate）是控制模型参数更新幅度的超参数。较小的学习率意味着参数更新的步长较小，训练过程可能更稳定但收敛速度较慢；较大的学习率可能导致训练过程不稳定甚至发散。这里设置为0.0001是一个相对较小的值，适用于一些精细调整的场景。
gradient_accumulation_steps = 1:
- 梯度累积步数（Gradient Accumulation Steps）是指在更新模型参数前，累积梯量的次数。设置为1意味着每次迭代都会立即更新模型参数。在内存有限但希望使用较大批大小进行训练时，可以通过增加梯度累积步数来模拟较大的批大小。
max_train_epochs = 30:
- 最大训练轮数（Max Training Epochs）是指整个训练数据集被遍历的次数。一个epoch等于整个数据集通过模型一次。这里设置为30，意味着整个数据集将被遍历30次。
warmup_proportion = 0.05:
- 预热比例（Warmup Proportion）是指在训练初期，学习率逐渐增加所占整个训练过程的比例。预热可以帮助模型在训练初期更稳定地更新参数，避免由于初始学习率过高而导致的训练不稳定。这里设置为0.05，意味着在前5%的训练轮数中，学习率会逐渐增加。
weight_decay = 0.01:
- 权重衰减（Weight Decay）是一种正则化技术，用于防止模型过拟合。它通过向损失函数添加一个与模型参数平方成正比的项来实现，鼓励模型参数保持较小值。这里设置为0.01。
max_grad_norm = 1.0:
- 最大梯度范数（Max Grad Norm）是梯度裁剪（Gradient Clipping）的一种形式，用于控制梯度的最大值。如果梯度的范数超过这个值，梯度将被缩放以确保其范数不超过这个值。这有助于防止梯度爆炸问题。这里设置为1.0。
cur_time = time.strftime(“%Y-%m-%d_%H:%M:%S”):
- 这行代码用于获取当前时间，并将其格式化为字符串，通常用于生成具有时间戳的文件名或日志，以便记录训练过程。
device = torch.device(‘cuda’):
- 这行代码指定了模型和数据应该在哪种设备上运行。'cuda'表示使用NVIDIA的CUDA技术来加速计算，通常是在具有NVIDIA GPU的计算机上。如果系统中没有可用的CUDA设备，PyTorch将回退到CPU。

这些参数共同决定了模型训练的具体配置，包括训练速度、模型性能以及训练过程中的稳定性等。

使用预训练模型

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('./GPT2').to(device)
from tokenizations import tokenization_bert_word_level as tokenization_bert
tokenizer = tokenization_bert.BertTokenizer(vocab_file="GPT2-Chinese/cache/vocab.txt")

创建自己DataSet对象

class MyDataSet(torch.utils.data.Dataset):
    def __init__(self, examples):
        self.examples = examples
    def __len__(self):
        return len(self.examples)
    def __getitem__(self, index):
        example = self.examples[index]
        return example, index

def the_collate_fn(batch):  
    indexs = [b[1] for b in batch]
    r = tokenizer([b[0] for b in batch], padding=True)
    input_ids = torch.LongTensor(r['input_ids'])
    attention_mask = torch.LongTensor(r['attention_mask'])
    return input_ids, attention_mask, indexs


dldx = []
for d in dlx:
    ds = MyDataSet(d)
    dld = torch.utils.data.DataLoader(
        ds,
        batch_size=batch_size,
        shuffle = True,
        num_workers=data_workers,
        collate_fn=the_collate_fn,
    )
    dldx.append(dld)

评估模型

import time
import torch
time.clock = time.perf_counter
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
        Args:
            logits: logits distribution shape (vocabulary size)
            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
    """
    assert logits.dim() == 1  # batch size 1 for now - could be updated for more but the code would be less clear
    top_k = min(top_k, logits.size(-1))  # Safety check
    if top_k > 0:
        # Remove all tokens with a probability less than the last token of the top-k
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p > 0.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = filter_value
    return logits
def generate(model, context, length, temperature=1.0, top_k=30, top_p=0.0, device='cpu'):
    inputs = torch.LongTensor(context).view(1, -1).to(device)
    if len(context) > 1:
        _, past = model(inputs[:, :-1], None)[:2]
        prev = inputs[:, -1].view(1, -1)
    else:
        past = None
        prev = inputs
    generate = [] + context
    with torch.no_grad():
        for i in range(length):
            output = model(prev, past)
            output, past = output[:2]
            output = output[-1].squeeze(0) / temperature
            filtered_logits = top_k_top_p_filtering(output, top_k=top_k, top_p=top_p)
            next_token = torch.multinomial(torch.softmax(filtered_logits, dim=-1), num_samples=1)
            generate.append(next_token.item())
            prev = next_token.view(1, 1)
    return generate
def is_word(word):
    for item in list(word):
        if item not in 'qwertyuiopasdfghjklzxcvbnm':
            return False
    return True
def get_next(s, temperature=1,topk=10, topp = 0, device='cuda'):
    context_tokens = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(s))
    out = generate(
        model, 
        context_tokens, 
        len(s), 
        temperature, 
        top_k=topk, 
        top_p=topp, 
        device=device
    )
    text = tokenizer.convert_ids_to_tokens(out)
    for i, item in enumerate(text[:-1]):  # 确保英文前后有空格
        if is_word(item) and is_word(text[i + 1]):
            text[i] = item + ' '
    for i, item in enumerate(text):
        if item == '[MASK]':
            text[i] = ''
        elif item == '[CLS]':
            text[i] = '\n\n'
        elif item == '[SEP]':
            text[i] = '\n'
    text = ''.join(text).replace('##', '').strip()
    return text

def print_cases():
    print(get_next('好好学习，') + '\n')
    print(get_next('白日依山尽，') + '\n')
    print(get_next('学而时习之，') + '\n')
    print(get_next('人之初性本善，') + '\n')
print_cases()

Model loaded succeed
好好学习，[UNK]于听最优

白日依山尽，独呼白衣人。

学而时习之，无获一首。凯

人之初性本善，或是陶唐臣。妇

定义优化器

from transformers import AdamW, get_linear_schedule_with_warmup

t_total = len(data_list) // gradient_accumulation_steps * max_train_epochs + 1
num_warmup_steps = int(warmup_proportion * t_total)

print('warmup steps : %d' % num_warmup_steps)

no_decay = ['bias', 'LayerNorm.weight'] # no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
    {'params':[p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],'weight_decay': weight_decay},
    {'params':[p for n, p in param_optimizer if any(nd in n for nd in no_decay)],'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=t_total)

训练模型

loss_list = []
for e in range(max_train_epochs):
    print(e)
    loss_sum = 0
    c = 0
    xxx = [x.__iter__() for x in dldx]
    j = 0
    for i in tqdm(range((len(data_list)//batch_size) + 5)):
        if len(xxx) == 0:
            break
        j = j % len(xxx)
        try:
            batch = xxx[j].__next__()
        except StopIteration:
            xxx.pop(j)
            continue
        j += 1
        input_ids, attention_mask, indexs = batch
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss, logits = outputs[:2]
        loss_sum += loss.item()
        c += 1
        loss.backward()
        optimizer.step()
        scheduler.step() 
        optimizer.zero_grad()
    print_cases()
    print(loss_sum / c)
    loss_list.append(loss_sum / c)

warmup steps : 558369
0
100%|██████████████████████████████████████████████████████████████████████████| 34903/34903 [1:24:44<00:00,  6.86it/s]
好好学习，况乃不下叫

白日依山尽，无分且可留。

学而时习之，安[UNK]渊剧谁。

人之初性本善，[UNK]可[UNK]。岂以为

13.360085074213426
1
100%|██████████████████████████████████████████████████████████████████████████| 34903/34903 [1:24:47<00:00,  6.86it/s]
好好学习，论种墓金,

白日依山尽，轻使我相亲。

学而时习之，[UNK]与大深放。

人之初性本善，取为深,。冷我

8.733063704236715
2
100%|██████████████████████████████████████████████████████████████████████████| 34903/34903 [1:02:16<00:00,  9.34it/s]
好好学习，愁我在风,

白日依山尽，[UNK]死不难归。

学而时习之，山,[UNK]难开。

人之初性本善，,我[UNK]之。[PAD][PAD]

7.1793754077503555
3
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:43<00:00, 11.47it/s]
好好学习，,,归来。

白日依山尽，山山不知归。

学而时习之，天,[UNK]一年。

人之初性本善，,已成虚。[PAD][PAD]

6.498912517490003
4
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:44<00:00, 11.46it/s]
好好学习，归去,不须

白日依山尽，山月已成新。

学而时习之，岁,未尝安。

人之初性本善，,不可作。[PAD][PAD]

6.137750816555631
5
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:44<00:00, 11.46it/s]
好好学习，诗人有奇,

白日依山尽，愁人独掩扉。

学而时习之，未肯事[UNK]。[PAD]

人之初性本善，,或可以。或云

5.934937731359383
6
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:44<00:00, 11.46it/s]
好好学习，,我老而不

白日依山尽，看来似梦中。

学而时习之，观心得未遑。

人之初性本善，人我独人,,。

5.810477741523139
7
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:45<00:00, 11.46it/s]
好好学习，家自有神,

白日依山尽，看山对菊□,

学而时习之，无为道德尊。

人之初性本善，,至道人亦然。

5.706207116321935
8
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:45<00:00, 11.46it/s]
好好学习，无过,有时

白日依山尽，清风对月生。

学而时习之，无为亦可嗤。

人之初性本善，[UNK],。[PAD][PAD][PAD][PAD]

5.609863697298926
9
100%|████████████████████████████████████████████████████████████████████████████| 34903/34903 [50:49<00:00, 11.45it/s]
好好学习，不读父心。

白日依山尽，清风照客人。

学而时习之，[UNK]不[UNK]。[PAD][PAD]

人之初性本善，自有本来,,有

5.515527831342945

保存模型

torch.save(model.state_dict(), 'GPT2_model_parameter.pkl')

测试模型

model.load_state_dict(torch.load('GPT2_model_parameter.pkl'))

print_cases()

好好学习，不肯入门。

白日依山尽，孤云对客人。

学而时习之，学行自深锄。

人之初性本善，未可[UNK],。[PAD][PAD]

参考文献

[1] 论文地址：https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[2] 开源代码：https://paperswithcode.com/paper/improving-language-understanding-by
[3] 论文地址：https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[4] 开源代码：https://paperswithcode.com/paper/language-models-are-unsupervised-multitask
[5] 论文地址：https://arxiv.org/abs/2005.14165
[6] 开源代码：https://github.com/openai/gpt-3

由于本人水平有限，难免出现错漏，敬请批评改正。
更多精彩内容，可点击进入Python日常小操作专栏、OpenCV-Python小应用专栏、YOLO系列专栏、自然语言处理专栏或我的个人主页查看
基于DETR的人脸伪装检测
YOLOv7训练自己的数据集（口罩检测）
YOLOv8训练自己的数据集（足球检测）
YOLOv10训练自己的数据集（交通标志检测）
YOLO11训练自己的数据集（吸烟、跌倒行为检测）
YOLOv5：TensorRT加速YOLOv5模型推理
YOLOv5：IoU、GIoU、DIoU、CIoU、EIoU
玩转Jetson Nano（五）：TensorRT加速YOLOv5目标检测
YOLOv5：添加SE、CBAM、CoordAtt、ECA注意力机制
YOLOv5：yolov5s.yaml配置文件解读、增加小目标检测层
Python将COCO格式实例分割数据集转换为YOLO格式实例分割数据集
YOLOv5：使用7.0版本训练自己的实例分割模型（车辆、行人、路标、车道线等实例分割）
使用Kaggle GPU资源免费体验Stable Diffusion开源项目