从零开始构建大模型

1.Understanding large language models

LLM，Large Language Models，大语言模型

LLM 大语言模型（例如 ChatGPT）是一种基于深度学习的神经网络模型，主要用于自然语言处理（NLP）领域。在 LLM 出现之前，传统的 NLP 主要使用简单模型或者人工编写的规则，应用场景有限。大语言模型的出现翻开了新的篇章，它将模型对语言的理解、分析以及复杂任务的处理提升到了一个全新的高度。例如过往的模型是完全无法基于几个关键字，编写一篇意思通顺的文章，而这个任务对现在的大语言模型来说不过是小菜一碟。

大模型给出的答案语义连贯，逻辑清晰，但它输出的每个词，其实都是基于概率计算出来的，暂时还没看出来它具有类似人类的那种“自我意识”。

大模型的突破主要来自在 Transformer 注意力机制的发现，它构成了现在各种大模型的基础，同时配合海量的数据进行训练，这两方面的因素让大模型对人类语言的理解实现了质的飞跃。

什么是大语言模型

大模型之所以称为“大”，主要是因为模型中的参数非常多，以亿为单位，动不动就百亿、千亿级别。同时也指它训练的数据量很大。这些数据主要来源自过往几十年人类信息化后生产的各种电子文档，例如 wiki 网页、数字图书等。

大模型的基本原理很简单，就是基于句子的前半部分，根据概率大小，预测下一个单词应该是什么。简单的原理配合海量的数据和注意力机制进行训练后，最终得到的结果大大超出了研究人员的预期。模型出现了所谓的“涌现”，即模型对人类语言，当训练的数据量超过一个临界值后，不管是哪个国家的自然语言或者计算机语言，模型突然展示出了对各种语言的惊人理解能力。

由于大模型擅于生成内容，因此有时候也被人们称为 GenAI（生成式 AI），以下是它与各种人工智能概念的关系：

深度学习是实现 AI 的主流算法，它通过设计训练和验证数据集，让模型自己反复学习和调整自身的参数，以提高预测的准确率。当准确达到某个标准后，即完成了训练。传统的编程方法由人类设计计算规则，深度学习则是给定结果，让模型自己猜测规则。只要猜测和调整的次数足够多，模型就会不断向正确答案靠拢。

但人工智能不只有深度学习一种实现方法，还有其他一些传统的方法，例如特征工程，这些方法在某些特定的应用场景中效果也很好，甚至能够解决一些深度学习算法无法解决的问题。

大语言模型应用场景

应用场景有很多，例如机器翻译，文案生成，语义分析，摘要总结，代码生成，聊天机器人，虚拟助手，医学或法律等领域的知识检索（对文档进行筛选和总结，以便专业性的技术问题）。

LLM 大模型很可能将重新定义我们与技术的交互方式，让交互过程变得更加符合人类的直觉，同时更加易于使用。

构建和使用大语言模型的步骤

构建大模型主要由预训练和微调两个步骤组成。在预训练阶段，将给模型输入海量和多样化的数据，以便模型能够习得对各种自然语言的广泛理解。预训练后将得到一个基座模型，之后再基于特定领域的标注数据，对模型进行微调训练，这样模型能够有效提高模型在特定领域上的任务表现。

传统的监督学习方法在预训练阶段即需要使用人工标注数据，因此会有数据标注成本。但大模型的预训练刚好不需要进行人工标注，因为它是基于句子的前半部分去预测下一个单词，那么所有的输入数据，天然就是正确的预标注数据了，因此不需要额外的人工标注，设计非常巧妙。如果不是因为这一点，大模型的训练将是不可能的，因为数据量太庞大了，成本和时间都将是天文数字。

Transformer 架构介绍

大部分 LLM 模型都是基于 Transformer 架构，该架构由 2017 年的论文 “Attention is all you need” 中首次提出。它的原理是针对句子中的各个单词给予不同的注意力权重，而不是一视同仁。因此每个单词对整个句子在语义维度的贡献程度是不一样的。通过使用注意力机制，能够让模型更准确的抓住句子的精髓部分。

最早的 Transformer 架构由编码器和解码器两部分组成，当时研究的场景是语言翻译，例如将英语翻译为德语。因此先使用编码器将源语言（如英语）编码成向量，然后再使用解码器，将向量编译成目标语言（如德语）；

它背后的思想大致为，虽然语言之间使用的单词和语法不同，但对于相同意思的句子，在经过注意力机制的筛选后，其对应的向量空间坐标值应该是相同的。经过实践证明，事实也确实如此。而且实验后发现，模型还可以进一步简化，编码器都不需要了，直接使用解码器就可以了，整个模型变得更加简洁和优雅了。

利用大数据

训练大模型需要的数据量非常惊人，好处是整个模型变得非常通用，经过适当的微调后中，能够有效应对各种细分领域的工作任务。

深入 GPT 架构

GPT 架构最早在 OpenAI 公司的论文 “Improving Language Understanding by Generative Pre-Training” 中提出，它直译的意思是”生成预训练“，即训练模型，让它能够不断生成句子的下一个单词。

GPT 架构对未标注的数据进行自动标注和自监督学习，是一种自回归的模型，即将上一步的结果，当作生成下一步结果的输入参数，这种模式有助于提升输出的语义连贯性。

构建大语言模型

大模型的构建主要由几个步骤构成，分别如下：

准备数据
使用注意力机制
设计模型的结构
反复迭代训练
模型评估
模型微调

2.Working with text data

理解嵌入

深度学习模型，包括 LLM 大模型，实际上无法直接处理原始文本，因为原始文件无法直接用于数学运算，因此需要先将文本转换成数字，这个转换的过程叫做嵌入（embedding）；所谓的嵌入，就是将非标数据进行标准化的一个过程。

各种类型的数据都需要进行嵌入转换，例如文本、音频、视频等。嵌入一般由大模型的第一层进行处理，也可以由单独的模型进行处理。但不同模型的嵌入格式不同，因此 A 模型生成的嵌入数据，通常无法直接给 B 模型使用。

本质上，嵌入就是将离散对象，例如文本、图片甚至整个文件，映射到一个连续的向量空间，以方便模型进行处理和计算。虽然最常使用的嵌入是以单词为单位，但其实也可以句子、段落、章节、文档为单位，常用于检索增加生成（Retrival Augmented Generation，简称RAG）场景，例如知识库。

嵌入有很多种实现算法，例如 Word2Vec，它的原理是训练模型根据单词所在的上下文，生成单词的嵌入。它的思想为在相似上下文中出现的单词，其代表的意思更加相似。

大模型通常会在输入层中包含嵌入功能，并且在训练过程中该层的参数会进行更新，以便对其不断优化。上面的示例图表只有两个维度，即嵌入的向量空间是二维的，能够表示的分类非常有限。真实的大模型其向量空间的维度是非常多的，例如 120M 个参数的 GPT-2 有 768 维，175B 个参数的 GPT-3 有 12288 维。

文本转 token

token 主要由单词、标点符号、特殊符号等构成，因此第一步我们需要先将文本拆分（split）成各种单词和标点符号。

# 下载示例文本 get_file.py
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

token 转 ID

基于拆分后的单词和标点符号，创建一个字典，这样每个单词和标点符号都将为其分配一个 ID，之后我们就可以转 token 转成 ID 了，也可以将 ID 转成文本。

# 创建字典
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    all_words = sorted(set(preprocessed))
    vocab = {token:integer for integer, token in enumerate(all_words)}

将编码和解码的函数整合在一起，封装成一个对象：

# tokenizer.py
import re

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {v: k for k, v in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        
        preprocessed = [item.strip() for item in preprocessed if item.strip() != '']
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s([,.:;?_!"()\'])', r'\1', text)
        return text

有了封装的对象后，获取文本 ID 就变得很简单了，示例如下：

tokenizer = SimpleTokenizerV1(vocab)
text = "Hello world"
ids = tokenizer.encode(text)
print(ids)

添加特殊 token

特殊 token 通常有两个用途：

处理不包括在字典中的未知单词或符号；
让模型对文本的理解变得更容易，例如显式的用一个特殊符号文档的开始和结束，这样模型不用自己判断；

# 改进后的版本，以便能够处理未知单词或符号
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {v: k for k, v in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        
        preprocessed = [item.strip() for item in preprocessed if item.strip() != '']
        # 将未知单词或符号替换为 '<UNK>'
        preprocessed = [item if item in self.str_to_int else '<UNK>' for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

一些常用的特殊符号：

[BOS]，beginning of sequence，序列起点；
[EOS]，end of sequence，序列终点；
[PAD]，padding，占位符（为提高训练速度，训练过程中，允许一次提交多个输入；由于不同句子长度不同，因此需要使用占位符，让不同句子拥有相同的长度，方便模型使用相同的函数进行计算）；

字节对编码

GPT 模型并没有使用 [unk] 符号来处理未知单词，而是使用 BPE 算法（byte pair encoding，字节对编码），将单词进一步拆分为更小的单位，例如子词（subword）或者字母；

BPE 的工作过程如下：

先将训练数据中的所有单词拆分成单个字符，并在单词的末尾添加结束符，以便能够区分不同的单词；这样就得到了初始的词汇表
统计相邻字符对的出现频率；例如单词 low，其中 lo 算一次；ow 也算一次；
根据统计结果，将出现频率最高的字符对，作为一个词汇，添加到词汇表中；
重复第2、3步，直到字符对的最高频率降为 1 或者某个预设的值；

这个方法通过递归的方式找出在各种文档中使用过一次以上的字符组合；

BPE 的优点是将单词进一步拆分成字符对，有效提升了模型处理未知词汇的能力，以及模型的泛化能力。

数据抽样

大模型的工作方式是基于句子的已知部分，预测下一个单词。因此，接下来的一个工作，便是生成句子（作为输入）+ 下一个单词（作为目标）的嵌入，作为训练用的数据。在生成嵌入时，会用到一个叫滑动窗口的方法，它跟图像模型的采样方法有点像，区别在于这里的滑动窗口的起始位置不变，示例如下：

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()


enc_txt = tokenizer.encode(raw_text)
# print(len(enc_txt))

enc_sample = enc_txt[50:]

context_size = 4 # 滑动窗口大小
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
# print(f"X: {x}")
# print(f"Y: {y}")

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    # print(context, "----->", desired)
    # print(tokenizer.decode(context), "----->", tokenizer.decode([desired]))

以上代码的输出如下：

在将 token 转成嵌入之前，还有一项最后的工作，即将 token 转换成 tensor（张量，PyTorch 或 Tensorflow 等框架所使用的数据结构）

tensor 通常以浮点数来表示，以下示例使用原始字符串表示主要是为了方便理解 tensor 所表示的内容；

理论上目标张量只需包含下一个待预测的单词即可，但是这里貌似为了让输出和输出的长度相等，除了待预测的目标单词外，还包含了前面的部分。因此输入和输出存在重叠；

创建一个类，能够将输入文本，例如一本书，转换成“输入+输出”对，以便用于训练；

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # 将 txt 转换成 token_ids
        token_ids = tokenizer.encode(txt)

        # 使用滑动窗口的方法，按 max_length 将文本切分成多个样本
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

    def tokenize(self, data):
        return self.tokenizer.encode(data)

将“输入+输出”对进行分批。相对单个输入输出对，按批次处理有助于提高训练效率，缩短训练时间；

# 基于 PyTorch 的 Dataloader，创建一个生成数据器的函数
def create_dataloader_v1(
    txt,
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0,
):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )
    return dataloader

使用 create_dataloader_v1 创建一个数据加载器，加载小说 the-verdict.txt

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# 输出如下（由于 max_length 设置为 4，所以长度只有 4 个token）
# [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

# 实际训练时，max_length 通常是 256

参数 stride = 1 用来设置滑动窗口的滑动步伐，1 表示每次滑动一个 token；如果是 4，则每次滑动 4 个 token；

批量参数 batch_size 用来表示每个训练批次的数据量，1 表示每次只输入一个 token 对；4 表示每次输入 4 对；批量小的话，需要的内存也小，但在训练过程中，会带来更多的模型参数波动幅度。合适的批量的大小，需要通过反复实验进行权衡，最终设置一个合理的折中值。

# 当 batch_size = 8, stride = 4 的效果
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# [tensor([[   40,   367,  2885,  1464],
#         [ 1807,  3619,   402,   271],
#         [10899,  2138,   257,  7026],
#         [15632,   438,  2016,   257],
#         [  922,  5891,  1576,   438],
#         [  568,   340,   373,   645],
#         [ 1049,  5975,   284,   502],
#         [  284,  3285,   326,    11]]), 
#  tensor([[  367,  2885,  1464,  1807],
#         [ 3619,   402,   271, 10899],
#         [ 2138,   257,  7026, 15632],
#         [  438,  2016,   257,   922],
#         [ 5891,  1576,   438,   568],
#         [  340,   373,   645,  1049],
#         [ 5975,   284,   502,   284],
#         [ 3285,   326,    11,   287]])]

# 当 max_length=4, stride=4 时，每个批次的 token 对之间没有重叠，这样做的好处是可以减少过拟合；

创建 token 嵌入

初始化嵌入层的权重参数。参数值一开始是随机的，但随着训练次数的增加，这些参数会慢慢收敛，并在某个值附近稳定下来。

import torch

# 假定词汇表总共只有 6 个单词（BPE 有 50257 个单词）
vocab_size = 6
# 假设嵌入的尺寸为 3（GPT-3 的嵌入尺寸为 12288，尺寸越大，能够存放的参数个数越多，参数越多，模型就越能够记住这个 token 在多种不同场景下的使用方法）
output_dim = 3 


torch.manual_seed(123) # 用于初始化随机数
embedding = torch.nn.Embedding(vocab_size, output_dim) # 生成嵌入层
print(embedding.weight)

# 结果如下：
# tensor([[ 0.3374, -0.1778, -0.1690],
#         [ 0.9178,  1.5810,  1.3010],
#         [ 1.2753, -0.2010, -0.1606],
#         [-0.4015,  0.9666, -1.1481],
#         [-1.1589,  0.3255, -0.6315],
#         [-2.8400, -0.7849, -1.4096]], requires_grad=True

# 词汇表的 6 个token，对应此处的 6 行，每行代表一个token；3 列则对应嵌入的尺

# 例如取 token_id = 3，print(embedding(torch.tensor([3]))) 结果如下（对应前面的第4行）
# tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

# 当输入的 token_ids 为 [2, 3, 5, 1] 时
input_ids = torch.tensor([2, 3, 5, 1])
print(embedding(input_ids))

# tensor([[ 1.2753, -0.2010, -0.1606],
#         [-0.4015,  0.9666, -1.1481],
#         [-2.8400, -0.7849, -1.4096],
#         [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

编码单词位置

按照上面的方法，每个 token_id 会映射一个 embeding 向量，但是这里有个小问题，即同一个 token_id，不管它出现在文本中的任意位置，它映射后的向量都是一模一样的，这意味着向量中没有包含位置信息。但在真实的语言中，同一个单词，出现在句子中的不同位置时，它的意思是有可能产生变化的。因此，我们需要将单词的位置信息，也添加到嵌入中作为参数，以便让模型能够学到的位置信息的相关知识。

位置有两种表示方法：

绝对位置：每个位置用一个数值进行表示，然后添加到原嵌入上；(GPT 模型也是使用绝对位置，并会在训练过程中优化该参数值)；
相对位置：计算 token 之间的相对位置，即 token 之间的相对距离；该方法的优点是有助于模型学习到更好的泛化能力，在处理不同长度的句子时，效果更出色；

with open("the-verdict.txt", "r", encoding="utf-8") as f:
        raw_text = f.read()

max_length = 4 # 句子长度为 4 个单词
batch_size = 8 # 每批 8 个句子
dataloader = create_dataloader_v1(
    raw_text, batch_size=batch_size, max_length=max_length, stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
# print("Token IDs:\n", inputs)
# print("\nInputs shape:\n", inputs.shape)

# Token IDs: 每个批次 8 个句子，每个句子 4 个单词
#  tensor([[   40,   367,  2885,  1464],
#         [ 1807,  3619,   402,   271],
#         [10899,  2138,   257,  7026],
#         [15632,   438,  2016,   257],
#         [  922,  5891,  1576,   438],
#         [  568,   340,   373,   645],
#         [ 1049,  5975,   284,   502],
#         [  284,  3285,   326,    11]])

# Inputs shape:
# torch.Size([8, 4])


vocab_size = 50257 # 50257 个 token
output_dim = 256 # 每个 token 有 256 个参数，用来记住这个 token 的各种用法
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
# torch.Size([8, 4, 256]) 每个 token 有 256 个维度

模型需要将文本数据转化成数值类型的向量数据（即嵌入）；文本或图像原本是离散型的数据，通过构建词汇表以及分配 token id，它们被转换成了连续的向量空间中的数值，以便能够输入神经网络模型进行计算；

3.Coding attention mechanisms

注意力机制有四种不同的变体，一些变体建立在另外一些变体的基础上，它们分别是：

简化的自注意力；
自注意力；
因果（掩码）注意力；
多头注意力（并行关注输入的不同维度）；

模型化长序列的问题

在注意力机制出现之前，一般使用 encoder-decoder 机制来处理实现语言翻译，例如将德语翻译为英语。由于不同语言的语法结构不同，显然一个单词一个单词的直接翻译是不可行的。为解决这个剖，encoder-decoder 引入了一个中间层（hidden state）；这个中间有点像是一层嵌入，用来存储句子的核心思想。然后再使用 decoder 将核心思想翻译为目标语言。

RNN 是实现 encoder-decoder 机制的一种架构，它先是逐个单词进行编码，然后将编译后的结果，和下一个单词，作为下一次编码的输入，直到迭代出最终结果，放到 hidden state 隐藏层中。但问题是在迭代的过程中，隐藏层一直在变化，存储的是最新结果，这会导致在处理复杂的长文本时，尤其是相互依赖的内容间隔较远时，RNN 无法记住前面的上下文。

通过注意力机制捕获数据依赖

为解决 RNN 因为无法获取前文，导致只因翻译短句的短板，引了一种新的注意力机制 Bahdanau；

Bahdanau 机制引入几年后，人们意外的发现，其实貌似没有必要使用 encoder + decoder 的机制，直接单独使用 decoder 就够了；

通过自注意力机制关注输入的不同部分

自注意力机制的核心，在于它能够识别一个输入序列中不同部分之前的相互关系和依赖，它通过计算权重来表示这种关系。传统的注意力机制则更多是关注两个序列的元素之间的关系（即传统的 encoder 和 decoder 机制）。

不带训练参数的自注意力机制

计算目标：上下文向量（context vector），用该向量来存储某个 token 的上下文信息，所以也叫增强嵌入向量（enriched embedding vector）；

import torch
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # Your    (x^1)
        [0.55, 0.87, 0.66],  # journey (x^2)
        [0.57, 0.85, 0.64],  # starts  (x^3)
        [0.22, 0.58, 0.33],  # with    (x^4)
        [0.77, 0.25, 0.10],  # one     (x^5)
        [0.05, 0.80, 0.55],  # step    (x^6)
    ]
)

query = inputs[1]  # 第二个 token "journey"
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(query, x_i)
print(attn_scores_2)  

# 结果：tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

# 计算各个 token 注意力权重的占比，以便实现归一化
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

# Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
# Sum: tensor(1.0000)

# 为避免过大值或过小值导致计算溢出，改用 PyTorch 内置的归一化函数
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())
# Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
# Sum: tensor(1.)

# 有了权重系数，接下来便是计算上下文向量 context vector
query = inputs[1]  # 此处取第二个 token，原单词 journey 作为示例
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i
print(context_vec_2)
# tensor([0.4419, 0.6515, 0.5683])

计算各输入 token 的注意力权重

每个 token 相当于其他 token，都会有一个注意力权重系数。

# 计算各个 token 之间的注意力分数
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

# 由于 for 循环的计算性能较弱，以上计算可用优化过的矩阵乘法，计算结果相同
attn_scores = inputs @ inputs.T
print(attn_scores)

# tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
#         [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
#         [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
#         [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
#         [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
#         [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

整个计算过程由以下三步组成：

# 第二步：计算权重系数，dim = -1 表示在最后一维进行归一化计算
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights) # 6 x 6 的矩阵

# tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
#         [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
#         [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
#         [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
#         [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
#         [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

# 第三步：计算上下文向量
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

# tensor([[0.4421, 0.5931, 0.5790],
#         [0.4419, 0.6515, 0.5683],
#         [0.4431, 0.6496, 0.5671],
#         [0.4304, 0.6298, 0.5510],
#         [0.4671, 0.5910, 0.5266],
#         [0.4177, 0.6503, 0.5645]])

用可训练系数实现自注意力机制

计算注意力参数

引入三个可训练的注意力权重参数，它们分别是 W_q，W_k，W_v，这三个参数用来将 token 映射成向量 query, key, value；

有意思的是，input token 是三维的，但映射后 q, k, v 是二维的；但实际的 GPT 模型中，这三个维度通常是一样的。此处出于演示的目的，进行了简化，少了一维；

因为 W 参数是可训练的，所以初始值可先设置为一个随机值；

# 初始化训练参数
d_in = inputs.shape[1] # token 的嵌入此处是 3 维
d_out = 2 # 输出设置为 2 维
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

# 计算第二个 token 的 q，k，v
x_2 = inputs[1]
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)
# tensor([0.4306, 1.4551])

权重参数和注意力参数的区别，前者用于神经网络中的输入层和输出层之间的连接计算，它有点像是一个卷积核，是一种数据变换和信息提取的操作。在训练过程中会不断优化收敛，训练结束后，它的值便固定了；后者用于表示一个句子中，每个 token 跟其他 token 的相互关系。因此它则跟上下文有关。同一个 token，放在不同的上下文中，它跟其他 token 的注意力参数不同；每一次输入，注意力参数都是动态变化的，但权重参数是固定的；

# 计算所有的q,k,v
querys = inputs @ W_query
keys = inputs @ W_key
values = inputs @ W_value
print("querys.shape:", querys.shape)
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
# querys.shape: torch.Size([6, 2])
# keys.shape: torch.Size([6, 2])
# values.shape: torch.Size([6, 2])

在前面简化版的注意力机制中，注意力分数是由各个 token 向量相互计算得出的，没有实际的意义。此处改进后的版本则是先使用权重参数对 token 进行三个维度的转换，分别转成 query, key, value；然后再由 query * key 得到注意力参数；

问：为什么 query * key 的结果可以作为注意力分数？

答：权重 W_q 应该是对所有的 token 的一种抽象和转换。W_k 貌似代表一个 token 在某种语境中的重要程度；

# 计算第二个 token 的注意力分数，注意此处的 keys.T
attn_scores_2 = query_2 @ keys.T
print("attn_scores_2: ", attn_scores_2)

# 结果是一个 6 维的数组，它表示该 token 相对其他 6 个token 的注意力重要程度
# attn_scores_2:  tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])

接下来可以将注意力分数归一化：

d_k = keys.shape[-1] # keys.shape = [6, 2]，此处取其参数维度 2
# 此处使用 softmax 进行权重归一化，但跟之前不同的点在于，额外除以 d_k 平方根，为什么呢？
# 答：该操作对归一化的点积值进行了缩放（缩小）。因为点积值过大的话，在反向传播的训练过程中，softmax 更像是阶梯函数，导致梯度接近于零，即没有平滑的过渡，这样会导致训练卡住，难以有效收敛；因此，自注意力机制也被叫做缩放点积注意力机制；
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
# attn_weights_2:  tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])

得到归一化的权重参数后，最后一步便是 value vector 乘以相应的权重参数，得到最终的上下文向量 context vector；

query * key 归一化后得到 weight，weight * value 得到 context；（此处的 query, key, value 都是使用可训练参数对原 token 嵌入进行了提炼后的结果；但是暂时没有看到 token 位置信息在哪个环节纳入考虑和计算；

1
2
3

context_vec_2 = attn_weights_2 @ values
print("context_vec_2: ", context_vec_2)
# context_vec_2:  tensor([0.3061, 0.8210])

据说此处的 query, key, value 借鉴自数据库的概念，key 类似主键，value 类似值（记录），query 类似搜索查询；但我个人感觉不太像；

实现一个简单的自注意力 Python 类

import torch.nn as nn
import torch

# 使用 torch.rand 生成随机数，特点：纯随机，因此训练时会需要更多的时间收敛
class SelfAttention_V1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, inputs):
        querys = inputs @ self.W_query
        keys = inputs @ self.W_key
        values = inputs @ self.W_value

        attn_scores = querys @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1] ** 0.5, dim=-1)

        context_vecs = attn_weights @ values
        return context_vecs

# 使用 nn.linear 生成随机数，特点：按一定规律随机，可加快训练收敛过程，避免一开始过于发散
class SelfAttention_V2(nn.module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, inputs):
        querys = self.W_query(inputs)
        keys = self.W_key(inputs)
        values = self.W_value(inputs)

        attn_scores = querys @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1] ** 0.5, dim=-1)

        context_vecs = attn_weights @ values
        return context_vecs

torch.manual_seed(789)
sa_v2 = SelfAttention_V2(d_in, d_out)
print(sa_v2(inputs))

# tensor([[-0.0739,  0.0713],
#         [-0.0748,  0.0703],
#         [-0.0749,  0.0702],
#         [-0.0760,  0.0685],
#         [-0.0763,  0.0679],
#         [-0.0754,  0.0693]], grad_fn=<MmBackward0>)

通过因果注意力机制屏蔽下一个单词

前面提到的自注意力机制，在计算上下文向量时，是基于整个完整的句子进行计算的。但在实际训练时，需要基于已知的部分内容进行计算，以推测下一个可能的单词，因此需要隐藏部分内容，并对已知内容的权重参数重新做归一化计算。

# 使用 SelfAttention_V2 计算权重参数
queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / d_out**0.5, dim=-1)
print(attn_weights)

# tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
#         [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
#         [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
#         [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
#         [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
#         [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
#        grad_fn=<SoftmaxBackward0>)

# 使用 PyTorch 自带的 tril 函数生成下三角矩阵
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)
# tensor([[1., 0., 0., 0., 0., 0.],
#         [1., 1., 0., 0., 0., 0.],
#         [1., 1., 1., 0., 0., 0.],
#         [1., 1., 1., 1., 0., 0.],
#         [1., 1., 1., 1., 1., 0.],
#         [1., 1., 1., 1., 1., 1.]])

# 权重参数 * 下三角矩阵得到掩模后的结果
masked_simple = attn_weights * mask_simple
print(masked_simple)
# tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
#         [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
#         [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
#         [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
#        grad_fn=<MulBackward0>)

# 对掩模结果重新归一化
row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
# tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
#         [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
#         [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
#         [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
#        grad_fn=<DivBackward0>)

前述的几个步骤，包含了两次的归一化，事实上可以合并成一次归一化，提高运算速度。

mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
# tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
#         [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
#         [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
#         [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
#         [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
#         [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
#        grad_fn=<MaskedFillBackward0>)

attn_weights = torch.softmax(masked / keys.shape[-1] ** 0.5, dim=1)
print(attn_weights)
# tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
#         [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
#         [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
#         [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
#        grad_fn=<SoftmaxBackward0>)

引入 dropout 避免过拟合

torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5)  # 50% dropout, 实际训练时一般设置为 0.1 或 0.2
example = torch.ones(6, 6)
print(dropout(example))
# tensor([[2., 2., 2., 2., 2., 2.],
#         [0., 2., 0., 0., 0., 0.],
#         [0., 0., 2., 0., 2., 0.],
#         [2., 2., 0., 0., 0., 2.],
#         [2., 0., 0., 0., 0., 2.],
#         [0., 2., 0., 0., 0., 0.]])
# 由于 dropout 50%，因此有一半的值被置为 0, 一半的值被放大为 2, 以保持合计值不变

torch.manual_seed(123)
print(dropout(attn_weights))
# tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.0000, 0.8966, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.0000, 0.0000, 0.6206, 0.0000, 0.0000, 0.0000],
#         [0.5517, 0.4921, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.4350, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#         [0.0000, 0.3327, 0.0000, 0.0000, 0.0000, 0.0000]],
#        grad_fn=<MulBackward0>)

实现一个紧凑的 Causal Attention 类

import torch.nn as nn
import torch


class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        # register_buffer 会将缓存数据自动移动到 CPU 或 GPU 上, 无需再手工检查
        self.register_buffer(
            "mask", torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, inputs):
        b, num_tokens, d_in = inputs.shape  # batch size, number of tokens, input dim
        queries = self.W_query(inputs)
        keys = self.W_key(inputs)
        values = self.W_value(inputs)

        attn_scores = queries @ keys.transpose(
            1, 2
        )  # 只转置最后两个维度, 保持第一个维度不变, 因为第一个维度是 batch size
        attn_scores = attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf
        )
        attn_weights = torch.softmax(attn_scores / keys.shape[-1] ** 0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context_vecs = attn_weights @ values
        return context_vecs

# 使用 CausalAttention 来计算输入的上下文向量
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)
# torch.Size([2, 6, 3])

torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape", context_vecs.shape)
# context_vecs.shape torch.Size([2, 6, 2])

将单头注意力扩展为多头注意力

一套权重参数对应的 Causal Attension 相当于一个单头注意力（single head attention，单个专家），我们可以训练多套权重参数，这样就可以获得多头注意力 multi-head attention（多个专家）。不同专家关注输入数据的不同维度，类似于使用不同的视角来看数据。

貌似多头注意力有可能和多模态场景结合起来？但又好像有些区别，多头注意力是同一份数据，进行不同的解读；多模态是不同的数据，解读后整合各自的解读结果，作为下一步行动的依据；

叠加多个单头注意力

# 创建一个能够聚合多个单头注意力的类
import torch.nn as nn
import torch
from CausalAttention import CausalAttention


class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [
                CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
                for _ in range(num_heads)
            ]
        )

    def forward(self, inputs):
        return torch.cat([head(inputs) for head in self.heads], dim=-1)

# 多头类使用示例
torch.manual_seed(123)
context_length = batch.shape[1]
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print("context_vecs.shape", context_vecs.shape)
print(context_vecs)
# context_vecs.shape torch.Size([2, 6, 4]) # context_vec 的输出是 2 维，但由于有2个单头叠加，所以是 4 维
# tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]],

#         [[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)

用 weight split 实现多头注意力

MultiHeadAttentionWrapper 使用 for 循环来叠加多个单头，由于 for 循环是线性计算，因此可考虑使用矩阵的平行计算来提升性能。另外，还可以将 MultiHeadAttentionWrapper 和 CausalAttention 整合到一起，创建一个单独的 MultiHeadAttention 类；

import torch.nn as nn
import torch

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0  # 确保 d_out 可以被 num_heads 整除
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask", torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, inputs):
        b, num_tokens, d_in = inputs.shape
        queries = self.W_query(inputs)
        keys = self.W_key(inputs)
        values = self.W_value(inputs)
        # 将 queries, keys, values 拆分为 num_heads 份
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        # 转置 (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        attn_scores = queries @ keys.transpose(2, 3)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        attn_scores.maksed_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1] ** 0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2)
        # 使用 view 方法 reshape: self.d_out = self.num_heads * self.head_dim
        # contiguous 方法可调整数据在内存上存储的顺序，与逻辑顺序保持一致，以提高后续的计算性能
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)

        context_vec = self.out_proj(context_vec)  # 输出映射（非必须）
        return context_vec

# 使用 MultiHeadAttention 计算 context_vecs
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
ha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print("context_vecs.shape:", context_vecs.shape)
print(context_vecs)
# context_vecs.shape: torch.Size([2, 6, 4])
# tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]],

#         [[-0.4519,  0.2216,  0.4772,  0.1063],
#          [-0.5874,  0.0058,  0.5891,  0.3257],
#          [-0.6300, -0.0632,  0.6202,  0.3860],
#          [-0.5675, -0.0843,  0.5478,  0.3589],
#          [-0.5526, -0.0981,  0.5321,  0.3428],
#          [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)

以上示例中的 MultiHeadAttention，嵌入向量为 3 维，有 2 个 Head；在实际的 LLM 大模型中，维度和 Head 数量要多得多的多。例如最小的 GPT-2 模型，有 12 个 Head，嵌入向量为 768 维，总共 1.2 亿个参数；最大的 GPT-2 模型，有 25 个 Head，嵌入向量有 1600 维，总共有 15 亿个参数。

4.Implementing a GPT model from scratch to generate text

前面一章主要关注如何实现自注意力机制，这一章则主要关注构建 LLM 架构余下的部分。

编写 LLM 架构

LLM 架构的核心之一是 Transformer 模块，该模块包含前面实现的掩码多头注意力机制（Masked Multi-Head Attention）；

# 初始化模型的超参数
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size，词汇表大小
    "context_length": 1024,  # Context length，上下文长度
    "emb_dim": 768,  # Embedding dimension，嵌入向量的维度
    "n_heads": 12,  # Number of heads，专家数量
    "n_layers": 12,  # Number of layers，层数
    "drop_rate": 0.1,  # Dropout rate，随机失活率
    "qkv_bias": False,  # Query/Key/Value bias，是否开启QKV偏置（目的：增强模型灵活性，避免全部依赖权重参数；开启后，会对 QKV 的计算结果进行偏移
}

先搭一个骨架，然后将各个模块组合在一起。其中一个 Transformer 模块由四个子模块组成。

# 预定义一个简单的Transformer Block, 之后会被真正的 TransformerBlock 替换
import torch
import torch.nn as nn

class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()

    def forward(self, x):
        return x

# 预定义一个简单的LayerNorm, 之后会被真正的 LayerNorm 替换
class DummyLayerNorm(nn.Module):
    def __init__(self, normlized_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # token embedding 和 position embedding
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        # dropout layer
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        # 预定义的 transformer block（此时每个 transformer 都一样，估计与实际的 GPT2 不同）
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
        # 预定义的最终 LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

import tiktoken
import torch
from DummyGPTModel import DummyGPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,  # Embedding dimension
    "n_heads": 12,  # Number of heads
    "n_layers": 12,  # Number of layers
    "drop_rate": 0.1,  # Dropout rate
    "qkv_bias": False,  # Query, Key, Value bias
}

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
# tensor([[6109, 3626, 6100,  345],
#         [6109, 1110, 6622,  257]])

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)
# 2 个句子，每个句子 4 个 token，每个 token 有 50257 个维度，刚好对应了 GPT2 的词表大小，之后会经过 softmax 处理，得到每个 token 的概率，最终选择概率最大的 token 作为下一个 token
# Output shape: torch.Size([2, 4, 50257])
# tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
#          [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
#          [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
#          [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

#         [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
#          [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
#          [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
#          [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
#        grad_fn=<UnsafeViewBackward0>)

使用归一层对激活值进行归一化

深度神经网络由很多层构成，但层数一多，训练过程中很容易出现梯度消失或者梯度爆炸问题，导致训练结果无法收敛。常见的解决方案是对输出结果进行归一化，以提高训练效率和稳定性。

# 添加归一化层的示例
torch.manual_seed(123)
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())  # ReLU 负责将负数置零
out = layer(batch_example)
print(out)
# tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
#         [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
#        grad_fn=<ReluBackward0>)
mean = out.mean(
    dim=-1, keepdim=True
)  # dim=-1 表示最后一个维度，keepdim=True 表示保持整体维度，而不是压缩成一维
var = out.var(
    dim=-1, keepdim=True
)  # var 表示方差，即各点到均值的距离平方的平均值（如果再开方就是标准差）
print("Mean:", mean)
print("Variance:", var)
# Mean: tensor([[0.1324],
#         [0.2170]], grad_fn=<MeanBackward1>)
# Variance: tensor([[0.0231],
#         [0.0398]], grad_fn=<VarBackward0>)


out_norm = (out - mean) / torch.sqrt(var)  # 归一化，减去均值除以标准差
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("out_norm:", out_norm)
print("mean:", mean)
print("var:", var)
# out_norm: tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
#         [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
#        grad_fn=<DivBackward0>)
# mean: tensor([[9.9341e-09],
#         [5.9605e-08]], grad_fn=<MeanBackward1>)
# var: tensor([[1.0000],
#         [1.0000]], grad_fn=<VarBackward0>)
torch.set_printoptions(sci_mode=False)  # 关闭科学计数法
print("mean:", mean)
print("var:", var)
# mean: tensor([[    0.0000],
#         [    0.0000]], grad_fn=<MeanBackward1>)
# var: tensor([[1.0000],
#         [1.0000]], grad_fn=<VarBackward0>)

# 使用类来封装归一化层
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(
            dim=-1, keepdim=True, unbiased=False
        )  # unbiased=False 表示方差计算公式为 1/N * sum((x - mean)^2)，否则为 1/(N-1) * sum((x - mean)^2)，因为 embedding 的维度通常很大，所以使用常规的 1/N-1 没有太大影响，但为了和 GPT2 保持一致，这里使用了 1/N 的方式, 也就是 unbiased=False
        norm_x = (x - mean) / torch.sqrt(var + self.eps)  # 加上 eps 防止分母为 0
        return (
            norm_x * self.scale + self.shift
        )  # scale 和 shift 为可学习参数, 跟图像处理中的缩放和平移很点像

# LayerNorm 类的使用示例
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, keepdim=True, unbiased=False)
print("Mean:\n", mean)
print("Variance:\n", var)
# Mean:
# tensor([[    -0.0000],
#         [     0.0000]], grad_fn=<MeanBackward1>)
# Variance:
# tensor([[1.0000],
#         [1.0000]], grad_fn=<VarBackward0>)

相比传统的 batch normalization，使用 layer normalization 的好处是单独对特征维度（最后一维）进行归一化计算，可以不用管 batch 维度，这样在分布式计算中更方便；

至此已经完成 backbone 和归一化层，接下来将实现 GELU 激活层；

使用 GELU 激活函数实现前馈网络

问：为什么要使用前馈网络？

答：在得到归一化的结果后，需要对结果进行梳理和筛选，淘汰不合格的值，保留合格的值，以供下一步计算使用。此时常用所谓的激活函数来实现梳理和筛选。它会让离散的结果值，变得更加平滑，这样有利于计算值与值之间的变化梯度。进而可以根据变化梯度，判断权重参数的调整幅度；

传统的深度学习方法中，经常使用 ReLU 作为实现非线性计算的激活函数。但是它在某些场景中过于简单，无法取得最好的效果。因此，在 LLM 中引入了 GELU（Gaussian error linear unit，高斯误差线性单元，它使用基于高斯分布即正态分布的累积分布函数，来实现更平滑的非线性计算，而不是像 ReLU 基于简单的阈值）；

# 使用 matplotlib 用图表展示 GELU 更加平滑的过渡效果
from DummyGPTModel import GELU
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

gelu, relu = GELU(), nn.ReLU()
x = torch.linspace(-3, 3, 100)  # 创建 100 个点，范围从 -3 到 3
y_gelu, y_relu = gelu(x), relu(x)
plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
    plt.subplot(1, 2, i)
    plt.plot(x, y)
    plt.title(f"{label} activation function")
    plt.xlabel("x")
    plt.ylabel(f"{label}(x)")
    plt.grid(True)

plt.tight_layout
plt.show()

结果如下：

从图上可见 GELU 在有些位置会出现负数，虽然负数的值不大，但它的存在，能够让模型在训练过程中捕获到更细微的参数变化，从而让训练结果的收敛变得更加容易。

# 实现一个简单的前馈网络，它由三个层构成，分别是两个线性层夹一个 GELU 激活层
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], cfg["emb_dim"] * 4),
            GELU(),
            nn.Linear(cfg["emb_dim"] * 4, cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768)
out = ffn(x)
print(out.shape)
# torch.Size([2, 3, 768])

输入层的每个 token 原本是 768 维，然后线性层将其扩大 4 倍，映射为 768 * 4 = 3072 维，据说这么做是为了带来更丰富的表达空间，有待观察其实际效果；

至此已实现 Transformer block 里面的 GELU 激活层和前馈网络层（示例发下），下一步开始添加快捷连接。

添加跳跃连接

Shortcut connections，跳跃连接，或者叫残差连接。它的目标是为了解决随着神经网络的深度增加，在反向逐级传播的过程中，出现梯度消失的问题，它的核心思想源自于计算机视觉中的残差网络（residual network，即大名鼎鼎的 ResNet）。其关键在于将输入重新加回到结果值中，这样一来，神经网络实际学习的并不是数据的映射，而是基于原始数据，进行一定程度的平移变换，这样可以保留原来的梯度；

# 构建一个由 5 个 layer 组成的简单神经网络模型
class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shorcut):
        super().__init__()
        self.use_shorcut = use_shorcut
        self.layers = nn.ModuleList(
            [
                nn.Sequential(
                    nn.Linear(layer_sizes[0], layer_sizes[1]),
                    nn.GELU(),
                ),
                nn.Sequential(
                    nn.Linear(layer_sizes[1], layer_sizes[2]),
                    nn.GELU(),
                ),
                nn.Sequential(
                    nn.Linear(layer_sizes[2], layer_sizes[3]),
                    nn.GELU(),
                ),
                nn.Sequential(
                    nn.Linear(layer_sizes[3], layer_sizes[4]),
                    nn.GELU(),
                ),
                nn.Sequential(
                    nn.Linear(layer_sizes[4], layer_sizes[5]),
                    nn.GELU(),
                ),
            ]
        )

    def forward(self, x):
        for layer in self.layers:
            layer_output = layer(x)  # 计算当前层的输出
            # 如果输出和输入的维度一样，且 use_shorcut 为 True，则使用跳跃连接
            if self.use_shorcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

# ExampleDeepNeuralNetwork 使用示例
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1.0, 0.0, -1.0]])  # 模拟一个简单的输入
torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shorcut=False)

def print_gradients(model, x):
    output = model(x)
    target = torch.tensor([[0.0]])  # 为简单起见，用 0 作为目标值
    loss = nn.MSELoss()  # 初始化损失函数
    loss = loss(output, target)  # 比较输出和目标值间的差距（即损失）
    loss.backward()  # 反向传播，计算损失的梯度
    for name, param in model.named_parameters():
        if "weight" in name:
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

print_gradients(model_without_shortcut, sample_input)
# 从下面打印结果可见，随着网络层数的增加，反向传播时，梯度越来越小，逐渐消失
# layers.0.0.weight has gradient mean of 0.0002017411752603948
# layers.1.0.weight has gradient mean of 0.00012011770741082728
# layers.2.0.weight has gradient mean of 0.0007152437465265393
# layers.3.0.weight has gradient mean of 0.0013988513965159655
# layers.4.0.weight has gradient mean of 0.005049604922533035

torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shorcut=True)
print_gradients(model_with_shortcut, sample_input)
# 从下面打印结果可见，使用跳跃连接后，梯度不会消失，保持稳定
# layers.0.0.weight has gradient mean of 0.22186797857284546
# layers.1.0.weight has gradient mean of 0.207092747092247
# layers.2.0.weight has gradient mean of 0.32923877239227295
# layers.3.0.weight has gradient mean of 0.2667771875858307
# layers.4.0.weight has gradient mean of 1.3268063068389893

在 Transformer 模块中连接注意力层和线性层

一个 GPT 模型由多个 Transformer 模块组成，而一个 Transformer 模块由以下几个部分组成，它们分别是：

Multi-head attention，多头注意力
Normalization layer，归一化
Dropout，随机失活
Feed Forward layer，前向传播
GELU activation，激活函数

其中自注意力和多头注意力负责分析输入的各个 token 之间的相互关系，Feed Forward 则负责对单个 token 的数据进行转换。二者结合起来，不仅有助于模型更好的理解和处理输入数据，也有助于找到背后的规律。

# 创建一个 Transformer 类，它主要由 MultiHeadAttention 和 FeedForward 两部分构成
class TransformerBlock(nn.Modlue):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["num_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"],
        )
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # forward 由两部分构成，先使用多头注意力计算 token 间的相互关系，然后用 FeedForward 进行数据转换（激活），为下一轮计算做准备
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # att 即 MultiHeadAttention
        x = self.drop_shortcut(x)
        x = x + shortcut  # 跳跃连接

        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)  # ff 即 FeedForward
        x = self.drop_shortcut(x)
        x = x + shortcut
        return x

# TransformerBlock 使用示例
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
# Input shape: torch.Size([2, 4, 768])
# Output shape: torch.Size([2, 4, 768])

从 Input shape 和 Output shape 可以看出输入和输出的 shape 保持不变，这并非意外，而是有意为之，以便可以在每一层 layer 的计算过程中，保持高效；虽然 shape 不变，但输出的张量已经包含了上下文信息。至此一个完整的 Transformer Block 组件已经形成。

编写 GPT 模型

GPT-2 模型的结构如下，分别包含：

token 转嵌入层 + 添加位置信息嵌入层；
12 个 Transformer 模块（每个模块有四层，叠加后有 48 层）；
最终归一化层 + 线性输出层；

最终结果是一个 shape 为 [4, 50257] 的张量（形状跟输入一样）；模型的最终目前是基于该张量，得到下一个最大概率的 token 是什么。

生成文本

最后一步是将模型生成的张量 [batch_size, num_token, vocab_size] 转换成目标语言的文本（例如英文或者中文）。

过程其实非常简单，输出的张量的最后一行，即是基于输入中的最后一个单词与下一个最有可能出现的单词之间的概率值，此时只需取出该行，然后找出最大概率的那个值的索引。然后根据索引，从词汇表里面找到对应的单词即可。

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # max_new_tokens 用来控制生成的文本长度
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]  # 根据上下文的长度，截取上下文（从后往前截）
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]  # 取最后一个 token 的 logits
        probas = torch.softmax(
            logits, dim=-1
        )  # 对 logits 进行 softmax 处理，得到每个 token 的概率，但其实这步有点多余，因为最大值不会变，直接计算最大值的索引就可以了
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)
        idx = torch.cat(
            (idx, idx_next), dim=-1
        )  # 将新生成的 token 拼接到 idx 的后面， 为下一轮生成做准备
    return idx


start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(
    0
)  # 增加一个维度，变成 (1, 6),模拟批量处理，此时 batch_size=1
print("encoded_tensor.shape:", encoded_tensor.shape)
# encoded: [15496, 11, 314, 716]
# encoded_tensor.shape: torch.Size([1, 4])

# 文本生成示例
model.eval()  # 将模型设置为评估模式，以便关闭 dropout(训练时才需要开启 dropout)
out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT_CONFIG_124M["context_length"],
)
print("Output:", out)
print("Ootput length:", len(out[0]))
# Output: tensor([[15496,    11,   314,   716, 27018,  7283, 46275, 41426, 33167, 33239]])
# Ootput length: 10

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print("Decoded text:", decoded_text)
# Decoded text: Hello, I am Feature IT snowballProtect youngstersMu
# Feature IT snowballProtect youngstersMu 是生成的内容，由于参数还未训练过，所以此时生成的答案约等于胡言乱语

5.Pretraining on unlabeled data

接下来将使用数据集，对模型参数进行训练，并选择合适的指标，对模型训练后的质量进行评估。

评估文本生成模型

使用 GPT 生成文本

import torch
import tiktoken

from GPTModel import GPTModel, generate_text_simple

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,  # 上下文长度小一些，不然笔记本训练时容易卡死
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}
torch.manual_seed
model = GPTModel(GPT_CONFIG_124M)
model.eval()


def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
    encoded_tensor = torch.tensor(encoded).unsqueeze(
        0
    )  # 增加一个维度，模拟批量处理，此时 batch_size=1
    return encoded_tensor


def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())


start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"],
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# Output text:
#  Every effort moves you icatorsthirdwenSave Black energies nationality retreating!!!!uz
# 由于还未训练，所以此时模型生成的文本暂时没有意义

计算文本生成损失

# 输入，为简单示例起见，上下文长度为3，即句子只有 3 个单词
inputs = torch.tensor(
    [
        [16833, 3626, 6100],  # ["Every", "effort", "moves"]
        [40, 1107, 588],  # ["I", "really", "like"]
    ]
)

# 目标输出，可用于计算实际输出和目标输出之间的损失
targets = torch.tensor(
    [
        [3626, 6100, 345],  # ["effort", "moves", "you"]
        [1107, 588, 11311],  # ["really", "like", "chocolate"]
    ]
)

with torch.no_grad():
    logits = model(inputs)
probas = torch.softmax(logits, dim=-1)
print("probas.shape", probas.shape)
# probas.shape torch.Size([2, 3, 50257])

token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)
# Token IDs:
#  tensor([[[44689],
#          [12200],
#          [ 3288]],

#         [[38526],
#          [19969],
#          [44901]]])

print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")
# 期望的输出
# Targets batch 1:  effort moves you
# 目前实际的输出，肉眼可见毫不相关
# Outputs batch 1:  remem nativesillery

接下来将设计一个函数，用来计算实际输出和目标输出之间的差距，然后基于该差距，调整权重参数值。由于目标是预测下一个单词，那么只要 softmax 得到的最大概率值索引，刚好对应词汇表中的预期单词的索引，就算成功了。这个概率值越大，理论上来说，模型预测的准确性越高。

# 目标输出的 token 目前在该批次中的概率值，训练的目标，就是让该概率值相对最大化
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1: ", target_probas_1)
# Text 1:  tensor([3.7347e-05, 3.9809e-05, 2.0112e-05])

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
# Text 2: tensor([1.6814e-05, 1.6684e-05, 1.2155e-05])

# torch.log 用于计算张量中每个元素的自然对数值，即 ln(x)
# 相对于直接使用原始概率值进行计算，将其转换成对数后，更方便后续的损失计算处理(计算结果更稳定，避免出现)
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print("log_probas:", log_probas)
# log_probas: tensor([-10.0696, -11.0272, -12.2895, -10.4855, -10.3192, -11.6028])

# 训练的目标，便是让平均值尽量接近 0
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
# tensor(-11.4343)

print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)
# Logits shape: torch.Size([2, 3, 50257])
# Targets shape: torch.Size([2, 3])

logits_flat = logits.flatten(0, 1)  # 只 flat 前两个维度
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
# Flattened logits: torch.Size([6, 50257])
# Flattened targets: torch.Size([6])

# 通过交叉熵计算损失
# logits_flat 有 50257 的维度，此处 cross_entropy 函数做了好几个动作，包括：
# 根据 target tokenId 索引从 50257 个元素中取到对应的概率值，然后 log 它并计算平均值
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
# tensor(11.1071)

损失函数的计算结果有两种表示形式，一种是交叉熵，还有一种是交叉熵的对数形式，称为 Perplexity（困惑度），它的取值范围在 1 到词汇表大小 N 之间。1 表示结果只有一个单词，没有困惑。50 表示有 50 个候选单词。N 表示词汇表中的所有单词都有可能是候选结果，相当于随机了。

计算训练集和验证集的损失

为了方便在有限的空间内进行展示，在以下示意图中，训练的数据集最大长度 max_length 为 6（实际上前面演示的 GPT_CONFIG_124M 用的是 256）; 加载的数据分成两部分，大部分做为训练数据集，小部分做为验证数据集；

此处将长度设置为固定值进行训练，但实际上也可以作为变量进行训练。这样的好处是可以让模型从文本中学习到更泛化的理解能力；

# 使用 the-verdict 文本，对进行训练，训练前，先设计好损失计算的函数
file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
    text_data = file.read()

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)
# Characters: 20479
# Tokens: 5145

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,  # 此处的 batch size 非常小，实际上训练时取 1024 甚至更大并不罕见
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0,
)
val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],  # context_length 为 256
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0,
)

print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)
# Train loader:
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
# torch.Size([2, 256]) torch.Size([2, 256])
print("Validation loader:")
for x, y in val_loader:
    print(x.shape, y.shape)
# Validation loader:
# torch.Size([2, 256]) torch.Size([2, 256])


def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(
        logits.flatten(0, 1), target_batch.flatten()
    )
    return loss


def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.0
    if (len(data_loader)) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches  # 所有批次的总平均损失


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
# Training loss: 10.99226008521186
# Validation loss: 10.980721473693848

有了损失计算函数后，那一步就可以开始正式的训练了；

训练大模型

训练主要由以下几个步骤组成：

def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()  # 切换为评估模式，用于临时关闭 dropout，以便得到准确的评估结果
    with torch.no_grad():  # 临时关闭梯度计算，节省内存和计算资源
        train_loss = calc_loss_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()  # 恢复训练模式
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded, max_new_tokens=50, context_size=context_size
        )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))
    model.train()


def train_model_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs,
    eval_freq,
    eval_iter,
    start_context,
    tokenizer,
):
    train_losses, val_lossed, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1
    for epoch in range(num_epochs):
        model.train()  # 强制进入训练模式，避免其他地方不小心将其设置为评估模式
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # 清空上一步遗留的梯度值
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # 反向传播，计算当前梯度值
            optimizer.step()  # 使用梯度值，更新模型参数
            tokens_seen += input_batch.numel()
            global_step += 1

            if global_step % eval_freq == 0:  # 每 eval_freq 步评估一次
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                val_lossed.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(
                    f"Ep {epoch + 1} (Step {global_step:06d}):"
                    f" Train loss {train_loss:.3f}, "
                    f"Val loss {val_loss:.3f},"
                )

        generate_and_print_sample(model, tokenizer, device, start_context)
    return train_losses, val_lossed, track_tokens_seen


torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
num_epochs = 10
train_losses, val_lossed, tokens_seen = train_model_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs=num_epochs,
    eval_freq=5,
    eval_iter=5,
    start_context="Every effort moves you",
    tokenizer=tokenizer,
)
# Ep 1 (Step 000000): Train loss 9.781, Val loss 9.923,
# Ep 1 (Step 000005): Train loss 8.057, Val loss 8.332,
# Every effort moves you,.
# Ep 2 (Step 000010): Train loss 6.763, Val loss 7.044,
# Ep 2 (Step 000015): Train loss 6.146, Val loss 6.628,
# Every effort moves you, and, and, and, and, and, and, and, and,, and,, and, and, and, and, and, and, and,, and, and, and, and, and, and,, and
# Ep 3 (Step 000020): Train loss 13.849, Val loss 14.409,
# Ep 3 (Step 000025): Train loss 5.536, Val loss 6.441,
# Every effort moves you, and to to to the to to the to the to the to the to the to the to the to the
# Ep 4 (Step 000030): Train loss 5.181, Val loss 6.360,
# Ep 4 (Step 000035): Train loss 5.026, Val loss 6.373,
# Every effort moves you of the picture to the picture to the picture to the picture to the picture to the picture to the picture to the picture to the picture to the the picture to the picture to the my to the picture to the picture to the of the picture to the
# Ep 5 (Step 000040): Train loss 4.689, Val loss 6.335,
# Every effort moves you know it was not to have to have to have to have to have to have to have--and, and I was, and I had been the picture--as Jack himself, and I had been to have to have to have to have to have
# Ep 6 (Step 000045): Train loss 4.133, Val loss 6.177,
# Ep 6 (Step 000050): Train loss 3.686, Val loss 6.150,
# Every effort moves you know it was not to have to have to see the fact of the last word.
# Ep 7 (Step 000055): Train loss 3.395, Val loss 6.097,
# Ep 7 (Step 000060): Train loss 2.701, Val loss 6.093,
# Every effort moves you know it was not that the picture--I had the fact the fact of the donkey, I had been--I
# Ep 8 (Step 000065): Train loss 2.494, Val loss 6.123,
# Ep 8 (Step 000070): Train loss 2.166, Val loss 6.153,
# Every effort moves you know it was not that the picture for nothing--I told Mrs.
# Ep 9 (Step 000075): Train loss 1.793, Val loss 6.197,
# Ep 9 (Step 000080): Train loss 1.471, Val loss 6.180,
# Every effort moves you know," was not that my hostess was "interesting": on the last word.
# Ep 10 (Step 000085): Train loss 1.071, Val loss 6.233,
# Every effort moves you know," was not that my hostess was "interesting": on that point I could have given Miss Croft the

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
    fig, ax1 = plt.subplots(figsize=(5, 3))
    ax1.plot(epochs_seen, train_losses, label="Trainning loss")
    ax1.plot(epochs_seen, val_losses, label="Validation loss", linestyle="-.")
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel("Loss")
    ax1.legend(loc="upper right")
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax2 = ax1.twiny()
    ax2.plot(tokens_seen, train_losses, alpha=0)
    ax2.set_xlabel("Tokens seen")
    fig.tight_layout()
    plt.show()

训练过程中，一开始 val_loss 下降的挺快，但很快进入了平台期，之后 train_loss 仍然继续下降。二者开始出现明显的分离，说明训练参数与训练数据之间开始出现过拟合了。

由于训练数据集非常小，而且还训练了 10 轮，所以很容易出现过拟合。反之，如果训练集很大，例如几万本书，而且只训练一轮的话，那么就不会出现过拟合了；

文本生成策略的随机性控制

温度缩放

# 使用模型生成输出
model.to("cpu")  # 将模型移回 CPU，推理时不需要 GPU
model.eval()  # 切换为评估模式，避免 dropout 等操作影响推理结果

tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

# Every effort moves you know," was not that my hostess was "interesting": on that point I could have given Miss Croft the

这里发现一个问题，即对于训练好的模型，相同的输入，每次调用时，总是会得到相同的输出。这是因为模型在训练时，dropout 的概率值是固定的，所以每次调用时，模型的行为是确定的。

为了更好的控制生成文本的随机性，以应对不同的使用场景，此处引入一个新的参数 temperature；

调低温度时，即 T < 1，减少模型生成结果的随机性，以便应对需要严谨结果的场景，例如生成代码；
调高温度时，即 T > 1，增加模型生成结果的随机性，以便应对需要引入创意的场景，例如诗歌；

它的原理是模型在输出 token 的概率分布前，对这个概率进行缩放。放大概率，意味着结果更有确定性。缩小概率，变得平滑，则结果会更加富有多样性。

new_logit = old_logit / temperature，所以 T 越小，logit 越大；T 越大，logit 越小

import torch

vocab = {
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
}

inverse_vocab = {v: k for k, v in vocab.items()}

next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

probas = torch.softmax(next_token_logits, dim=0)

def print_sampled_tokens(probas):
    torch.manual_seed(123)
    sample = [torch.multinomial(probas, num_samples=1).item() for _ in range(1000)]
    sampled_ids = torch.bincount(torch.tensor(sample))
    for i, freq in enumerate(sampled_ids):
        print(f"{inverse_vocab[i]} : {freq}")


print_sampled_tokens(probas)
# 通过 multinomial 引入了随机性
# closer: 71
# every: 2
# effort: 0
# forward: 544
# inches: 2
# moves: 1
# pizza: 0
# toward: 376
# you: 4

# 使用不同的 temperature 参数输出不同的结果
def softmax_with_temperature(logits, temperature=1.0):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

Top-K 取样

较大的 temperature 虽然会增加多样性，但同时也意味着生成的结果存在不可控的情况，即有可能会生成一个没有意义的文本。因此，可搭配 Top-K 策略一起使用。Top-K 从名字可以看出来，就是取前 K 个最大概率，其他选项则忽略，这样可以减少无意义的情况出现。

top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k) # 结果是降序排列
print("Top logits:", top_logits)
print("Top positions:", top_pos)
# Top logits: tensor([6.7500, 6.2800, 4.5100])
# Top positions: tensor([3, 7, 0])

new_logits = torch.where(
    condition=next_token_logits < top_logits[-1],  # top_logits 已是降序排列
    input=torch.tensor(float("-inf")),
    other=next_token_logits,
)
print("New logits:", new_logits)
# New logits: tensor([4.5100,   -inf,   -inf, 6.7500,   -inf,   -inf,   -inf, 6.2800,   -inf])

修改文本生成函数

将 temperature 和 top-k 结合起来使用；

def generate(
    model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None
):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]  # 只取最后一步的 logits

        if top_k is not None:
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(
                logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits
            )

        if temperature > 0.0:
            logits = logits / temperature
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)

        if idx_next == eos_id:
            break

        idx = torch.cat((idx, idx_next), dim=1)  # 拼接新的 token

    return idx

torch.manual_seed(123)
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=15,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=25,
    temperature=1.4,
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# Output text:
# Every effort moves you stand to work on surprise, a one of us had gone
# with randomness, and the other side of the world. I was

加载和保存模型参数

# 直接调用 save 方法，即可保存模型的参数
torch.save(model.state_dict(), "model.pth")

model = GPTModel(GPT_CONFIG_124M)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

model.load_state_dict(torch.load("model.pth"), map_location=device)
model.eval()

# 可同时保存模型和优化器的参数
torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "model_and_optimizer.pth",
)

# 加载模型和优化器的参数
checkpoint = torch.load("model_and_optimizer.pth", map_location=device)
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

加载预训练参数

加载预训练的模型参数前，需要先下载它们，以下是下载模型参数的代码；

# 下载模型参数
import os
import urllib.request

import json
import numpy as np
import tensorflow as tf
from tqdm import tqdm

def download_and_load_gpt2(model_size, models_dir):
    # 有多个不同参数数量的版本
    allowed_sizes = ("124M", "355M", "774M", "1558M")
    if model_size not in allowed_sizes:
        raise ValueError(f"Model size not in {allowed_sizes}")

    # 待下载的文件
    model_dir = os.path.join(models_dir, model_size)
    base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
    backup_base_url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/gpt2"
    filenames = [
        "checkpoint", "encoder.json", "hparams.json",
        "model.ckpt.data-00000-of-00001", "model.ckpt.index",
        "model.ckpt.meta", "vocab.bpe"
    ]

    # 下载文件
    os.makedirs(model_dir, exist_ok=True)
    for filename in filenames:
        file_url = os.path.join(base_url, model_size, filename)
        backup_url = os.path.join(backup_base_url, model_size, filename)
        file_path = os.path.join(model_dir, filename)
        download_file(file_url, file_path, backup_url)

    # 加载配置和参数
    tf_ckpt_path = tf.train.latest_checkpoint(model_dir)
    settings = json.load(open(os.path.join(model_dir, "hparams.json"), "r", encoding="utf-8"))
    params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings)

    return settings, params


def download_file(url, destination, backup_url=None):
    def _attempt_download(download_url):
        with urllib.request.urlopen(download_url) as response:
            # Get the total file size from headers, defaulting to 0 if not present
            file_size = int(response.headers.get("Content-Length", 0))

            # Check if file exists and has the same size
            if os.path.exists(destination):
                file_size_local = os.path.getsize(destination)
                if file_size == file_size_local:
                    print(f"File already exists and is up-to-date: {destination}")
                    return True  # Indicate success without re-downloading

            block_size = 1024  # 1 Kilobyte

            # Initialize the progress bar with total file size
            progress_bar_description = os.path.basename(download_url)
            with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar:
                with open(destination, "wb") as file:
                    while True:
                        chunk = response.read(block_size)
                        if not chunk:
                            break
                        file.write(chunk)
                        progress_bar.update(len(chunk))
            return True

    try:
        if _attempt_download(url):
            return
    except (urllib.error.HTTPError, urllib.error.URLError):
        if backup_url is not None:
            print(f"Primary URL ({url}) failed. Attempting backup URL: {backup_url}")
            try:
                if _attempt_download(backup_url):
                    return
            except urllib.error.HTTPError:
                pass

        # If we reach here, both attempts have failed
        error_message = (
            f"Failed to download from both primary URL ({url})"
            f"{' and backup URL (' + backup_url + ')' if backup_url else ''}."
            "\nCheck your internet connection or the file availability.\n"
            "For help, visit: https://github.com/rasbt/LLMs-from-scratch/discussions/273"
        )
        print(error_message)
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


# Alternative way using `requests`
"""
def download_file(url, destination):
    # Send a GET request to download the file in streaming mode
    response = requests.get(url, stream=True)

    # Get the total file size from headers, defaulting to 0 if not present
    file_size = int(response.headers.get("content-length", 0))

    # Check if file exists and has the same size
    if os.path.exists(destination):
        file_size_local = os.path.getsize(destination)
        if file_size == file_size_local:
            print(f"File already exists and is up-to-date: {destination}")
            return

    # Define the block size for reading the file
    block_size = 1024  # 1 Kilobyte

    # Initialize the progress bar with total file size
    progress_bar_description = url.split("/")[-1]  # Extract filename from URL
    with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar:
        # Open the destination file in binary write mode
        with open(destination, "wb") as file:
            # Iterate over the file data in chunks
            for chunk in response.iter_content(block_size):
                progress_bar.update(len(chunk))  # Update progress bar
                file.write(chunk)  # Write the chunk to the file
"""


def load_gpt2_params_from_tf_ckpt(ckpt_path, settings):
    # Initialize parameters dictionary with empty blocks for each layer
    params = {"blocks": [{} for _ in range(settings["n_layer"])]}

    # Iterate over each variable in the checkpoint
    for name, _ in tf.train.list_variables(ckpt_path):
        # Load the variable and remove singleton dimensions
        variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name))

        # Process the variable name to extract relevant parts
        variable_name_parts = name.split("/")[1:]  # Skip the 'model/' prefix

        # Identify the target dictionary for the variable
        target_dict = params
        if variable_name_parts[0].startswith("h"):
            layer_number = int(variable_name_parts[0][1:])
            target_dict = params["blocks"][layer_number]

        # Recursively access or create nested dictionaries
        for key in variable_name_parts[1:-1]:
            target_dict = target_dict.setdefault(key, {})

        # Assign the variable array to the last key
        last_key = variable_name_parts[-1]
        target_dict[last_key] = variable_array

    return params

# 下载 124M 个参数的 gpt2 模型
from gpt_download import download_and_load_gpt2

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())

GPT-2 模型有好几个，它们的差别主要在于参数规模不同，但模型结构是差不多的。不同的地方包括：

参数的数量；
嵌入维度的数量；
Transformer Block 的数量；
Multi-head 多头注意力的数量；

model_configs = {
    "gpt2-small(124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium(355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large(774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl(1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

model_name = "gpt2-small(124M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024})
NEW_CONFIG.update({"qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval()


def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch: {left.shape} vs {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params["wte"])

    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1
        )
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T
        )
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T
        )
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T
        )

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1
        )
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b
        )
        gpt.trf_blocks[b].att.W_key.bias = assign(gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b
        )

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T,
        )
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"],
        )

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight,
            params["blocks"][b]["mlp"]["c_fc"]["w"].T,
        )
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias,
            params["blocks"][b]["mlp"]["c_fc"]["b"],
        )
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T,
        )
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"],
        )

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale,
            params["blocks"][b]["ln_1"]["g"],
        )
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift,
            params["blocks"][b]["ln_1"]["b"],
        )
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale,
            params["blocks"][b]["ln_2"]["g"],
        )
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift,
            params["blocks"][b]["ln_2"]["b"],
        )

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

load_weights_into_gpt(gpt, params)
gpt.to(device)

tokenizer = tiktoken.get_encoding("gpt2")

torch.manual_seed(123)
token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every efforts moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5,
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
# Every efforts moves you have made a movement a moves a moved move was moved

6.Fine-tuning for classification

目标：微调模型，以便其能够完成分类任务。

不同的微调类型

有两种常见的微调，一种是用于完成分类任务，一种是用于遵循指令输出结果。

指令型微调能够处理更加复杂和通用的任务，但同时它也需要更大的数据集进行训练，以及更多的算力。反之，分类型微调只需可较少的数据集即可完成训练。

准备数据集

进行微调训练前，需要先准备好数据集。此处为了方便，直接从网上下载一份现成的数据集，其中包括垃圾短信和正常短信（也可以用于垃圾邮件的分类训练，原理一样）；

import urllib.request
import zipfile
import os
from pathlib import Path
import pandas as pd

# 下载数据集
def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
    if data_file_path.exists():
        print(f"{data_file_path} already exists. Skipping download and extraction")
        return

    with urllib.request.urlopen(url) as response:
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read())

    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extracted_path)

    original_file_path = Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")


url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
data_dir = "data"
zip_path = f"{data_dir}/sms_spam_coleectioni.zip"
# print("zip_path", zip_path)
extracted_path = f"{data_dir}/sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"
download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
print(df["Label"].value_counts())
# ham     4825
# spam     747
# Name: count, dtype: int64


def create_balanced_dataset(df):
    num_spam = df[df["Label"] == "spam"].shape[0]
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=42)
    balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])
    return balanced_df


balanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())
# Label
# ham     747
# spam    747
# Name: count, dtype: int64

balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})
# print(balanced_df["Label"])

def random_split(df, train_frac, validation_frac):
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)
    train_end = int(len(df) * train_frac)
    validation_end = train_end + int(len(df) * validation_frac)

    train_df = df[:train_end]
    validation_df = df[train_end:validation_end]
    test_df = df[validation_end:]

    return train_df, validation_df, test_df

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)

train_df.to_csv(f"{data_dir}/train.csv", index=None)
validation_df.to_csv(f"{data_dir}/validation.csv", index=None)
test_df.to_csv(f"{data_dir}/test.csv", index=None)

创建数据加载器

由于每个句子长短不一，因此需要对较短的句子做 padding 填充，以便所有句子的长度相同。

import torch
from torch.utils.data import Dataset
import pandas as pd
import tiktoken

class SpamDataset(Dataset):
    # 模型的上下文长度为 1024，如果单句文本长度超过 1024，则需要传入 max_length 参数
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)
        self.encoded_texts = [tokenizer.encode(text) for text in self.data["Text"]]
        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            self.encoded_texts = [
                encoded_text[: self.max_length] for encoded_text in self.encoded_texts
            ]
        self.encoded_texts = [
            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
            for encoded_text in self.encoded_texts
        ]

    def __getitem__(self, idx):
        encoded = self.encoded_texts[idx]
        label = self.data.iloc[idx]["Label"]
        return [
            torch.tensor(encoded, dtype=torch.long),
            torch.tensor(label, dtype=torch.long),
        ]

    def __len__(self):
        return len(self.data)

    def _longest_encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            max_length = max(max_length, len(encoded_text))
        return max_length


tokenizer = tiktoken.get_encoding("gpt2")
train_dataset = SpamDataset(
    csv_file="data/train.csv", max_length=None, tokenizer=tokenizer
)
# print(train_dataset.max_length)
# 120

val_dataset = SpamDataset(
    csv_file="data/validation.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer,
)
test_dataset = SpamDataset(
    csv_file="data/test.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer,
)

分类任务跟之前的模型训练有一个很大的不同，即它不再是预测下一个单词是什么，而是预测标签（label），例如判断是否为垃圾消息，即是（0）或者否（1）

num_workers = 0
batch_size = 8
torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True,
)
val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=True,
)
test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=True,
)

for input_batch, target_batch in train_loader:
    pass
print("Input batch dimension:", input_batch.shape)
print("Label batch dimension:", target_batch.shape)
# Input batch dimension: torch.Size([8, 120])
# Label batch dimension: torch.Size([8])
print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} valication batches")
print(f"{len(test_loader)} test batches")
# 130 training batches
# 18 valication batches
# 37 test batches

用预训练参数初始化模型

# 初始化模型
CHOOSE_MODEL = "gpt2-small(124M)"
INPUT_PROMPT = "Every effort moves"
BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True,
}
model_configs = {
    "gpt2-small(124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium(355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large(774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl(1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = get_model_size(CHOOSE_MODEL)
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="data/gpt2")
model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()

text_1 = "Every effort moves you"
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_1, tokenizer),
    max_new_tokens=15,
    context_size=BASE_CONFIG["context_length"],
)
print(token_ids_to_text(token_ids, tokenizer))
# Every effort moves you.

text_2 = (
    "Is the following text 'spam'? Answer with 'yes' or 'no':"
    " 'You are a winner you have been specially"
    " selected to receive $1000 cash or a $2000 award.'"
)
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_2, tokenizer),
    max_new_tokens=23,
    context_size=BASE_CONFIG["context_length"],
)
print(token_ids_to_text(token_ids, tokenizer))
# Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.' Answer a cash award a cash award cash' cash cash cash a cash a cash' cash a cash' cash'

虽然输入显式的指示模型最后输出 Yes 或者 No 作为结果，但显然模型最后的输出跟预期不同。这是因为模型之前的训练并没有专门针对分类任务进行设计。

添加分类头

为了更好的完成分类任务，需要对模型进行微调。其中一种办法就是替换模型最后的输出层，原本是输出 vocab 中各个单词的概率，现在改为输出 yes 和 no 两个 label 的概率即可，分别用数值 1 和 0 表示。

以下是模型的结构，中间有 12 个 transformer block，最后有一个 out_head，这个 out_head 负责输出 50257 个单词的概率；通过替换 out_head 层，即可实现将输出转换成 yes/no 两个结果；

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    ... (为方便查看，此处省略了从 1 到 10 等 10 个 transformer block)
    (11): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
  )
  (final_norm): LayerNorm()
  (out_head): Linear(in_features=768, out_features=50257, bias=False)
)

# 将模型参数的 requires_grad 属性设置为 False，以便冻结模型参数，这样在训练时不会更新模型参数
for param in model.parameters():
    param.requires_grad = False

torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
    in_features=BASE_CONFIG["emb_dim"],
    out_features=num_classes,
)  # 新添加的层 requires_grad 默认是 True

# 上面只替换最后的 out_head 层，但基于过往经验，对最后几层也进行训练，模型最终的效果会更好，因此下面将 final_norm 层和最后一个 transformer 的参数放开，允许在训练过程中更新
for param in model.trf_blocks[-1].parameters():
    param.requires_grad = True
for param in model.final_norm.parameters():
    param.requires_grad = True

# 调整后的模型，用法仍然跟以前相同，区别在于最后的输出有点点不一样
inputs = tokenizer.encode("Do you have time")
inputs = torch.tensor(inputs).unsqueeze(0)
print("Inputs:", inputs)
print("Inputs dimensions:", inputs.shape)
# Inputs: tensor([[5211,  345,  423,  640]])
# Inputs dimensions: torch.Size([1, 4])

with torch.no_grad():
    outputs = model(inputs)
print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape)
# Outputs:
#  tensor([[[-0.5966,  0.6271],
#          [-2.8117,  6.8164],
#          [-2.0197,  5.6612],
#          [-3.3210,  4.7156]]])
# Outputs dimensions: torch.Size([1, 4, 2])
# 未调整前的模型，最后的输出是 50257 维的，现在只剩下 2 维了

print("Last output token:", outputs[:, -1, :])
# Last output token: tensor([[-3.3210,  4.7156]])
# 在预测最后的输出时，只取读取最后一个 token 所在行的概率进行计算即可
# 因为该 token 包含前面所有 token 完整的相关信息

计算分类损失和精度

之前在预测 token 时，通过 softmax + argmax 找出最大概率的索引值即可。在分类任务，原理仍然不变，只是之前有 50267 个索引，现在只剩下 2 个了；

# 之前在预测 token 时，通过 softmax + argmax 找出最大概率的索引值即可
# 在分类任务，原理仍然不变，只是之前有 50267 个索引，现在只剩下 2 个了；

probas = torch.softmax(outputs[:, -1, :], dim=-1)
label = torch.argmax(probas)
print("Class label:", label.item())
# Class label: 1


# 计算预测的准确率
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    model.eval()
    correct_predictions, num_examples = 0, 0

    if num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            input_batch = input_batch.to(device)
            target_batch = target_batch.to(device)

            with torch.no_grad():
                logits = model(input_batch)[:, -1, :]
            predicted_labels = torch.argmax(logits, dim=-1)

            num_examples += predicted_labels.shape[0]
            correct_predictions += (predicted_labels == target_batch).sum().item()
        else:
            break
    return correct_predictions / num_examples


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(
    data_loader=train_loader, model=model, device=device, num_batches=10
)
val_accuracy = calc_accuracy_loader(
    data_loader=val_loader, model=model, device=device, num_batches=10
)
test_accuracy = calc_accuracy_loader(
    data_loader=test_loader, model=model, device=device, num_batches=10
)
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 56.25%
# Validation accuracy: 62.50%
# Test accuracy: 55.00%
# 从以上结果可见，目前预测准确率都很低，因为此时模型暂时没有学会如何区分垃圾短信和正常短信


# 计算单批损失
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
	# 计算模型的输出，并取出输出中的最后一个 token 的 logits 即可
    logits = model(input_batch)[:, -1, :]
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    return loss


# 计算所有批次的损失，取平均值
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.0
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches


with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)

print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")
# Training loss: 2.213
# Validation loss: 2.050
# Test loss: 1.408
# 从以上结果可见，目前模型的损失函数值都很高

使用监督数据微调模型

# 分类器训练
def train_classifier_simple(
    model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter
):
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    examples_seen, global_step = 0, -1

    for epoch in range(num_epochs):
        model.train()

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # 清空上一步遗留的梯度值
            # calc_loss_batch 将预测输出，并返回损失值
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # 反向传播，计算当前梯度值
            optimizer.step()  # 使用梯度值，更新模型参数
            examples_seen += input_batch.shape[0]
            global_step += 1

            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                print(
                    f"Ep {epoch + 1} (Step {global_step: 06d}):"
                    f"Train loss {train_loss:.3f},"
                    f" Val loss {val_loss:.3f}"
                )

        train_accuracy = calc_accuracy_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_accuracy = calc_accuracy_loader(
            val_loader, model, device, num_batches=eval_iter
        )
        print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")
        print(f"Validation accuracy: {val_accuracy*100:.2f}%")
        train_accs.append(train_accuracy)
        val_accs.append(val_accuracy)

    return train_losses, val_losses, train_accs, val_accs, examples_seen


# 评估模型准确率
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(
            train_loader, model, device, num_batches=eval_iter
        )
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()
    return train_loss, val_loss


start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
num_epochs = 5

train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs=num_epochs,
    eval_freq=50,
    eval_iter=5,
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes")
# Last output token: tensor([[-3.3210,  4.7156]])
# Ep 1 (Step  00000):Train loss 1.369, Val loss 1.781
# Ep 1 (Step  00050):Train loss 0.488, Val loss 0.349
# Ep 1 (Step  00100):Train loss 0.216, Val loss 0.335
# Training accuracy: 95.00% | Validation accuracy: 95.00%
# Ep 2 (Step  00150):Train loss 0.257, Val loss 0.209
# Ep 2 (Step  00200):Train loss 0.086, Val loss 0.191
# Ep 2 (Step  00250):Train loss 0.138, Val loss 0.128
# Training accuracy: 100.00% | Validation accuracy: 97.50%
# Ep 3 (Step  00300):Train loss 0.127, Val loss 0.211
# Ep 3 (Step  00350):Train loss 0.187, Val loss 0.105
# Training accuracy: 92.50% | Validation accuracy: 97.50%
# Ep 4 (Step  00400):Train loss 0.108, Val loss 0.089
# Ep 4 (Step  00450):Train loss 0.027, Val loss 0.094
# Ep 4 (Step  00500):Train loss 0.203, Val loss 0.048
# Training accuracy: 100.00% | Validation accuracy: 97.50%
# Ep 5 (Step  00550):Train loss 0.073, Val loss 0.044
# Ep 5 (Step  00600):Train loss 0.047, Val loss 0.074
# Training accuracy: 100.00% | Validation accuracy: 97.50%
# Training completed in 14.17 minutes

训练设置了 5 轮，但其实 2 轮之后，预测的准确率就很高了。因此，实际需要训练的轮次，跟数据集的大小和任务的复杂度有关。如果数据集足够大，任务也较为简单，有可能 1~2 轮就足够了，这样也可以很多时间和算力。

train_accuracy = calc_accuracy_loader(
    data_loader=train_loader, model=model, device=device, num_batches=10
)
val_accuracy = calc_accuracy_loader(
    data_loader=val_loader, model=model, device=device, num_batches=10
)
test_accuracy = calc_accuracy_loader(
    data_loader=test_loader, model=model, device=device, num_batches=10
)
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")
# Training accuracy: 100%
# Validation accuracy: 97.5%
# Test accuracy: 95.67%
# 可见在训练过后，相对之前约 50% 的准确率，现在的准确率得到了明显的提高

将模型用于垃圾消息分类

def classify_review(
    text, model, tokenizer, device, max_length=None, pad_token_id=50256
):
    model.eval()

    input_ids = tokenizer.encode(text)
    supported_context_length = model.pos_emb.weight.shape[1]
    # 限制输入长度
    input_ids = input_ids[: min(max_length, supported_context_length)]
    # 如果输入长度小于最大长度，则用 pad_token_id 填充
    input_ids += [pad_token_id] * (max_length - len(input_ids))

    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0)

    with torch.no_grad():
        logits = model(input_tensor)[:, -1, :]
    predicted_labels = torch.argmax(logits, dim=-1).item()

    return "spam" if predicted_labels == 1 else "not_spam"


text_1 = "You are a winner you have been specially selected to receive $1000 cash or a $2000 award."

print(
    classify_review(
        text_1, model, tokenizer, device, max_length=train_dataset.max_length
    )
)

text_2 = "Hey, just wanted to check if we're still on for dinner tonight? Let me know!"
print(
    classify_review(
        text_2, model, tokenizer, device, max_length=train_dataset.max_length
    )
)
# 保存模型参数
torch.save(model.state_dict(), "data/review_classifier.pth")

# 模型的参数保存后，可以在下次使用时直接加载
# model_state_dict = torch.load("data/review_classifier.pth, map_location=device")
# model.load_state_dict(model_state_dict)

# 直接加载模型进行分类
import torch
from GPTModel import GPTModel
import tiktoken

from utils import get_model_size

tokenizer = tiktoken.get_encoding("gpt2")


def classify_review(
    text, model, tokenizer, device, max_length=None, pad_token_id=50256
):
    model.eval()

    input_ids = tokenizer.encode(text)
    supported_context_length = model.pos_emb.weight.shape[1]
    # 限制输入长度
    input_ids = input_ids[: min(max_length, supported_context_length)]
    # 如果输入长度小于最大长度，则用 pad_token_id 填充
    input_ids += [pad_token_id] * (max_length - len(input_ids))

    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0)

    with torch.no_grad():
        logits = model(input_tensor)[:, -1, :]
    predicted_labels = torch.argmax(logits, dim=-1).item()

    return "spam" if predicted_labels == 1 else "not_spam"


CHOOSE_MODEL = "gpt2-small(124M)"
BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True,
}
model_configs = {
    "gpt2-small(124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium(355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large(774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl(1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = get_model_size(CHOOSE_MODEL)
model = GPTModel(BASE_CONFIG)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
    in_features=BASE_CONFIG["emb_dim"],
    out_features=num_classes,
)

# 模型的参数保存后，可以在下次使用时直接加载
model_state_dict = torch.load("data/review_classifier.pth", map_location=device)
model.load_state_dict(model_state_dict)

text_1 = "You are a winner you have been specially selected to receive $1000 cash or a $2000 award."

print(classify_review(text_1, model, tokenizer, device, max_length=120))

text_2 = "Hey, just wanted to check if we're still on for dinner tonight? Let me know!"
print(classify_review(text_2, model, tokenizer, device, max_length=120))

7.Fine-tuning to follow instructions

指令微调介绍

根据特定的指令来完成特定的任务，示例如下：

指令微调同样由三个步骤组成，分别为：

准备数据集
微调模型
评估模型

准备数据集

import json
import os
import urllib.request


def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as f:
            text_data = f.read()
    with open(file_path, "r") as f:
        data = json.load(f)
    return data


file_path = "data/instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))
# Number of entries: 1100

print("Example entry:\n", data[50])
# Example entry:
#  {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}
print("Another example entry:\n", data[999])
# {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}

按指令微调模型，其训练数据集是一个包含指令、输入和输出的 JSON 文件。数据可以有多种不同的格式化模板，例如 Alpaca 和 Phi-3 两种常见的类型。

# 格式化输入的数据，以便用于训练
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task."
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text


model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)
# Below is an instruction that describes a task.Write a response that appropriately completes the request.

# ### Instruction:
# Identify the correct spelling of the following word.

# ### Input:
# Ocassion

# ### Response:
# The correct spelling is 'Occasion.'


model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"
print(model_input + desired_response)
# Below is an instruction that describes a task.Write a response that appropriately completes the request.

# ### Instruction:
# What is an antonym of 'complicated'?

# ### Response:
# An antonym of 'complicated' is 'simple'.

将数据分批

train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion : train_portion + test_portion]
val_data = data[train_portion + test_portion :]
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))
# Training set length: 935
# Validation set length: 55
# Test set length: 110

数据分批涉及多个动作，包括：

格式化
token 化
填充，保证长度相同
创建目标 token ids
用占位符替换填充值

以下是格式化和 token 化的示例：

# 对数据集进行格式化
class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

填充长度可按需调整，不同 batch 大小可以会有所不同，单个 batch 内部的长度相同即可

# 自定义的 collate_fn 函数，用于对每个 batch 内部的长度进行 padding，
# 用于替换 DataLoader 默认的 collate_fn
def custom_collate_draft_1(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst = []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]  # 添加一个结束标记
        # 填充
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor


inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (inputs_1, inputs_2, inputs_3)
print(custom_collate_draft_1(batch))
# tensor([[    0,     1,     2,     3,     4],
#         [    5,     6, 50256, 50256, 50256],
#         [    7,     8,     9, 50256, 50256]])

除了需要将 inputs 转成 token id，还需要将 response（即预期输出）也转成 token id，以便提交给模型进行训练学习；

response 和 inputs 的区别在于往右移动了一位，示例如下：

# 改进版的 collate_fn 函数，增加了 targets 目标输出
def custom_collate_draft_2(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst = []
    target_lst = []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]  # 添加一个结束标记
        # 填充
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        inputs_lst.append(inputs)
        target_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(target_lst).to(device)
    return inputs_tensor, targets_tensor

inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)
# tensor([[    0,     1,     2,     3,     4],
#         [    5,     6, 50256, 50256, 50256],
#         [    7,     8,     9, 50256, 50256]])
# tensor([[    1,     2,     3,     4, 50256],
#         [    6, 50256, 50256, 50256, 50256],
#         [    8,     9, 50256, 50256, 50256]])

接下来需要将多余的 end-of-text 标记替换成 -100，以便在计算损失函数时，能够将其排除，避免影响计算；仅保留第一个 end-of-text，因为我们需要模型学习任何结束输出。

end-of-text 之所以替换为 -100，是因为在框架中，使用写死的 -100 做为判断；

def custom_collate_fn(
    batch, pad_token_id=50256, ignore_index=-100, allowed_max_length=None, device="cpu"
):
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:  # numel 方法会返回元素的总数量
            # 将多余的 pad_token_id 替换为 ignore_index，即 -100
            # 这样交叉熵函数计算损失时，会忽略该值；该值是内置的参数，有点像个魔术数字
            targets[indices[1:]] = ignore_index
        # 通过 allowed_max_length 限制长度（如需）
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor


inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)
# tensor([[    0,     1,     2,     3,     4],
#         [    5,     6, 50256, 50256, 50256],
#         [    7,     8,     9, 50256, 50256]])
# tensor([[    1,     2,     3,     4, 50256],
#         [    6, 50256,  -100,  -100,  -100],
#         [    8,     9, 50256,  -100,  -100]])

接下来还可以考虑进一步 mask 掉输出中与指令重叠的部分，以便得到纯输出。但这个做法目前存在争议，因为有部分研究发现，如果不 mask 重叠的部分，模型的性能更好。

创建数据加载器

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)
# Device: cpu

# 使用 partial 封装函数和具体的 device 参数
customized_collate_fn = partial(
    custom_collate_fn, device=device, allowed_max_length=1024
)

num_workers = 0
batch_size = 8

torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)
print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)
# Train loader:
# torch.Size([8, 61]) torch.Size([8, 61])
# torch.Size([8, 76]) torch.Size([8, 76])
# torch.Size([8, 73]) torch.Size([8, 73])
# ...
# torch.Size([8, 74]) torch.Size([8, 74])
# torch.Size([8, 69]) torch.Size([8, 69])
# 第一维 8 是批量大小，第二维是每个批次的长度

加载预训练模型

之前做分类任务微调时，使用的是 124M 参数的小模型。但对于指令型任务，这个模型性能一般，因此接下来改用 355M 参数的模型。

BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True,
}
model_configs = {
    "gpt2-small(124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium(355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large(774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl(1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
CHOOSE_MODEL = "gpt2-medium(355M)"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = get_model_size(CHOOSE_MODEL)


settings, params = download_and_load_gpt2(
    model_size=model_size,
    models_dir="data/gpt2m",
)

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()


torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)


token_ids = generate(
    model=model,
    idx=text_to_token_ids(input_text, tokenizer),
    max_new_tokens=35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256,
)
generated_text = token_ids_to_text(token_ids, tokenizer)

response_text = generated_text[len(input_text) :].strip()
print(response_text)
# ###
# ### the active
# The active
# The active
# The active
# The active
# The active
# The active
# The active
# The active
# The active
# The

# 以上生成的内容完全没有实现预期中的效果，毕竟还没有进行微调训练，情有可原

微调模型

# 未训练前的预测损失
model.to(device)
torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Traning loss:", train_loss)
print("Validation loss:", val_loss)
# Traning loss: 4.88994665145874
# Validation loss: 5.035206127166748

# 开始训练
start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)

num_epochs = 2
train_losses, val_losses, tokens_seen = train_model_simple(
    model,
    train_loader,
    val_loader,
    optimizer,
    device,
    num_epochs=num_epochs,
    eval_freq=5,
    eval_iter=5,
    start_context=format_input(val_data[0]),
    tokenizer=tokenizer,
)
end_time = time.time()
executed_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {executed_time_minutes:.2f} minutes.")

# 保存训练结果，方便下次直接复用，无需重新训练一次
torch.save(model.state_dict(), "data/instruction.pth")

# Ep 1 (Step 000000): Train loss 3.258, Val loss 3.389,
# Ep 1 (Step 000005): Train loss 1.878, Val loss 1.920,
# Ep 1 (Step 000010): Train loss 1.472, Val loss 1.480,
# Ep 1 (Step 000015): Train loss 1.302, Val loss 1.353,
# Ep 1 (Step 000020): Train loss 1.195, Val loss 1.181,
# Ep 1 (Step 000025): Train loss 1.023, Val loss 1.142,
# Ep 1 (Step 000030): Train loss 1.020, Val loss 1.092,
# Ep 1 (Step 000035): Train loss 0.991, Val loss 1.062,
# Ep 1 (Step 000040): Train loss 0.892, Val loss 0.959,
# Ep 1 (Step 000045): Train loss 0.919, Val loss 0.905,
# Ep 1 (Step 000050): Train loss 0.905, Val loss 0.873,
# Ep 1 (Step 000055): Train loss 0.816, Val loss 0.849,
# Ep 1 (Step 000060): Train loss 0.783, Val loss 0.895,
# Ep 1 (Step 000065): Train loss 0.794, Val loss 0.818,
# Ep 1 (Step 000070): Train loss 0.705, Val loss 0.757,
# Ep 1 (Step 000075): Train loss 0.678, Val loss 0.777,
# Ep 1 (Step 000080): Train loss 0.700, Val loss 0.699,
# Ep 1 (Step 000085): Train loss 0.603, Val loss 0.673,
# Ep 1 (Step 000090): Train loss 0.533, Val loss 0.606,
# Ep 1 (Step 000095): Train loss 0.572, Val loss 0.633,
# Ep 1 (Step 000100): Train loss 0.432, Val loss 0.589,
# Ep 1 (Step 000105): Train loss 0.497, Val loss 0.620,
# Ep 1 (Step 000110): Train loss 0.515, Val loss 0.591,
# Ep 1 (Step 000115): Train loss 0.464, Val loss 0.576,
# Below is an instruction that describes a task.Write a response that appropriately completes the request.  ### Instruction: Convert the active sentence to passive: 'The chef cooks the meal every day.' ### Response: 'The chef.' ### Response: 'The chef.' ### Response: '### Response: '### Response: '### Response: '### Response: '### Response: '### Response: '### Response: '###
# Ep 2 (Step 000120): Train loss 0.486, Val loss 0.598,
# Ep 2 (Step 000125): Train loss 0.436, Val loss 0.605,
# Ep 2 (Step 000130): Train loss 0.360, Val loss 0.516,
# Ep 2 (Step 000135): Train loss 0.358, Val loss 0.465,
# Ep 2 (Step 000140): Train loss 0.531, Val loss 0.552,
# Ep 2 (Step 000145): Train loss 0.379, Val loss 0.546,
# Ep 2 (Step 000150): Train loss 0.324, Val loss 0.478,
# Ep 2 (Step 000155): Train loss 0.418, Val loss 0.467,
# Ep 2 (Step 000160): Train loss 0.394, Val loss 0.438,
# Ep 2 (Step 000165): Train loss 0.307, Val loss 0.495,
# Ep 2 (Step 000170): Train loss 0.358, Val loss 0.425,
# Ep 2 (Step 000175): Train loss 0.294, Val loss 0.425,
# Ep 2 (Step 000180): Train loss 0.270, Val loss 0.464,
# Ep 2 (Step 000185): Train loss 0.353, Val loss 0.413,
# Ep 2 (Step 000190): Train loss 0.372, Val loss 0.383,
# Ep 2 (Step 000195): Train loss 0.246, Val loss 0.368,
# Ep 2 (Step 000200): Train loss 0.334, Val loss 0.396,
# Ep 2 (Step 000205): Train loss 0.339, Val loss 0.424,
# Ep 2 (Step 000210): Train loss 0.263, Val loss 0.388,
# Ep 2 (Step 000215): Train loss 0.197, Val loss 0.375,
# Ep 2 (Step 000220): Train loss 0.241, Val loss 0.321,
# Ep 2 (Step 000225): Train loss 0.198, Val loss 0.349,
# Ep 2 (Step 000230): Train loss 0.235, Val loss 0.400,
# Below is an instruction that describes a task.Write a response that appropriately completes the request.  ### Instruction: Convert the active sentence to passive: 'The chef cooks the meal every day.' The active sentence to the active sentence to the active sentence to the active sentence to the active sentence to the active sentence to the active sentence active sentence active sentence active sentence active sentence active sentence active sentence active sentence active sentence active sentence active sentence active sentence
# Training completed in 44.29 minutes.

# 使用微调后的模型进行测试
torch.manual_seed(123)
for entry in test_data[:3]:
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256,
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text) :].replace("### Response:", "").strip()
    )
    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("------------------------------------------")

# Below is an instruction that describes a task.Write a response that appropriately completes the request.

# ### Instruction:
# Rewrite the sentence using a simile.

# ### Input:
# The car is very fast.

# Correct response:
# >> The car is as fast as lightning.

# Model response:
# >> The car is very fast.
# ------------------------------------------
# Below is an instruction that describes a task.Write a response that appropriately completes the request.

# ### Instruction:
# What type of cloud is typically associated with thunderstorms?

# Correct response:
# >> The type of cloud typically associated with thunderstorms is cumulonimbus.

# Model response:
# >> What type of cloud associated with type of cloud associated with associated with associated with associated with associated with associated with associated with associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud with cloud with cloud with cloud with cloud with cloud with cloud with cloud with cloud with cloud with cloud with associated with associated with cloud with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud with associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated cloud associated with associated with associated with associated with associated with associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud associated with cloud with cloud with cloud with cloud with cloud with cloud
# ------------------------------------------
# Below is an instruction that describes a task.Write a response that appropriately completes the request.

# ### Instruction:
# Name the author of 'Pride and Prejudice'.

# Correct response:
# >> Jane Austen.

# Model response:
# >> Name the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of the author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author of author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the author describes the

# 训练结果远不如预期，很菜，待查找一下原因

读取并保存响应

有多种方法可用来评估微调后的模型质量，包括：

简答题和多选题测试，例如 MMMU
- 优点：可自动化；
- 缺点：结果会受到题目设计的影响，例如选择题中选项的顺序；
与其他模型相比的测试，例如 LMSYS（需人工介入进行主观评估，例如进行手工数据标注）；
- 优点：不容易出错
- 缺点：费时费力，而且每个人有自己不同的偏好；
自动化对话测试，例如用其他更高级的模型，对生成结果进行评分，例如 AlpacaEval；
- 可自动化
- 缺点：会受其他模型局限性的影响

每种测试方法都有其优缺点，因此更好的方式是将它们组合起来使用，而不是只使用其中一种。

from tqdm import tqdm
import tiktoken
import torch
import json

from GPTModel import GPTModel
from utils import (
    download_and_load_file,
    format_input,
    generate,
    text_to_token_ids,
    token_ids_to_text,
)

file_path = "data/instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion : train_portion + test_portion]
val_data = data[train_portion + test_portion :]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = tiktoken.get_encoding("gpt2")
BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True,
    "emb_dim": 1024,
    "n_layers": 24,
    "n_heads": 16,
}

# 加载微调后的模型
model = GPTModel(BASE_CONFIG)
checkpoint = torch.load("data/instruction.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])

# 生成文本
for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256,
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generated_text[len(input_text) :].replace("### Response:", "").strip()
    )
    test_data[i]["model_response"] = response_text


with open("data/instruction-data-with-response.json", "w") as f:
    json.dump(test_data, f, indent=4)

# 检查生成的文本是否保存成功
print(test_data[0])

评估模型

人工审核模型的输出过于费时费力，一种简单高效的办法是使用另外一个训练好的模型，对当前模型的输出结果进行评估和打分。

Ollama 是一个专门用来加载和运行大模型的库，但不能用来训练或微调；可在官网上下载：https://ollama.com

我们可使用该库直接下载另外一个大模型，例如 llama3，然后用它来评估我们自己训练的模型；下载完后，让其在后台运行，然后就可以在命令行中调用它了，例如：运行 ollama run llama3 命令，会自动下载 llama3，并进入对话状态；

# 下载并运行 ollama3 模型
PS C:\Users\ccw> ollama run llama3
>>> what do llamas eat?
Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on grasses, including various species like timothy grass, orchard grass, and brome.
2. Hay: They enjoy eating hay, such as alfalfa or oat hay, which is high in fiber and protein.
3. Grains: Llamas might eat grains like oats, barley, or corn, but these should not make up more than 10% of their diet.
4. Fruits and vegetables: Many llamas enjoy fruits and veggies like apples, carrots, sweet potatoes, and peas as treats or supplements.
5. Minerals: Llamas need access to mineral blocks or loose minerals that provide essential nutrients like calcium, phosphorus, and salt.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, such as willow, alder, and juniper.
2. Bark: In some cases, they might eat bark, especially during times of food scarcity.
3. Mushrooms: Llamas have been known to eat certain types of mushrooms, like those in the genus Boletus.

In captivity or on farms, llama owners usually provide a diet that consists mainly of hay and grains, with occasional treats like fruits and veggies.
It's essential to ensure they receive a balanced diet and access to clean water at all times.

# 退出模型对话
>>> /bye

手动使用 Llama3 对之前的微调模型进行打分

从打分结果来看，效果很差，Llama3 给模型打了 100 分，但实际输入的是多个重复无意义的 Input 单词。另外使用通义千问对相同输入打分，通义的结果是 0 分。显然通义正确理解了输入的内容，而 Llama3 完全没理解输入的内容；

PS C:\Users\ccw> ollama run llama3
>>> Given the input `Below is an instruction that describe a task. Write a response that appropriately completes the request.
...
... ### Instruction:
... Rewrite the following sentence so that it is in active voice.
...
... ### Input:
... The cake was baked by Sarah.` and correct output `Sarah baked the cake.`, score the model response `Input: Input: Input: Input: Input: Input: Input` on
...  a scale from 0 to 100, where 100 is the best score.
A nice and simple instruction!

To rewrite the sentence in active voice, I'll make sure that the subject of the sentence performs the action described by the verb.

Original sentence: The cake was baked by Sarah.
Rewritten sentence: Sarah baked the cake.

Score: 100

Why? Because the rewritten sentence follows the structure of an active voice sentence, where the subject (Sarah) performs the action (baked) on the
object (the cake). The original sentence is in passive voice, which can sometimes make it harder to identify who performed the action.

自动批量打分

def generate_model_scores(json_data, json_key, model="llama3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Count not convert score: {score}")
            continue
    return scores


scores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores) / len(scores):.2f}\n")
# Number of scores: 110 of 110
# Average score: 50.32

通过打分，可以比较不同模型的性能，同时也可以用来调整方法重新训练模型，例如：

调整微调过程中的相关参数，例如学习率 learning rate，批量大小 batch siz，迭代次数 num of epochs 等；
增加训练数据集中的数据量，或者让内容覆盖更多的主题和风格；
使用不同的 prompt 或 instruction 格式，以便让模型更准确的理解意图；
使用更强大的预训练模型进行微调，以便能够理解更复杂的输入内容；

计算机

#机器学习

从零开始构建大模型

https://ccw1078.github.io/2025/03/10/从零开始构建大模型/

作者

ccw

发布于

2025年3月10日

许可协议

富足人生指南上一篇

金钱心理学下一篇