stanford cs336 学习笔记
课程的目的
想要真正在 AI 行业做出创新,就必须了解其底层原理,而不是只是调参数、调 API。
AI 大厂正在垄断并封闭先进的 AI 技术,必须学习底层原理,才能复刻这些成果。
在课程中,学生将亲手训练 <1 B 参数模型:数据抓取→清洗→Tokenizer→Transformer→训练→评估。通过实操体会调度、并行、混精、省显存等工程细节,把平常抽象调用的网络结构拉回到可自主控制修改的范围。
Byte-Pair Encoding (BPE) Tokenizer
The Unicode Standard
unicode 标准给世界所有语言的每个字符指定了一个对应的数字编号
Problem (unicode1): Understanding Unicode (1 point)
- (a) What Unicode character does chr(0) return?
1
'\x00' - (b) How does this character’s string representation (repr()) differ from its printed representation?
1
__repr__() 会返回一个便于程序调试的信息,而print返回一个便于人类可读的信息。比如 chr(0) 的内容就无法用 print 打印,但可以用 repr 实现按 hex 显示 - (c) What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:
1 2 3 4
>>> chr(0) >>> print(chr(0)) >>> "this is a test" + chr(0) + "string" >>> print("this is a test" + chr(0) + "string")
1
chr(0)返回的是不可视的字符,所以无法用print()打印
Unicode Encodings
通过 utf-8 或其他编码规则,可以将自然语言编码成计算机能处理的字节串。
Problem (unicode2): Unicode Encodings (3 points)
- (a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.
1
大部分的都是英文字母,utf-8 长度更短 (b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results.
1 2 3 4
def decode_utf8_bytes_to_str_wrong(bytestring: bytes): return "".join([bytes([b]).decode("utf-8") for b in bytestring]) >>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")) 'hello'
1 2 3 4 5 6
utf-8是变长的,不能拆分为单字节解码 decode_utf8_bytes_to_str_wrong("你好".encode("utf-8")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in decode_utf8_bytes_to_str_wrong UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data
(c) Give a two byte sequence that does not decode to any Unicode character(s).
1
0xFF 0xFF
Subword Tokenization
字节级(byte-level)分词(Tokenization)可以彻底解决词表外(OOV)问题,但会显著拉长输入序列长度,从而增加计算开销并加剧长程依赖,导致模型训练变慢、建模更困难。
子词(subword)分词是介于词级和字节级之间的折中方案:通过使用更大的词表来压缩字节序列长度。字节级分词的基础词表只有 256 个字节值,而子词分词会把高频出现的字节序列(如 the)合并成一个 token,从而减少序列长度。BPE(Byte-Pair Encoding)通过反复合并最常见的字节对来构建这些子词单元,使频繁出现的词或片段逐渐成为单个 token。基于 BPE 的子词分词器在保持良好 OOV 处理能力的同时,显著缩短序列长度;构建其词表的过程称为 BPE tokenizer 的训练。
BPE Tokenizer Training
训练 BPE 的三个步骤:
Vocabulary initialization(词表初始化)
词表是从 token 到整型 id 的映射表。这个表在训练前的默认值非常简单,就是每个 byte(ascii) 到对应的数字,比如
'a'(0x61) -> 97,共 256 条。Pre-tokenization()

