bert第三篇:tokenizer

    科技2025-04-16  13

    文章目录

    tokenizer基本含义bert里涉及的tokenizerBasicTokenzerwordpiecetokenizerFullTokenzierPretrainTokenizer 关系图实操如何训练训练自己中文的tokenizer总结引用

    tokenizer基本含义

    tokenizer就是分词器; 只不过在bert里和我们理解的中文分词不太一样,主要不是分词方法的问题,bert里基本都是最大匹配方法。

    最大的不同在于“词”的理解和定义。 比如:中文基本是字为单位。 英文则是subword的概念,例如将"unwanted"分解成[“un”, “##want”, “##ed”] 请仔细理解这个做法的优点。 这是tokenizer的一个要义。

    bert里涉及的tokenizer

    BasicTokenzer

    主要的类是BasicTokenizer,做一些基础的大小写、unicode转换、标点符号分割、小写转换、中文字符分割、去除重音符号等操作,最后返回的是关于词的数组(中文是字的数组)

    def tokenize(self, text): """Tokenizes a piece of text.""" text = convert_to_unicode(text) text = self._clean_text(text) # This was added on November 1st, 2018 for the multilingual and Chinese # models. This is also applied to the English models now, but it doesn't # matter since the English models were not trained on any Chinese data # and generally don't have any Chinese data in them (there are Chinese # characters in the vocabulary because Wikipedia does have some Chinese # words in the English Wikipedia.). text = self._tokenize_chinese_chars(text) orig_tokens = whitespace_tokenize(text) split_tokens = [] for token in orig_tokens: if self.do_lower_case: token = token.lower() token = self._run_strip_accents(token) split_tokens.extend(self._run_split_on_punc(token)) output_tokens = whitespace_tokenize(" ".join(split_tokens)) return output_tokens

    BasicTokenzer是预处理。

    wordpiecetokenizer

    另外一个则是关键wordpiecetokenizer,就是基于vocab切词。

    def tokenize(self, text): """Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] Args: text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. Returns: A list of wordpiece tokens. """ text = convert_to_unicode(text) output_tokens = [] for token in whitespace_tokenize(text): chars = list(token) if len(chars) > self.max_input_chars_per_word: output_tokens.append(self.unk_token) continue is_bad = False start = 0 sub_tokens = [] while start < len(chars): end = len(chars) cur_substr = None #找个单词,找不到end向前滑动;还是看代码实在!!! while start < end: substr = "".join(chars[start:end]) if start > 0: substr = "##" + substr if substr in self.vocab: cur_substr = substr break end -= 1 if cur_substr is None: is_bad = True break sub_tokens.append(cur_substr) start = end if is_bad: output_tokens.append(self.unk_token) else: output_tokens.extend(sub_tokens) return output_tokens

    FullTokenzier

    这个基本上就是利用basic和wordpiece来切分。用于bert训练的预处理。基本就一个tokenize方法。不会有encode_plus等方法。

    PretrainTokenizer

    这个则是bert的base类,定义了很多方法(convert_ids_to_tokens)等。 后续的BertTokenzier,GPT2Tokenizer都继承自pretrainTOkenizer,下面的关系图可以看到这个全貌。

    关系图

    实操

    from transformers.tokenization_bert import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") print("词典大小:",tokenizer.vocab_size) text = "the game has gone!unaffable I have a new GPU!" tokens = tokenizer.tokenize(text) print("英文分词来一个:",tokens) text = "我爱北京天安门,吢吣" tokens = tokenizer.tokenize(text) print("中文分词来一个:",tokens) input_ids = tokenizer.convert_tokens_to_ids(tokens) print("id-token转换:",input_ids) sen_code = tokenizer.encode_plus("i like you much", "but not him") print("多句子encode:",sen_code) print("decode:",tokenizer.decode(sen_code['input_ids']))

    输出结果:

    词典大小: 30522 英文分词来一个: ['the', 'game', 'has', 'gone', '!', 'una', '##ffa', '##ble', 'i', 'have', 'a', 'new', 'gp', '##u', '!'] 中文分词来一个: ['我', '[UNK]', '北', '京', '天', '安', '[UNK]', ',', '[UNK]', '[UNK]'] id-token转换: [1855, 100, 1781, 1755, 1811, 1820, 100, 1989, 100, 100] 多句子encode: {'input_ids': [101, 1045, 2066, 2017, 2172, 102, 2021, 2025, 2032, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} decode: [CLS] i like you much [SEP] but not him [SEP]

    看代码或者实际操练一遍,再来看理论知识更好。实操是关键,是思想的体现。

    当然也可以单独实验bertwordpiecetokenzer

    from transformers.tokenization_bert import BertWordPieceTokenizer # initialize tokenizer tokenizer = BertWordPieceTokenizer( vocab_file= "vocab.txt", unk_token = "[UNK]", sep_token = "[SEP]", cls_token = "[CLS]", pad_token = "[PAD]", mask_token = "[MASK]", clean_text = True, handle_chinese_chars = True, strip_accents= True, lowercase = True, wordpieces_prefix = "##" ) # sample sentence sentence = "Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect." # tokenize the sample sentence encoded_output = tokenizer.encode(sentence) print(encoded_output) print(encoded_output.tokens)

    如何训练

    其实就是提取vacab的过程。 BPE算法也比较容易理解:不断的选择most common的加入到词典,为什么? 因为覆盖的语料量比较大。

    举个bpe的例子。

    原始统计词: ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5) 开始统计char: ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5) 合并最大的ug: ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5) 合并最大频度的hug: ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug'] 最后原始统计词的表示转换为: ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)

    训练自己中文的tokenizer

    def train_cn_tokenizer(): # ! pip install tokenizers from pathlib import Path from tokenizers import ByteLevelBPETokenizer paths = [str(x) for x in Path("zho-cn_web_2015_10K").glob("**/*.txt")] # Initialize a tokenizer tokenizer = ByteLevelBPETokenizer() # Customize training tokenizer.train(files=paths, vocab_size=52_000, min_frequency=3, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ]) # Save files to disk tokenizer.save( ".","zh-tokenizer-train")

    我强烈建议,根据自己的业务定制自己的vocab,当然要配套模型。 最后的结果

    {"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":127,"¿":128,"À":129,"Á":130,"Â":131,"Ã":132,"Ä":133,"Å":134,"Æ":135, ...

    总结

    理论结合实践,敲代码仔细深度理解。tokenzier的本质是分词,提取有意义的wordpiece,又尽可能的少,用尽量少的信息单元来描述无限的组合。几个类的集成理清楚。里面的细节可以继续阅读原始类来继续跟进。wordpiece是比word更小的概念,有何好处? 能解决oov吗。 需要再次思考。

    引用

    https://albertauyeung.github.io/2020/06/19/bert-tokenization.htmlhttps://spacy.io/usage/spacy-101https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizerhttps://zhuanlan.zhihu.com/p/160813500https://github.com/google/sentencepiecehttps://huggingface.co/transformers/tokenizer_summary.htmlhttps://huggingface.co/blog/how-to-train
    Processed: 0.010, SQL: 8