NLP基础:编辑距离+拼写纠错实战

    科技2026-03-30  13

    NLP基础:编辑距离+拼写纠错实战

    1. 编辑距离相关1.1 编辑距离的计算1.2 运行结果1.3 生成特定编辑距离的字符串1.3.1 生成与目标字符编辑距离为1的字符1.3.2 运行结果1.3.3 生成与目标字符编辑距离为2的字符1.3.4 运行结果 2. 拼写纠错实现2.1 总体思路2.2 加载词库2.3 生成候选词集合2.4 构建Bigram模型2.4.1 语料加载debug2.4.2 相关代码 2.5 根据用户日志统计打错概率2.6 利用测试数据进行纠错2.7 部分运行结果 3. 总结

    1. 编辑距离相关

    1.1 编辑距离的计算

    利用动态规划(DP算法)计算两个字符串的编辑距离

    def edit_distance(str1, str2): """ :param str1:input string :param str2:another input string :return : the edit distance between two strings """ length1 = len(str1) length2 = len(str2) matri = [[0 for i in range(length2 + 1)] for j in range(length1 + 1)] #创建(lenth1+1)*(lenth2+1)的矩阵 for i in range(length1 + 1): for j in range(length2 + 1): if i == 0: matri[i][j] = j if j == 0: matri[i][j] = i elif str1[i-1] == str2[j-1]:#如果此时的字符相同,则在i-1 ---> j-1 的基础上不需要操作,操作数不增加 matri[i][j] = matri[i-1][j-1] else: matri[i][j] = 1 + min(matri[i-1][j],# 删除操作 matri[i][j-1],# 插入操作 matri[i-1][j-1])#替换操作 return matri[length1][length2] #test str1 = "there" str2 = "therir" print(edit_distance(str1, str2))

    1.2 运行结果

    D:\Anaconda\python.exe “E:/project/spell correction/edit_distance.py” 2

    1.3 生成特定编辑距离的字符串

    1.3.1 生成与目标字符编辑距离为1的字符

    def generate_edit_distance_one(str): """ :param str : input string :return : 返回与输入的字符串编辑距离为1的可能字符 """ letters = 'abcdefghijklmnopqrstuvwxyz' length = len(str) splits = [(str[:i], str[i:]) for i in range(length + 1)] # print(set(splits)) deletes = [left + right[1:] for left, right in splits if right] inserts = [left + c + right for left, right in splits for c in letters] replaces = [left + c + right[1:] for left, right in splits if right for c in letters] return set(deletes + inserts + replaces) print(generate_edit_distance_one("apple"))

    1.3.2 运行结果

    {‘apptle’, ‘aqpple’, ‘apyle’, ‘gpple’, ‘applek’, ‘ypple’, ‘ajple’, ‘apnple’, ‘appfle’, ‘ahple’, ‘applre’, ‘pple’, ‘apfle’, ‘aepple’, ‘vapple’, ‘appce’, ‘applie’, ‘appkle’, ‘appleq’, ‘napple’, ‘appfe’, ‘eapple’, ‘appre’, ‘agple’, ‘aprple’, ‘applf’, ‘gapple’, ‘adple’, ‘apjple’, ‘appxle’, ‘avpple’, ‘applne’, ‘wapple’, ‘applpe’, ‘applwe’, ‘xapple’, ‘npple’, ‘apsle’, ‘appleu’, ‘appple’, ‘applz’, ‘apgle’, ‘apzle’, ‘azple’, ‘afple’, ‘applme’, ‘appmle’, ‘applel’, ‘applxe’, ‘applep’, ‘upple’, ‘xpple’, ‘aphle’, ‘kapple’, ‘appke’, ‘applo’, ‘applk’, ‘apppe’, ‘appve’, ‘anple’, ‘tapple’, ‘appxe’, ‘awpple’, ‘apiple’, ‘appne’, ‘jpple’, ‘aqple’, ‘applge’, ‘applfe’, ‘vpple’, ‘rpple’, ‘applef’, ‘atpple’, ‘apole’, ‘apople’, ‘aopple’, ‘happle’, ‘aptple’, ‘applqe’, ‘lpple’, ‘appee’, ‘apprle’, ‘aupple’, ‘applg’, ‘axple’, ‘epple’, ‘aeple’, ‘aople’, ‘apphle’, ‘apqple’, ‘applu’, ‘applje’, ‘apaple’, ‘azpple’, ‘ipple’, ‘apqle’, ‘akple’, ‘apkple’, ‘apkle’, ‘appae’, ‘apale’, ‘apfple’, ‘rapple’, ‘applle’, ‘appli’, ‘acpple’, ‘apwle’, ‘aple’, ‘appcle’, ‘applee’, ‘applce’, ‘axpple’, ‘applx’, ‘anpple’, ‘opple’, ‘dapple’, ‘adpple’, ‘appyle’, ‘appge’, ‘appwe’, ‘mapple’, ‘appwle’, ‘applv’, ‘appble’, ‘zpple’, ‘kpple’, ‘applze’, ‘appse’, ‘appled’, ‘fapple’, ‘appvle’, ‘applve’, ‘apvple’, ‘arpple’, ‘appte’, ‘mpple’, ‘applm’, ‘japple’, ‘appoe’, ‘apdple’, ‘appule’, ‘appe’, ‘zapple’, ‘apgple’, ‘ampple’, ‘aprle’, ‘uapple’, ‘qpple’, ‘aptle’, ‘spple’, ‘qapple’, ‘appld’, ‘ahpple’, ‘apnle’, ‘apeple’, ‘appze’, ‘asple’, ‘appile’, ‘applq’, ‘apples’, ‘apzple’, ‘aplple’, ‘appje’, ‘yapple’, ‘apdle’, ‘cpple’, ‘appie’, ‘appde’, ‘applt’, ‘apphe’, ‘wpple’, ‘appdle’, ‘fpple’, ‘applr’, ‘oapple’, ‘appsle’, ‘apxple’, ‘appjle’, ‘appleg’, ‘applde’, ‘applem’, ‘aapple’, ‘appley’, ‘apjle’, ‘appbe’, ‘appme’, ‘appln’, ‘applej’, ‘applw’, ‘applke’, ‘apyple’, ‘applev’, ‘apcle’, ‘applp’, ‘appls’, ‘appleo’, ‘apmle’, ‘applez’, ‘applue’, ‘awple’, ‘applj’, ‘applb’, ‘tpple’, ‘atple’, ‘acple’, ‘applc’, ‘arple’, ‘appye’, ‘abpple’, ‘aphple’, ‘apsple’, ‘applye’, ‘agpple’, ‘applen’, ‘auple’, ‘ppple’, ‘avple’, ‘aypple’, ‘appleh’, ‘papple’, ‘applex’, ‘alpple’, ‘apploe’, ‘aspple’, ‘appzle’, ‘appgle’, ‘sapple’, ‘dpple’, ‘ajpple’, ‘apmple’, ‘appole’, ‘bpple’, ‘apble’, ‘applew’, ‘akpple’, ‘applte’, ‘ayple’, ‘appue’, ‘bapple’, ‘applse’, ‘appleb’, ‘apuple’, ‘apile’, ‘appqle’, ‘lapple’, ‘applea’, ‘appl’, ‘applei’, ‘applbe’, ‘aiple’, ‘alple’, ‘appla’, ‘abple’, ‘apwple’, ‘capple’, ‘appele’, ‘applhe’, ‘applae’, ‘appler’, ‘aipple’, ‘applh’, ‘apxle’, ‘apple’, ‘afpple’, ‘apule’, ‘applet’, ‘apcple’, ‘hpple’, ‘apbple’, ‘apele’, ‘appll’, ‘apply’, ‘appnle’, ‘aaple’, ‘ample’, ‘aplle’, ‘iapple’, ‘apvle’, ‘applec’, ‘appqe’, ‘appale’}

    1.3.3 生成与目标字符编辑距离为2的字符

    def generate_edit_distance_two(str): return set([y for x in generate_edit_distance_one(str) for y in generate_edit_distance_one(x)]) print(len(generate_edit_distance_two("apple")))

    1.3.4 运行结果

    D:\Anaconda\python.exe “E:/project/spell correction/edit_distance.py” 35334

    2. 拼写纠错实现

    2.1 总体思路

    对于用户输入的字符串,通常思路是遍历词库中的所有词,计算与输入字符串的编辑距离,并返回编辑距离最小的字符作为纠错结果。这样的作法复杂度太高,因为要遍历所有的词。另一种想法是根据输入的字符串,生成一系列与其编辑距离小的字符集合,在生成的集合中找到最有可能的字符作为纠错结果。那么如何在一系列候选字符中找到最可能的那个呢?首先,对于生成的字符,将不在词库的词进行删除;其次,计算语言模型概率:由贝叶斯公式可知,p(mistake|correct) 正比于p(correct)*p(mistake|correct),mistake是用户输入的词,而correct是候选词集合中的词,接下来就是分别计算p(correct)和p(mistake|correct)。p(correct)可根据大量的语料得到,以unigram、bigram等模型,统计词频来计算结果。而p(mistake|correct)表示correct的前提下,错误恰好是mistake的概率,这可以根据用户日志统计得到。

    2.2 加载词库

    import numpy as np from nltk.corpus import reuters from edit_distance import generate_edit_distance_one #加载词典库 vocab = [] with open("././data/vocab.txt", 'r') as f: all_content = f.readlines() for line in all_content: vocab.append(line.strip()) vocab = set(vocab)#去除重复词 print("The number of vocabu :{}".format(len(vocab)))

    2.3 生成候选词集合

    # 需要生成所有候选集合 def generate_candidates(word): """ word: 给定的输入(错误的输入) 返回所有(valid)候选集合 """ # 生成编辑距离为1的单词 candidates = generate_edit_distance_one(word) # 删掉不存在于词典库里面的单词 return [word for word in candidates if word in vocab]

    2.4 构建Bigram模型

    根据语料得到bigram模型,从而为计算p(correct)做好准备。

    2.4.1 语料加载debug

    import nltk nltk.download('reuters')

    会报如下错: 解决办法:手动下载reuters语料库,下载链接参考:语料库下载参考博客(百度云下载)。下载后如果不对文件夹进行调整,依然会报找不到punkt库的错,此时需要:

    将corpora文件夹下的punkt.zip压缩文件,解压至tokenizers文件夹下 在\tokenizers\punkt文件夹下创建PY3文件夹 将english.pickle文件复制至PY3文件夹下 完成以上步骤后,若报Attribute Error seek的错,直接忽略,采用try except 运行代码即可,可以运行出结果的。

    2.4.2 相关代码

    # 读取语料库 categories = reuters.categories() corpus = reuters.sents(categories=categories) # 构建语言模型: bigram term_count = {} bigram_count = {} try: for doc in corpus: doc = ['<s>'] + doc for i in range(0, len(doc) - 1): # bigram: [i,i+1] term = doc[i] bigram = doc[i:i + 2] if term in term_count: term_count[term] += 1 else: term_count[term] = 1 bigram = ' '.join(bigram) if bigram in bigram_count: bigram_count[bigram] += 1 else: bigram_count[bigram] = 1 except: pass

    2.5 根据用户日志统计打错概率

    为计算p(mistake|correct)做好准备

    channel_prob = {} for line in open('././data/spell-errors.txt'): items = line.split(":") correct = items[0].strip() mistakes = [item.strip() for item in items[1].strip().split(",")] channel_prob[correct] = {} for mis in mistakes: channel_prob[correct][mis] = 1.0/len(mistakes)

    2.6 利用测试数据进行纠错

    V = len(term_count.keys()) file = open("././data/testdata.txt", 'r') for line in file: items = line.rstrip().split('\t') line = items[2].split() # line = ["I", "like", "playing"] for word in line: if word not in vocab: # 需要替换word成正确的单词 # Step1: 生成所有的(valid)候选集合 candidates = generate_candidates(word) # 一种方式: if candidate = [], 多生成几个candidates, 比如生成编辑距离不大于2的 # TODO : 根据条件生成更多的候选集合 if len(candidates) < 1: continue # 不建议这么做(这是不对的) probs = [] # 对于每一个candidate, 计算它的score # score = p(correct)*p(mistake|correct) # = log p(correct) + log p(mistake|correct) # 返回score最大的candidate for candi in candidates: prob = 0 # a. 计算p(mistake|correct) if candi in channel_prob and word in channel_prob[candi]: prob += np.log(channel_prob[candi][word]) else: prob += np.log(0.0001) # b. 利用bigram模型计算p(correct) if line.index(word) < len(line) - 1: two_word = line[line.index(word) - 1] + " " + word if two_word in bigram_count and line[line.index(word) - 1] in term_count: prob += np.log((bigram_count[two_word] + 1) / (term_count[line[line.index(word) - 1]] + V)) else: prob += np.log(1.0 / V) else: if candi in term_count: prob += np.log((term_count[candi] + 1) / V) #如果到底了,则采用unigram概率 else: prob += np.log(1.0 / V) probs.append(prob) max_idx = probs.index(max(probs)) print(word, candidates[max_idx])

    2.7 部分运行结果

    格式为:输入词 纠正词

    Amerrican American participation. participation althlugh although disappointing. disappointing tonnes, tonnes FOB, FOB said. said lots, lots bilion billion rupiah. rupiah broers brokers traders. traders shepment shipment delivery. delivery mine, mined project, projects unit. unit Rivir River Mt. Mt Bundey, Bundey mine, mined plant, plants produc product mid-1988. mid-1988 abouct about tonnes. tonnes Sumitomo, Sumitomo yen, yen Sogo, Sogo small, small October. October link-up, link-up position. position We’ll Well years, years

    3. 总结

    利用动态规划来计算编辑距离,刚开始有点懵,后面还是参考博客编辑距离算法详解:Levenshtein Distance算法——动态规划问题懂了,主要是想清楚矩阵的元素值代表的含义究竟是什么,对于编辑距离来说,具体是什么意思,后面就是根据递推填表就好了。害,又是动态规划,明天有机会专门写一篇动态规划小练习吧(一直拖)。一直以为拼写纠错很神秘,看了视频,结合大量语料,原来也可以用传统的方法实现拼写纠错,效果看起来还可以,这里是bigram模型计算词概率,要是我肯定会偷懒使用unigram,也没试,理论上没bigram好。
    Processed: 0.010, SQL: 9