利用动态规划(DP算法)计算两个字符串的编辑距离
def edit_distance(str1, str2): """ :param str1:input string :param str2:another input string :return : the edit distance between two strings """ length1 = len(str1) length2 = len(str2) matri = [[0 for i in range(length2 + 1)] for j in range(length1 + 1)] #创建(lenth1+1)*(lenth2+1)的矩阵 for i in range(length1 + 1): for j in range(length2 + 1): if i == 0: matri[i][j] = j if j == 0: matri[i][j] = i elif str1[i-1] == str2[j-1]:#如果此时的字符相同,则在i-1 ---> j-1 的基础上不需要操作,操作数不增加 matri[i][j] = matri[i-1][j-1] else: matri[i][j] = 1 + min(matri[i-1][j],# 删除操作 matri[i][j-1],# 插入操作 matri[i-1][j-1])#替换操作 return matri[length1][length2] #test str1 = "there" str2 = "therir" print(edit_distance(str1, str2))D:\Anaconda\python.exe “E:/project/spell correction/edit_distance.py” 2
{‘apptle’, ‘aqpple’, ‘apyle’, ‘gpple’, ‘applek’, ‘ypple’, ‘ajple’, ‘apnple’, ‘appfle’, ‘ahple’, ‘applre’, ‘pple’, ‘apfle’, ‘aepple’, ‘vapple’, ‘appce’, ‘applie’, ‘appkle’, ‘appleq’, ‘napple’, ‘appfe’, ‘eapple’, ‘appre’, ‘agple’, ‘aprple’, ‘applf’, ‘gapple’, ‘adple’, ‘apjple’, ‘appxle’, ‘avpple’, ‘applne’, ‘wapple’, ‘applpe’, ‘applwe’, ‘xapple’, ‘npple’, ‘apsle’, ‘appleu’, ‘appple’, ‘applz’, ‘apgle’, ‘apzle’, ‘azple’, ‘afple’, ‘applme’, ‘appmle’, ‘applel’, ‘applxe’, ‘applep’, ‘upple’, ‘xpple’, ‘aphle’, ‘kapple’, ‘appke’, ‘applo’, ‘applk’, ‘apppe’, ‘appve’, ‘anple’, ‘tapple’, ‘appxe’, ‘awpple’, ‘apiple’, ‘appne’, ‘jpple’, ‘aqple’, ‘applge’, ‘applfe’, ‘vpple’, ‘rpple’, ‘applef’, ‘atpple’, ‘apole’, ‘apople’, ‘aopple’, ‘happle’, ‘aptple’, ‘applqe’, ‘lpple’, ‘appee’, ‘apprle’, ‘aupple’, ‘applg’, ‘axple’, ‘epple’, ‘aeple’, ‘aople’, ‘apphle’, ‘apqple’, ‘applu’, ‘applje’, ‘apaple’, ‘azpple’, ‘ipple’, ‘apqle’, ‘akple’, ‘apkple’, ‘apkle’, ‘appae’, ‘apale’, ‘apfple’, ‘rapple’, ‘applle’, ‘appli’, ‘acpple’, ‘apwle’, ‘aple’, ‘appcle’, ‘applee’, ‘applce’, ‘axpple’, ‘applx’, ‘anpple’, ‘opple’, ‘dapple’, ‘adpple’, ‘appyle’, ‘appge’, ‘appwe’, ‘mapple’, ‘appwle’, ‘applv’, ‘appble’, ‘zpple’, ‘kpple’, ‘applze’, ‘appse’, ‘appled’, ‘fapple’, ‘appvle’, ‘applve’, ‘apvple’, ‘arpple’, ‘appte’, ‘mpple’, ‘applm’, ‘japple’, ‘appoe’, ‘apdple’, ‘appule’, ‘appe’, ‘zapple’, ‘apgple’, ‘ampple’, ‘aprle’, ‘uapple’, ‘qpple’, ‘aptle’, ‘spple’, ‘qapple’, ‘appld’, ‘ahpple’, ‘apnle’, ‘apeple’, ‘appze’, ‘asple’, ‘appile’, ‘applq’, ‘apples’, ‘apzple’, ‘aplple’, ‘appje’, ‘yapple’, ‘apdle’, ‘cpple’, ‘appie’, ‘appde’, ‘applt’, ‘apphe’, ‘wpple’, ‘appdle’, ‘fpple’, ‘applr’, ‘oapple’, ‘appsle’, ‘apxple’, ‘appjle’, ‘appleg’, ‘applde’, ‘applem’, ‘aapple’, ‘appley’, ‘apjle’, ‘appbe’, ‘appme’, ‘appln’, ‘applej’, ‘applw’, ‘applke’, ‘apyple’, ‘applev’, ‘apcle’, ‘applp’, ‘appls’, ‘appleo’, ‘apmle’, ‘applez’, ‘applue’, ‘awple’, ‘applj’, ‘applb’, ‘tpple’, ‘atple’, ‘acple’, ‘applc’, ‘arple’, ‘appye’, ‘abpple’, ‘aphple’, ‘apsple’, ‘applye’, ‘agpple’, ‘applen’, ‘auple’, ‘ppple’, ‘avple’, ‘aypple’, ‘appleh’, ‘papple’, ‘applex’, ‘alpple’, ‘apploe’, ‘aspple’, ‘appzle’, ‘appgle’, ‘sapple’, ‘dpple’, ‘ajpple’, ‘apmple’, ‘appole’, ‘bpple’, ‘apble’, ‘applew’, ‘akpple’, ‘applte’, ‘ayple’, ‘appue’, ‘bapple’, ‘applse’, ‘appleb’, ‘apuple’, ‘apile’, ‘appqle’, ‘lapple’, ‘applea’, ‘appl’, ‘applei’, ‘applbe’, ‘aiple’, ‘alple’, ‘appla’, ‘abple’, ‘apwple’, ‘capple’, ‘appele’, ‘applhe’, ‘applae’, ‘appler’, ‘aipple’, ‘applh’, ‘apxle’, ‘apple’, ‘afpple’, ‘apule’, ‘applet’, ‘apcple’, ‘hpple’, ‘apbple’, ‘apele’, ‘appll’, ‘apply’, ‘appnle’, ‘aaple’, ‘ample’, ‘aplle’, ‘iapple’, ‘apvle’, ‘applec’, ‘appqe’, ‘appale’}
D:\Anaconda\python.exe “E:/project/spell correction/edit_distance.py” 35334
根据语料得到bigram模型,从而为计算p(correct)做好准备。
会报如下错: 解决办法:手动下载reuters语料库,下载链接参考:语料库下载参考博客(百度云下载)。下载后如果不对文件夹进行调整,依然会报找不到punkt库的错,此时需要:
将corpora文件夹下的punkt.zip压缩文件,解压至tokenizers文件夹下 在\tokenizers\punkt文件夹下创建PY3文件夹 将english.pickle文件复制至PY3文件夹下 完成以上步骤后,若报Attribute Error seek的错,直接忽略,采用try except 运行代码即可,可以运行出结果的。为计算p(mistake|correct)做好准备
channel_prob = {} for line in open('././data/spell-errors.txt'): items = line.split(":") correct = items[0].strip() mistakes = [item.strip() for item in items[1].strip().split(",")] channel_prob[correct] = {} for mis in mistakes: channel_prob[correct][mis] = 1.0/len(mistakes)格式为:输入词 纠正词
Amerrican American participation. participation althlugh although disappointing. disappointing tonnes, tonnes FOB, FOB said. said lots, lots bilion billion rupiah. rupiah broers brokers traders. traders shepment shipment delivery. delivery mine, mined project, projects unit. unit Rivir River Mt. Mt Bundey, Bundey mine, mined plant, plants produc product mid-1988. mid-1988 abouct about tonnes. tonnes Sumitomo, Sumitomo yen, yen Sogo, Sogo small, small October. October link-up, link-up position. position We’ll Well years, years
