NLP基础:分词算法实战
1. 前向最大匹配法1.1 加载词库1.2 前向最大匹配实现1.3 前向最大匹配实现结果
2. 后向最大匹配法2.1 加载词库与实现2.2 后向最大匹配实现结果
3. 双向最大匹配法3.1 import 前向与后向最大匹配3.2 双向匹配实现3.3 双向匹配结果
4. 利用语言模型进行分词4.1 加载词库与一部分unigram概率词典4.2 核心功能代码实现4.3 实现结果4.4 Viterbi算法优化4.4.1 图的构建4.4.2 Viterbi算法实现4.4.3 Viterbi实现结果
4.5 Dijkstra算法优化
5. 总结
1. 前向最大匹配法
1.1 加载词库
with open("././data/word2id.pkl", 'rb') as f
:
dic_words
= pickle
.load
(f
)
print("Load vocabulary success!!")
1.2 前向最大匹配实现
def forward_max_matching(input_str
, max_length
):
"""
利用递归前向最大匹配分词
:param input_str :输入的字符串
:return :以“/”为分割符返回分词结果
"""
max_len
= min(max_length
, len(input_str
))
if len(input_str
) < 1:
return " "
else:
for end_index
in range(max_len
, 0, -1):
if input_str
[:end_index
] in dic_words
:
return input_str
[:end_index
] + "/" + forward_max_matching
(input_str
[end_index
:], max_length
)
print("forward_max_matching:{}".format(forward_max_matching
("北京的天气真好啊", 5)[:-2]))
print("forward_max_matching:{}".format(forward_max_matching
("今天的课程内容很有意思", 5)[:-2]))
print("forward_max_matching:{}".format(forward_max_matching
("经常有意见分歧", 5)[:-2]))
1.3 前向最大匹配实现结果
Load vocabulary success!! forward_max_matching:北京/的/天气/真好/啊 forward_max_matching:今天/的/课程/内容/很/有意思 forward_max_matching:经常/有意/见/分歧 Process finished with exit code 0
2. 后向最大匹配法
2.1 加载词库与实现
import pickle
with open("././data/word2id.pkl", 'rb') as f
:
dic_words
= pickle
.load
(f
)
print("Load vocabulary success!!")
def backward_max_matching(input_str
, max_length
):
"""
利用递归实现后向最大匹配分词
:param input_str :输入的字符串
:return :以"/"为分隔符返回分词列表
"""
max_len
= min(max_length
, len(input_str
))
if len(input_str
) < 1:
return " "
else:
for end_index
in range(-1 * max_len
, 0):
word
= input_str
[end_index
:]
if word
in dic_words
:
return backward_max_matching
(input_str
[:end_index
], max_length
) + "/" + word
print("backward_max_matching:{}".format(backward_max_matching
("北京的天气真好啊", 5)[2:]))
print("backward_max_matching:{}".format(backward_max_matching
("今天的课程内容很有意思", 5)[2:]))
print("backward_max_matching:{}".format(backward_max_matching
("经常有意见分歧", 5)[2:]))
2.2 后向最大匹配实现结果
Load vocabulary success!! backward_max_matching:北京/的/天气/真好/啊 backward_max_matching:今天/的/课程/内容/很/有意思 backward_max_matching:经/常有/意见/分歧 Process finished with exit code 0
3. 双向最大匹配法
通过对前向、后向最大匹配的结果进行对比实现:首先看结果的长度,优先选择较短的;长度相同的,取单字数少的。
3.1 import 前向与后向最大匹配
from forward_max_matching
import forward_max_matching
from backward_max_matching
import backward_max_matching
3.2 双向匹配实现
def bi_direction_matching(input_str
):
"""
利用双向匹配实现分词,主要思想是通过前向匹配和后向匹配的结果对比实现
:parameter input_str :输入的字符串
:return : 双向匹配的列表结果
"""
count1
= 0
count2
= 0
result_fw
= forward_max_matching
(input_str
, 5)[:-2]
result_bw
= backward_max_matching
(input_str
, 5)[2:]
result_fw
= result_fw
.split
('/')
result_bw
= result_bw
.split
('/')
if len(result_fw
) < len(result_bw
):
return result_fw
elif len(result_fw
) > len(result_bw
):
return result_bw
else:
for word_fw
in result_fw
:
if len(word_fw
) == 1:
count1
+= 1
for word_bw
in result_bw
:
if len(word_bw
) == 1:
count2
+= 1
if count1
<= count2
:
return result_fw
else:
return result_bw
print("bi_direction_matching:{}".format(bi_direction_matching
("北京的天气真好啊")))
print("bi_direction_matching:{}".format(bi_direction_matching
("今天的课程内容很有意思")))
print("bi_direction_matching:{}".format(bi_direction_matching
("经常有意见分歧")))
3.3 双向匹配结果
Load vocabulary success!! forward_max_matching:北京/的/天气/真好/啊 forward_max_matching:今天/的/课程/内容/很/有意思 forward_max_matching:经常/有意/见/分歧 Load vocabulary success!! backward_max_matching:北京/的/天气/真好/啊 backward_max_matching:今天/的/课程/内容/很/有意思 backward_max_matching:经/常有/意见/分歧 bi_direction_matching:[‘北京’, ‘的’, ‘天气’, ‘真好’, ‘啊’] bi_direction_matching:[‘今天’, ‘的’, ‘课程’, ‘内容’, ‘很’, ‘有意思’] bi_direction_matching:[‘经常’, ‘有意’, ‘见’, ‘分歧’]
4. 利用语言模型进行分词
考虑到所有的可能的分词结果,利用语言模型(这里是unigram模型)计算得到每种分词结果成立的可能性,将概率最大的作为分词结果。
4.1 加载词库与一部分unigram概率词典
import pickle
import numpy
as np
with open("././data/word2id.pkl", 'rb') as f
:
dic_words
= pickle
.load
(f
)
print("Load vocabulary success!!")
word_prob
= {"北京":0.03,"的":0.08,"天":0.005,"气":0.005,"天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02,
"今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,"很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
"程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,"有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005}
4.2 核心功能代码实现
def word_segment_naive(input_str
):
"""
1. 对于输入字符串做分词,并返回所有可行的分词之后的结果。
2. 针对于每一个返回结果,计算句子的概率
3. 返回概率最高的最作为最后结果
input_str: 输入字符串 输入格式:“今天天气好”
best_segment: 最好的分词结果 输出格式:["今天","天气","好"]
"""
segments
= []
def word_segment(start_index
):
"""
利用递归实现所有可能的分词的结果
"""
result
= []
if start_index
>= len(input_str
):
return [""]
else:
for i
in range(start_index
+1, len(input_str
)+1):
word
= input_str
[start_index
:i
]
if word
in dic_words
:
result
= result
+ [word
+ ',' + x
for x
in word_segment
(i
)]
return result
results
= word_segment
(0)
for result
in results
:
segments
.append
(result
.split
(',')[:-1])
best_score
= 1000000000000
for seg
in segments
:
score
= 0.0
for word
in seg
:
if word
in word_prob
:
score
+= -np
.log
(word_prob
[word
])
else:
score
+= -np
.log
(0.00001)
if score
<= best_score
:
best_score
= score
best_segment
= seg
return best_segment
print(word_segment_naive
("北京的天气真好啊"))
print(word_segment_naive
("今天的课程内容很有意思"))
print(word_segment_naive
("经常有意见分歧"))
4.3 实现结果
可以看出效果比最大匹配好
Load vocabulary success!! [‘北京’, ‘的’, ‘天气’, ‘真好’, ‘啊’] [‘今天’, ‘的’, ‘课程’, ‘内容’, ‘很’, ‘有意思’] [‘经常’, ‘有’, ‘意见’, ‘分歧’]
4.4 Viterbi算法优化
由于上述步骤复杂度太高,因此需要进行优化,首先是使用DP算法Viterbi算法。
4.4.1 图的构建
def generate_graph(input_str
):
"""
根据词库生成图,服务于动态规划
: param input_str: 输入的字符串
: return: 返回一个字典,key 为每个结点序号 , value是该单词的incoming links列表
"""
graph_dic
= {}
incoming_links
= []
length
= len(input_str
)
for node
in range(length
, 0, -1):
last_node
= node
- 1
while last_node
>= 0:
word
= input_str
[last_node
:node
]
if word
in dic_words
:
incoming_links
.append
(word
)
last_node
-= 1
graph_dic
[node
] = incoming_links
incoming_links
= []
return graph_dic
4.4.2 Viterbi算法实现
def word_segment_viterbi(input_str
):
"""
1. 基于输入字符串,词典,以及给定的unigram概率来创建DAG(有向图)。
2. 编写维特比算法来寻找最优的PATH
3. 返回分词结果
input_str: 输入字符串 输入格式:“今天天气好”
best_segment: 最好的分词结果 输出格式:["今天","天气","好"]
"""
graph
= generate_graph
(input_str
)
cost
= [0]*(len(input_str
)+1)
parent
= [0]*(len(input_str
)+1)
for node
in range(1, len(input_str
)+1):
min_cost
= np
.inf
for neighbor
in graph
[node
]:
if neighbor
in word_prob
:
cost_tmp
= -np
.log
(word_prob
[neighbor
])
else:
cost_tmp
= -np
.log
(0.00001)
cost_
= cost
[node
- len(neighbor
)] + cost_tmp
if cost_
< min_cost
:
min_cost
= cost_
cost
[node
] = min_cost
parent
[node
] = node
- len(neighbor
)
index_lst
= []
current_node
= len(input_str
)
last_node
= parent
[current_node
]
index_lst
.append
(current_node
)
index_lst
.append
(last_node
)
while last_node
> 0:
current_node
= parent
[last_node
]
last_node
= parent
[current_node
]
if last_node
== 0 and current_node
== 0:
index_lst
.append
(0)
break
index_lst
.append
(current_node
)
index_lst
.append
(last_node
)
index_lst
.reverse
()
best_segment
= []
for i
in range(len(index_lst
) - 1):
best_segment
.append
(input_str
[index_lst
[i
]:index_lst
[i
+1]])
return best_segment
print(word_segment_viterbi
("北京的天气真好啊"))
print(word_segment_viterbi
("今天的课程内容很有意思"))
print(word_segment_viterbi
("经常有意见分歧"))
4.4.3 Viterbi实现结果
与未优化前的结果一致
[‘北京’, ‘的’, ‘天气’, ‘真好’, ‘啊’] [‘今天’, ‘的’, ‘课程’, ‘内容’, ‘很’, ‘有意思’] [‘经常’, ‘有’, ‘意见’, ‘分歧’]
4.5 Dijkstra算法优化
将句子成立的概率积转化为 -log 相加,且转化的结果均大于0,且为求和形式,因此可以使用Dijkstra贪心算法。实现代码如下,实现的效果与之前的一致。
def word_segment_dijkstra(input_str
):
"""
1. 基于输入字符串,词典,以及给定的unigram概率来创建DAG(有向图)。
2. 编写维特比算法来寻找最优的PATH
3. 返回分词结果
input_str: 输入字符串 输入格式:“今天天气好”
best_segment: 最好的分词结果 输出格式:["今天","天气","好"]
"""
graph
= generate_graph
(input_str
)
cost
= [np
.inf
]*(len(input_str
)+1)
cost
[0] = 0
parent
= [0]*(len(input_str
)+1)
for node
in range(1, len(input_str
)+1):
neighbors
= graph
[node
]
for neighbor
in neighbors
:
if neighbor
in word_prob
:
cost_tmp
= -np
.log
(word_prob
[neighbor
])
else:
cost_tmp
= -np
.log
(0.00001)
current_cost
= cost
[node
- len(neighbor
)] + cost_tmp
if current_cost
<= cost
[node
]:
cost
[node
] = current_cost
parent
[node
] = node
- len(neighbor
)
else:
pass
index_lst
= []
current_node
= len(input_str
)
last_node
= parent
[current_node
]
index_lst
.append
(current_node
)
index_lst
.append
(last_node
)
while last_node
> 0:
current_node
= parent
[last_node
]
last_node
= parent
[current_node
]
if last_node
== 0 and current_node
== 0:
index_lst
.append
(0)
break
index_lst
.append
(current_node
)
index_lst
.append
(last_node
)
index_lst
.reverse
()
best_segment
= []
for i
in range(len(index_lst
) - 1):
best_segment
.append
(input_str
[index_lst
[i
]:index_lst
[i
+1]])
return best_segment
5. 总结
设句子长度为L,每个词的incoming links数量固定为M,则未优化前的时间复杂度是O(L^2),使用Viterbi算法优化的时间复杂度是O(L*M),而M是远远小于L的,因此时间复杂度大大降低。此处使用的unigram语言模型,对于不在word_prob的词的概率固定为0.00001,而实际的概率是可以从大量语料得到的。可以使用更好的语言模型,比如HMM模型;毕竟unigram与现实中语言是不符的,因为词之间会有上下文的联系,并不是独立的。