NLP基础:文本的向量表示
1. 词袋模型1.1 利用sklearn函数1.2 手动计算1.3 计算结果对比
2. TF-IDF2.1 利用sklearn函数2.2 手动计算2.3 计算结果对比
3. 总结
1. 词袋模型
1.1 利用sklearn函数
import numpy
as np
from collections
import Counter
from sklearn
.feature_extraction
.text
import CountVectorizer
vectorizer
= CountVectorizer
()
corpus
= [
'He is going from Beijing to Shanghai.',
'He denied my request, but he actually lied.',
'Mike lost the phone, and phone was in the car.',
]
X
= vectorizer
.fit_transform
(corpus
)
print("文本的向量表示:Bag of Words")
print("sklearn函数输出:")
print(X
.toarray
())
1.2 手动计算
Y
= []
word_voc
= ['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was']
for sentence_
in corpus
:
sentence
= []
for x
in sentence_
[:-1].replace
(',', '').split
():
sentence
.append
(x
.lower
())
vector
= [0]*len(word_voc
)
for word_index
in range(len(word_voc
)):
word
= word_voc
[word_index
]
vector
[word_index
] = Counter
(sentence
)[word
]
Y
.append
(vector
)
Y
= np
.array
(Y
)
print("手动计算输出:")
print(Y
)
1.3 计算结果对比
文本的向量表示:Bag of Words
sklearn函数输出:
[[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0]
[1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0]
[0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]
手动计算输出:
[[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0]
[1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0]
[0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]
2. TF-IDF
2.1 利用sklearn函数
import numpy
as np
from collections
import Counter
from sklearn
.feature_extraction
.text
import TfidfVectorizer
vectorizer
= TfidfVectorizer
(use_idf
=True, smooth_idf
=False, norm
=None)
corpus
= [
'He is going from Beijing to Shanghai.',
'He denied my request, but he actually lied.',
'Mike lost the phone, and phone was in the car.',
]
X
= vectorizer
.fit_transform
(corpus
)
print("文本的向量表示:TF-IDF")
print("sklearn函数输出:")
print(X
.toarray
())
2.2 手动计算
Y
= []
word_voc
= ['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was']
doc_count
= [0]*len(word_voc
)
corpus_
= []
for sentence_
in corpus
:
sentence
= []
for x
in sentence_
[:-1].replace
(',', '').split
():
sentence
.append
(x
.lower
())
corpus_
.append
(sentence
)
for word_index
in range(len(word_voc
)):
count
= 0
for sent
in corpus_
:
if word_voc
[word_index
] in sent
:
count
+= 1
doc_count
[word_index
] = count
for sentence_
in corpus
:
sentence
= []
for x
in sentence_
[:-1].replace
(',', '').split
():
sentence
.append
(x
.lower
())
vector
= [0]*len(word_voc
)
for word_index
in range(len(word_voc
)):
word
= word_voc
[word_index
]
vector
[word_index
] = (Counter
(sentence
)[word
]) * (np
.log
(len(corpus
) / doc_count
[word_index
]) + 1)
Y
.append
(vector
)
Y
= np
.array
(Y
)
print("手动计算输出:")
print(Y
)
2.3 计算结果对比
文本的向量表示:TF
-IDF
sklearn函数输出:
[[0. 0. 2.09861229 0. 0. 0.
2.09861229 2.09861229 1.40546511 0. 2.09861229 0.
0. 0. 0. 0. 0. 2.09861229
0. 2.09861229 0. ]
[2.09861229 0. 0. 2.09861229 0. 2.09861229
0. 0. 2.81093022 0. 0. 2.09861229
0. 0. 2.09861229 0. 2.09861229 0.
0. 0. 0. ]
[0. 2.09861229 0. 0. 2.09861229 0.
0. 0. 0. 2.09861229 0. 0.
2.09861229 2.09861229 0. 4.19722458 0. 0.
4.19722458 0. 2.09861229]]
手动计算输出:
[[0. 0. 2.09861229 0. 0. 0.
2.09861229 2.09861229 1.40546511 0. 2.09861229 0.
0. 0. 0. 0. 0. 2.09861229
0. 2.09861229 0. ]
[2.09861229 0. 0. 2.09861229 0. 2.09861229
0. 0. 2.81093022 0. 0. 2.09861229
0. 0. 2.09861229 0. 2.09861229 0.
0. 0. 0. ]
[0. 2.09861229 0. 0. 2.09861229 0.
0. 0. 0. 2.09861229 0. 0.
2.09861229 2.09861229 0. 4.19722458 0. 0.
4.19722458 0. 2.09861229]]
3. 总结
纸上谈来终觉浅,绝知此事要躬行。任何在脑海里觉得简单的内容,不通过代码实际操作,你就不知道自己有没有掌握,即使是很基本的文本表示。注意TF-IDF的计算公式,与李航老师的《统计学习方法》P322不一致!两种表示共有的问题是维数灾难,每个句子(文本)的表示的长度是词库的词数,而通常词数是很大的,以百万级起步。