NLP基础：文本的向量表示

科技2025-12-16 17

NLP基础：文本的向量表示

1. 词袋模型1.1 利用sklearn函数1.2 手动计算1.3 计算结果对比 2. TF-IDF2.1 利用sklearn函数2.2 手动计算2.3 计算结果对比 3. 总结

1. 词袋模型

1.1 利用sklearn函数

import numpy as np from collections import Counter from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = [ 'He is going from Beijing to Shanghai.', 'He denied my request, but he actually lied.', 'Mike lost the phone, and phone was in the car.', ] X = vectorizer.fit_transform(corpus) print("文本的向量表示：Bag of Words") print("sklearn函数输出：") print(X.toarray())

1.2 手动计算

1.3 计算结果对比

文本的向量表示：Bag of Words sklearn函数输出： [[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0] [1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0] [0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]] 手动计算输出： [[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0] [1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0] [0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]

2. TF-IDF

2.1 利用sklearn函数

import numpy as np from collections import Counter from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=False, norm=None) corpus = [ 'He is going from Beijing to Shanghai.', 'He denied my request, but he actually lied.', 'Mike lost the phone, and phone was in the car.', ] X = vectorizer.fit_transform(corpus) print("文本的向量表示：TF-IDF") print("sklearn函数输出：") print(X.toarray())

2.2 手动计算

Y = [] word_voc = ['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was'] doc_count = [0]*len(word_voc) corpus_ = [] for sentence_ in corpus: sentence = [] for x in sentence_[:-1].replace(',', '').split(): sentence.append(x.lower()) corpus_.append(sentence) for word_index in range(len(word_voc)): count = 0 for sent in corpus_: if word_voc[word_index] in sent: count += 1 doc_count[word_index] = count for sentence_ in corpus: sentence = [] for x in sentence_[:-1].replace(',', '').split(): sentence.append(x.lower()) vector = [0]*len(word_voc) for word_index in range(len(word_voc)): word = word_voc[word_index] vector[word_index] = (Counter(sentence)[word]) * (np.log(len(corpus) / doc_count[word_index]) + 1) Y.append(vector) Y = np.array(Y) print("手动计算输出：") print(Y)

2.3 计算结果对比

文本的向量表示：TF-IDF sklearn函数输出： [[0. 0. 2.09861229 0. 0. 0. 2.09861229 2.09861229 1.40546511 0. 2.09861229 0. 0. 0. 0. 0. 0. 2.09861229 0. 2.09861229 0. ] [2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 2.81093022 0. 0. 2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 0. 0. ] [0. 2.09861229 0. 0. 2.09861229 0. 0. 0. 0. 2.09861229 0. 0. 2.09861229 2.09861229 0. 4.19722458 0. 0. 4.19722458 0. 2.09861229]] 手动计算输出： [[0. 0. 2.09861229 0. 0. 0. 2.09861229 2.09861229 1.40546511 0. 2.09861229 0. 0. 0. 0. 0. 0. 2.09861229 0. 2.09861229 0. ] [2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 2.81093022 0. 0. 2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 0. 0. ] [0. 2.09861229 0. 0. 2.09861229 0. 0. 0. 0. 2.09861229 0. 0. 2.09861229 2.09861229 0. 4.19722458 0. 0. 4.19722458 0. 2.09861229]]

3. 总结

纸上谈来终觉浅，绝知此事要躬行。任何在脑海里觉得简单的内容，不通过代码实际操作，你就不知道自己有没有掌握，即使是很基本的文本表示。注意TF-IDF的计算公式，与李航老师的《统计学习方法》P322不一致！两种表示共有的问题是维数灾难，每个句子（文本）的表示的长度是词库的词数，而通常词数是很大的，以百万级起步。

Processed: 0.019, SQL: 9