NLP基础:文本的向量表示

    科技2025-12-16  17

    NLP基础:文本的向量表示

    1. 词袋模型1.1 利用sklearn函数1.2 手动计算1.3 计算结果对比 2. TF-IDF2.1 利用sklearn函数2.2 手动计算2.3 计算结果对比 3. 总结

    1. 词袋模型

    1.1 利用sklearn函数

    import numpy as np from collections import Counter from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = [ 'He is going from Beijing to Shanghai.', 'He denied my request, but he actually lied.', 'Mike lost the phone, and phone was in the car.', ] X = vectorizer.fit_transform(corpus) print("文本的向量表示:Bag of Words") print("sklearn函数输出:") print(X.toarray())

    1.2 手动计算

    Y = [] word_voc = ['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was'] for sentence_ in corpus: sentence = [] for x in sentence_[:-1].replace(',', '').split(): sentence.append(x.lower()) vector = [0]*len(word_voc) for word_index in range(len(word_voc)): word = word_voc[word_index] vector[word_index] = Counter(sentence)[word] Y.append(vector) Y = np.array(Y) print("手动计算输出:") print(Y)

    1.3 计算结果对比

    文本的向量表示:Bag of Words sklearn函数输出: [[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0] [1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0] [0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]] 手动计算输出: [[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0] [1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0] [0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]

    2. TF-IDF

    2.1 利用sklearn函数

    import numpy as np from collections import Counter from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=False, norm=None) corpus = [ 'He is going from Beijing to Shanghai.', 'He denied my request, but he actually lied.', 'Mike lost the phone, and phone was in the car.', ] X = vectorizer.fit_transform(corpus) print("文本的向量表示:TF-IDF") print("sklearn函数输出:") print(X.toarray())

    2.2 手动计算

    Y = [] word_voc = ['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was'] doc_count = [0]*len(word_voc) corpus_ = [] for sentence_ in corpus: sentence = [] for x in sentence_[:-1].replace(',', '').split(): sentence.append(x.lower()) corpus_.append(sentence) for word_index in range(len(word_voc)): count = 0 for sent in corpus_: if word_voc[word_index] in sent: count += 1 doc_count[word_index] = count for sentence_ in corpus: sentence = [] for x in sentence_[:-1].replace(',', '').split(): sentence.append(x.lower()) vector = [0]*len(word_voc) for word_index in range(len(word_voc)): word = word_voc[word_index] vector[word_index] = (Counter(sentence)[word]) * (np.log(len(corpus) / doc_count[word_index]) + 1) Y.append(vector) Y = np.array(Y) print("手动计算输出:") print(Y)

    2.3 计算结果对比

    文本的向量表示:TF-IDF sklearn函数输出: [[0. 0. 2.09861229 0. 0. 0. 2.09861229 2.09861229 1.40546511 0. 2.09861229 0. 0. 0. 0. 0. 0. 2.09861229 0. 2.09861229 0. ] [2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 2.81093022 0. 0. 2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 0. 0. ] [0. 2.09861229 0. 0. 2.09861229 0. 0. 0. 0. 2.09861229 0. 0. 2.09861229 2.09861229 0. 4.19722458 0. 0. 4.19722458 0. 2.09861229]] 手动计算输出: [[0. 0. 2.09861229 0. 0. 0. 2.09861229 2.09861229 1.40546511 0. 2.09861229 0. 0. 0. 0. 0. 0. 2.09861229 0. 2.09861229 0. ] [2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 2.81093022 0. 0. 2.09861229 0. 0. 2.09861229 0. 2.09861229 0. 0. 0. 0. ] [0. 2.09861229 0. 0. 2.09861229 0. 0. 0. 0. 2.09861229 0. 0. 2.09861229 2.09861229 0. 4.19722458 0. 0. 4.19722458 0. 2.09861229]]

    3. 总结

    纸上谈来终觉浅,绝知此事要躬行。任何在脑海里觉得简单的内容,不通过代码实际操作,你就不知道自己有没有掌握,即使是很基本的文本表示。注意TF-IDF的计算公式,与李航老师的《统计学习方法》P322不一致!两种表示共有的问题是维数灾难,每个句子(文本)的表示的长度是词库的词数,而通常词数是很大的,以百万级起步。
    Processed: 0.019, SQL: 9