python nltk 库
Python, being Python, apart from its incredible readability, has some remarkable libraries at hand. One of which, is NLTK. NLTK or Natural Language Tool Kit is one of the best Python NLP libraries out there. The functionality it leaves at your fingertips while maintaining its ease of use and again, readability is just fantastic.
Python是Python,除了具有令人难以置信的可读性外,还有一些出色的库。 其中之一是NLTK。 NLTK或自然语言工具包是目前最好的Python NLP库之一。 在保持其易用性的同时,它的功能触手可及,而且可读性同样出色。
In fact, we’re going to be completing this mini project under 25 lines of code. And you’re most probably going to understand each line as you read through it. Crazy, I know.
实际上,我们将用25行代码完成这个小型项目。 阅读完每一行后,您很可能会理解每一行。 疯狂,我知道。
IDE
集成开发环境
Personally whenever I’m doing anything even relatively fancy, in Python, I use Jupyter Lab. Being able to see what each line does makes it really easy to debug and it’s also strangely therapeutic. Shrugs.
就个人而言,无论何时我想做的事甚至都比较花哨,在Python中,我都使用Jupyter Lab 。 能够看到每条线的功能使得调试起来非常容易,而且也很奇怪。 耸耸肩。
Jupyter Lab Jupyter实验室But you’re free to use whatever you want. It’s a free world. Mostly.
但是您可以随意使用任何内容。 这是一个自由的世界。 大多。
2. Dependencies
2. 依存关系
Now, we’ve got to get hold of the libraries we need. Just 4, super easy to get libraries.
现在,我们必须掌握所需的库。 仅需4个,超级容易获得库。
NLTK NLTK Numpy 脾气暴躁的 Pandas 大熊猫 Scikit-learn Scikit学习To install NLTK, run the following in the terminal
要安装NLTK ,请在终端中运行以下命令
pip install nltkTo install Numpy, run the following in the terminal
要安装Numpy ,请在终端中运行以下命令
pip install numpyTo install Pandas, run the following in the terminal
要安装Pandas ,请在终端中运行以下命令
pip install pandasTo install Scikit-learn, run the following in the terminal
要安装Scikit-learn ,请在终端中运行以下命令
pip install scikit-learnSo intuitive. I mean, come on, it really can’t get any easier.
如此直观。 我的意思是,拜托,这真的很难。
First things first. Let’s import NLTK.
首先是第一件事。 让我们导入NLTK。
import NLTKNow, there’s a slight hitch. I did say 4 dependencies, didn’t I ? Ok, here’s the last one, I swear. But this one’s programmatic.
现在,有一个小障碍。 我确实说了4个依赖项,对吗? 好吧,这是最后一个,我发誓。 但是这个是程序化的。
nltk.download(‘vader_lexicon’) # one time onlyThis is going to go ahead and grab, well, the vader_lexicon.
这将继续进行,抢走vader_lexicon 。
While this is the official page for NLTK’s VADER, it’s actually the code and not an explanation of VADER which by the way, does not, refer to Darth Vader, very sad, I know.
虽然这是NLTK的VADER的官方页面,但实际上是代码,而不是VADER的解释,顺便说一句, 不是,引用达斯·维德,我很遗憾。
It actually stands for Valence Aware sEntiment Reasoning. It’s basically going to do all the sentiment analysis for us. So convenient. I mean, at this rate jobs are definitely going to be vanishing faster. (No, I’m kidding)
它实际上代表Valence Aware情感推理 。 基本上,它将为我们进行所有情绪分析。 好方便 我的意思是,按照这种速度,工作肯定会消失得更快。 ( 不,我在开玩笑 )
The way this magical downloadable works, is by mapping the word you pass into it, to lexical features with emotional intensities. In english, since you ask, that means figuring out, let’s just called them synonyms for now, to figure out what that word relates to and then gives it a score. A sentiment score, to be precise.
这种神奇的可下载功能的工作方式是将您传递给它的单词映射到具有情感强度的 词汇特征 。 用英语来说 ,既然您要问,那意味着要弄清楚,现在我们就称它们为同义词,找出该词与什么相关,然后为其打分。 准确地说,是情感得分 。
So now that each word has a sentiment score, the score of a paragraph of words, is going to be, you guessed it, the sum of all the sentiment scores. Shocking, I know.
因此,既然每个单词都有一个情感得分,那么您猜到了,一个单词段落的得分将是所有情感得分的总和 。 令人震惊 ,我知道。
Now, you might go thinking, ok, fine it goes ahead and gets the score of each word fine. But does it understand context ? Like for example, the difference between did work and did not work ?
现在,您可能会思考,好吧,继续进行下去,并获得每个单词的分数即可。 但是它了解上下文吗? 例如, 工作与不工作之间的区别?
DUH !!!
H!
I mean otherwise why would it be ‘one of the best’ ?
我的意思是否则为什么会成为“ 最好的之一 ”?
Another really important thing to keep in mind, is that VADER actually pays attention to capitalization and exclamations. It will give a higher positive score to AWESOME!!!!! than AWESOME and awesome.
要牢记的另一个真正重要的事情是,VADER实际上关注 大写和惊叹号 。 它将赋予AWESOME 更高的积极分数!!!! 比真棒和令人敬畏 。
That’s it class, theory’s over.
就是这样,理论已经结束了。
Let’s now import the downloaded VADER module.
现在,让我们导入下载的VADER模块。
from nltk.sentiment.vader import SentimentIntensityAnalyzerand then make an instance of the SentimentIntensityAnalyzer, by doing this
然后通过执行此操作来创建SentimentIntensityAnalyzer的实例
vader = SentimentIntensityAnalyzer() # or whatever you want to call itBy now your code should look something like this
现在,您的代码应如下所示
Code snippet 程式码片段Upon running it, you should see something like this. If you get the same error as me, don’t worry, it’s basically warning you that the Twitter module from NLTK is not installed and so you won’t be able to tap into that functionality.
运行它后,您应该会看到类似这样的内容。 如果您遇到与我相同的错误,请不要担心,它基本上是在警告您未安装NLTK的Twitter模块,因此您将无法利用该功能。
Code snippet 2 程式码片段2Now let’s try out what this ‘VADER’ can do. write the following and run it
现在,让我们尝试一下此“ VADER”的功能。 写以下内容并运行它
sample = ‘I really love NVIDIA’vader.polarity_scores(sample) Code snippet 3 程式码片段3So, it was 69.2% positive. Which might not be perfect, but it definitely gets the job done, as you’ll see.
因此,它是69.2%的阳性。 如您所见,这可能并不完美,但绝对可以完成工作。
In case you’re wondering, the compound value is basically the normal of the 3 values negative, positive and neutral.
在你想知道的情况下, 化合物值是基本上正常的3个值负 , 正和中性的。
Now, try this
现在,尝试这个
sample = ‘I really don\'t love NVIDIA’vader.polarity_scores(sample) Code snippet 4 程式码片段454.9% negative, whew, by the skin of its teeth.
牙齿的皮肤产生了54.9%的负面刺激。
Here’s a file with Amazon reviews of a product from which we’re going to be extracting sentiments. Go ahead and download it. Also ensure that it’s in the same directory as the python file you’re working on. Otherwise remember to add the correct path to it.
这是一个包含亚马逊对某产品的评论的文件,我们将从该产品中提取情感。 继续下载。 还要确保它与您正在处理的python文件位于同一目录中。 否则,请记住为其添加正确的路径。
We’re going to be needing both pandas and numpy now
现在我们既需要熊猫又需要numpy
import numpy as npimport pandas as pddf = pd.read_csv(‘wherever you stored the file.tsv’, sep=’\t’)df.head()In the above code, we’ve initialized a Pandas Dataframe object, and called it to view the top 5 objects in the dataframe.
在上面的代码中,我们初始化了一个Pandas Dataframe对象,并调用它以查看数据框中的前5个对象。
This dataset already has all the reviews categorized under positive and negative. This is just for you to cross check the values you get back from VADER and calculate your metrics.
此数据集已经将所有评论归为正面和负面。 这只是让您交叉检查从VADER获得的值并计算指标。
To see how many positive and negative reviews we have type in the following
要查看我们有多少正面和负面评论,请输入以下内容
df[‘label’].value_counts() Code snippet 5 程式码片段5Let’s try one of the objects out shall we ?
让我们尝试其中一个对象吧?
But before we do that, let’s ensure that our dataset is nice and clean, i.e, ensure that there aren’t any blank objects.
但是在执行此操作之前,让我们确保我们的数据集干净整洁,即,确保没有任何空白对象。
df.dropna(inplace=True)empty_objects = []for index, label, review in df.itertuples(): if type(review)==str: if review.isspace(): empty_objects.append(i)df.drop(empty_objects, inplace=True)This little convenience function will drop any blank dataframe objects. The
这个小的便利功能将删除任何空白的数据框对象。 的
inplace=Truemethod ensures that the dataframe keeps the changes made by dropping any blank objects, and not cheekily throwing them away despite all our effort. Very much like a commit in Github.
该方法可确保数据框通过删除任何空白对象来保持所做的更改,并且尽管我们付出了所有努力,也不会过分地丢弃它们。 非常像Github中的提交。
Code snippet 6 程式码片段6However, this particular dataset had no empty objects, but still, it doesn’t harm to be careful.
但是,这个特定的数据集没有空对象,但是小心一点也无害。
Currently there’s a couple of problems:
当前存在两个问题:
We can’t compare the extracted sentiment to the original sentiment as doing that for each sentiment is time consuming and quite frankly, completely caveman. 我们无法将提取的情感与原始情感进行比较,因为这样做对于每个情感都是耗时的,并且坦率地说,完全是穴居人。 The extracted sentiment is printed out, which, in my opinion is plain flimsy. 提取出来的情绪被打印出来,我认为这很脆弱。Let’s fix it.
让我们修复它。
Let’s add the sentiment to the dataframe alongside its original sentiment.
让我们将情感与原始情感一起添加到数据框。
df[‘scores’] = df[‘review’].apply(lambda: review: vader.polarity_scores(review))df.head()The above code will create a new column called ‘scores’ which will contain the extracted sentiments.
上面的代码将创建一个名为“ scores”的新列,其中将包含提取的情绪。
Code snippet 7 程式码片段7But currently the scores column has just the raw sentiment which, we can’t really compare programmatically with the ‘label’ column which already has all the data, so let’s find a work around.
但是目前,scores列仅具有原始情绪,我们无法真正通过编程将其与已经包含所有数据的“ label”列进行比较,因此让我们找到解决方法。
Let’s use the compound value.
让我们使用复合值。
Code snippet 8 程式码片段8If the compound value is greater than 0, we can safely say that the review is positive, otherwise it’s negative. Great ! Let’s implement that now !
如果复合值大于0,则可以肯定地说该评论为肯定,否则为否定。 太好了! 让我们现在实现它!
Code snippet 9 程式码片段9There’s definitely room for improvement. But, do keep in mind that we got this score without making any changes to VADER and that we didn’t write any custom code to figure out the sentiment ourselves.
肯定还有改进的空间。 但是,请记住,我们在未对VADER进行任何更改的情况下获得了此分数,并且我们没有编写任何自定义代码来自行确定情绪。
Alright then, if you have any queries feel free to post them in the comments and I’ll try to help out ! Peace.
好吧,如果您有任何疑问,请随时在评论中发表,我会尽力帮助的! 和平。
翻译自: https://medium.com/@bossbeagle1509/sentiment-analysis-using-python-and-nltk-library-d68caba27e1d
python nltk 库
相关资源:nltk_data nltk语料库下载