虚拟货币机器学习预测

科技2022-08-01 106

虚拟货币机器学习预测

Using Topic Modelling & Twitter scraping to extract Voice of Customer after the Clicks Group racially offensive advertisement debacle.

在Clicks组种族冒犯的广告崩溃之后，使用主题建模和Twitter抓取方法提取客户之声。

On Friday, 4 September, Twitter reacted to the Clicks Group, one of South Africa’s largest healthcare retailer after they released an online advert depicting a black woman’s hair described as ‘frizzy and dull’, and a white woman’s hair described as ‘normal’ as part of an ad campaign with American hair care brand , TRESemmé.

9月4日，星期五，Twitter对南非最大的医疗保健零售商之一Clicks Group做出了回应。ClicksGroup在网上发布了一则广告，描述一名黑人女性的头发被描述为“卷曲和暗沉”，而一名白人女性的头发被描述为“正常”。美国护发品牌TRESemmé的广告系列的一部分。

In this post I scrape Twitter data related to the debacle and use a Natural Language Processing algorithm, the latent Dirichlet allocation (LDA) to unveil some the major topics which emerge from the twitter discourse since the event.

在这篇文章中，我抓取了与这场灾难有关的Twitter数据，并使用了自然语言处理算法(潜在的Dirichlet分配(LDA))来揭示自事件以来Twitter话语中出现的一些主要主题。

Why Twitter is a good source for Voice of Customer in South Africa?

为什么Twitter是南非“客户之声”的好来源？

Academics studying social media analysis have a multitude of datasources to use but Twitter has often been selected as a data source of choice due to its access friendly infrastructure, predisposition for social discourse and near total availability of data, particularly in South Africa. Google is able to provide a signal for the search trends by region in South Africa which speaks to the heartbeat of social media.

研究社交媒体分析的学者拥有大量数据源可供使用，但Twitter由于其易于访问的基础架构，易受社交影响的倾向以及几乎全部的数据可用性而经常被选择作为数据源，特别是在南非。 Google能够为南非各地区的搜索趋势提供信号，这表明了社交媒体的心跳。

Googles Search Trends For Clicks in South Africa Google在南非的点击搜索趋势

The most related searches from google were “Clicks Hair Advert” & “Clicks Racism Ad” with average growth over 900% each. These directly correlated with the sharp stock price fall following the advertisement’s release.

来自Google的最相关的搜索是“ Clicks Hair广告”和“ Clicks种族主义广告”，每个平均增长超过900％。这些与广告发布后股价大幅下跌直接相关。

Twitter客户分析之声 (Twitter Voice of Customer Analysis)

I used scraping library GetOldTweets3 to extract around 15,000 tweets linked to Googles Trending Related Searches for Clicks limited to the South African boundaries into a pandas data frame detailing the user, tweet text, Date and the hashtags. These are from 4 September until 8 September 2020.

我使用抓取库GetOldTweets3将大约15,000条与仅限于南非边界的Google趋势相关搜索的链接相关的推文提取到了熊猫数据框中，详细说明了用户，推文，日期和主题标签。从2020年9月4日至2020年9月8日。

I extracted the texts which are the most relevant data and performed gensims standard pre-processing including removing stop words, using tokenisers to remove punctuations and unnecessary characters altogether and finally creating the dictionary and Corpus needed for Topic Modelling. I go through the detail of pre-processing data for Topic Modelling in my previous post here.

我提取了最相关的文本，并执行了gensims标准的预处理，包括删除停用词，使用标记程序完全删除标点符号和不必要的字符，最后创建主题建模所需的词典和语料库。我经过了主题建模预处理数据的细节在我以前的帖子在这里。

可视化推文 (Visualising the Tweets)

I’ve found that word cloud’s are useful in getting a visual of some of the key themes before even attempting the modelling. To make the insights more useful I used generate_from_frequencies function from pythons wordcloud library which restructures the text data dictionary such that tweets with more reach with respect to retweets and favouriting, are prioritised.

我发现词云在尝试建模之前可以帮助您直观了解一些关键主题。为了使见解更有用，我使用了pythons wordcloud库中的generate_from_frequencies函数，该函数重新构造了文本数据字典，从而使有关推文和收藏夹的推文具有更高的优先级。

WordClouds are great at for extracting the essential insights, with the 100 words from 15,000 tweets revealing that racism, racists and protests were some of the most associated words with the Clicks Brand. We cant gather enough insights from words clouds alone to reveal what topics emerge from the 15,000 tweets which is where the LDA model comes in.

WordCloud擅长提取基本见解，从15,000条推文中的100个单词揭示了种族主义，种族主义和抗议是Clicks品牌中最相关的单词。我们不能仅从词云中收集到足够的见解，就无法揭示LDA模型来自15,000条推文中出现的主题。

使用LDA进行主题建模 (Topic Modelling with the LDA)

To qualify why LDA as a topic modelling algorithm is significant for this exercise, imagine all these words from individual Tweets are broken down individually as we have done with our preprocessing. The LDA investigates the distribution and frequency of these words within their respective documents and tries to imagine a fixed set of topics. Each topic represents a set of the words from the Tweets & the LDA maps all the words from Tweets to the topics such that they are captured by those topics. This yields something like which looks like this:

要确定为什么LDA作为主题建模算法对于本练习很重要，请想象一下像我们对预处理所做的那样，将各个Tweet中的所有单词分别分解。 LDA会调查这些单词在其各自文档中的分布和频率，并尝试想象固定的一组主题。每个主题代表推文中的一组单词，LDA将推文中的所有单词映射到主题，以使它们被这些主题捕获。这将产生如下所示的内容：

LDA Output by Topic 按主题的LDA输出

关键主题及其解释 (Key Topics & their Interpretations)

Four topics emerged from the Topic Model which speaks to some of the key themes of the events which ensued after Clicks released the offensive online ad.

主题模型中出现了四个主题，这些主题谈到了Clicks发布令人反感的在线广告后所发生事件的一些关键主题。

使用pyLDAvis可视化Twitter主题 (Visualising the Twitter Topics with pyLDAvis)

The LDA output doesn't really make too much sense in this format which is why we use pyLDAvis, an interactive LDA visualization package, to plot all generated topics and their keywords. PyLDAvis calculates semantic distance between topics and projects topics on a 2D plane.

LDA输出在这种格式下并没有太大意义，这就是为什么我们使用交互式LDA可视化包pyLDAvis绘制所有生成的主题及其关键字的原因。 PyLDAvis在2D平面上计算主题和项目主题之间的语义距离。

Bubble size of the represents “importance” of the topic and distance between the bubbles reflects the similarity between topics. The closer the two circles are, the more similar the topics are. A good topic model should have some dominant bigger bubbles, with smaller ones scattered on the plane and avoid overlaps which shows topic infusions. I have iterated to 4 topics which meet this criteria.

气泡大小代表主题的“重要性”，气泡之间的距离反映了主题之间的相似性。两个圆圈越近，主题越相似。一个好的主题模型应该有一些占主导地位的较大气泡，较小的气泡分散在平面上，并避免出现重叠的现象，这表明主题已注入。我已经迭代了4个符合此条件的主题。

主题1：一般点击和头发广告语 (Topic 1: General Clicks & Hair Ad Discourse)

At over 9,000 , the root word click is expectedly the most common in the entire corpus which speaks the brand at hand. The key themes surrounding the brand have to do with the hair advert with key words such as “hair” , “clicksmustfall”, “employee” and even political party “eff” which has been the most vocal in this discussion, emerging.

在整个语料库中，点击根词有望超过9,000，代表着手头的品牌。品牌周围的关键主题与发型有关，其中包括“ hair”，“ clicksmustfall”，“ employee”甚至是政党“ eff”等关键词，而这在本次讨论中一直是最活跃的话题。

主题2：抗议暴力与封闭商店 (Topic 2: Protests Violence & Closed Stores)

The key topics surrounding this topic speak to the discussion around the protests which resulted in the closing of Clicks stores on Wednesday, September 8. Words such as “close”, “store” , “protest” , and “right” were dominant in the topic while words such as “wrong”,”violence” & “damage” were also in the same cluster. By frequency and distribution, it would appear there were mixed feelings about the protests with the dominant sentiment deeming the “protest” as “right” or a “right” while less but significant amounts saw them as “wrong” and “violent”.

与该主题相关的主要话题是在围绕抗议活动的讨论中进行的，抗议活动导致Clicks商店于9月8日关闭。“ close”，“ store”，“ protest”和“ right”等词在该主题同时出现在“错误”，“暴力”和“损坏”等词中。从频率和分布上看，人们对抗议活动的感觉参差不齐，占主导地位的情绪认为“抗议”是“正确”或“正确”，而更少但又很多的人将其视为“错误”和“暴力”。

主题3和4：种族主义，种族主义者和白人 (Topic 3 and 4 : Racism, Racists & White People)

These two topics are within close proximity with good reason — They both deal with assertions around race & racism, a topic which is at the core of this debacle.

这两个主题非常接近，有充分的理由-它们都处理有关种族与种族主义的主张，而种族与种族主义是这场崩溃的核心。

Topic 3 has “racism” and “racist” as the most frequent words followed by “take” ,“job” ,”lose” & “today” . This points to the discourse around professional accountability for the racial advert which may very well have resulted in the firing of a senior executive at Clicks responsible for the racist advert on 8 September. Interesting enough, the word “apology” appears at the very bottom of this topic.Topic 4 speaks to some of these themes in topic 3 but specifically at “ white”, “people” and “clicksshutdown”.

主题3以“种族主义”和“种族主义”为最常见的词，其次是“采取”，“工作”，“失败”和“今天”。这说明了有关种族广告专业责任的讨论，这很可能导致9月8日Clicks的一名负责种族主义广告的高级管理人员被解雇。足够有趣的是，在此主题的最底部出现了“道歉”一词。主题4与主题3中的某些主题相关，但具体涉及“白人”，“人”和“点击关闭”。

结论 (Conclusion)

The social media landscape has created a new social currency which users can leverage to get companies to respond to criticism. Topic Modelling social media allows us to parse thousands of data points within the twitter sphere to get a gauge of the heartbeat of the populace in order to understand the themes which emerge.

社交媒体格局创造了一种新的社交货币，用户可以利用该社交货币来促使公司应对批评。社交媒体的主题建模使我们能够解析Twitter范围内的数千个数据点，以了解民众的心跳状况，从而了解出现的主题。

翻译自: https://medium.com/@juliansteam/voice-of-customer-is-the-new-currency-what-machine-learning-twitter-reveal-about-the-clicks-248699698a9a

虚拟货币机器学习预测