Chatter Charts (CC) is a sports visualization that mixes statistics with social media data to create a storyboard retelling of the game through the collective fanbase’s perspective.
Chatter Charts(CC)是一种运动可视化工具, 将统计数据与社交媒体数据混合在一起 ,从而通过集体支持者的视角创建了故事板,以重新介绍游戏。
CC splits a sports game’s social media comments into two-minute intervals and treats these comments as if they are part of a book telling a linear story where each interval is a chapter.
CC将体育游戏的社交媒体评论分为两分钟间隔,并将这些评论当作是讲述线性故事的书的一部分,其中每个间隔都是一章。
With this approach, it can leverage a statistical method called TF-IDF to rank all words at every interval and filter for the best performing one.
通过这种方法,它可以利用称为TF-IDF的统计方法对每个间隔的所有单词进行排名,并筛选出效果最佳的单词。
In the data frame above,
在上面的数据框中,
interval is the rounded interval
间隔是四舍五入的间隔
interval_volume is the number of full text comments inside that interval
interval_volume是该间隔内的全文注释的数量
word is the highest ranked tf_idf word of that interval
单词是该间隔中排名最高的tf_idf单词
n is the word’s raw count inside that interval
n是该间隔内单词的原始计数
tf is the percentage of that word relative to the total number of words in that interval — ex. 2.4% of words were “scorianov”
tf是该单词相对于该时间间隔内单词总数的百分比。 2.4%的单词是“ scorianov”
idf is a weighting based on how often the word occurs throughout the entire corpus — 3.5x is quite high because “scorianov” rarely occurs throughout the entire game
idf是一个基于单词在整个语料库中出现的频率的权重-3.5倍非常高,因为“ scorianov”在整个游戏中很少出现
tf_idf is tf multiplied by idf
tf_idf是tf乘以idf
In short, TF-IDF calculates the relative word count in an interval (Term Frequency) and weights it based on how often that word appears throughout the entire game (Inverse-Document Frequency).
简而言之,TF-IDF会计算一个间隔(术语频率)中的相对单词计数,并根据该单词在整个游戏中出现的频率(逆文档频率)对其进行加权。
This way, generic words such as “the” are punished heavily. Meanwhile words such as “hooking” or “dive” are able to outperform “penalty” when multiple penalties happen in a game.
这样,诸如“ the”之类的通用词会受到严厉惩罚。 同时,当游戏中发生多种惩罚时,诸如“钩”或“潜”之类的词就能胜过“惩罚”。
You can read more about TF-IDF from the Wikipedia page.
您可以从Wikipedia页面上阅读有关TF-IDF的更多信息。
It’s also the answer to one of my most commonly asked question:
这也是我最常问的问题之一的答案:
“How come this isn’t all swearing?!”
“这怎么不都发誓呢?!”
I’m going to hit you with the cold water: cussing is very common from sports fans. So much so, it is considered as noise. Look at the data frame above, specifically at the IDF weighting for “fuck,” it is DRASTICALLY lower than the others —so it needed to appear 18 times to out-rank the other words in the interval.
我要用冷水打你:体育迷经常骂人。 因此,它被视为噪音。 看一下上面的数据框,特别是IDF加权“ fuck ”,它大大低于其他数据-因此它需要出现18次才能使该间隔中的其他单词排第。
I pull comment data from two sources:
我从两个来源获取评论数据:
Reddit Game Threads via Python’s {PRAW} package:
通过Python的{PRAW}包进行Reddit游戏线程 :
r/Canucks game thread r / Canucks游戏线程and Twitter Tweets via R’s {rtweet} package:
和通过R的{rtweet}包进行Twitter Tweets :
A tweet that used #GoStars 使用#GoStars的推文Tip: You’ll need to create a Reddit Web app to use {PRAW} and a Twitter developer account to get access to API calls and make requests using {rtweet}.
提示:您需要创建一个 Reddit Web应用程序 才能使用{ PRAW } 和一个 Twitter开发人员帐户, 以使用{ rtweet } 访问API调用和发出请求 。
I also find the goal, intermission, and game start/end markers by leveraging the respective timestamps from either team’s official Twitter account. They love announcing when their team scores!
我还可以利用任何一支球队官方Twitter帐户中的相应时间戳来找到目标,间歇时间和比赛开始/结束标记 。 他们喜欢在球队得分时宣布!
So by combining both comments and the markers data, I can plot real-time reactions to the game.
因此,通过结合注释和标记数据, 我可以绘制对游戏的实时React。
I’ve built a workflow where I only need to provide a few details about the game and everything else will populate. This is my command center:
我建立了一个工作流,仅需提供有关游戏的一些详细信息,其他所有内容都将得到填充。 这是我的指挥中心:
Canucks POV example: I just need to fill this out and my scripts will do the rest. Canucks POV示例:我只需要填写此内容,剩下的就由我的脚本完成。In the background, team-specific data is called on using the main_teamand opponent variables. My script looks up colours, logos, social media info, and track down a list of fans who tweet about the team from a csv I’ve created.
在后台,使用main_team和opponent变量调用特定于团队的数据。 我的脚本查找颜色,徽标,社交媒体信息,并跟踪从我创建的csv向团队发布推文的粉丝列表。
This workflow was designed with the ability to build charts for any sport. For instance football, soccer, and baseball all have large events that define a game — touchdowns, goals, and RBIs respectively.
设计此工作流程的目的是能够为任何运动生成图表。 例如,足球,足球和棒球都具有定义游戏的大型事件-分别是触地得分,进球和RBI。
The second portion of the code contains all the event-based markers. I have to open Twitter and copy these manually. What I paste is the string of numbers at the end of a tweet.
代码的第二部分包含所有基于事件的标记。 我必须打开Twitter并手动复制这些内容。 我粘贴的是推文末尾的数字字符串。
https://twitter.com/<account>/status/1300474445925167104
https://twitter.com/ < 帐户> / status / 1300474445925167104
I lookup those numbers, also known as the status_id and grab their timestamps to draw goals, game start/end, intermissions, and logos.
我查找这些数字(也称为status_id并抓住它们的时间戳来绘制目标,比赛开始/结束,中场休息和徽标。
real-time markers 实时标记The code to make this visualization is written 97% in R and 3% in Python — Python only fetches the Reddit comments for me.
实现这种可视化的代码用R编写了97%,用Python编写了3%-Python只为我获取Reddit注释。
I have my raw data pulled. And some markers. Time to get to work. I use R from here on out.
我提取了原始数据。 还有一些标记。 该上班了。 我从现在开始使用R。
First, group the comments into two-minute intervals. I do this with the round_date function from {lubridate}. Super easy to use.
首先, 将评论分为两分钟 。 我使用{ lubridate }中的round_date函数执行此操作。 超级好用。
### ROUND DATESrounded_interval_df <- raw_df %>% round_date(interval, unit = "2 mins") rounded interval output, see how `created_at` rounds into `interval` 四舍五入的时间间隔输出,请参阅“ created_at”如何四舍五入为“ interval”Why two-minutes you ask? Hockey is fast. Things happen quickly. Anything longer can drown out events.
为什么要问两分钟? 曲棍球很快。 事情很快发生。 更长的时间可能淹没事件。
Why not one-minute? There’s usually not enough volume to satisfy TF-IDF, especially in smaller markets. However, I can use one-minute for ad-hoc charts — like a third period collapse.
为什么不一分钟呢? 通常没有足够的容量来满足TF-IDF,尤其是在较小的市场中。 但是,我可以将一分钟用于临时图表,就像第三次崩溃一样。
Next, calculate the comment volume for each interval.
接下来, 计算每个时间间隔的评论量 。
This let’s me plot the line in the chart and acts as the y-axis guide for the animated words to follow.
这让我在图表中绘制线条,并作为要跟随动画单词的y轴指南。
### CALCULATE VOLUMEinterval_volume_df <- rounded_interval_df %>% count(interval, name = "interval_volume") interval volume output 间隔音量输出Next, tokenize. This means I take data that is currently structured as one comment per row and break it up so each word in a comment has its own row.
接下来, 标记化 。 这意味着我将当前结构为每行一个注释的数据进行分解,以便注释中的每个单词都有自己的行。
{tidytext} does this with unnest_tokens :
{ tidytext }使用unnest_tokens完成此unnest_tokens :
### TOKENIZEunnested_df <- rounded_interval_df %>% unnest_tokens(word, text, token = "tweets")Remove stop-words. These are words like “I” and “the” — the stop_words variable is made available when you load {tidytext}. TF-IDF does discount these, but I find it’s friendlier to flat out remove them.
删除停用词 。 这些是“ I”和“ the”之类的词-加载{tidytext}时, stop_words变量可用。 TF-IDF 确实打折了这些,但我发现将它们平整地删除更为友好。
### TOKENIZED AND PROCESSEDprocessed_df <- unnested_df %>% anti_join(stop_words, by = "word") a tokenized data frame, see how each non stop-word is pulled from the sentence — — — — → 一个标记化的数据帧,看看如何从句子中提取每个非停用词— — — — —→Next, count the words in each interval for the last data preparation before TF-IDF.
接下来, 对每个间隔中的字数进行计数,以进行TF-IDF之前的最后一次数据准备。
### COUNT TOKENScounted_token_df <- processed_df %>% count(word, interval) word count of a single interval, excuse the cussing 一个间隔的字数,请原谅Lastly, apply TF-IDF.
最后, 应用TF-IDF 。
This is the code I use to get the output above with “scorianov.” Thebind_tf_idf function is from {tidytext}. You can also try using log-odds from {tidylo}.
这是我用来通过“ scorianov”获得以上输出的代码。 bind_tf_idf函数来自{tidytext}。 您也可以尝试使用{ tidylo }中的对数奇数。
### TF-IDFimportant_word_df <- counted_token_df %>% bind_tf_idf(word, interval, n) %>% # one line! filter(n >= 3) %>% # number of occurrences to be considered filter(idf < 4) %>% # limit VERY random words (typically noise) arrange(interval, desc(tf_idf)) %>% distinct(interval, .keep_all = TRUE) # take the top term### COMBINE WITH VOLUMEfull_data <- interval_volume_df %>% full_join(important_word_df, by = "interval") %>% filter(interval >= min_hour, interval <= max_hour) %>% arrange(interval) %>% fill(word, .direction = "down")Note: full_join and fill will make sure any instances our intervals do not meet the minimum number of occurrences for TF-IDF will instead forward-fill the interval with the previous word.
注意: full_join 和 fill 将确保我们的间隔不符合TF-IDF的最小出现次数的所有实例,而是使用前一个单词来向前填充间隔。
That’s like 70% of the important code that goes into the chart —a vast majority is all going on behind the scenes The rest is styling the plot.
这就像图表中重要代码的70%一样-绝大部分都在幕后进行,其余代码则在设计情节。
The Chatter Chart looks like this before animation.
动画前的颤振图看起来像这样。
base plot before animation 动画前的基本情节At this state, I animate the plot using {gganimate}.
在这种状态下,我使用{ gganimate }对绘图进行动画处理 。
animated_plot <- base_plot + transition_reveal(interval) # animate over intervalanimate(plot = animated_plot, fps = 25, duration = 38, height = 608, width = 1080, units = 'px', type = "cairo", res = 144, renderer = av_renderer("file-name.mp4"))By transitioning over the interval, I can build dynamic parts of the chart. For instance, my scoreboard increases and grows when someone scores. It’s quite similar to creating markers in Adobe After Effects.
通过在时间间隔上过渡,我可以构建图表的动态部分。 例如,当有人得分时,我的记分板会增加并增加。 这与在Adobe After Effects中创建标记非常相似。
The rest is intermediate ggplot. Some of the packages I leverage include {ggtext} for adding html styling to the title, {shadowtext} to add the white background to the words, and {extrafont} for importing custom fonts.
其余的是中间的ggplot。 我使用的一些软件包包括{ggtext}用于为标题添加html样式,{shadowtext}用于为单词添加白色背景,以及{extrafont}用于导入自定义字体。
Want to read more about how this evolved over the years?
想更多地了解这些年来的发展情况吗?
I talk about scaling, marketing strategy, and iterating in the public in this article.
我在本文中讨论了扩展,营销策略和在公众场合进行迭代。
https://medium.com/@chattercharts/ok-whats-the-statistics-behind-chatter-charts-5bb6b149e6c9
https://medium.com/@chattercharts/ok-whats-the-statistics-behind-chatter-charts-5bb6b149e6c9
As for now? I’m looking for sponsors, partners, and hitting up your podcast.
就目前而言? 我正在寻找赞助商,合作伙伴,并寻找您的播客。
Email: chattercharts@gmail.com
电子邮件: chattercharts@gmail.com
翻译自: https://medium.com/@chattercharts/chatter-charts-methodology-5f82a405a673
相关资源:研究论文-核电厂SMA 继电器震颤分析.pdf