递归 计算数组重复次数最多

    科技2022-07-31  105

    递归 计算数组重复次数最多

    The Neilsen ratings estimate audience numbers for broadcast TV shows, but we have no independent estimate of viewing figures for memes (an art form that has exploded in popularity over the past decade). So I created some!

    Neilsen的收视率可以估算广播电视节目的观众人数,但我们没有独立估算收录模因的观看人数(一种在过去十年间Swift普及的艺术形式)。 所以我创造了一些!

    I will describe how the view counts displayed in the chart above were arrived at using a dataset of ~43k images and 5 data science principles (which are principles I’ve applied to many projects I’ve worked on over the years).

    我将描述上表中显示的观看次数是如何使用约43k图像的数据集和5条数据科学原理 (这是我多年来应用于许多项目的原理)得出的。

    1)明确定义您要估算的值 (1) Define explicitly what you want to estimate)

    Words are open to interpretation. I have lost count of how many times two people will agree in conversation on the aim of a project... only to later find they had interpreted one crucial word differently. You will save time if you thoroughly debate the definition of every word at the start of the project.

    话语易于解释。 我忘记了两个人会在一个项目的目标上进行对话的次数……而后来却发现他们对一个关键单词的理解不同。 如果在项目开始时彻底辩论每个单词的定义,则可以节省时间 。

    For example say you want to estimate the ‘UK population in 2019’. Does this mean the population at the start, end or middle of 2019? Or the average of all three? Does population mean permanent residents, or every person including tourists and temporary residents? There is no right answer!

    例如,您要估算“ 2019年英国人口”。 这是否意味着2019年初,年底或中期的人口? 还是这三者的平均值? 人口是指永久居民,还是包括游客和临时居民在内的每个人? 没有正确的答案!

    So back to memes. To be clear, I am not talking about the academic definition of a meme, but rather the internet image meme (a combination of image and text that is shared online). More specifically I am interested in working out the most popular meme template (the underlying image used for the meme). So when I say ‘the most viewed meme’ I actually mean:

    回到模因。 明确地说,我不是在讨论模因的学术定义 ,而是谈论互联网形象模因 (在线共享的图像和文本的组合)。 更具体地说,我对制定最受欢迎的模因模板感兴趣 (用于模因的基础图像)。 因此,当我说“观看次数最多的模因”时,我的意思是:

    The most viewed meme template (which we work out by adding up the view counts for all internet image memes using that meme template)

    观看次数最多的模因模板 (我们通过使用该模因模板将所有互联网图像模因的视图计数相加得出)

    So with that out the way, time to start collecting data.

    因此,有了这种方法,就可以开始收集数据了。

    2)以最小化偏差的方式采样数据 (2) Sample data in a way that minimises bias)

    It is not possible for the Neilson system to monitor every single television set, just like how I was not able to download every meme shared online. In both cases sampling is necessary.

    Neilson系统无法监视每台电视机,就像我无法下载在线共享的每个模因一样。 在这两种情况下,都必须采样 。

    We say a data sample is unbiased if it truly represents the wider population, but in many cases this is impossible. Often we have to sample data in such a way that we minimise bias as much as reasonably possible. Then do our best to correct for the bias later on when analysing the data.

    我们说,如果数据样本确实代表了更广泛的人群,那么它是无偏见的,但是在许多情况下,这是不可能的。 通常,我们必须以尽可能合理地减少偏差的方式对数据进行采样 。 然后尽力在以后分析数据时纠正偏差。

    For this project memes were sampled from Reddit, one of the largest image sharing websites in the world. Several times a day a web scraper would look at several meme focused parts of the site and collect the 100 most popular posts. Many of these memes were hosted on Imgur, a site that publishes viewing figures, so cross referencing their data allows us to infer viewing figures for Reddit posts. This sampling can be done with a few lines of python, thanks to the APIs of both Reddit & Imgur. For brevity I won’t explain it here but the code used can be found in this article:

    对于这个项目,模因是从Reddit采样的, Reddit是世界上最大的图像共享网站之一。 一天几次,网络抓取工具会查看该网站的几个以表情包为重点的部分,并收集100个最受欢迎的帖子。 这些模因很多都托管在Imgur上, Imgur是一个发布查看数据的网站,因此交叉引用其数据可以使我们推断Reddit帖子的查看数据。 借助Reddit和Imgur的API,可以使用几行python完成此采样。 为了简洁起见,我在这里不做解释,但是可以在本文中找到使用的代码:

    Now the question is does this sampling method minimise bias? Reddit is only one website, so it does not truly represent the entire internet. We could reduce bias by sampling memes posted to other sites, like Instagram or Facebook. However these websites provide limited public data that is not comparable, the only way to compare between sites would be to make wild assumptions that could introduce much more bias into our final estimates.

    现在的问题是,这种采样方法是否可以使偏差最小化? Reddit只是一个网站,因此它不能真正代表整个互联网。 我们可以通过抽样发布到其他网站(如Instagram或Facebook)的模因来减少偏见。 但是,这些网站提供的公共数据有限,无法比较,因此,要在两个网站之间进行比较的唯一方法是做出疯狂的假设,这可能会给我们的最终估算带来更多偏见。

    Sometimes you just need to accept there is no right answer and make a judgement call. I judged it was better to sample from the one best source, rather than combining many sources and ending up with an unreliable dataset. I say Reddit is the best source because it is the largest image sharing website from which you can infer viewing figures to a reasonable degree of accuracy (by cross referencing with Imgur data).

    有时,您只需要接受没有正确答案的电话,然后做出判断即可 。 我认为最好从一个最佳来源进行采样,而不是合并多个来源并最终获得不可靠的数据集。 我说Reddit是最好的资源,因为它是最大的图像共享网站,从中您可以推断观看图片的合理程度(通过与Imgur数据进行交叉引用)。

    3)复杂模型仅用于复杂问题 (3) Complex models are only for complex problems)

    We need to identify the meme template used in each of the memes in our dataset. This is an image classification problem, but more importantly it is a simple image classification problem. Do not use a complicated solution when a simple one will do just fine.

    我们需要确定数据集中每个模因中使用的模因模板。 这是一个图像分类问题,但更重要的是,这是一个简单的图像分类问题。 当简单的解决方案可以解决问题时,请不要使用复杂的解决方案 。

    Recent state of the art image classifiers, such as those that win the Image-Net competition, are deep neural nets capable of object recognition regardless of the angle, lighting or background. Looking at a meme and recognising the underlying meme template is a far easier task, and so requires something far simpler than a 100 layer neural net.

    最新的图像分类器(例如在Image-Net竞赛中获胜的图像分类器)是能够识别物体的深层神经网络,无论角度,光照或背景如何。 查看一个模因并识别底层模因模板是一项容易得多的任务,因此需要比100层神经网络简单得多的东西。

    There are only so many meme templates and they all have distinctive color palettes. We can accurately classify memes by just counting pixels and passing these counts to a linear support vector machine, which takes seconds to train (as opposed to the days required for a neural net). A worked example of how to construct this exact model can be found here:

    只有这么多的模因模板,并且它们都有独特的调色板。 我们只需对像素进行计数并将这些计数传递到线性支持向量机 即可准确地对模因进行分类 ,这需要数秒钟的时间来训练(与神经网络所需的天数相反)。 可以在以下位置找到有关如何构建此精确模型的有效示例:

    4)验证,如果可能的话使用人工 (4) Verify, using a human if possible)

    Many a time, an eager young data scientist has run over to my desk proudly proclaiming a magnificent result, only for their confidence to disappear upon being asked what they did to verify the result. Magnificent results often disappear after a bit of basic verification uncovers a major flaw.

    很多时候,一位急切的年轻数据科学家自豪地跑到我的办公桌前,宣布了一个宏伟的结果,只是因为当他们被问到他们做了什么来验证结果时,他们的信心消失了。 在经过一些基本验证发现一个重大缺陷之后,宏伟的结果通常会消失。

    When it comes to verifying the results of an image classification model, there is no substitute for the human eye (yet). You might think the results of an image classifier on this dataset (of around 43000 images) would take a long time to verify, but there are many tools that can speed thing up. Using this labelling tool I was able to verify the results (and mark any incorrect classifications) in a matter of hours:

    当要验证图像分类模型的结果时,还没有人眼可以替代。 您可能认为此数据集上的图像分类器(大约43000张图像)的结果需要花费很长时间来验证,但是有许多工具可以加快速度。 使用此标记工具,我可以在几个小时内验证结果(并标记任何不正确的分类):

    It took me on average 20 seconds to verify 100 images (viewing them in a 10x10 grid), so I ended up getting through all 43000 images in under 3 hours. Not something I want to do every day, but once a year is fine.

    我平均花了20秒钟来验证100张图像(以10x10的网格查看它们),所以我最终在3小时内浏览了全部43000张图像。 我不是每天都想做的事,但是每年一次就可以了。

    5)仔细考虑每个假设 (5) Consider every assumption carefully)

    Statistical models rely on data and assumptions. Often you cannot improve the raw data, but you can improve the assumptions.

    统计模型依赖于数据和假设。 通常,您无法改善原始数据,但是可以改善假设 。

    The final step of this work is to take the dataset and extract viewing figures for each meme template. Due to limitations in the data this analysis requires a couple of extra assumptions, which I will explain below. If you want the full code for this step it can be found in this Kaggle notebook:

    这项工作的最后一步是获取数据集并提取每个模因模板的观看图形。 由于数据的限制,该分析需要几个额外的假设,我将在下面进行解释。 如果您需要此步骤的完整代码,可以在此Kaggle笔记本中找到它:

    The first assumption concerns missing values. When one of the entries in your dataset has a missing value is it better to remove the entry (thus reducing the size of your sample) or infer what that value is (thus potentially introducing inaccuracy)? It depends what proportion of your dataset has missing values; for a low proportion it is often better to just drop them, but for a high proportion (as it was for this meme dataset) then dropping all those values could significantly reduce how representative your sample is, hence it made more sense for me to fill those missing values as best I could.

    第一个假设涉及缺失值。 当数据集中的某个条目具有缺失值时,最好删除该条目(从而减小样本的大小)或推断该值是什么(从而可能导致不准确性)? 这取决于您的数据集中缺失值的比例 ; 对于较低的比例,通常最好将其删除,但对于较高的比例(就像对于此meme数据集一样),然后删除所有这些值可能会大大降低样本的代表性,因此,对我而言,填充那些缺失的值我会尽力而为。

    The second assumption concerns correcting for bias towards Reddit users in our dataset. I used the following ‘propagation’ assumption to address this. I sampled from dozens of different sections of Reddit, so I could measure how many of these sections each meme template appeared in. I assumed the wider a meme spreads within Reddit, the wider it has spread outside of Reddit, and therefore the view counts for those memes are inflated to reflect that.

    第二个假设涉及纠正数据集中偏向Reddit用户的偏见。 我使用以下“传播”假设来解决此问题。 我从Reddit的数十个不同部分中进行了采样,因此我可以衡量每个Meme模板出现在这些部分中的多少个。我假设一个Meme在Reddit中的传播范围越广,它在Reddit之外的传播范围就越广,因此该视图对于这些模因被夸大以反映这一点。

    When it comes to assumptions, there is never one correct answer. The only thing you can do is make a judgement that you can justify to others.

    当涉及假设时,永远不会有一个正确的答案。 您唯一可以做的就是判断自己可以为他人辩护。

    结果; (2018年)观看次数最多的模因模板 (Results; the most viewed meme templates (of 2018))

    The method was run throughout 2018, and resulted in downloading a total of 400,000 images, of which 43,660 were identified as using one of the 250 most common meme templates.

    该方法在2018年全年运行,总共下载了400,000张图像,其中43,660张被确定为使用250个最常见的模因模板之一。

    As we can see that Drake meme was by far the most viewed throughout 2018 with over 157 million views (according to this analysis, which most likely underestimates the true numbers).

    我们可以看到,到目前为止,Drake meme是2018年全年观看次数最多的观看次数,超过1.57亿次观看(根据此分析,这很可能低估了真实数字)。

    fair use doctrine 合理使用原则下允许发表评论,此处复制了模因的缩略图

    And the distribution of total view counts among the top templates broadly resembles a Pareto distribution.

    顶部模板之间的总观看次数分布大致类似于Pareto分布。

    Figure by author 作者图

    想要看看吗? (Care to take a look?)

    You can download all ~43k images along with the metadata here:

    您可以在此处下载所有〜43k图像以及元数据:

    最后的想法 (Final thoughts)

    There are lots of things that are difficult to measure exactly; meme popularity it one of them. Sometimes we just have to accept that and do the best job possible. I discussed 5 principles used during this work, which can be summarised as; think carefully about every step of the project, before you take the step.

    许多事情很难准确衡量; 模因流行是其中之一。 有时我们只需要接受这一点,并尽力而为。 我讨论了这项工作中使用的5条原则,可以概括为: 在执行此步骤之前,请仔细考虑项目的每个步骤。

    翻译自: https://towardsdatascience.com/calculating-the-10-most-viewed-memes-dc8e1e24caf3

    递归 计算数组重复次数最多

    Processed: 0.009, SQL: 8