从头开始学习比特币
Many people imagine that data science is mostly machine learning and that data scientists mostly build and train and tweak machine-learning models all day long. (Then again, many of those people don’t actually know what machine learning is.)
许多人认为,数据科学主要是机器学习,而数据科学家通常整天都在构建,训练和调整机器学习模型。 (然后,其中许多人实际上并不知道什么是机器学习。)
In fact, data science is mostly turning business problems into data problems and collecting data and understanding data and cleaning data and formatting data, after which machine learning is almost an afterthought. Even so, it’s an interesting and essential afterthought that you pretty much have to know about in order to do data science.
实际上,数据科学主要将业务问题转变为数据问题,并收集数据,理解数据,清理数据和格式化数据,然后几乎是事后才想到机器学习。 即便如此,这是一个有趣且必不可少的事后想法,您几乎必须了解它才能进行数据科学。
Before we can talk about machine learning we need to talk about models.
在讨论机器学习之前,我们需要讨论模型。
What is a model? It’s simply a specification of a mathematical (or probabilistic) relationship that exists between different variables.
什么是模特? 它只是不同变量之间存在的数学(或概率)关系的规范。
For instance, if you’re trying to raise money for your social networking site, you might build a business model (likely in a spreadsheet) that takes inputs like “number of users” and “ad revenue per user” and “number of employees” and outputs your annual profit for the next several years.
例如,如果您想为自己的社交网站筹集资金,则可以建立一个业务模型(可能在电子表格中),该模型采用诸如“用户数”,“每位用户的广告收入”和“员工数”之类的输入”,然后输出您未来几年的年利润。
A cookbook recipe entails a model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients needed. And if you’ve ever watched poker on television, you know that they estimate each player’s “win probability” in real time based on a model that takes into account the cards that have been revealed so far and the distribution of cards in the deck.
菜谱食谱需要一个模型,该模型将“进食者数量”和“饥饿感”等输入与所需成分的数量相关联。 而且,如果您曾经看过电视上的扑克游戏,那么您就会知道,他们会根据一个模型来实时估算每个玩家的“获胜概率”,该模型考虑了到目前为止已经公开的纸牌以及纸牌在纸牌中的分布。
The business model is probably based on simple mathematical relationships: profit is revenue minus expenses, revenue is units sold times average price, and so on. The recipe model is probably based on trial and error — someone went in a kitchen and tried different combinations of ingredients until they found one they liked. And the poker model is based on probability theory, the rules of poker, and some reasonably innocuous assumptions about the random process by which cards are dealt.
业务模型可能基于简单的数学关系:利润是收入减去支出,收入是销售量乘以平ASP格,依此类推。 配方模型可能基于反复试验-某人走进厨房尝试了不同的食材组合,直到找到自己喜欢的食材。 扑克模型是基于概率论,扑克规则以及一些关于随机处理纸牌过程的合理无害的假设的。
Everyone has her own exact definition, but we’ll use machine learning to refer to creating and using models that are learned from data. In other contexts this might be called predictive modeling or data mining, but we will stick with machine learning. Typically, our goal will be to use existing data to develop models that we can use to predict various outcomes for new data, such as:
每个人都有自己的确切定义,但是我们将使用机器学习来指代创建和使用从数据中学到的模型。 在其他情况下,这可能称为预测建模或数据挖掘,但我们将坚持使用机器学习。 通常,我们的目标是使用现有数据来开发模型,以用于预测新数据的各种结果,例如:
• Predicting whether an email message is spam or not
•预测电子邮件是否为垃圾邮件
• Predicting whether a credit card transaction is fraudulent
•预测信用卡交易是否欺诈
• Predicting which advertisement a shopper is most likely to click on
•预测购物者最有可能点击哪个广告
• Predicting which football team is going to win the Super Bowl
•预测哪个足球队将赢得超级碗
We’ll look at both
我们来看看
Supervised models (in which there is a set of data labeled with the correct answers to learn from), and监督模型(其中有一组数据标记有正确的答案以供学习),以及Unsupervised models (in which there are no such labels).无监督模型(其中没有此类标签)。There are various other types like
还有其他各种类型,例如
Semi-Supervised (in which only some of the data are labeled) and半监督(其中仅标记了一些数据)和Online (in which the model needs to continuously adjust to newly arriving data) that we won’t cover in this article.在线(模型需要不断调整以适应新到达的数据)在本文中我们不会介绍。Now, in even the simplest situation there are entire universes of models that might describe the relationship we’re interested in. In most cases we will ourselves choose a parameterized family of models and then use data to learn parameters that are in some way optimal.
现在,即使在最简单的情况下,也有可能描述我们感兴趣的关系的模型全域。在大多数情况下,我们将自己选择一个参数化的模型族,然后使用数据来学习某种程度上最佳的参数。
For instance, we might assume that a person’s height is (roughly) a linear function of his weight and then use data to learn what that linear function is. Or we might assume that a decision tree is a good way to diagnose what diseases our patients have and then use data to learn the “optimal” such tree. Throughout the rest of the articles we’ll be investigating different families of models that we can learn.
例如,我们可能假设一个人的身高(大致)是他的体重的线性函数,然后使用数据来了解该线性函数是什么。 或者,我们可能认为决策树是诊断患者患有哪些疾病然后使用数据学习“最佳”这样的树的好方法。 在其余的文章中,我们将研究可以学习的不同模型系列。
But before we can do that, we need to better understand the fundamentals of machine learning. For the rest of the articles, we’ll discuss some of those basic concepts, before we move on to the models themselves.
但是在我们做到这一点之前,我们需要更好地了解机器学习的基础。 在本文的其余部分,我们将讨论模型中的一些基本概念。
A common danger in machine learning is overfitting — producing a model that performs well on the data you train it on but that generalizes poorly to any new data. This could involve learning noise in the data. Or it could involve learning to identify specific inputs rather than whatever factors are actually predictive for the desired output.
机器学习中的一个常见危险是过拟合-生成一个模型,该模型在您训练的数据上表现良好,但对任何新数据的推广效果都很差。 这可能涉及学习数据中的噪声。 或者,它可能涉及学习识别特定的输入,而不是实际预测所需输出的任何因素。
The other side of this is underfitting, producing a model that doesn’t perform well even on the training data, although typically when this happens you decide your model isn’t good enough and keep looking for a better one.
另一方面是拟合不足,生成的模型即使在训练数据上也表现不佳,尽管通常情况下,您会确定模型不够好,并一直在寻找更好的模型。
I’ve fit three polynomials to a sample of data. (Don’t worry about how; we’ll get to that in later article.)
我已经将三个多项式拟合到一个数据样本中。 (不必担心如何;我们将在以后的文章中介绍。)
The horizontal line shows the best fit degree 0 (i.e., constant) polynomial. It severely underfits the training data. The best fit degree 9 (i.e., 10-parameter) polynomial goes through every training data point exactly, but it very severely overfits — if we were to pick a few more data points it would quite likely miss them by a lot. And the degree 1 line strikes a nice balance — it’s pretty close to every point, and (if these data are representative) the line will likely be close to new data points as well.
水平线显示最佳拟合度0(即常数)多项式。 它严重不利于训练数据。 最佳拟合度9(即10参数)多项式正好遍历每个训练数据点,但是它非常严重地过拟合-如果我们要选择更多的数据点,则很可能会错过很多数据点。 1度线达到了很好的平衡-它非常接近每个点,并且(如果这些数据具有代表性)该线也可能接近新的数据点。
Clearly models that are too complex lead to overfitting and don’t generalize well beyond the data they were trained on. So how do we make sure our models aren’t too complex? The most fundamental approach involves using different data to train the model and to test the model.
显然,过于复杂的模型会导致过拟合,并且不能将其推广到训练过的数据之外。 那么,如何确保我们的模型不太复杂? 最基本的方法涉及使用不同的数据来训练模型和测试模型。
The simplest way to do this is to split your data set, so that (for example) two-thirds of it is used to train the model, after which we measure the model’s performance on the remaining third:
最简单的方法是拆分数据集,以便(例如)将其中的三分之二用于训练模型,然后在剩下的三分之一上测量模型的性能:
Often, we’ll have a matrix “x” of input variables and a vector “y” of output variables. In that case, we need to make sure to put corresponding values together in either the training data or the test data:
通常,我们会有一个输入变量的矩阵“ x”和一个输出变量的向量“ y”。 在这种情况下,我们需要确保将相应的值放到训练数据或测试数据中:
so that you might do something like:
这样您就可以执行以下操作:
If the model was overfit to the training data, then it will hopefully perform really poorly on the (completely separate) test data. Said differently, if it performs well on the test data, then you can be more confident that it’s fitting rather than overfitting.
如果模型过度适合训练数据,那么它有望在(完全独立的)测试数据上表现非常差。 换句话说,如果它在测试数据上表现良好,那么您可以更有信心,认为它适合而不是过度适合。
However, there are a couple of ways this can go wrong.
但是,有两种方法可能会出错。
The first is if there are common patterns in the test and train data that wouldn’t generalize to a larger data set.
第一个是测试中是否存在通用模式,而训练数据不会泛化为更大的数据集。
For example, imagine that your data set consists of user activity, one row per user per week. In such a case, most users will appear in both the training data and the test data, and certain models might learn to identify users rather than discover relationships involving attributes. This isn’t a huge worry, although it did happen to me once.
例如,假设您的数据集由用户活动组成,每个用户每周一行。 在这种情况下,大多数用户将同时出现在训练数据和测试数据中,并且某些模型可能会学会识别用户,而不是发现涉及属性的关系。 尽管这确实发生在我身上,但这并不是一个大问题。
A bigger problem is if you use the test/train split not just to judge a model but also to choose from among many models. In that case, although each individual model may not be overfit, the “choose a model that performs best on the test set” is a meta training that makes the test set function as a second training set. (Of course the model that performed best on the test set is going to perform well on the test set.)
一个更大的问题是,如果您不仅要使用测试/训练拆分来判断模型,还要从众多模型中进行选择。 在那种情况下,尽管每个单独的模型可能不会过拟合,但是“选择在测试集上表现最佳的模型”是使测试集用作第二训练集的元训练。 (当然,在测试集上表现最佳的模型将在测试集上表现良好。)
In such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models, and a test set for judging the final model.
在这种情况下,您应该将数据分为三个部分:用于构建模型的训练集,用于在训练后的模型中进行选择的验证集以及用于判断最终模型的测试集。
When I’m not doing data science, I dabble in medicine. And in my spare time I’ve come up with a cheap, noninvasive test that can be given to a newborn baby that predicts — with greater than 98% accuracy — whether the newborn will ever develop leukemia. My lawyer has convinced me the test is unpatentable, so I’ll share with you the details here: predict leukemia if and only if the baby is named Luke (which sounds sort of like “leukemia”).
当我不做数据科学时,我会涉猎医学。 在业余时间,我想出了一种便宜的,无创的检查方法,可以对新生婴儿进行预测,该准确性可以达到98%以上,预测新生儿是否会患上白血病。 我的律师说服我进行这项检查是没有专利的,因此我将在这里与您分享详细信息:当且仅当婴儿的名字叫卢克(听起来像“白血病”)时,才可以预测白血病。
As we’ll see below, this test is indeed more than 98% accurate. Nonetheless, it’s an incredibly stupid test, and a good illustration of why we don’t typically use “accuracy” to measure how good a model is.
正如我们将在下面看到的,该测试的准确度确实超过98%。 尽管如此,这是一个非常愚蠢的测试,并且很好地说明了为什么我们通常不使用“准确性”来衡量模型的良好程度。
Imagine building a model to make a binary judgment. Is this email spam? Should we hire this candidate? Is this air traveler secretly a terrorist?
想象一下建立一个模型来进行二元判断。 这是垃圾邮件吗? 我们应该雇用这个候选人吗? 这个航空旅行者是否是恐怖分子?
Given a set of labeled data and such a predictive model, every data point lies in one of four categories:
给定一组标记数据和这种预测模型,每个数据点都属于以下四个类别之一:
• True positive: “This message is spam, and we correctly predicted spam.”
•正面肯定: “此邮件是垃圾邮件,我们正确地预测了垃圾邮件。”
• False positive (Type 1 Error): “This message is not spam, but we predicted spam.”
•误报(类型1错误): “此邮件不是垃圾邮件,但我们预测为垃圾邮件。”
• False negative (Type 2 Error): “This message is spam, but we predicted not spam.”
•假阴性(类型2错误): “此邮件是垃圾邮件,但我们预计不是垃圾邮件。”
• True negative: “This message is not spam, and we correctly predicted not spam.”
•真否定: “此邮件不是垃圾邮件,我们正确地预测不是垃圾邮件。”
We often represent these as counts in a confusion matrix:
我们通常将这些表示为混淆矩阵中的计数:
Let’s see how my leukemia test fits into this framework. These days approximately 5 babies out of 1,000 are named Luke. And the lifetime prevalence of leukemia is about 1.4%, or 14 out of every 1,000 people.
让我们看看我的白血病测试如何适应这个框架。 这些天,在1,000名婴儿中,大约有5名被命名为Luke。 白血病的终生患病率约为1.4%,即每千人中有14人。
If we believe these two factors are independent and apply my “Luke is for leukemia” test to 1 million people, we’d expect to see a confusion matrix like:
如果我们认为这两个因素是独立的,并且将我的“卢克治疗白血病”测试应用于100万人,那么我们期望看到一个混乱的矩阵,例如:
We can then use these to compute various statistics about model performance. For example, accuracy is defined as the fraction of correct predictions:
然后,我们可以使用它们来计算有关模型性能的各种统计信息。 例如,准确性定义为正确预测的分数:
That seems like a pretty impressive number. But clearly this is not a good test, which means that we probably shouldn’t put a lot of credence in raw accuracy.
这似乎是一个令人印象深刻的数字。 但是很显然,这不是一个很好的测试,这意味着我们可能不应该在原始准确性上放任多信誉。
It’s common to look at the combination of precision and recall. Precision measures how accurate our positive predictions were:
通常要兼顾精度和召回率。 精度衡量我们的积极预测的准确性:
And recall measures what fraction of the positives our model identified:
回想度量了我们的模型确定的积极因素的一部分:
These are both terrible numbers, reflecting that this is a terrible model.
这些都是可怕的数字,反映出这是一个可怕的模型。
Sometimes precision and recall are combined into the F1 score, which is defined as:
有时,精度和召回率会合并为F1分数,其定义为:
This is the harmonic mean of precision and recall and necessarily lies between them.
这是精确度和召回率的谐波平均值,必定介于两者之间。
Usually the choice of a model involves a trade-off between precision and recall.
通常,模型的选择需要在精度和召回率之间进行权衡。
A model that predicts “yes” when it’s even a little bit confident will probably have a high recall but a low precision; 如果模型在稍有自信的情况下会预测“是”,则其召回率可能很高,但精度却很低。 A model that predicts “yes” only when it’s extremely confident is likely to have a low recall and a high precision. 仅在非常自信的情况下预测“是”的模型可能具有较低的召回率和较高的准确性。Alternatively, you can think of this as a trade-off between false positives and false negatives.
另外,您可以将其视为误报与误报之间的权衡。
Saying “yes” too often will give you lots of false positives; 经常说“是”会给您带来很多误报。 Saying “no” too often will give you lots of false negatives. 经常说“不”会给您带来很多误报。Imagine that there were 10 risk factors for leukemia, and that the more of them you had the more likely you were to develop leukemia. In that case you can imagine a continuum of tests:
想象一下,有10个白血病危险因素,而您拥有的危险因素越多,您患白血病的可能性就越大。 在这种情况下,您可以想象一个连续的测试:
“predict leukemia if at least one risk factor,” “如果有至少一种危险因素,则可以预测白血病”, “predict leukemia if at least two risk factors,” “如果有至少两个危险因素,则可以预测白血病”,and so on.
等等。
As you increase the threshhold, you increase the test’s precision (since people with more risk factors are more likely to develop the disease), and 随着阈值的增加,测试的精度也会提高(因为具有更多危险因素的人更容易患上这种疾病),并且 you decrease the test’s recall (since fewer and fewer of the eventual disease sufferers will meet the threshhold).您可以减少测试的召回率(因为越来越少的最终疾病患者会达到阈值)。In cases like this, choosing the right threshhold is a matter of finding the right trade-off.
在这种情况下,选择正确的阈值只是寻找正确的权衡问题。
Another way of thinking about the overfitting problem is as a trade-off between bias and variance.
关于过度拟合问题的另一种思考方式是在偏差和方差之间进行权衡。
Both are measures of what would happen if you were to retrain your model many times on different sets of training data (from the same larger population).
两种方法都可以衡量如果您在不同的训练数据集(来自相同的较大人群)上多次训练模型时会发生的情况。
For example, the degree “0” model in “Overfitting and Underfitting”, will make a lot of mistakes for pretty much any training set (drawn from the same population), which means that it has a high bias. However, any two randomly chosen training sets should give pretty similar models (since any two randomly chosen training sets should have pretty similar average values). So we say that it has a low variance. High bias and low variance typically correspond to underfitting.
例如,“过度拟合和不足拟合”中的度数“ 0”模型对于几乎所有训练集(从相同总体中得出)都将犯很多错误,这意味着它具有很高的偏差。 但是,任意两个随机选择的训练集应提供非常相似的模型(因为任意两个随机选择的训练集应具有非常相似的平均值)。 所以我们说它的变化很小。 高偏差和低方差通常对应于拟合不足。
On the other hand, the degree “9” model fit the training set perfectly. It has very low bias but very high variance (since any two training sets would likely give rise to very different models). This corresponds to overfitting.
另一方面,“ 9”度模型非常适合训练集。 它的偏见非常低,但方差却很高(因为任何两个训练集都可能会产生非常不同的模型)。 这对应于过度拟合。
Thinking about model problems this way can help you figure out what do when your model doesn’t work so well.
以这种方式思考模型问题可以帮助您弄清楚当模型无法正常工作时该怎么做。
If your model has high bias (which means it performs poorly even on your training data) then one thing to try is adding more features. Going from the degree “0” model in “Overfitting and Underfitting”, to the degree “1” model was a big improvement.
如果您的模型有很高的偏差(这意味着即使在训练数据上它的表现也很差),那么尝试做的一件事就是添加更多功能。 从“过度拟合和欠拟合”中的度数“ 0”模型到“ 1”度数模型是一个很大的改进。
If your model has high variance, then you can similarly remove features. But another solution is to obtain more data (if you can).
如果您的模型具有较高的方差,则可以类似地删除特征。 但是另一个解决方案是获取更多数据(如果可以)。
We fit a degree 9 polynomial to different size samples. The model fit based on 10 data points is all over the place, as we saw before. If we instead trained on 100 data points, there’s much less overfitting. And the model trained from 1,000 data points looks very similar to the degree 1 model.
我们将9级多项式拟合到不同大小的样本。 正如我们之前看到的,基于10个数据点的模型拟合遍地都是。 如果我们改为在100个数据点上进行训练,则过度拟合的情况要少得多。 从1,000个数据点训练的模型看起来与1级模型非常相似。
Holding model complexity constant, the more data you have, the harder it is to overfit.
在保持模型复杂性不变的情况下,拥有的数据越多,拟合得越困难。
On the other hand, more data won’t help with bias. If your model doesn’t use enough features to capture regularities in the data, throwing more data at it won’t help.
另一方面,更多的数据不会对偏差产生帮助。 如果您的模型没有使用足够的功能来捕获数据中的规律性,则向其扔更多数据将无济于事。
As we mentioned,
正如我们所提到的
When your data doesn’t have enough features, your model is likely to underfit.当您的数据没有足够的功能时,您的模型可能会不合适。When your data has too many features, it’s easy to overfit. 当您的数据具有太多功能时,很容易过拟合。But what are features and where do they come from?
但是功能是什么?它们来自何处?
Features are whatever inputs we provide to our model.
功能是我们提供给模型的任何输入。
In the simplest case, features are simply given to you. If you want to predict someone’s salary based on her years of experience, then years of experience is the only feature you have.
在最简单的情况下,功能只是提供给您。 如果您想根据某人的工作年限来预测其薪水,那么多年的工作经验是您唯一的功能。
(Although, as we saw in “Overfitting and Underfitting”, you might also consider adding years of experience squared, cubed, and so on if that helps you build a better model.)
(尽管,正如我们在“过拟合和欠拟合”中所看到的那样,如果可以帮助您建立更好的模型,您也可以考虑增加多年的经验,如平方,立方等)。
Things become more interesting as your data becomes more complicated. Imagine trying to build a spam filter to predict whether an email is junk or not. Most models won’t know what to do with a raw email, which is just a collection of text. You’ll have to extract features. For example:
随着您的数据变得越来越复杂,事情变得越来越有趣。 想象一下,尝试构建一个垃圾邮件过滤器来预测电子邮件是否为垃圾邮件。 大多数模型都不知道如何处理原始电子邮件,而原始电子邮件只是文本的集合。 您必须提取功能。 例如:
• Does the email contain the word “Viagra”?
•电子邮件中是否包含“伟哥”一词?
• How many times does the letter d appear?
•字母d出现几次?
• What was the domain of the sender?
•发件人的域是什么?
The first is simply a “yes” or “no”, which we typically encode as a “1” or “0”. The second is a number. And the third is a choice from a discrete set of options.
第一个只是“是”或“否”,我们通常将其编码为“ 1”或“ 0”。 第二个是数字。 第三是从一组离散选项中进行选择。
Pretty much always, we’ll extract features from our data that fall into one of these three categories. What’s more, the type of features we have constrains the type of models we can use.
通常,我们将从数据中提取属于以下三类之一的特征。 而且,我们拥有的功能类型会限制我们可以使用的模型的类型。
The Naive Bayes classifier we’ll build in further article is suited to yes-or-no features, like the first one in the preceding list.
我们将在后续文章中构建的Naive Bayes分类器适合于是或否功能,例如前面列表中的第一个功能。
Regression models, as we’ll going to study in further article, require numeric features (which could include dummy variables that are 0s and 1s).
回归模型(我们将在后续文章中进行研究)需要数字特征(其中可能包括0和1的虚拟变量)。
And decision trees, which we’ll look at in further article, can deal with numeric or categorical data.
我们将在后续文章中介绍的决策树可以处理数字或分类数据。
For example, your inputs might be vectors of several hundred numbers. Depending on the situation, it might be appropriate to distill these down to handful of important dimensions and use only those small number of features. Or it might be appropriate to use a technique (like regularization, which we’ll look at in “Regularization” on further article) that penalizes models the more features they use.
例如,您的输入可能是几百个数字的向量。 视情况而定,将其精简为几个重要尺寸并仅使用少量特征即可。 或者使用一种技术(如正则化,我们将在后面的文章“正则化”中介绍)可能更合适,因为该技术会对模型使用的更多功能造成不利影响。
How do we choose features? That’s where a combination of experience and domain expertise comes into play. If you’ve received lots of emails, then you probably have a sense that the presence of certain words might be a good indicator of spamminess. And you might also have a sense that the number of d’s is likely not a good indicator of spamminess. But in general you’ll have to try different things, which is part of the fun.
我们如何选择功能? 那是经验和领域专业知识相结合的地方。 如果您收到很多电子邮件,那么您可能会感觉到某些单词的出现可能是垃圾邮件的良好指示。 而且您可能还认为d的数量可能不是垃圾邮件的良好指标。 但总的来说,您将不得不尝试不同的方法,这是乐趣的一部分。
I hope you found this article useful, Thank you for reading till here. If you have any question and/or suggestions, let me know in the comments.You can also get in touch with me directly through email or LinkedIn or Twitter
希望本文对您有所帮助,谢谢您的阅读。 如果您有任何疑问和/或建议,请在评论中让我知道。您也可以直接通过电子邮件, LinkedIn或Twitter与我联系
References and Further Reading
参考资料和进一步阅读
Data Science & Machine Learning Use Cases
数据科学与机器学习用例
翻译自: https://medium.com/@ravivarmathotakura/explore-machine-learning-from-scratch-20503f8a3865
从头开始学习比特币
相关资源:四史答题软件安装包exe