数据科学(线性代数)
Linear algebra is the branch of mathematics that deals with vector spaces. Although I can’t hope to teach you linear algebra in a brief article, it underpins a large number of data science concepts and techniques, which means I owe it to you to at least try. What we learn in this article we’ll use heavily throughout the rest of the data science and machine learning further articles.
线性代数是处理向量空间的数学分支。 尽管我不希望在简短的文章中教您线性代数,但是它支持大量的数据科学概念和技术,这意味着我至少应该尝试一下。 我们在本文中学习的内容将在其余的数据科学和机器学习的后续文章中大量使用。
Abstractly, vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors.
抽象地,向量是可以加在一起(以形成新向量)并且可以与标量(即数字)相乘以形成新向量的对象。
Concretely (for us), vectors are points in some finite-dimensional space. Although you might not think of your data as vectors, they are a good way to represent numeric data.
具体而言(对我们而言),向量是某些有限维空间中的点。 尽管您可能不会将数据视为向量,但是它们是表示数字数据的好方法。
For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors (height, weight, age). If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors (exam1, exam2, exam3, exam4).
例如,如果您有很多人的身高,体重和年龄,则可以将数据视为三维向量(高度,体重,年龄)。 如果您要通过四门考试来授课,则可以将学生成绩视为四维向量(考试1,考试2,考试3,考试4)。
The simplest from-scratch approach is to represent vectors as lists of numbers. A list of three numbers corresponds to a vector in three-dimensional space, and vice versa:
最简单的从头开始的方法是将向量表示为数字列表。 三个数字的列表对应于三维空间中的向量,反之亦然:
One problem with this approach is that we will want to perform arithmetic on vectors. Because Python lists aren’t vectors (and hence provide no facilities for vector arithmetic), we’ll need to build these arithmetic tools ourselves. So let’s start with that.
这种方法的一个问题是我们将要对向量执行算术运算。 由于Python列表不是向量(因此不提供向量算术功能),我们需要自己构建这些算术工具。 因此,让我们开始吧。
To begin with, we’ll frequently need to add two vectors. Vectors add component wise. This means that if two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0], whose second element is v[1] + w[1], and so on. (If they’re not the same length, then we’re not allowed to add them.)
首先,我们经常需要添加两个向量。 向量明智地增加了分量。 这意味着如果两个向量v和w的长度相同,则它们的和就是其第一个元素为v [0] + w [0],第二个元素为v [1] + w [1]的向量,依此类推上。 (如果长度不一样,则不允许添加它们。)
For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3, 3]
例如,将向量[1,2]和[2,1]相加会得出[1 + 2、2 + 1]或[3,3]
Adding Two Vectors 两个向量相加We can easily implement this by zip-ing the vectors together and using a list comprehension to add the corresponding elements:
我们可以通过将向量压缩在一起并使用列表推导添加相应的元素来轻松实现此目的:
Similarly, to subtract two vectors we just subtract corresponding elements:
同样,要减去两个向量,我们只需减去相应的元素:
We’ll also sometimes want to component wise sum a list of vectors. That is, create a new vector whose first element is the sum of all the first elements, whose second element is the sum of all the second elements, and so on. The easiest way to do this is by adding one vector at a time:
有时我们也想对向量列表进行明智的求和。 也就是说,创建一个新矢量,其第一个元素是所有第一个元素的总和,第二个元素是所有第二个元素的总和,依此类推。 最简单的方法是一次添加一个向量:
If you think about it, we are just reduce-ing the list of vectors using vector_add, which means we can rewrite this more briefly using higher-order functions:
如果您考虑一下,我们只是使用vector_add减少了向量的列表,这意味着我们可以使用高阶函数来更简短地重写它:
or even:
甚至:
although this last one is probably more clever than helpful.
尽管这最后一个可能比帮助更聪明。
We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying each element of the vector by that number:
我们还需要能够将向量乘以标量,我们只需将向量的每个元素乘以该数字即可:
This allows us to compute the component wise means of a list of (same-sized) vectors:
这使我们能够计算(相同大小的)向量列表的按分量计算的均值:
A less obvious tool is the dot product. The dot product of two vectors is the sum of their component wise products:
点积是不太明显的工具。 两个向量的点积是两个分量的乘积之和:
The dot product measures how far the vector v extends in the w direction. For example, if w = [1, 0] then dot(v, w) is just the first component of v. Another way of saying this is that it’s the length of the vector you’d get if you projected v onto w.
点积测量向量v在w方向上延伸的距离。 例如,如果w = [1,0],则dot(v,w)只是v的第一个分量。另一种说法是,这是将v投影到w时得到的向量的长度。
The dot product as vector projection 点积作为矢量投影Using this, it’s easy to compute a vector’s sum of squares:
使用这个,很容易计算出向量的平方和:
Which we can use to compute its magnitude (or length):
我们可以用来计算其大小(或长度):
We now have all the pieces we need to compute the distance between two vectors, defined as:
现在,我们拥有计算两个向量之间的距离所需的所有片段,定义为:
Distance Between Two Vectors 两个向量之间的距离Which is possibly clearer if we write it as (the equivalent):
如果我们将其写为(等效项),则可能更清楚:
That should be plenty to get us started. We’ll be using these functions heavily throughout the article.
这应该足以让我们开始。 在整篇文章中,我们将大量使用这些功能。
Note: Using lists as vectors is great for exposition but terrible for performance. In production code, you would want to use the NumPy library, which includes a high-performance array class with all sorts of arithmetic operations included.
注意:将列表用作向量对说明很有用,但对性能却很不利。 在生产代码中,您需要使用NumPy库,该库包含一个高性能的数组类,其中包含各种算术运算。
A matrix is a two-dimensional collection of numbers. We will represent matrices as lists of lists, with each inner list having the same size and representing a row of the matrix. If A is a matrix, then A[i][j] is the element in the ith row and the jth column. Per mathematical convention, we will typically use capital letters to represent matrices. For example:
矩阵是数字的二维集合。 我们将矩阵表示为列表列表,每个内部列表具有相同的大小并表示矩阵的一行。 如果A是矩阵,则A [i] [j]是第i行和第j列中的元素。 根据数学惯例,我们通常将使用大写字母表示矩阵。 例如:
Note: In mathematics, you would usually name the first row of the matrix “row 1” and the first column “column 1.” Because we’re representing matrices with Python lists, which are zero-indexed, we’ll call the first row of a matrix “row 0” and the first column “column 0.”
注意:在数学中,通常将矩阵的第一行命名为“第1行”,将第一列命名为“第1列”。 因为我们用零索引的Python列表表示矩阵,所以我们将矩阵的第一行称为“行0”,将第一列称为“列0”。
Given this list-of-lists representation, the matrix A has len(A) rows and len(A[0]) columns, which we consider its shape:
给定此列表表示形式,矩阵A具有len(A)行和len(A [0])列,我们考虑其形状:
If a matrix has n rows and k columns, we will refer to it as a n × k matrix. We can (and sometimes will) think of each row of a n × k matrix as a vector of length k, and each column as a vector of length n:
如果矩阵具有n行和k列,则将其称为×k矩阵。 我们可以(有时会)将×k矩阵的每一行视为长度为k的向量,并将每一列视为长度为n的向量:
We’ll also want to be able to create a matrix given its shape and a function for generating its elements. We can do this using a nested list comprehension:
我们还将希望能够根据给定的形状创建矩阵,并具有生成其元素的功能。 我们可以使用嵌套列表理解来做到这一点:
Given this function, you could make a 5 × 5 identity matrix (with 1s on the diagonal and 0s elsewhere) with:
有了这个功能,您可以制作一个5×5的单位矩阵(对角线为1,其他地方为0),其值如下:
Matrices will be important to us for several reasons.
矩阵对我们很重要,原因有几个。
First, we can use a matrix to represent a data set consisting of multiple vectors, simply by considering each vector as a row of the matrix. For example, if you had the heights, weights, and ages of 1,000 people you could put them in a 1, 000 × 3 matrix:
首先,我们可以简单地通过将每个向量视为矩阵的一行来使用矩阵来表示由多个向量组成的数据集。 例如,如果您的身高,体重和年龄为1,000人,则可以将它们放入1,000×3的矩阵中:
Second, as we’ll see later, we can use an n × k matrix to represent a linear function that maps k-dimensional vectors to n-dimensional vectors. Several of our techniques and concepts will involve such functions.
第二,正如我们稍后将看到的,我们可以使用n×k矩阵表示将k维向量映射到n维向量的线性函数。 我们的几种技术和概念将涉及此类功能。
Third, matrices can be used to represent binary relationships. We represented the edges of a network as a collection of pairs (i, j). An alternative representation would be to create a matrix A such that A[i][j] is 1 if nodes i and j are connected and 0 otherwise.
第三,矩阵可用于表示二进制关系。 我们将网络的边缘表示为对(i,j)的集合。 另一种表示方式是创建一个矩阵A,如果节点i和j连接,则A [i] [j]为1,否则为0。
We could also represent this as:
我们也可以这样表示:
If there are very few connections, this is a much more inefficient representation, since you end up having to store a lot of zeroes. However, with the matrix representation it is much quicker to check whether two nodes are connected — you just have to do a matrix lookup instead of (potentially) inspecting every edge:
如果连接很少,则表示效率会低得多,因为最终必须存储很多零。 但是,使用矩阵表示形式,可以更快地检查两个节点是否已连接—您只需要进行矩阵查找即可(而不是(可能)检查每个边):
Similarly, to find the connections a node has, you only need to inspect the column (or the row) corresponding to that node:
同样,要查找节点具有的连接,只需检查与该节点对应的列(或行):
Previously we added a list of connections to each node object to speed up this process, but for a large, evolving graph that would probably be too expensive and difficult to maintain.
以前,我们为每个节点对象添加了一个连接列表,以加快此过程的速度,但是对于大型的,不断变化的图,它可能过于昂贵且难以维护。
Linear algebra is widely used by data scientists (frequently implicitly, and not infrequently by people who don’t understand it). It wouldn’t be a bad idea to read a textbook. You can find several freely available online:
线性代数已被数据科学家广泛使用(经常隐式,而不是不了解它的人很少使用)。 读一本教科书并不是一个坏主意。 您可以在线找到一些免费的:
I hope you found this article useful, Thank you for reading till here. If you have any question and/or suggestions, let me know in the comments.You can also get in touch with me directly through email & Linkedin
希望本文对您有所帮助,谢谢您的阅读。 如果您有任何疑问和/或建议,请在评论中让我知道。您也可以通过电子邮件和Linkedin直接与我联系
References and Further Reading
参考资料和进一步阅读
Crash Course in Python for Data Science (Part-3)
Python数据科学速成课程(第3部分)
Crash Course in Python for Data Science (Part-2)
Python数据科学速成课程(第2部分)
Crash Course in Python for Data Science (Part-1)
Ç 的Python为数据科学(第一部分-1皮疹课程 )
Tips and Trick to Write Better Python Code (Part-1)
编写更好的Python代码的技巧和窍门(第1部分)
Python Coding Interview with Solution (Part-2)
Python编码面试与解决方案(第2部分)
Python Coding Interview with Solution (Part-1)
解决方案的Python编码面试(第1部分)
Data Science & Machine Learning Use Cases
数据科学与机器学习用例
翻译自: https://medium.com/analytics-vidhya/linear-algebra-for-data-science-3c423078dc22
数据科学(线性代数)
相关资源:jdk-8u281-windows-x64.exe