term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.
develop a class of operations from linear algebra, known as matrix decompositionuse a special form of matrix decomposition to construct a low-rank approximation to the term-document matrixexamine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexingFor a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:
The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.
In a similar fashion, the left eigenvectors of C are the M-vectors y such that
The number of nonzero eigenvalues of C is at most rank©.
Note:
the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is smallFor a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.a square matrix can be factored into the product of matrices derived from its eigenvectors
Two theorems Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decompositionwhere the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order
If the eigenvalues are distinct, then this decomposition is unique.
Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decompositionbuild on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.
M * N term-document matrix C, thus C is very unlikely to be symmetric.
Theorem SVD Illustration of the SVDthere are two cases:
M > NM < NGiven an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be
the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy
When k is far smaller than r , we refer to Ck as a low-rank approximation.
The SVD can be used to solve the low-rank matrix approximation problem. We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:The rank of Ck is at most k.
this procedure yields the matrix of rank k with the lowest possible Frobenius error.
the form of Ckwhere ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.
LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing
use SVD to construct a low-rank approximation Ck to the term-document matrix
map each row/column to a k-dimensional space
use the new k-dimensional LSI represnetation to compute similarities between vectors
query vector is mapped into its representation in the LSI space
Note: The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection a value of k in the low hundreds can actually increase precision on some query benchmarks. This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy LSI works best in applications where there is little overlap between queries and documents.
soft clusteringLSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.