信息检索导论第十八章笔记(英文)

科技2024-05-06 89

Matrix decompositions and latent semantic indexing

term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

develop a class of operations from linear algebra, known as matrix decompositionuse a special form of matrix decomposition to construct a low-rank approximation to the term-document matrixexamine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

Linear algebra review

eigenvalues of C

For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

In a similar fashion, the left eigenvectors of C are the M-vectors y such that

The number of nonzero eigenvalues of C is at most rank©.

Note:

the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is smallFor a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

Matrix decompositions

a square matrix can be factored into the product of matrices derived from its eigenvectors

Two theorems Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

If the eigenvalues are distinct, then this decomposition is unique.

Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

Term–document matrices and singular value decompositions

M * N term-document matrix C, thus C is very unlikely to be symmetric.

Theorem SVD

Illustration of the SVD

there are two cases:

M > NM < N

Low-rank approximations

Forbenius Norm

Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

When k is far smaller than r , we refer to Ck as a low-rank approximation.

The SVD can be used to solve the low-rank matrix approximation problem. We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

The rank of Ck is at most k.

this procedure yields the matrix of rank k with the lowest possible Frobenius error.

the form of Ck

where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

Latent semantic indexing

LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

use SVD to construct a low-rank approximation Ck to the term-document matrix

map each row/column to a k-dimensional space

use the new k-dimensional LSI represnetation to compute similarities between vectors

query vector is mapped into its representation in the LSI space

Note: The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection a value of k in the low hundreds can actually increase precision on some query benchmarks. This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy LSI works best in applications where there is little overlap between queries and documents.

soft clustering

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

Processed: 0.016, SQL: 8