信息检索导论第十八章笔记(英文)

    科技2024-05-06  89

    Matrix decompositions and latent semantic indexing

    term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

    develop a class of operations from linear algebra, known as matrix decompositionuse a special form of matrix decomposition to construct a low-rank approximation to the term-document matrixexamine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

    Linear algebra review

    eigenvalues of C

    For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

    The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

    In a similar fashion, the left eigenvectors of C are the M-vectors y such that

    The number of nonzero eigenvalues of C is at most rank©.

    Note:

    the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is smallFor a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

    Matrix decompositions

    a square matrix can be factored into the product of matrices derived from its eigenvectors

    Two theorems Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

    where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

    If the eigenvalues are distinct, then this decomposition is unique.

    Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

    build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

    Term–document matrices and singular value decompositions

    M * N term-document matrix C, thus C is very unlikely to be symmetric.

    Theorem SVD

    Illustration of the SVD

    there are two cases:

    M > NM < N

    Low-rank approximations

    Forbenius Norm

    Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

    the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

    When k is far smaller than r , we refer to Ck as a low-rank approximation.

    The SVD can be used to solve the low-rank matrix approximation problem. We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

    The rank of Ck is at most k.

    this procedure yields the matrix of rank k with the lowest possible Frobenius error.

    the form of Ck

    where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

    Latent semantic indexing

    LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

    use SVD to construct a low-rank approximation Ck to the term-document matrix

    map each row/column to a k-dimensional space

    use the new k-dimensional LSI represnetation to compute similarities between vectors

    query vector is mapped into its representation in the LSI space

    Note: The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection a value of k in the low hundreds can actually increase precision on some query benchmarks. This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy LSI works best in applications where there is little overlap between queries and documents.

    soft clustering

    LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

    Processed: 0.016, SQL: 8