2013年6月28日 星期五

Probabilistic latent semantic indexing

This paper present an approach to automated document indexing which is based on a statistical
latent class model for factor analysis of count data.


Many retrieval methods are based on simple word matching strategies to determine the rank of relevance of a document with respect to a query.
They has drawbacks mainly due to the ambivalence of words and their unavoidable lack of precision as well as due to personal style and individual difference in word usage.

Latent Semantic Analysis (LSA) is an approach to automatic indexing and information retrieval that attempts
to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space.
However it has a number of de ficits, mainly due to its unsatisfactory statistical foundation.

Probabilistic Latent Semantic Analysis (PLSA) - that has a solid statistical foundation, since it is based on the likelihood principle and defines a proper generative model of the data.
PLSA allows to deal with polysemous words and to explicitly distinguish between different meanings
and different types of word usage.


The Model 

Hidden topic: z \in Z = \{z_1, ..., z_K\}
Word (Term): w \in W = \{w_1, ..., w_M\}
Document: d \in D = \{d_1, ..., d_N\}


Expectation Maximization (EM) algorithm

E- step

 M-step re-estimation 

And more author propose a generalization of maximum likelihood for mixture models -
called tempered EM (TEM) - which is based on entropic regularization and is closely related to a method known as deterministic annealing.

Modify the E-step

The main advantage of TEM in context is to avoid over fitting.


Geometry of the Aspect Model

A continuous latent space is obtained within the space of all multinomial distributions.
This can also be thought of in terms of dimensionality reduction and the sub-simplex can be identified with a probabilistic latent semantic space.
 Sketch of the probability sub-simplex spanned by the aspect model.



This is a method for automated indexing based on a statistical latent class model. This approach has important theoretical advantages over standard LSI, since it is based on the likelihood principle.
It can also take advantage of statistical standard methods for model fitting, over fitting control, and model combination.




沒有留言:

張貼留言