2013年6月28日 星期五

Latent Dirichlet allocation

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships.


Probabilistic LSI (pLSI) model, also known as the aspect model.
There are several problems:
(1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting.
 (2) it is not clear how to assign probability to a document outside of the training set.

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is
that documents are represented as random mixtures over latent topics, where each topic is characterized
by a distribution over words.
Consider mixture models that capture the exchangeability of both words and documents.

Mathematical equation:

probability density on the simplex:

Given the parameters a and b, the joint distribution of a topic mixture q, a set of N topics z, and
a set of N words w is given by:

Graphical model representation of LDA




Relationship with other latent variable models
LDA to simpler latent variable models for text—the unigram model, a

mixture of unigrams, and the pLSI model.

Graphical model representation of different models of discrete data.

The topic simplex for three topics embedded in the word simplex for three words.
The pLSI model induces an empirical distribution on the topic simplex denoted by x. 
LDA places a smooth distribution on the topic simplex denoted by the contour lines.


Actually I am not well-understanding to this paper. There are too much mathematical equations.


沒有留言:

張貼留言