Probabilistic LSI (pLSI) model, also known as the aspect model.
There are several problems:
(1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting.
(2) it is not clear how to assign probability to a document outside of the training set.
Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is
that documents are represented as random mixtures over latent topics, where each topic is characterized
by a distribution over words.
Consider mixture models that capture the exchangeability of both words and documents.
Mathematical equation:
probability density on the simplex:
Given the parameters a and b, the joint distribution of a topic mixture q, a set of N topics z, and
a set of N words w is given by:
Graphical model representation of LDA
LDA to simpler latent variable models for text—the unigram model, a
mixture of unigrams, and the pLSI model.
Graphical model representation of different models of discrete data.
The topic simplex for three topics embedded in the word simplex for three words.
The pLSI model induces an empirical distribution on the topic simplex denoted by x.
LDA places a smooth distribution on the topic simplex denoted by the contour lines.
Actually I am not well-understanding to this paper. There are too much mathematical equations.
沒有留言:
張貼留言