In the topic modeling task, we focus on inferring the topic distribution for each item i based on its corresponding textual features (document). Given a corpus of documents, we propose a new generative probabilistic model for latent topic modeling and propose a novel Inference Network to infer the model parameters following the VAE framework. In this subsection, we first introduce the generative process of the proposed topic model and then present the inference method.
4.1.1 Generative Process.
Suppose we have a corpus of
D documents, where there are
\(N_i\) words in the
i-th document. We assume there are
K latent topics presented in these documents, and we aim to develop a model to find the latent topics and the semantics of each topic. Similar to LDA [
10], we assume that each document is represented as a random mixture over latent topics, and each topic is described as a distribution over words. The topic-word distribution can be denoted by a matrix
\(\boldsymbol {B}\), where the
j-th row represents the word distribution of the
j-th topic. For inference and parameter estimation of this model using the VAE framework, as is done in many previous studies [
14,
80], we impose a logistic normal prior on the topic distribution
\(\boldsymbol {\theta }\) to approximate the Dirichlet prior used in LDA. The reason for doing this is that the Dirichlet prior is important in obtaining the interpretable topics in LDA [
80], but it is incompatible with the inference process in VAE, whereas a logistic normal prior, which is a multivariate Gaussian distribution followed by a softmax transformation, can be incorporated with the VAE framework. By setting proper parameters induced by Laplace approximation of Hennig et al. [
38], the logistic normal prior can approximate the Dirichlet prior [
80].
To improve the topic quality and document representation of neural topic models, to differentiate items in terms of topic distribution for the performance of the few-shot classification task, and to solve the component collapsing problem [
22], we improve the existing neural topic model by introducing an additional prior
\(\boldsymbol {\alpha }\) on the logistic normal prior and name it a
topic mask. We propose the topic mask with the following three motivations.
First, the performance of topic models depends on its priors, especially on the priors for the topic distributions
\(\boldsymbol {\theta }\) [
21,
88]. The imposed prior for the topic distributions
\(\boldsymbol {\theta }\) can be interpreted as the prior topic distributions of the whole corpus. Its dimensions align with the number of topics, and the values along each dimension represent the prior probability of the corresponding topics appearing in the corpus. For example, a high prior means that the corresponding topic appears with a higher probability. A symmetric prior means that each dimension of the prior is equal, representing that we suppose each topic appears with equal probability in the corpus, whereas an asymmetric one means that we believe that different topics appear with varying probabilities in the real-world corpus. A symmetric Dirichlet prior is usually imposed on the topic distributions in topic models like LDA [
10], and this symmetric prior is commonly set as a constant recommended by Steyvers and Griffiths [
81]. It makes sense to believe that different topics appear with varying probabilities in the real-world corpus, especially in the corpus of a particular field. For example, the words “model,” “data,” and “algorithm” appear frequently in papers in the machine learning field [
88]. Hence, topics containing these words tend to appear with a higher probability for documents in this field. Existing studies have pointed out that compared with a symmetric prior, an asymmetric Dirichlet prior can better model the prior topic distributions and get better performance, measured by the probability of held-out documents, the quality of inferred topics [
88], and the document representation ability through document clustering and classification tasks [
21]. For neural topic models, although an asymmetric prior also seems reasonable to be beneficial for model performance, the commonly used Dirichlet prior is hard to combine with most neural topic models, and no research about neural topic models in the literature has ever studied the incorporation of an asymmetric prior [
14,
66,
80]. For the preceding reasons, in our model we design a component to impose an asymmetric prior on topic distributions and validate its effectiveness in improving the performance of the neural topic model.
Second, the inferred topic distributions are used as item (document) representations for the subsequent automatic tagging task. For this task, the ideal situation is that for each tag, items belonging to the tag have very different representations from those not belonging to the tag. The more significant the difference between topic distributions of different items (documents), the easier it is to train an accurate classification model [
32,
68]. In neural topic models, when we use a fixed prior for topic distributions, no matter whether it is symmetric or asymmetric, all of the documents share the same prior topic distributions, limiting the difference of the inferred topic distributions of different documents. To increase the difference for improving the accuracy of the following few-shot classification task, it is natural to consider a document-specific prior. Thus, in our proposed model, we design the prior as a probability distribution, which allows us to sample priors for different documents and get document-specific priors.
Third, most variational inference based neural topic models may easily suffer from the component collapsing problem, which is a particular type of local optimum very close to the prior belief [
14,
80]. The component collapsing problem in neural topic models is caused by the
Kullback-Leibler (KL) divergence regularization term in the variational objective of VAE. When the topic number is large, this regularization term dominates the objective. As a result, the latent topic distributions of different documents tend to be similar to each other. To tackle this problem, some studies improve the structure and training techniques of the neural network, including batch normalization, dropout layer, high moment weight, and learning rate [
14,
80]. We adopt these techniques, which are proved to be able to help avoid local optimum and alleviate the component collapsing problem. However, the root cause of the component collapsing problem lies in the KL divergence regularization term of the objective function. One way to solve this problem is to abandon the model structure of the VAE-based neural topic models, and use MMD [
66] or GAN [
91] instead of KL divergence to perform distribution matching. In this article, we propose an alternative method, which is to impose asymmetric prior on the topic distribution and make it changeable across different documents. In this way, KL divergence regularization cannot force all latent topic distributions to reach close to the prior to get identical topics. We adopt this idea because this method can naturally be combined with the other two motivations and solve the corresponding problems simultaneously by imposing an asymmetric and document-specific prior.
The proposed topic mask is assumed to follow a Beta distribution to generate a number between 0 and 1, determining whether a topic should be assigned to the item. In this way, we can impose an asymmetric prior on the topic distributions and force them to be more differentiated and document specific. Thus, we can improve the classification performance and the topic quality.
To handle the incompatibility problem with the inference process in VAE, we propose to use a Gaussian distribution followed by a sigmoid transformation to approximate the Beta distribution. In addition, we relax the constraint on the topic-word matrix
\(\boldsymbol {B}\). In LDA,
\(\boldsymbol {B}\) must be a stochastic matrix so that each row in
\(\boldsymbol {B}\) represents a multinomial topic-word distribution, and we multiply the topic distribution
\(\boldsymbol {\theta }\) and the topic-word distribution
\(\boldsymbol {B}\) to get a multinomial distribution for generating words. However, the constraint on
\(\boldsymbol {B}\) can reduce the topic quality in VAE [
80]. Thus, we relax the constraint when generating words as commonly done.
We demonstrate the generative process in Table
2, and the corresponding graphical representation is shown in Figure
4. In Table
2, lines 4 through 6 show the generation process of the topic mask
\(\boldsymbol {\alpha }\). For item
i’s document (or document
i for brevity), its topic mask
\(\boldsymbol {\alpha }_i\) is a
K-dimensional vector, where each element
\(\alpha _{ik}\) is a number between 0 and 1, indicating the degree of whether item
i is related to the
k-th topic. The topic mask
\(\boldsymbol {\alpha }_i\) of document
i is generated through the following two steps. First, auxiliary variables
\(\beta _{ik}\) is sampled from a Gaussian distribution
\(N(\mu _{Beta}(s, t), \sigma ^2_{Beta}(s, t))\). Then,
\(\alpha _{ik}\) is obtained by mapping
\(\beta _{ik}\) through a sigmoid function.
\(\mu _{Beta}(s, t)\) and
\(\sigma ^2_{Beta}(s, t)\) are functions derived from the Laplace approximation to ensure that the generated
\(\boldsymbol {\alpha }_i\) is approximately sampled from a Beta distribution
\(Beta(s, t)\). Specifically, we can derive that
\(\mu _{Beta}(s, t) = \log (s / t) - 1\) and
\(\sigma _{Beta} = \frac{1}{s} + \frac{1}{t}\) based on the work of Hennig et al. [
38].
Next, given the topic mask
\(\boldsymbol {\alpha }_i\), we generate the topic distribution
\(\boldsymbol {\theta }_i\) for document
i by the process described in lines 7 through 9. As stated previously, a logistic normal prior is used to approximate the Dirichlet prior. For a Dirichlet prior with concentration parameters
\(\boldsymbol {c} = (c_1, c_2, \ldots , c_K)\), Hennig et al. [
38] proposed the following approximation functions to calculate the parameters of the logistic normal prior:
where
\(\mu _{Dir}(\boldsymbol {c})\) and
\(\sigma ^2_{Dir}(\boldsymbol {c})\) maps the concentration parameters of a Dirichlet prior to two
K-dimensional vectors, which are respectively the mean and the diagonal of the covariance matrix of a Gaussian distribution. And the two equations shown earlier calculate the
k-th element of these two vectors. In line 8, we use the logistic normal distribution to approximate the Dirichlet prior and its concentration parameter is
\(\pi \boldsymbol {\alpha }_i + \bar{\pi }\). The Dirichlet distribution requires its concentration parameter
\(\pi \boldsymbol {\alpha }_i + \bar{\pi }\) to be greater than 0, necessitating a constant positive value for
\(\bar{\pi }\) to ensure that
\(\boldsymbol {r}_i\) can approximate a Dirichlet distribution.
Note that unlike the symmetric prior used in most works [
14,
80], in our proposed model, the concentration parameter
\(\pi \boldsymbol {\alpha }_i + \bar{\pi }\) is controlled by the topic mask
\(\boldsymbol {\alpha }_i\), and it defines an asymmetric prior where each element in the vector differs from others. The asymmetric prior results in different possibilities for generating these
K topics. We define
\(\bar{\pi } \lt \lt \pi\), thus a larger
\(\alpha _{ik}\) leads to a larger prior on the
k-th dimension of topic distribution, which indicates that the topic distribution
\(\boldsymbol {\theta }_i\) should bias toward the
k-th topic.
Finally, each word is generated in lines 10 and 11. In line 11,
\(softmax(\boldsymbol {\theta }_i\boldsymbol {B} + \boldsymbol {d})\) is a generalization of the dot multiplication between the topic distribution and the topic-word distribution. Here,
\(\boldsymbol {d}\) is a
V-dimensional background vector constituted by the logarithm of each word’s overall frequency.
\(\boldsymbol {B}\) is a topic-word matrix whose size is
\(K\times V\), and the
k-th row of
\(\boldsymbol {B}\) evaluates the expected word-frequency deviations from the background vector
\(\boldsymbol {d}\) when the
k-th topic appears in document
i. We can take
\(\boldsymbol {B}\) as an extension to the topic-word distribution in LDA, and this modification helps improve the topic quality [
80]. Then, the
j-th word
\(w_{ij}\) is assumed to draw from a multinomial distribution whose parameter is
\(softmax(\boldsymbol {\theta }_i\boldsymbol {B} + \boldsymbol {d})\).
4.1.2 Model Inference.
The inference for the graphical model in Figure
4 is intractable. Therefore, we propose a method to infer the latent variables following the VAE framework [
49]. The VAE framework has a favorable property that it uses neural networks to output the inferred topic distributions. It helps to incorporate with the neural network based classification component introduced in Section
4.2.
The VAE framework is developed based on the conventional variational inference algorithm, where we use variational distributions to approximate posterior distributions. In the proposed generative model (see Figure
4), we have two latent variables
\(\boldsymbol {\theta }_i\) and
\(\boldsymbol {\alpha }_i\) to infer for each document
i, thus we propose to use variational distribution
\(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) for approximating their true posterior
\(p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {w}_i)\), where
\(\boldsymbol {w}_i\) is the word sequence of document
i. In NTFSL, we focus on extracting topics merit in the documents, so we extract the distributional information (e.g., TFIDF) from
\(\boldsymbol {w}_i\) as feature vector
\(\boldsymbol {x}_i\) and ignore other local features, such as the sequence of words, in
\(\boldsymbol {w}_i\). Therefore, we substitute
\(\boldsymbol {w}_i\) with
\(\boldsymbol {x}_i\) in the rest of the article. To ensure a good approximation, KL divergence between
\(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) and
\(p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\) is minimized. Previous work [
9] on variational inference demonstrates that minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO) shown next:
Instead of optimizing ELBO analytically as done in the conventional way, we develop a new inference network together with a reconstruction network to encode the variational distributions following the VAE framework. To represent this assumption, we rewrite
\(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) as
\(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\).
\(\Phi\) denotes the parameters of the neural network for generating the inferred
\(\boldsymbol {\theta }_i\) and
\(\boldsymbol {\alpha }_i\), and the network takes distributional features
\(\boldsymbol {x}_i\) as inputs. In Figure
5, we illustrate the Inference Network of
\(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\).
The proposed Inference Network first uses a shared
Multi-Layer Perceptron (MLP) neural network with (pre-trained) word embeddings to encode the mean
\(\boldsymbol {\mu }_{ri}\) and the standard deviation
\(\boldsymbol {\sigma }_{ri}\) of the unnormalized topic distribution
\(\boldsymbol {r}_i \sim N(\boldsymbol {\mu }_{ri}, \boldsymbol {\sigma }_{ri}^2)\). Letting
\(\boldsymbol {W}_e\) denote the word embedding matrix, we use the followingformulas to calculate
\(\boldsymbol {\mu }_{ri}\) and
\(\boldsymbol {\sigma }_{ri}\):
where
\(f(\cdot)\) is a non-linear transformation, and both
\(\boldsymbol {\mu }_{ri}\) and
\(\boldsymbol {\sigma }_{ri}\) are
K-dimensional vectors.
Previous works [
14,
80] directly calculate the topic distribution
\(\boldsymbol {\theta }_i\) by normalizing the generated
\(\boldsymbol {r}_i\). However, when transforming the distributional features
\(\boldsymbol {x}_i\) through a non-linear transformation, we may lose some information merit in the raw distributional features. In many cases, the appearance of some words implies whether a document has a specific topic, and therefore the word frequencies (i.e., word distributional features of a document) may be very helpful for distinguishing topics. In fact, classical topic models like LDA [
10] use only word frequencies to infer the topic distributions. To incorporate this consideration, we propose to construct the network for approximating the posterior distribution of the topic mask
\(\boldsymbol {\alpha }_i\) in the following way so that we make use of distributional features
\(\boldsymbol {x}_i\) to better distinguish different topics in the generative process of our model. And by making the topic distributions have larger variance, the component collapsing problem is also alleviated with the generation of topic mask variables
\(\boldsymbol {\alpha }_i\).
The topic mask variable
\(\boldsymbol {\alpha }_i\) in the Inference Network helps improve the topic coherence as well as the classification performance. Suppose
\(\boldsymbol {\beta }_i\) is the unnormalized topic mask, and we assume it is Gaussian distributed—that is,
\(\boldsymbol {\beta }_i \sim N(\boldsymbol {\mu }_{\beta i}, \boldsymbol {\sigma }_{\beta i}^2)\), where
\(\boldsymbol {\mu }_{\beta i}\) and
\(\boldsymbol {\sigma }_{\beta i}\) are respectively the mean and the standard deviation and are computed asfollows:
Given
\(\boldsymbol {\beta }_i\), we can then calculate the topic mask
\(\boldsymbol {\alpha }_i\) by a sigmoid transformation:
We then construct the variational distribution of
\(\boldsymbol {\theta }_i\) in the Inference Network as
where
\(\Delta\) is a very small positive number like
\(1e-10\). Note that the topic mask
\(\boldsymbol {\alpha }_i\) can be interpreted as the probability of a topic assigned to the document. By adding
\(\log {(\boldsymbol {\alpha }_i + \Delta)}\) to
\(\boldsymbol {r}_i\), we can ensure that a small
\(\alpha _{ik}\) can lead to a low probability on the
k-th dimension in
\(\boldsymbol {\theta }_i\).
Although we leverage neural networks to output the distribution parameters of
\(\boldsymbol {\theta }_i\) and
\(\boldsymbol {\beta }_i\), calculating the expectation in Equation (
3) is still intractable. To resolve this problem, we adopt the sampling and reparameterization trick [
49]. In the sampling step, variables
\(\boldsymbol {\theta }_i^{(s)}\) and
\(\boldsymbol {\alpha }_i^{(s)}\) are sampled in accordance with the variational distribution
\(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)\). Then they are substituted into
\(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}\) for an estimation of the expectation
\(E_{q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)})\).
Then, we design another neural network
\(p_\Psi (\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\), whose input is the latent variables and output is normalized
\(\hat{\boldsymbol {x}}_i\). The normalized
\(\hat{\boldsymbol {x}}_i\) is calculated by
\(softmax(\boldsymbol {\theta }_i\boldsymbol {B}+\boldsymbol {d})\) as described in Table
2.
\(\boldsymbol {B}\) is randomly initialized, and it represents the unnormalized word distributions of each topic.
\(\boldsymbol {d}\) is pre-calculated by taking the logarithm of each word’s overall frequency. Thus,
\(\boldsymbol {\theta }_i\boldsymbol {B}\) evaluates the deviation of the word occurrence probability with respect to the whole corpus so that we can better capture those topics related to some infrequent words. Each element in
\(\hat{\boldsymbol {x}}_i\) is then the probability the corresponding word appears in the document, and we can evaluate the probability
\(p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) with these numbers. To some degree, this network regenerates the normalized
\(\boldsymbol {x}_i\) from
\(\boldsymbol {\theta }_i\) generated by the Inference Network whose inputs are
\(\boldsymbol {x}_i\). Therefore, we name this network the
Reconstruction Network, and its structure is illustrated in Figure
6.
For the convenience of the described sampling step, a reparameterization trick is used to sample
\(\boldsymbol {\theta }_i^{(s)}\)and
\(\boldsymbol {\alpha }_i^{(s)}\):
Then,
\(\boldsymbol {\theta }_i^{(s)}\) and
\(\boldsymbol {\alpha }_i^{(s)}\) are calculated with the sampled
\(\boldsymbol {r}_i^{(s)}\) and
\(\boldsymbol {\beta }_i^{(s)}\) using Equations (
9) and (
10). With this sampling process, we can derive a Monte Carlo approximation of Equation (
3) and use it as the loss function
\(l_t(i)\) for the topic modeling task:
where
\(\Phi = \lbrace \boldsymbol {W}_e, \boldsymbol {W}_{\mu _r}, \boldsymbol {W}_{\sigma _r}, \boldsymbol {W}_{\mu _\beta }, \boldsymbol {W}_{\sigma _\beta }, \boldsymbol {b}_{\mu _r}, \boldsymbol {b}_{\sigma _r}, \boldsymbol {b}_{\mu _\beta }, \boldsymbol {b}_{\sigma _\beta }\rbrace\),
\(\Psi = \lbrace \boldsymbol {B}\rbrace\) are parameters to be learned in the Inference Network and the Reconstruction Network.
We start the derivation of this loss function from Equation (
3):
\(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) is replaced with \(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\) as we use the Inference Network to generate the variational distribution.
According to the generative process described in Section
4.1.1,
\(\boldsymbol {x}\) is not dependent on the topic mask distribution
\(\boldsymbol {\alpha }\) given the topic distribution
\(\boldsymbol {\theta }\), thus we can further get the following:
Using the sampling trick, we can approximate the expectation
\(E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\cdot)\) by sampling
\(\boldsymbol {\alpha }_i^{(s)} \sim q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)\) according to Equation (
13) in Section
4.1.2:
Using the sample trick again to sample
\(\boldsymbol {\theta }_i^{(s)} \sim q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {x}_i,\boldsymbol {\alpha }_i^{(s)})\):
Taking the negative of the preceding equation, we can get the result in Equation (
14).
Assuming the Reconstruction Network outputs a normalized
\(\hat{\boldsymbol {x}}_i\) (probability), then the firstterm in Equation (
14) is
All of the distributions in the last two terms (i.e.,
\(q_\Phi (\boldsymbol {\theta }_i | \boldsymbol {x}_i, \boldsymbol {\alpha }_i^{(s)})\),
\(p(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)})\),
\(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)\), and
\(p(\boldsymbol {\alpha }_i)\)) are log-normal distributions. For two
K-dimensional log-normal distributions
\(p \sim LN(\mu _p, \Sigma _p)\) and
\(q \sim LN(\mu _q, \Sigma _q)\), their KL divergence
\(KL(q||p)\) has an analytical result:
Applying this result to the last two terms in Equation (
14), we can get the topic modeling task loss
\(l_{t}\).
The technical details (i.e., neural network setting details) of the topic modeling task including the Inference Network and the Reconstruction Network are shown in Figures
11 and
12 in Appendix
A.