research-article

Open access

Tagging Items with Emerging Tags: A Neural Topic Model Based Few-Shot Learning Approach

Authors:

Shangkun Che,

Hongyan Liu,

Shen LiuAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 4

Article No.: 102, Pages 1 - 37

https://doi.org/10.1145/3641859

Published: 22 March 2024 Publication History

PDF eReader

Abstract

The tagging system has become a primary tool to organize information resources on the Internet, which benefits both users and the platforms. To build a successful tagging system, automatic tagging methods are desired. With the development of society, new tags keep emerging. The problem of tagging items with emerging tags is an open challenge for an automatic tagging system, and it has not been well studied in the literature. We define this problem as a tag-centered cold-start problem in this study and propose a novel neural topic model based few-shot learning method named NTFSL to solve the problem. In our proposed method, we innovatively fuse the topic modeling task with the few-shot learning task, endowing the model with the capability to infer effective topics to solve the tag-centered cold-start problem with the property of interpretability. Meanwhile, we propose a novel neural topic model for the topic modeling task to improve the quality of inferred topics, which helps enhance the tagging performance. Furthermore, we develop a novel inference method based on the variational auto-encoding framework for model inference. We conducted extensive experiments on two real-world datasets, and the results demonstrate the superior performance of our proposed model compared with state-of-the-art machine learning methods. Case studies also show the interpretability of the model.

1 Introduction

Tagging is one of the most widely used functionalities for Web applications [5]. By allocating tags to different items, we can build an indexing system that facilitates users to navigate to their interested items and improve their satisfaction of search experience [101]. For example, on the Robinhood platform, a popular Internet broker company, tags¹ are used to annotate stocks, where stocks with similar features are grouped together by being annotated with the same tag. By clicking tags like “E-commerce,” users can browse a list of stocks that are related to the business. Thematic investment is an important investment strategy, attracting both individual investors and institutional investors in recent years. Tags are naturally a good expression of an investment theme, which can help users effectively navigate through the stocks of their interested themes and understand some important investment hints of the stocks. In other applications, tags have been used to annotate items such as questions, news feeds, forum posts, publications, videos, photos, mobile apps, and even citizen complaints. Tags not only can enhance information retrieval services such as search, content recommendation, and automatic classification [5] but also can have many other impacts or applications. For example, from an individual level, one study shows that both the meaning of tag content and the perception (emotion or feeling) of tag color have impact on a consumer’s decision behavior [98]. Tags can be used to discover user interest and construct a user profile [99], which can be further utilized to enhance recommendation performance [11]. From an organization level, tags can help companies boost innovation [34] and help enhance government governance capacity through quickly responding and conducting flexible governance on issues shown in citizen complaints [28]. Tags are also used as more reliable knowledge sources than free texts to induce concepts and build ontology [51].

A successful tagging system should be able to cope with the emergence of new tags. As society evolves, language is also evolving rapidly. For example, Covid-19 was a new word first recorded in a report by the World Health Organization in February 2020.²The change in word usage can mirror the evolution of culture, technology, and society [61]. Moreover, the competence among words may also urge people to use new words to replace the old ones [71], and users may introduce new tags based on their personal tendencies [77]. In one of the largest knowledge sharing platforms, Stack Overflow (SO), tags are used to categorize questions, which facilitates users to find and answer each question. Tags are changing rapidly in this community, and the platform records the time when each tag is first created. As Figure 1 shows, users can access questions with new tags. Filtering out tags without generation time, we obtained 47,269 tags up to May 2021 from this platform and the first one was created 12 years ago. On average, 304 new tags occurred each month. When a new tag is used by a user for some questions, other existing questions and questions that will be posted may also be related with this tag. These questions apparently should also be tagged with it. Thus, the platform should have the ability to automatically identify those questions and attach or recommend the new tag to them.

Fig. 1.

For financial service platforms, when a new investment theme emerges, investors want to find all related stocks. Thus, in this case, it is very important to tag stocks with new theme tags quickly. For example, with the release of the ChatGPT model,³ ChatGPT has been used as an investment theme to tag many stocks since the end of January 2023 in the financial service platform of Hithink Flush (HTF).⁴In this way, investors can keep up with popular investment themes. In such scenarios, the time used to perform tagging is crucial. Even if the new tags occurring everyday are not many, as the number of stocks is large, the tagging process should be done automatically. Therefore, the emergence of new tags requires the platforms to develop the ability to quickly adapt to them and accurately allocate the tags to items.

To realize the full benefits of tagging, tags need to be accurately assigned to their corresponding items. Toward this end, many methods have been proposed for tagging items automatically and efficiently [12, 24, 39, 59, 74], considering the sheer number of tags and items. Existing methods for automatic tagging rely on a large number of training samples for each tag, which are ineffective in assigning items to new emerging tags. For platforms such as SO that rely on end users to tag items, it takes a long time to accumulate a large number of samples for each new tag. For example, on the SO platform, we collected all 482 new tags occurring from January 7 to February 21, 2022, and show the number (ratio) of tags with the number of related questions in Figure 2. From this figure, we can see that about 69.9\(\%\) of tags have only one question tagged with it and 15.4\(\%\) tags with only two questions. More than 90\(\%\) of new tags have only at most three questions tagged with them in 45 days.

Fig. 2.

For platforms such as financial service platforms that require specific staff of the platform to perform tagging, obtaining a large number of labeled samples is also quite time consuming and laborious. Therefore, it is not viable to use existing methods to tag items with new tags timely. To address the gap in existing automatic tagging research, we define a new research problem: the tag-centered cold-start problem. When a new tag appears with a few representative items, how can we find those items that should also be assigned the new tag from a large item pool? It is not difficult to find a few representative items for a new tag in practice. For example, assuming we are given a new tag “wireless earphone,” it is intuitive to think of AirPods and the stock of Apple.

To assign new tags in an agile and automatic way, an effective method is needed to handle the tag-centered cold-start problem. Thus, we propose a novel method to solve the tag-centered cold-start problem based on the few-shot learning [92] with a neural topic model embedded.

Few-shot learning can learn from a limited number of examples with supervised information, whereas traditional machine learning models cannot rapidly generalize from a few examples, as traditional machine learning models are usually trained based on a large volume of training data. The cold-start problem is one of the target tasks of few-shot learning [93]. For example, in recommender systems, the cold-start problem is related to the sparsity of information available due to the emerging new users or new items [57]. Hung and Huang [43] present a few-shot learning model to recommend new financial products with very limited transaction records to customers (i.e., a cold-start problem) by transferring the past experiences and knowledge [43]. For the tag-centered cold-start problem proposed in our research, each tag is only associated with a limited number of items. Traditional automatic tagging models rely on many items with each tag for training. With very few items for a new tag to train the model, traditional machine learning methods are very likely to learn an overfitted model or a model dominated by tags with many items. With the few-shot learning method, we mimic the learning task that mines patterns with a few labeled items for each tag based on information of old tags and focus on finding the patterns generalizable to unseen tags (new tags). The learning process is particularly designed to learn from very few labeled data to enhance generalizability. Therefore, few-shot learning is suitable for overcoming the limitations of traditional automatic tagging algorithms in cold-start scenarios.

Our proposed method focuses on leveraging items’ content information—more specifically, the textual description of items—to solve the tag-centered cold-start problem. Textual information is easy to collect and helpful for understanding the model. For items like news feeds, each item itself is a piece of textual information. And for items like stocks, the textual information in their financial reports is a good resource to describe their underlying business. Moreover, as humans often justify decisions verbally, machine learning models that take the textual information as inputs have the potential to generate corresponding explanations while generating the predictions [58]. For example, McAuley and Leskovec [62], Wang and Blei [89], and Rossetti et al. [75] all utilize textual information to perform recommendation task. As they use topic models to analyze textual information and connect latent user preference factors with the topics inferred from textual information, they tend to have a higher level of transparency and can provide text explanations for their recommendations [58]. Our work also introduces topic models to deal with the textual inputs. We choose topic modeling over other text-based methods to utilize text information due to the importance of interpretability in machine learning applications. Interpretability of machine learning models can enhance user trust and has been a significant concern for researchers [15]. In applications such as financial investment, medical diagnosis, and decision making, the interpretability of the model directly influences its practicality. Topic models offer good interpretability by presenting human-readable topics [1], and hence we choose topic modeling for our NTFSL (Neural Topic model based Few-Shot Learning) framework. Topic models are a class of effective models to extract meaningful abstract features from textual information [37, 41], and they can act as a means to conduct dimension reduction, which are useful for document classification [63].

Thus, to solve the tag-centered cold-start problem, we propose a new few-shot learning framework, which can perform the topic modeling task and the few-shot classification task simultaneously. However, it is hard to incorporate the traditional topic model into this framework. Therefore, we propose a novel neural topic model that can be fused with the classification component so that the two tasks can be trained through an end-to-end way. Moreover, the neural topic model can leverage the advantages of neural networks and achieve better classification performance. Additionally, the neural topic models tend to suffer from the component collapsing problem [80]. The component collapsing problem means that the topic distributions tend to be identical in each dimension, which may lead to worse classification performance as the topic distributions are not distinguishable. Our proposed neural topic model is developed to alleviate this problem, thus improving the classification performance. We name the proposed method NTFSL. To the best of our knowledge, no topic-based few-shot learning method has been proposed for few-shot classification tasks before.

We summarize our study’s contributions as follows:

—

The task to automatically assign new tags to proper items in the tagging systems (i.e., the tag-centered cold-start problem) is not well studied in previous literature, yet it is a problem of realistic needs. We propose and define the new problem.

—

We propose a novel method to solve this problem. We develop a new few-shot learning framework to perform both the topic modeling and the classification task in an end-to-end way. The fusion of topic modeling with a classification process is helpful to improve classification accuracy. The design of this framework also introduces interpretability to the learning result. By feeding very few seed items of each new tag, our proposed method can save the efforts of manually annotating a large number of items.

—

Under the proposed learning framework, we propose a new neural topic model—a new generative probabilistic model for document generation—which improves existing topic model methods by helping to alleviate the component collapsing problem. To make the topic model compatible with the learning framework, we further develop a novel inference network to perform topic inference based on the Variational Auto-Encoding (VAE) framework.

—

We evaluate our proposed method on two real-world datasets. The experimental results demonstrate that the proposed method NTFSL outperforms other baseline methods on the tag-centered cold-start problem.

2 Related Work

2.1 Automatic Tagging Methods

Automatic tagging methods aim at assigning proper tags for items. Some works propose to accomplish this task based on historical tag assignment information, and they mainly focus on exploring the established relations among existing items and tags. Methods of this kind can be categorized as the collaborative filtering methods, and they can leverage different machine learning techniques. For example, Heymann et al. [39] propose to use association rules for discovering the co-occurrence patterns of tags, and they assign additional tags for tagged items based on the found rules. Symeonidis et al. [85] propose to derive the user, item, and tag’s latent factors based on the high-order singular value decomposition (HOSVD) method from the history of users’ tagging behaviors. Rendle and Schmidt-Thieme [74] propose the model PITF, which uses the matrix factorization method to infer the vector representations of items and tags by decomposing the pairwise interactions among items, tags, and users, and items are then annotated with tags that are similar in terms of the vector representations. Later, Fang et al. [26] improve PITF by incorporating non-linearity into the model. The user’s tagging behavior information can also be modeled as a graph and analyzed through graph analysis methods. Jäschke et al. [47] construct a graph whose vertices are items, tags, and users, and they use a PageRank-like algorithm to compute the score of each vertex and assign tags accordingly. Liu et al. [60] propose a method called FolkDiffusion based on the idea of heat diffusion on the graph with users, items, and tags as vertices. Sun et al. [82] study the impact of personalized information, collaborative information, and content information on tag recommendation and design a deep neural network model with hierarchical attentions to perform personalized tag recommendation for each user-item pair. However, the collaborative filtering methods cannot assign tags for items with very few interactions or a brand new item, and hence they suffer from the item cold-start problem [96].

Content-based methods, however, can cope with the task to assign existing tags to new items by analyzing the relationship between the descriptive information of items and the corresponding tags. Many content-based methods are developed from the LDA (latent Dirichlet allocation) model [10], and they extract the topics from the descriptive texts and link tags with these inferred topics [8, 12, 24, 44, 96]. For example, Wu et al. [96] assume that for each item, tags associated with its descriptive text as well as their relevant words may have appeared multiple times in its descriptive text. Meanwhile, they also assume that each tag corresponds to a topic, which means that the tag and its latent topic are one-to-one correspondence. Hence, each word is generated either by the tag-word distribution or by the relevant-word distribution. Data mining techniques such as utility pattern mining are also used in tagging. Belhadi et al. [6] use the tag co-occurrence pattern and tweet content information to assign hashtags to tweets. In recent years, deep learning based models have been proposed to deal with the automatic tagging problem. Gong and Zhang [33] propose an attention-based Convolutional Neural Network (CNN) to capture information from micro-blog texts for hashtag recommendation. Zhang et al. [105] utilize LSTM (Long Short-Term Memory) for textual feature extraction and CNN for image feature extraction to recommend hashtags for multi-modal tweets. Gao et al. [28] propose a chained neural network model to recommend tags for citizen complaints by jointly incorporating the textual content of complaints, spatio-temporal information of complaints, and the taxonomy of candidate tags. Tang et al. [86] propose a model based on the encoder-decoder framework to recommend tags for textual content, which can simultaneously model three aspects: sequential text modeling, tag correlation, and content-tag overlapping. Lei et al. [52] regard tagging based on text content information as a text classification task and design a capsule network with dynamic routing to perform the task. Due to the difficulties of predicting very infrequent hashtags [20], most studies remove infrequent hashtags. Some researchers attempt to use tag co-occurrence and semantic relations between tags to address the long-tail tag recommendation problem. Li et al. [53] propose a deep learning model to recommend hashtags for micro-videos, jointly considering the sequential feature learning from video and its descriptive text, the video-user-hashtag interaction, and the hashtag correlations. They improve the long-tail hashtag recommendation through hashtag embedding propagation with external knowledge to derive the relationship between frequent hashtags and long-tail hashtags, which is defined as the hashtags appearing less than 100 times and greater than 50 times in their experimental study.

Despite lots of studies in the literature, existing content-based methods focus on solving the item-centered cold-start problem, whereas another aspect of the cold-start problem, the tag-centered cold-start problem, is ignored in the literature. When a new tag appears, it lacks interactions with items, thus this problem cannot be properly resolved with existing automatic tagging methods. It is challenging to solve this problem, as we do not have enough associated samples for the new tag to train the models proposed in the literature. Meanwhile, we may not have external knowledge about tag correlation information related to the new tag to ease the problem. Thus, we formulate the tag-centered cold-start problem and fill up the research gap in this article.

2.2 Topic Model Based Classification

Topic models like LDA [10] are directed probabilistic graphical models that can infer meaningful latent topic distributions from raw texts. LDA assumes that each document is a mixture of different topics, whereas each topic represents a distribution over words in the vocabulary. These topics can be used as effective features of documents, thus many research works leverage these topics to perform classification tasks on documents. For example, McAuliffe and Blei [64] propose a supervised model based on LDA, where topic distributions are used to predict the real-value document ratings. Chong et al. [17], however, use topic distributions to perform classification on discrete document labels. Zhu et al. [107] introduce the mechanism of margin-based classification models (e.g., SVM) into a supervised topic model, and help infer more discriminative and more suitable topic representations by integrating the max-margin constraints within the inference process.

Using topics as features can improve the classification performance of documents. It can also improve the interpretability of supervised machine learning models by presenting human-readable topics [1, 58, 90] if proper designs are implemented. For example, McAuley and Leskovec [62] integrate a topic model that extracts topics from review texts with a recommender system algorithm, and the interpretable textual topics can help justify the predicted item ratings. Peña et al. [69] constrain user and item factors to topic space to train an interpretable model that users can visualize. Jameel et al. [46] utilizes supervised topic models to do document classification and retrieval tasks.

Topic models are generally intractable, and researchers usually adopt Gibbs sampling or variational inference to infer the distribution of latent variables (e.g., the topic distribution). Both methods are developed based on statistical derivations and have proved to be effective for approximating difficult-to-compute probability densities in Bayesian statistics. However, they also have some deficits like computational expensiveness [9]. Moreover, over the past decade, deep learning has been a huge success and the neural network is widely used as a classifier in many machine learning problems. Yet both methods to infer latent distributions are not compatible with neural networks. Although there are some attempts to leverage the power of neural networks in topic models, they avoid the incompatibility of the inference methods by sacrificing leveraging the probabilistic generative process behind the models. For example, Cao et al. [13] use two networks to represent the document-topic distribution and the topic-word distribution, and use a third neural network to classify each document based on the outputs of the previous two networks. However, the model is basically not a directed probabilistic model, thus it lacks the ability to directly infer topic distributions from documents. To retrieve meaningful topic distributions from the documents, we should stick to the directed probabilistic model and infer the posteriors in a Bayesian way. However, as mentioned earlier, traditional inference methods for probabilistic models are not compatible with neural networks. Therefore, it is hard for topic models to enjoy the benefits of neural networks and scale to large datasets.

To overcome these problems, in recent years, researchers have found a way to make the variational inference method compatible with neural networks. Kingma and Welling [49] propose a framework that converts the optimization process of variational inference into the training process of an auto-encoding network, aiming at conducting Bayesian inference in a more effective way. The proposed VAE framework uses neural networks to approximate the posteriors of latent variables, providing the possibility to combine topic models with neural networks [14, 65, 80]. Indeed, some researchers propose supervised neural topic models based on the VAE framework. Zeng et al. [103] aim at classification problems on short texts, and they use a neural topic model based on LDA [65] to generate topic distributions and feed them as well as the features extracted from the text sequences into the final classifier. Card et al. [14] introduce another variable representing document labels in the generative process of LDA and extend VAE for inferring the label. Although the VAE methods help combine neural networks with probabilistic models, existing VAE methods also have the deficits of easily suffering from the component collapsing problem, as the neural networks easily get stuck at a bad local optimum [80] and the VAE method can be over-regularized [56]. Some studies have been conducted to alleviate this problem. Training techniques such as dropout can be used to try to avoid local optimum [80]. The regularization term in the VAE-based model can be replaced by other methods such as MMD (Maximum Mean Discrepancy) [66] or GAN (Generative Adversarial Network) [91].

Compared to existing works on neural topic models, we propose a new neural topic model with a different generative process and a novel inference method. Our proposed model not only improves the LDA-based neural topic model through methods of imposing asymmetric and document-specific prior on topic distribution but also provides an alternative way to alleviate the component collapsing problem. This improvement distinguishes our work from existing neural topic models from the perspective of methodology.

2.3 Few-Shot Learning

Machine learning, especially deep learning, has achieved great success in many application scenarios nowadays. However, a large amount of data is usually needed during the learning process of deep learning models, while we human beings can learn from limited samples. For example, given a few pictures of dogs, we can learn the concept of “dog” in a minute and get the ability to recognize dogs with different appearances [84]. To mimic this ability of learning from very few training samples, many researchers started to work on few-shot learning problems in recent years.

Few-shot learning methods can be roughly categorized into two types. First, metric learning methods try to learn mapping functions as well as corresponding metrics in the training phase. For example, the Matching Network adopts a KNN-like strategy, and it uses a weighted average of training samples’ labels to determine the label of each testing sample, where the weights are derived from similarity metrics [87]. The Prototypical Network computes the mean of the encoding results of the samples for each label (prototype) and assigns the label of the testing sample based on its similarity with the prototypes [78]. The Prototypical Network is fast to train, is easy to understand, and achieves state-of-the-art performance in many tasks [29]. Thus, our model also follows the framework of the Prototypical Network. (Second, meta-learner methods [42, 55, 72] try to learn mechanisms to output proper optimization parameters given the gradients on few-shot samples. Ravi and Larochelle [72] train an LSTM-based meta-learner to learn parameters in the update rule for the model parameters. Liang et al. [55] use meta-learning to effectively predict the sentiment polarity of previously unseen aspect categories. Huang et al. [42] extract and propagate transferable knowledge of prior users and learn a good initialization for new users based on meta-learners. Additionally, in some specific scenarios, few-shot learning may encounter issues such as distribution shift, intra-class bias, and domain shift. Existing research works have effectively addressed such data-related problems through counterfactual generation frameworks, automatic attribute-consistent networks, and other methods, enhancing the performance of few-shot learning [18, 79, 104].

Most works on few-shot learning focus on computer vision applications (e.g., [73, 84]), whereas relatively fewer works in this field handle few-shot text classification tasks. Yu et al. [102] use multiple few-shot learning models including the Matching Network and the Prototypical Network to collaboratively perform text classification. Han et al. [36] apply some few-shot learning models like the Prototypical Network and the Meta Network to relation classification. Xiong et al. [97] use text to predict the entity of rare relations in knowledge graphs following the framework of the Prototypical Network. Gao et al. [29] add instance-level and feature-level attentions to the Prototypical Network to classify relations from knowledge graphs with text. And based on this work, Sun et al. [83] add another word-level attention. Geng et al. [30, 31] introduce a module called the Induction Network, which learns a non-linear mapping from the vectors of multiple samples to a vector of a single class. There are some other few-shot text classification works as well (e.g., [3, 19, 27, 40, 50]). Most of these works can be categorized into metric learning methods, and they respectively deal with sentiment analysis [102], dialog intent classification [30, 31, 102], event classification [19, 50], and relations classification [29, 36, 83, 97] tasks, where the texts to be classified are sentences or short paragraphs [94]. Most models proposed in few-shot text classification tasks can be roughly split into two parts: an encoder to map each sample into a latent space, and a classifier to evaluate the similarity metrics based on those latent representations. RNN and CNN are usually used as the encoders to extract latent features from the texts, whereas the extracted features are difficult to understand. Existing studies on few-shot classification devoted a lot to designing sophisticated mechanisms (classifiers) to evaluate the similarity. But the corresponding complicated non-linear relationship makes it difficult to understand the choice of the model, which further devastates the interpretability of the few-shot learning model.

The literature review shows that most existing few-shot learning studies on text classification do not extract interpretable features from the texts. To better encode texts and to improve the model interpretability, we turn to topic models for help. But how to model the text generative process, how to perform the model inference, and how to fuse it with the few-shot deep learning framework are quite challenging tasks. To the best of our knowledge, there are no previous works that use topic models for classification in the few-shot learning settings. In this article, we focus on developing an explainable topic encoder and conduct classification based on the similarity among the topic distributions. In this way, the mechanism to perform classification is easy to understand and the learned features to represent each document are interpretable through document-topic distribution and topic-word distribution. Conceptually, our proposed encoder can also be incorporated with existing sophisticated classification mechanisms.

The key difference of our work with existing few-shot learning works is that we incorporate the topic inference into the few-shot learning framework. Our work contributes to the few-shot learning literature by demonstrating that leveraging the neural topic model in few-shot learning can improve the classification performance for texts and also introduce interpretability in the model.

2.4 Neural Network Interpretability

The lack of interpretability of neural network models has become the main obstacle in their wide applications, which means that the decision-making process of the models is not transparent and explicable. Interpretability refers to the extent of humans’ ability to understand and reason a model [25]. Many efforts have been made to increase the interpretability of neural network models. Existing approaches about neural network interpretability are mainly categorized into two classes: passive (or post-hoc) interpretability and active (or intrinsic, ad-hoc) interpretability [2, 23, 25, 106].

The post-hoc (passive) ones involve analyzing the models’ internal representations and decision boundaries to provide interpretation. They usually require creating a second model to approximate the behavior of an existing model to provide explanations for the model [23] and are usually not completely faithful to the original model [25]. Post-hoc explanation methods need to approximate the behavior of models, which are limited in their approximation [23]. Most of the existing neural network interpreting methods are post-hoc methods [106]. Very few neural network models incorporate interpretability into model design.

The ad-hoc (active) interpretability approaches actively change the network architecture or training process for better interpretability [106]. Attention is often used as an explanation mechanism. But explanation based on attention weights is controversial [7, 35, 45, 95]. Many works point out that attention weights are frequently uncorrelated with gradient-based measures of feature importance, and it is very often to construct adversarial attention distributions that yield equivalent predictions based on many experiments. We need to use more technical methods and include human rationale to make attention explanation, which is not easy work [7, 45].

Our proposed model offers intrinsic interpretability, which is achieved by constructing self-explanatory models, incorporating interpretability directly into their structures [23].

3 Problem Formulation

For the tag-centered cold start problem where we need to assign a new tag to items, we assume that the new tag has several representative items as the seeds and each item has a text to describe it, leaving the incorporation of other prior information for future study. To formulate this problem, we introduce some symbols first. Let \(\boldsymbol {T}_{train}\) represent a set of existing tags and \(\boldsymbol {E}_{train}\) be the set of items that have tags in \(\boldsymbol {T}_{train}\). Let \(\boldsymbol {T}_{test}\) be a set of new tags. For each new tag \(T \in \boldsymbol {T}_{test}\), we only have a few items with this tag T, denoted by \(\boldsymbol {\mathcal {S}}_T=\lbrace s_1^{(T)},s_2^{(T)}, \ldots ,s_M^{(T)}\rbrace\), where M is usually a small number such as 5 or 10. Set \(\boldsymbol {\mathcal {S}}_T\) is called the support set of tag T. Let \(\boldsymbol {\mathcal {Q}}_{T}\) be a set of items, which is called the query set of tag T. For each item \(q \in \boldsymbol {\mathcal {Q}}_{T}\), we want to know if it should be assigned with the new tag T.

Given the training data including \(\boldsymbol {T}_{train}\) and \(\boldsymbol {E}_{train}\), our research problem is how to design and learn a tagging model. Based on the learned model and the support set \(\boldsymbol {\mathcal {S}}_T\) of each new tag \(T \in \boldsymbol {T}_{test}\), we can assign the new tag to items in a given query set \(\boldsymbol {\mathcal {Q}}_{T}\).

The notations in this article are presented in Table 1.

Table 1.

Notation	Meaning
\(\boldsymbol {T}_{train}\)	A set of existing tags
\(\boldsymbol {E}_{train}\)	A set of items with tags in \(\boldsymbol {T}_{train}\)
\(\boldsymbol {E}_T \subseteq \boldsymbol {E}_{train}\)	A set of items having tag T
\(\boldsymbol {T}_{test}\)	A set of new tags
\(\boldsymbol {\mathcal {S}}_T\)	The support set of tag T
\(\boldsymbol {\mathcal {Q}}_{T}\)	The query set of tag T
M	Size of support set
K	Number of topics
V	Number of unique words in documents
\(N_i\)	Number of words in the document of item i
\(\boldsymbol {\alpha }_i\)	The topic mask for the topic distribution of item i
\(\boldsymbol {\theta }_i\)	The topic distribution of item i
\(\boldsymbol {\beta }_i\)	The unnormalized topic mask, \(\boldsymbol {\alpha }_i = sigmoid(\boldsymbol {\beta }_i)\)
\(\boldsymbol {r}_i\)	The unnormalized topic distribution, \(\boldsymbol {\theta }_i = softmax(\boldsymbol {r}_i)\)
\(\boldsymbol {d}\)	Background vector of log-frequency of words
\(\boldsymbol {w}_i\)	The words of the item i’s document
\(\boldsymbol {x}_i\)	The distributional features extracted from \(\boldsymbol {w}_i\)

Table 1. Notations

4 Model Description

To solve the tag-centered tagging problem, we need to accomplish two tasks: the topic modeling task, where we need to learn an interpretable representation for each item and tag, and the few-shot classification task, where the learned representations are used to perform few-shot classification. Existing deep learning methods to learn document representation lack explanation. As discussed in Section 1, the topic model can extract meaningful abstract features from textual information and is explainable by nature. Thus, we propose a new framework, which integrates a neural topic model with a few-shot classification model to perform the two tasks simultaneously.

The overall framework of the proposed NTFSL model is illustrated in Figure 3. As shown by the figure, the framework integrates two tasks. There are two components for the topic modeling task: the Inference Network and the Reconstruction Network. Given the text distribution of an item such as a query item denoted by \(x_q\), the Inference Network infers topic distribution \(\theta _q\) of this item, and the Reconstruction Network reconstructs the text distribution based on the topic distribution \(\theta _q\). To enhance the performance of this task, we propose a novel neural topic model as well as a novel inference method to learn the model. It is worth noting that our proposed framework is not limited to using our proposed neural topic model. Other neural topic models can also be extended to incorporate into our proposed framework, which is demonstrated in Section 5. For the few-shot learning component, given the support set \(\boldsymbol {\mathcal {S}}_T\) of tag T, we can infer the topic distribution for each item in the support set, which is then aggregated to represent the topic distribution \(\theta _T\) of the tag. With the query item’s topic distribution and the tag’s topic distribution, we can measure their matching score \(y_{q}^{(T)}\), based on which we choose items toassign the tag.

Fig. 3.

4.1 Topic Modeling Task

In the topic modeling task, we focus on inferring the topic distribution for each item i based on its corresponding textual features (document). Given a corpus of documents, we propose a new generative probabilistic model for latent topic modeling and propose a novel Inference Network to infer the model parameters following the VAE framework. In this subsection, we first introduce the generative process of the proposed topic model and then present the inference method.

4.1.1 Generative Process.

Suppose we have a corpus of D documents, where there are \(N_i\) words in the i-th document. We assume there are K latent topics presented in these documents, and we aim to develop a model to find the latent topics and the semantics of each topic. Similar to LDA [10], we assume that each document is represented as a random mixture over latent topics, and each topic is described as a distribution over words. The topic-word distribution can be denoted by a matrix \(\boldsymbol {B}\), where the j-th row represents the word distribution of the j-th topic. For inference and parameter estimation of this model using the VAE framework, as is done in many previous studies [14, 80], we impose a logistic normal prior on the topic distribution \(\boldsymbol {\theta }\) to approximate the Dirichlet prior used in LDA. The reason for doing this is that the Dirichlet prior is important in obtaining the interpretable topics in LDA [80], but it is incompatible with the inference process in VAE, whereas a logistic normal prior, which is a multivariate Gaussian distribution followed by a softmax transformation, can be incorporated with the VAE framework. By setting proper parameters induced by Laplace approximation of Hennig et al. [38], the logistic normal prior can approximate the Dirichlet prior [80].

To improve the topic quality and document representation of neural topic models, to differentiate items in terms of topic distribution for the performance of the few-shot classification task, and to solve the component collapsing problem [22], we improve the existing neural topic model by introducing an additional prior \(\boldsymbol {\alpha }\) on the logistic normal prior and name it a topic mask. We propose the topic mask with the following three motivations.

First, the performance of topic models depends on its priors, especially on the priors for the topic distributions \(\boldsymbol {\theta }\) [21, 88]. The imposed prior for the topic distributions \(\boldsymbol {\theta }\) can be interpreted as the prior topic distributions of the whole corpus. Its dimensions align with the number of topics, and the values along each dimension represent the prior probability of the corresponding topics appearing in the corpus. For example, a high prior means that the corresponding topic appears with a higher probability. A symmetric prior means that each dimension of the prior is equal, representing that we suppose each topic appears with equal probability in the corpus, whereas an asymmetric one means that we believe that different topics appear with varying probabilities in the real-world corpus. A symmetric Dirichlet prior is usually imposed on the topic distributions in topic models like LDA [10], and this symmetric prior is commonly set as a constant recommended by Steyvers and Griffiths [81]. It makes sense to believe that different topics appear with varying probabilities in the real-world corpus, especially in the corpus of a particular field. For example, the words “model,” “data,” and “algorithm” appear frequently in papers in the machine learning field [88]. Hence, topics containing these words tend to appear with a higher probability for documents in this field. Existing studies have pointed out that compared with a symmetric prior, an asymmetric Dirichlet prior can better model the prior topic distributions and get better performance, measured by the probability of held-out documents, the quality of inferred topics [88], and the document representation ability through document clustering and classification tasks [21]. For neural topic models, although an asymmetric prior also seems reasonable to be beneficial for model performance, the commonly used Dirichlet prior is hard to combine with most neural topic models, and no research about neural topic models in the literature has ever studied the incorporation of an asymmetric prior [14, 66, 80]. For the preceding reasons, in our model we design a component to impose an asymmetric prior on topic distributions and validate its effectiveness in improving the performance of the neural topic model.

Second, the inferred topic distributions are used as item (document) representations for the subsequent automatic tagging task. For this task, the ideal situation is that for each tag, items belonging to the tag have very different representations from those not belonging to the tag. The more significant the difference between topic distributions of different items (documents), the easier it is to train an accurate classification model [32, 68]. In neural topic models, when we use a fixed prior for topic distributions, no matter whether it is symmetric or asymmetric, all of the documents share the same prior topic distributions, limiting the difference of the inferred topic distributions of different documents. To increase the difference for improving the accuracy of the following few-shot classification task, it is natural to consider a document-specific prior. Thus, in our proposed model, we design the prior as a probability distribution, which allows us to sample priors for different documents and get document-specific priors.

Third, most variational inference based neural topic models may easily suffer from the component collapsing problem, which is a particular type of local optimum very close to the prior belief [14, 80]. The component collapsing problem in neural topic models is caused by the Kullback-Leibler (KL) divergence regularization term in the variational objective of VAE. When the topic number is large, this regularization term dominates the objective. As a result, the latent topic distributions of different documents tend to be similar to each other. To tackle this problem, some studies improve the structure and training techniques of the neural network, including batch normalization, dropout layer, high moment weight, and learning rate [14, 80]. We adopt these techniques, which are proved to be able to help avoid local optimum and alleviate the component collapsing problem. However, the root cause of the component collapsing problem lies in the KL divergence regularization term of the objective function. One way to solve this problem is to abandon the model structure of the VAE-based neural topic models, and use MMD [66] or GAN [91] instead of KL divergence to perform distribution matching. In this article, we propose an alternative method, which is to impose asymmetric prior on the topic distribution and make it changeable across different documents. In this way, KL divergence regularization cannot force all latent topic distributions to reach close to the prior to get identical topics. We adopt this idea because this method can naturally be combined with the other two motivations and solve the corresponding problems simultaneously by imposing an asymmetric and document-specific prior.

The proposed topic mask is assumed to follow a Beta distribution to generate a number between 0 and 1, determining whether a topic should be assigned to the item. In this way, we can impose an asymmetric prior on the topic distributions and force them to be more differentiated and document specific. Thus, we can improve the classification performance and the topic quality.

To handle the incompatibility problem with the inference process in VAE, we propose to use a Gaussian distribution followed by a sigmoid transformation to approximate the Beta distribution. In addition, we relax the constraint on the topic-word matrix \(\boldsymbol {B}\). In LDA, \(\boldsymbol {B}\) must be a stochastic matrix so that each row in \(\boldsymbol {B}\) represents a multinomial topic-word distribution, and we multiply the topic distribution \(\boldsymbol {\theta }\) and the topic-word distribution \(\boldsymbol {B}\) to get a multinomial distribution for generating words. However, the constraint on \(\boldsymbol {B}\) can reduce the topic quality in VAE [80]. Thus, we relax the constraint when generating words as commonly done.

We demonstrate the generative process in Table 2, and the corresponding graphical representation is shown in Figure 4. In Table 2, lines 4 through 6 show the generation process of the topic mask \(\boldsymbol {\alpha }\). For item i’s document (or document i for brevity), its topic mask \(\boldsymbol {\alpha }_i\) is a K-dimensional vector, where each element \(\alpha _{ik}\) is a number between 0 and 1, indicating the degree of whether item i is related to the k-th topic. The topic mask \(\boldsymbol {\alpha }_i\) of document i is generated through the following two steps. First, auxiliary variables \(\beta _{ik}\) is sampled from a Gaussian distribution \(N(\mu _{Beta}(s, t), \sigma ^2_{Beta}(s, t))\). Then, \(\alpha _{ik}\) is obtained by mapping \(\beta _{ik}\) through a sigmoid function. \(\mu _{Beta}(s, t)\) and \(\sigma ^2_{Beta}(s, t)\) are functions derived from the Laplace approximation to ensure that the generated \(\boldsymbol {\alpha }_i\) is approximately sampled from a Beta distribution \(Beta(s, t)\). Specifically, we can derive that \(\mu _{Beta}(s, t) = \log (s / t) - 1\) and \(\sigma _{Beta} = \frac{1}{s} + \frac{1}{t}\) based on the work of Hennig et al. [38].

Table 2.

Fig. 4.

Next, given the topic mask \(\boldsymbol {\alpha }_i\), we generate the topic distribution \(\boldsymbol {\theta }_i\) for document i by the process described in lines 7 through 9. As stated previously, a logistic normal prior is used to approximate the Dirichlet prior. For a Dirichlet prior with concentration parameters \(\boldsymbol {c} = (c_1, c_2, \ldots , c_K)\), Hennig et al. [38] proposed the following approximation functions to calculate the parameters of the logistic normal prior:

\begin{equation} \mu _{Dir, k}(\boldsymbol {c}) = \log {c_k} - \frac{1}{K}\sum _{i=1}^{K}\log {c_i,} \end{equation}

(1)

\begin{equation} \sigma ^2_{Dir, k}(\boldsymbol {c}) = \frac{1}{c_k} \left(1 - \frac{2}{K}\right) + \frac{1}{K^2}\sum _{i=1}^{K}\frac{1}{c_i}, \end{equation}

(2)

where \(\mu _{Dir}(\boldsymbol {c})\) and \(\sigma ^2_{Dir}(\boldsymbol {c})\) maps the concentration parameters of a Dirichlet prior to two K-dimensional vectors, which are respectively the mean and the diagonal of the covariance matrix of a Gaussian distribution. And the two equations shown earlier calculate the k-th element of these two vectors. In line 8, we use the logistic normal distribution to approximate the Dirichlet prior and its concentration parameter is \(\pi \boldsymbol {\alpha }_i + \bar{\pi }\). The Dirichlet distribution requires its concentration parameter \(\pi \boldsymbol {\alpha }_i + \bar{\pi }\) to be greater than 0, necessitating a constant positive value for \(\bar{\pi }\) to ensure that \(\boldsymbol {r}_i\) can approximate a Dirichlet distribution.

Note that unlike the symmetric prior used in most works [14, 80], in our proposed model, the concentration parameter \(\pi \boldsymbol {\alpha }_i + \bar{\pi }\) is controlled by the topic mask \(\boldsymbol {\alpha }_i\), and it defines an asymmetric prior where each element in the vector differs from others. The asymmetric prior results in different possibilities for generating these K topics. We define \(\bar{\pi } \lt \lt \pi\), thus a larger \(\alpha _{ik}\) leads to a larger prior on the k-th dimension of topic distribution, which indicates that the topic distribution \(\boldsymbol {\theta }_i\) should bias toward the k-th topic.

Finally, each word is generated in lines 10 and 11. In line 11, \(softmax(\boldsymbol {\theta }_i\boldsymbol {B} + \boldsymbol {d})\) is a generalization of the dot multiplication between the topic distribution and the topic-word distribution. Here, \(\boldsymbol {d}\) is a V-dimensional background vector constituted by the logarithm of each word’s overall frequency. \(\boldsymbol {B}\) is a topic-word matrix whose size is \(K\times V\), and the k-th row of \(\boldsymbol {B}\) evaluates the expected word-frequency deviations from the background vector \(\boldsymbol {d}\) when the k-th topic appears in document i. We can take \(\boldsymbol {B}\) as an extension to the topic-word distribution in LDA, and this modification helps improve the topic quality [80]. Then, the j-th word \(w_{ij}\) is assumed to draw from a multinomial distribution whose parameter is \(softmax(\boldsymbol {\theta }_i\boldsymbol {B} + \boldsymbol {d})\).

4.1.2 Model Inference.

The inference for the graphical model in Figure 4 is intractable. Therefore, we propose a method to infer the latent variables following the VAE framework [49]. The VAE framework has a favorable property that it uses neural networks to output the inferred topic distributions. It helps to incorporate with the neural network based classification component introduced in Section 4.2.

The VAE framework is developed based on the conventional variational inference algorithm, where we use variational distributions to approximate posterior distributions. In the proposed generative model (see Figure 4), we have two latent variables \(\boldsymbol {\theta }_i\) and \(\boldsymbol {\alpha }_i\) to infer for each document i, thus we propose to use variational distribution \(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) for approximating their true posterior \(p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {w}_i)\), where \(\boldsymbol {w}_i\) is the word sequence of document i. In NTFSL, we focus on extracting topics merit in the documents, so we extract the distributional information (e.g., TFIDF) from \(\boldsymbol {w}_i\) as feature vector \(\boldsymbol {x}_i\) and ignore other local features, such as the sequence of words, in \(\boldsymbol {w}_i\). Therefore, we substitute \(\boldsymbol {w}_i\) with \(\boldsymbol {x}_i\) in the rest of the article. To ensure a good approximation, KL divergence between \(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) and \(p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\) is minimized. Previous work [9] on variational inference demonstrates that minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO) shown next:

\begin{equation} F(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i, \boldsymbol {x}_i) := E_{q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}) - KL(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i) || p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)). \end{equation}

(3)

Instead of optimizing ELBO analytically as done in the conventional way, we develop a new inference network together with a reconstruction network to encode the variational distributions following the VAE framework. To represent this assumption, we rewrite \(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) as \(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\). \(\Phi\) denotes the parameters of the neural network for generating the inferred \(\boldsymbol {\theta }_i\) and \(\boldsymbol {\alpha }_i\), and the network takes distributional features \(\boldsymbol {x}_i\) as inputs. In Figure 5, we illustrate the Inference Network of \(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\).

Fig. 5.

The proposed Inference Network first uses a shared Multi-Layer Perceptron (MLP) neural network with (pre-trained) word embeddings to encode the mean \(\boldsymbol {\mu }_{ri}\) and the standard deviation \(\boldsymbol {\sigma }_{ri}\) of the unnormalized topic distribution \(\boldsymbol {r}_i \sim N(\boldsymbol {\mu }_{ri}, \boldsymbol {\sigma }_{ri}^2)\). Letting \(\boldsymbol {W}_e\) denote the word embedding matrix, we use the followingformulas to calculate \(\boldsymbol {\mu }_{ri}\) and \(\boldsymbol {\sigma }_{ri}\):

\begin{equation} \boldsymbol {z}_i = f(\boldsymbol {W}_e\boldsymbol {x}_i), \end{equation}

(4)

\begin{equation} \boldsymbol {\mu }_{ri} = \boldsymbol {W}_{\mu _r}\boldsymbol {z}_i + \boldsymbol {b}_{\mu _r}, \end{equation}

(5)

\begin{equation} \boldsymbol {\sigma }_{ri} = \boldsymbol {W}_{\sigma _r}\boldsymbol {z}_i + \boldsymbol {b}_{\sigma _r}, \end{equation}

(6)

where \(f(\cdot)\) is a non-linear transformation, and both \(\boldsymbol {\mu }_{ri}\) and \(\boldsymbol {\sigma }_{ri}\) are K-dimensional vectors.

Previous works [14, 80] directly calculate the topic distribution \(\boldsymbol {\theta }_i\) by normalizing the generated \(\boldsymbol {r}_i\). However, when transforming the distributional features \(\boldsymbol {x}_i\) through a non-linear transformation, we may lose some information merit in the raw distributional features. In many cases, the appearance of some words implies whether a document has a specific topic, and therefore the word frequencies (i.e., word distributional features of a document) may be very helpful for distinguishing topics. In fact, classical topic models like LDA [10] use only word frequencies to infer the topic distributions. To incorporate this consideration, we propose to construct the network for approximating the posterior distribution of the topic mask \(\boldsymbol {\alpha }_i\) in the following way so that we make use of distributional features \(\boldsymbol {x}_i\) to better distinguish different topics in the generative process of our model. And by making the topic distributions have larger variance, the component collapsing problem is also alleviated with the generation of topic mask variables \(\boldsymbol {\alpha }_i\).

The topic mask variable \(\boldsymbol {\alpha }_i\) in the Inference Network helps improve the topic coherence as well as the classification performance. Suppose \(\boldsymbol {\beta }_i\) is the unnormalized topic mask, and we assume it is Gaussian distributed—that is, \(\boldsymbol {\beta }_i \sim N(\boldsymbol {\mu }_{\beta i}, \boldsymbol {\sigma }_{\beta i}^2)\), where \(\boldsymbol {\mu }_{\beta i}\) and \(\boldsymbol {\sigma }_{\beta i}\) are respectively the mean and the standard deviation and are computed asfollows:

\begin{equation} \boldsymbol {\mu }_{\beta i} = \boldsymbol {W}_{\mu _\beta }\boldsymbol {x}_i + \boldsymbol {b}_{\mu _\beta }, \end{equation}

(7)

\begin{equation} \boldsymbol {\sigma }_{\beta i} = \boldsymbol {W}_{\sigma _\beta }\boldsymbol {x}_i + \boldsymbol {b}_{\sigma _\beta }. \end{equation}

(8)

Given \(\boldsymbol {\beta }_i\), we can then calculate the topic mask \(\boldsymbol {\alpha }_i\) by a sigmoid transformation:

\begin{equation} \boldsymbol {\alpha }_i = sigmoid(\boldsymbol {\beta }_i). \end{equation}

(9)

We then construct the variational distribution of \(\boldsymbol {\theta }_i\) in the Inference Network as

\begin{equation} \boldsymbol {\theta }_i = softmax(\boldsymbol {r}_i + \log {(\boldsymbol {\alpha }_i + \Delta)}), \end{equation}

(10)

where \(\Delta\) is a very small positive number like \(1e-10\). Note that the topic mask \(\boldsymbol {\alpha }_i\) can be interpreted as the probability of a topic assigned to the document. By adding \(\log {(\boldsymbol {\alpha }_i + \Delta)}\) to \(\boldsymbol {r}_i\), we can ensure that a small \(\alpha _{ik}\) can lead to a low probability on the k-th dimension in \(\boldsymbol {\theta }_i\).

Although we leverage neural networks to output the distribution parameters of \(\boldsymbol {\theta }_i\) and \(\boldsymbol {\beta }_i\), calculating the expectation in Equation (3) is still intractable. To resolve this problem, we adopt the sampling and reparameterization trick [49]. In the sampling step, variables \(\boldsymbol {\theta }_i^{(s)}\) and \(\boldsymbol {\alpha }_i^{(s)}\) are sampled in accordance with the variational distribution \(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)\). Then they are substituted into \(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}\) for an estimation of the expectation \(E_{q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)})\).

Then, we design another neural network \(p_\Psi (\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\), whose input is the latent variables and output is normalized \(\hat{\boldsymbol {x}}_i\). The normalized \(\hat{\boldsymbol {x}}_i\) is calculated by \(softmax(\boldsymbol {\theta }_i\boldsymbol {B}+\boldsymbol {d})\) as described in Table 2. \(\boldsymbol {B}\) is randomly initialized, and it represents the unnormalized word distributions of each topic. \(\boldsymbol {d}\) is pre-calculated by taking the logarithm of each word’s overall frequency. Thus, \(\boldsymbol {\theta }_i\boldsymbol {B}\) evaluates the deviation of the word occurrence probability with respect to the whole corpus so that we can better capture those topics related to some infrequent words. Each element in \(\hat{\boldsymbol {x}}_i\) is then the probability the corresponding word appears in the document, and we can evaluate the probability \(p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) with these numbers. To some degree, this network regenerates the normalized \(\boldsymbol {x}_i\) from \(\boldsymbol {\theta }_i\) generated by the Inference Network whose inputs are \(\boldsymbol {x}_i\). Therefore, we name this network the Reconstruction Network, and its structure is illustrated in Figure 6.

Fig. 6.

For the convenience of the described sampling step, a reparameterization trick is used to sample \(\boldsymbol {\theta }_i^{(s)}\)and \(\boldsymbol {\alpha }_i^{(s)}\):

\begin{equation} \boldsymbol {\epsilon }_r^{(s)} \sim N(0, 1), \boldsymbol {\epsilon }_\beta ^{(s)} \sim N(0, 1), \end{equation}

(11)

\begin{equation} \boldsymbol {r}_i^{(s)} = \boldsymbol {\mu }_{ri} + \boldsymbol {\sigma }_{ri} \cdot \boldsymbol {\epsilon }_r^{(s)}, \end{equation}

(12)

\begin{equation} \boldsymbol {\beta }_i^{(s)} = \boldsymbol {\mu }_{\beta i} + \boldsymbol {\sigma }_{\beta i} \cdot \boldsymbol {\epsilon }_\beta ^{(s)}. \end{equation}

(13)

Then, \(\boldsymbol {\theta }_i^{(s)}\) and \(\boldsymbol {\alpha }_i^{(s)}\) are calculated with the sampled \(\boldsymbol {r}_i^{(s)}\) and \(\boldsymbol {\beta }_i^{(s)}\) using Equations (9) and (10). With this sampling process, we can derive a Monte Carlo approximation of Equation (3) and use it as the loss function \(l_t(i)\) for the topic modeling task:

\begin{equation} \begin{aligned}l_t(i) := &- \log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i^{(s)})} + KL(q_\Phi (\boldsymbol {\theta }_i | \boldsymbol {x}_i, \boldsymbol {\alpha }_i^{(s)})||p(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)})) + KL(q_\Phi (\boldsymbol {\alpha }_i)|\boldsymbol {x}_i)||p(\boldsymbol {\alpha }_i)), \end{aligned} \end{equation}

(14)

where \(\Phi = \lbrace \boldsymbol {W}_e, \boldsymbol {W}_{\mu _r}, \boldsymbol {W}_{\sigma _r}, \boldsymbol {W}_{\mu _\beta }, \boldsymbol {W}_{\sigma _\beta }, \boldsymbol {b}_{\mu _r}, \boldsymbol {b}_{\sigma _r}, \boldsymbol {b}_{\mu _\beta }, \boldsymbol {b}_{\sigma _\beta }\rbrace\), \(\Psi = \lbrace \boldsymbol {B}\rbrace\) are parameters to be learned in the Inference Network and the Reconstruction Network.

We start the derivation of this loss function from Equation (3):

\begin{equation*} \begin{aligned}F(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i, \boldsymbol {x}_i) &:= E_{q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}) - KL(q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i) || p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i))\\ &=E_{q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)}) + E_{q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)})\\ &- E_{q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {q_{\Phi }(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i|\boldsymbol {x}_i)}). \end{aligned} \end{equation*}

\(q(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i)\) is replaced with \(q_\Phi (\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i | \boldsymbol {x}_i)\) as we use the Inference Network to generate the variational distribution.

According to the generative process described in Section 4.1.1, \(\boldsymbol {x}\) is not dependent on the topic mask distribution \(\boldsymbol {\alpha }\) given the topic distribution \(\boldsymbol {\theta }\), thus we can further get the following:

\begin{equation*} \begin{aligned}F(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i, \boldsymbol {x}_i) &= E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i)}))\\ &+ E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {\theta }_i | \boldsymbol {\alpha }_i)} + \log {p(\boldsymbol {\alpha }_i)})) \\ &-E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {q_{\Phi }(\boldsymbol {\theta }_i | \boldsymbol {x}_i, \boldsymbol {\alpha }_i)} + \log {q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)})) \\ &= E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i)}))\\ &+ E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {\theta }_i | \boldsymbol {\alpha }_i)} - \log {q_\Phi (\boldsymbol {\theta }_i|\boldsymbol {x}_i, \boldsymbol {\alpha }_i)}))\\ & - E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\log {p(\boldsymbol {\alpha }_i)} - \log {q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}) \\ &=E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i)}))\\ &+ E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {{\alpha }_i}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {\theta }_i | \boldsymbol {\alpha }_i)} - \log {q_\Phi (\boldsymbol {\theta }_i|\boldsymbol {x}_i, \boldsymbol {\alpha }_i)})) \\ &- KL(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)||p(\boldsymbol {\alpha }_i)). \end{aligned} \end{equation*}

Using the sampling trick, we can approximate the expectation \(E_{q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)}(\cdot)\) by sampling \(\boldsymbol {\alpha }_i^{(s)} \sim q_{\Phi }(\boldsymbol {\alpha }_i|\boldsymbol {x}_i)\) according to Equation (13) in Section 4.1.2:

\begin{equation*} \begin{aligned}F(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i, \boldsymbol {x}_i) &\approx E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i)}) + E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {\theta }_i | \boldsymbol {\alpha }_i^{(s)})} - \log {q_\Phi (\boldsymbol {\theta }_i|\boldsymbol {x}_i, \boldsymbol {\alpha }^{(s)}_i)})\\ &- KL(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)||p(\boldsymbol {\alpha }_i))\\ &=E_{q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)}, \boldsymbol {x}_i)}(\log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i)}) - KL(q_\Phi (\boldsymbol {\theta }_i|\boldsymbol {x}_i,\boldsymbol {\alpha }_i^{(s)})||p(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)})) \\ &-KL(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)||p(\boldsymbol {\alpha }_i)). \end{aligned} \end{equation*}

Using the sample trick again to sample \(\boldsymbol {\theta }_i^{(s)} \sim q_{\Phi }(\boldsymbol {\theta }_i|\boldsymbol {x}_i,\boldsymbol {\alpha }_i^{(s)})\):

\begin{equation*} \begin{aligned}F(\boldsymbol {\theta }_i, \boldsymbol {\alpha }_i, \boldsymbol {x}_i) &\approx \log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i^{(s)})} - KL(q_\Phi (\boldsymbol {\theta }_i | \boldsymbol {x}_i, \boldsymbol {\alpha }_i^{(s)})||p(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)})) - KL(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)||p(\boldsymbol {\alpha }_i)). \end{aligned} \end{equation*}

Taking the negative of the preceding equation, we can get the result in Equation (14).

Assuming the Reconstruction Network outputs a normalized \(\hat{\boldsymbol {x}}_i\) (probability), then the firstterm in Equation (14) is

\begin{equation*} \log {p(\boldsymbol {x}_i | \boldsymbol {\theta }_i^{(s)})} = \sum _{v = 1}^{V}{x_{iv}\log {\hat{x}_{iv}}}. \end{equation*}

All of the distributions in the last two terms (i.e., \(q_\Phi (\boldsymbol {\theta }_i | \boldsymbol {x}_i, \boldsymbol {\alpha }_i^{(s)})\), \(p(\boldsymbol {\theta }_i|\boldsymbol {\alpha }_i^{(s)})\), \(q_\Phi (\boldsymbol {\alpha }_i|\boldsymbol {x}_i)\), and \(p(\boldsymbol {\alpha }_i)\)) are log-normal distributions. For two K-dimensional log-normal distributions \(p \sim LN(\mu _p, \Sigma _p)\) and \(q \sim LN(\mu _q, \Sigma _q)\), their KL divergence \(KL(q||p)\) has an analytical result:

\begin{equation*} KL(q||p) = \frac{1}{2}\left(tr\left(\Sigma _p^{-1}\Sigma _q\right) + (\mu _q - \mu _p)^{T}\Sigma _p^{-1}(\mu _q - \mu _p) - K + \log {\frac{|\Sigma _p|}{|\Sigma _q|}}\right). \end{equation*}

Applying this result to the last two terms in Equation (14), we can get the topic modeling task loss \(l_{t}\).

The technical details (i.e., neural network setting details) of the topic modeling task including the Inference Network and the Reconstruction Network are shown in Figures 11 and 12 in Appendix A.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

4.2 Few-Shot Classification Task

Using the Inference Network as an encoder, we can then integrate its outputs, which are the topic distributions of different items, into few-shot classification tasks that can handle the tag-centered cold-start problem. The few-shot classification task determines whether a new tag should be assigned to a specific item based on the similarity of the inferred topic distributions. The higher the similarity metric, the more probable the item should be assigned with the new tag.

For a support set \(\boldsymbol {\mathcal {S}}_T\) of a tag T, we evaluate the topic distribution \(\boldsymbol {\theta }_T\) of the tag by feeding feature vector \(\boldsymbol {x}_i\) of each item in the support set into the Inference Network and then take the average. Human beings are assumed to generalize an abstract concept from several samples by taking the average of each item in an abstract feature space [76], and it is also a common practice to calculate the concept representations [78]. Similarly, for each item \(q \in \boldsymbol {\mathcal {Q}}_{T}\), we get its topic distribution \(q \in \boldsymbol {\mathcal {Q}}_{T}\) by feeding its feature vector \({x}_q\) into the inference network. We then compare the similarity between the topic distribution of the tag \(\boldsymbol {\theta }_T\) and that of each query item \(\boldsymbol {\theta }_q\). Formally, given \(\boldsymbol {\theta }_T\) and \(\boldsymbol {\theta }_q\), their matching score is defined as

\begin{equation} y_{q}^{(T)}= \boldsymbol {\theta }_T \cdot \boldsymbol {\theta }^{\prime }_q. \end{equation}

(15)

Then for each tag T in \(\boldsymbol {T}_{test}\), a list of sorted items is presented based on the value of matching score \(y_{q}^{(T)}\), where the top items can be recognized as the items that should be annotated with the new tag T.

Figure 3 shows the overall framework of the proposed NTFSL model. To train the whole model (called NTFSL for convenience) including the topic modeling component and the classification component, we follow an episodic paradigm that has been widely used for training and testing few-shot learning models [73]. The episodic paradigm assumes a labeled training dataset and a disjoint testing set that shares no overlap labels (i.e., tags) with the training dataset. In other words, tags in the test set do not occur in the training dataset. The testing set contains many few-shot classification tasks. For each task, given a new tag associated with a support set of M samples, the task is to categorize unlabeled samples in its query set into this label. As M is a small number such as 5 or 10 and it is too small to train any robust classifiers, the episodic paradigm seizes help from the training dataset where many labeled data can be used for training models.

To simulate the few-shot learning task in the testing dataset, we build many episodes from the training dataset. For each episode, a set of tags \(\boldsymbol {T}^{(s)}\) as well as a support set consisting of M items for each tag are sampled from \(\boldsymbol {\mathcal {T}}_{train}\) and \(\boldsymbol {E}_{train}\),respectively. Denote \(\boldsymbol {x}\) as the features of an item and y as its tag, then \(\boldsymbol {\mathcal {S}} = \lbrace (\boldsymbol {x}_1, y_1), (\boldsymbol {x}_2, y_2), \ldots \rbrace\) is the support set. Then we construct a query set for the tags in \(\boldsymbol {T}^{(s)}\), denoted by \(\boldsymbol {\mathcal {Q}} = \lbrace (\boldsymbol {x}^{(Q)}_1, y^{(Q)}_1), (\boldsymbol {x}^{(Q)}_2, y^{(Q)}_2), \ldots , (\boldsymbol {x}^{(Q)}_{|Q|}, y^{(Q)}_{|Q|})\rbrace\), which is sampled from the training item set \(\boldsymbol {E}_{train} / \boldsymbol {\mathcal {S}}\).

Model parameters are updated by feeding the support set \(\boldsymbol {\mathcal {S}}\) to the model and optimizing its classification performance on the query set \(\boldsymbol {\mathcal {Q}}\).

For a query item q that should be annotated with a tag T, their similarity score \(y_{q}^{(T)}\) is expected to be higher than the similarity scores of irrelevant items with the tag. Therefore, we construct the loss function to train our proposed model in the following way. For each query item \(q^+ \in \boldsymbol {\mathcal {Q}}\), supposing its tag is T, we randomly select another item denoted by \(q^-\) that is not associated with tag T from \(\boldsymbol {E}_{train} / \boldsymbol {E}_{T}\). Items \(q^+\) and \(q^-\), corresponding to a positive sample and a negative sample of tag T, constitute an item pair. Let \(\boldsymbol {\mathcal {Q}}^+_T\) be the set of items in the query set \(\boldsymbol {\mathcal {Q}}\) with tag \(T,\) and let \(\boldsymbol {\mathcal {Q}}^-_T\) be the set of corresponding negative items. Then for each pair of items \((q^+, q^-), q^+\in \boldsymbol {\mathcal {Q}}^+_T, q^-\in \boldsymbol {\mathcal {Q}}^-_T\), their corresponding loss functionis as follows:

\begin{equation} l_f(q^+, q^-) = - \log {\sigma (y_{q^+}^{(T)} - y_{q^-}^{(T)})} = - \log {\frac{1}{1 + e^{y_{q^-}^{(T)} - y_{q^+}^{(T)}}}}. \end{equation}

(16)

By minimizing \(l_f(q^+, q^-)\), the difference between the similarity score of the positive sample \(y_{q^+}^T\) and that of the negative sample \(y_{q^-}^T\) is maximized. In other words, after the training process, query items can acquire higher similarity scores when they are more relevant to a certain new tag.

Taking all tags in \(\boldsymbol {T}^{(s)}\) and the loss function of the topic modeling task into consideration, we get the loss function of the model NTFSL:

\begin{equation} l = \frac{1}{|\boldsymbol {T}^{(s)}|N}\sum _{T \in \boldsymbol {T}^{(s)}}{\sum _{q^+\in \boldsymbol {\mathcal {Q}}^+_T, q^-\in \boldsymbol {\mathcal {Q}}^-_T}{l_f(q^+, q^-) + \lambda (l_t(q^+) + l_t(q^-)),}} \end{equation}

(17)

where \(\lambda\) is a hyper-parameter that balances the losses from two aforementioned tasks and N is the size of \(\boldsymbol {\mathcal {Q}}^+_T\).

The training algorithm is described in Table 3. In lines 9, 11, and 12, we use superscript \((s)\) to denote the corresponding topic distribution \(\boldsymbol {\theta }\) and the topic mask distribution \(\boldsymbol {\alpha }\), as we leverage a neural network (the Inference Network) with the sampling process to generate them. Network setting details of the Inference Network and the Reconstruction Network are deferred to Appendix A. We adopted an Adam optimizer with high moment weight to minimize the loss function.

Table 3.

5 Empirical Study

5.1 Experimental Setups

5.1.1 Dataset Description.

We conduct experiments on two real-world datasets, HTF and SO.

—

Hithink Flush: HTF is one of the largest Internet-based financial service providers in China. It provides investment theme tags for stocks to assist investors in making their investment decisions. We randomly selected 141 tags created between 2012 and 2021 and crawled stocks related to them. We use the Management Discussion and Analysis (MD&A) section from each company’s annual financial report as the stock’s textual features.

—

Stack Overflow: SO is a question-and-answer website for programmers. It uses tags as a vital tool to organize the questions. We randomly crawled 288 tags with creation time distributed over 10 years from 2012 to 2021 and a maximum of 1,000 questions for each tag. However, among the initially crawled 288 tags, the number of associated questions for each tag is quite unbalanced. To address this issue, we conducted a pre-processing step, eliminating tags with fewer than 100 associated questions, resulting in a final set of 198 tags for analysis.

Table 4 shows some basic statistics of these two datasets after pre-processing steps such as tokenization and removing infrequent words. We can see that items in HTF have longer texts compared with those of SO. For the HTF dataset, we order 141 tags according to their creation time and use a window with length 132 and step size 1 to create 10 sub-datasets. For the first sub-dataset, the first 112 tags and corresponding items as well as their textual description data constitute the training dataset, and the following 20 tags (from tag 113 to tag 132) constitute the test dataset. Then, moving the window one step further, we create the second dataset, in which tags with numbers 2 to 113 are in the training dataset and tags with numbers 114 to 133 are in the test dataset. For each sub-dataset, there is no overlap between its training dataset and its test dataset in terms of both the text data and the class label (tag) data. In this way, we train the model using old tags and test the model with new tags. For the SO dataset, we have a similar pre-processing using a time window with length 153 and with step size 5. For each dataset, we take the average over the evaluation results on the 10 sub-datasets of each method as our final estimation for its performance.

Table 4.

	No. of Tags	No. of Unique Words	No. of Items	No. of Words per Item
HTF	141	5,000	3,633	2,262.45
SO	198	36,000	36,720	183.58

Table 4. Dataset Statistics

To evaluate the classification performance of different methods, we construct multiple tagging tasks for each tag T in the dataset \(\boldsymbol {T}_{test}\). For each task, we first randomly sample M items that have T as one of their tags to constitute its support set. Then, we put all other items both with and without tag T into T’s query set. For each item in the query set, each model can predict the matching score between the item and tag T. The top items with the highest matching scores will be compared with the ground truth to compute evaluation measures. Overall tagging performance for each model is obtained by averaging each task’s performance measures.

5.1.2 Evaluation Metrics.

We mainly use precision and recall to evaluate the classification performance of different models. Given the support set of a tag T, the proposed NTFSL model, as well as the benchmark models, can determine the probability (similarity score) of each item that should be associated with the tag. Then the candidate item list \(H^{(T)}\) of tag T is generated by sorting the possibility in descending order and taking the top P items. The precision metric can measure the accuracy of different methods in identifying the relevant items. It is defined as the proportion of the correctly identified items in the candidate item list. Formally, use \(G^{(T)}\) to denote the ground truth set of tag T. For each item in the candidate list \(H^{(T)} = (h_1^{(T)}, h_2^{(T)}, \ldots , h_P^{(T)})\), let \(v_i^{(T)} = 1\) if the i-th item \(h_i^{(T)}\) in \(H^{(T)}\) is in the ground truth set \(G^{(T)}\), and otherwise \(v_i^{(T)} = 0\). Then the precision of this few-shot task abouttag T is

\begin{equation} Precision_T@P = \frac{\sum _{i=1}^{P}v_i^{(T)}}{P}. \end{equation}

(18)

The overall precision metric on the test set \(\boldsymbol {T}_{test}\) takes the average over the precision of every tag \(T\in \boldsymbol {T}_{test}\):

\begin{equation} Precision@P = \frac{1}{L|\boldsymbol {T}_{test}|} \sum _{T\in \boldsymbol {T}_{test}}\sum _{i=1}^{L}Precision^{i}_T@P, \end{equation}

(19)

where L is the number of the few-shot tasks we constructed for each tag and \(Precision^{i}_T@P\) is the precision for the i-th task corresponding to tag T.

The recall metric measures the efficacy of these methods in fetching the associated items, and it is the proportion of the correctly identified items with respect to the items in the ground truth set asshown next:

\begin{equation} Recall_T@P = \frac{\sum _{i=1}^{P}v_i^{(T)}}{|G^{(T)}|}. \end{equation}

(20)

We then take the average of \(Recall_T@P\) over the testing set \(\boldsymbol {T}_{test}\) as the final recall metric:

\begin{equation} Recall@P = \frac{1}{L|\boldsymbol {T}_{test}|} \sum _{T\in \boldsymbol {T}_{test}}\sum _{i=1}^{L}Recall^{i}_T@P, \end{equation}

(21)

where L is the number of testing tasks for each tag and \(Recall^{i}_T@P\) is the recall for the i-th task corresponding to tag T.

5.1.3 Benchmark Methods.

The proposed NTFSL model follows the framework of the Prototypical Network [78], which is the state-of-the-art metric learning method for many few-shot learning problems. The Prototypical Network uses a mapping function \(f_{\boldsymbol {\Theta }} (\cdot)\) as the encoder to encode items to their vectorized representations and a classifier to evaluate the similarity between different items. To preserve the interpretability of the model, we use dot multiplication in Equation (15) as the classifier. The topic encoder of the NTFSL model can also be combined with more complicated classifiers, but this may lead to low interpretability. Therefore, to make a fair comparison, we compare the proposed model with encoders of existing state-of-the-art models, combining with the same classifier as ours. Additionally, we compare our model with two content-based automatic tagging algorithms named SIMWORD and ACN (Attention-based Capsule Network) with outstanding performance. In the experiments, the pre-trained word embedding dimension is set to 300. We utilize the word embedding trained by Li et al. [54] and Pennington et al. [70] for the HTF dataset and SO dataset, respectively. We set the batch size to 10 and tune the learning rate for each method. We use an Adam optimizer for optimization. The implementation of the experiments is based on PyTorch 1.4.0, Python 3.7.11, and CUDA 11.2. The GPU used for the experiments is NVIDIA Tesla V100 SXM2 with 32G memory. The benchmark models to be compared are listed next:

—

MLP: The encoder is and MLP neural network that takes feature \(\boldsymbol {x}\) as input. We tune the number of layers and the hidden units in each layer in the MLP encoder and choose the one with the best performance. As a result, we choose a two-layer MLP with 300 and 100 hidden units.

—

MLP(topic): The encoder is also an MLP model with one fully connected layer, but it takes the pre-trained topic distribution as input. The topic distribution is inferred by the classical LDA algorithm. We tune the number of topics inferred by LDA and the number of layers in the MLP encoder and choose the one with the best performance. Finally, we choose the topic number to be 100 and choose a one-layer MLP with 100 hidden units.

—

CNN-Att: This encoder is proposed in the work of Sun et al. [83], and it combines the word-level attention mechanism [100] with the TextCNN [48] encoder to encode short texts. For the hierarchical attention networks, some critical hyper-parameters are tuned. The window size of the CNN instance encoder is chosen to be 3, and the dimension of the hidden layer is 100.

—

biLSTM: This uses a bidirectional LSTM model followed by an MLP encoder with the self-attention mechanism [31] as the encoder. We set the hidden units size of biLSTM as 200 and the attention dimension as 100.

—

Att-Gen: This is an encoder proposed for natural language processing (NLP) few-shot learning tasks, which leverages some distributional statistics to represent word importance, and builds an attention mechanism based on these statistics [4]. Specifically, Att-Gen takes a bidirectional LSTM to infer the attention score of different words for each specific task, and encodes each document as an attention-weighted average of the embedding representation of each word. We have 100 hidden units for biLSTM and apply dropout of 0.1 on the output.

—

SIMWORD: This is an improved topic model for tagging tasks [96]. It involves establishing associations between tags and words in the corresponding content through a topic model for automatic annotation. We tune the hyper-parameters in the model and set \(K=100\), \(\alpha =0.5\), \(\beta =0.1\), \(\gamma =0.01,\) and \(\delta =0.01\) finally.

—

ACN: ACN is also a content-based automatic tagging algorithm [52]. It utilizes a capsule network based on attention mechanisms to train a tag classification model, determining whether each item belongs to a specific tag. The size of the embedding vector is 300, and the size of each mini-batch is 128. We utilize three parallel N-gram convolutional layers with the window size being 3, 4, and 5, respectively, to extract features from the input sequence.

—

W-NTFSL: Besides our proposed neural topic model, our proposed few-shot learning framework can also incorporate other neural topic models. To demonstrate this characteristic, we implement another model named W-NTFSL, which incorporates W-LDA [66] as the topic model under our proposed framework. W-NTFSL differs from NTFSL model in using W-LDA instead of our proposed neural topic model as its encoder. W-LDA is a state-of-the-art neural topic model in the Wasserstein autoencoders framework, which directly enforces a Dirichlet prior on the latent document-topic vectors and minimizes the MMD instead of KL divergence to perform distribution matching. Hyper-parameters are also tuned. The topic number is 100, and the support set size is 10. We have a 0.2 dropout layer On the output of the encoder. The learning rate is 1e-4, and \(\lambda\) is set as 0.1.

5.2 Experimental Results

Table 5 illustrates the performance of our proposed model and other benchmark models on dataset HTF with its support set size M setting to 10 and the topic number setting to 100 for NTFSL and W-NTFSL in the table. Sensitivity analysis about these parameters will be conducted in Section 5.3. Due to the space limit, all experimental results on dataset SO are deferred to Appendix B. Note that the results on dataset SO are consistent with those on dataset HTF.

Table 5.

Method	Precision@P			Recall@P
	P = 10	P = 20	P = 30	P = 10	P = 20	P = 30
MLP	0.1539	0.1441	0.1372	0.0291	0.0549	0.0807
MLP (topic)	0.2172	0.2118	0.2003	0.0410	0.0810	0.1144
CNN-Att	0.1873	0.1779	0.1697	0.0402	0.0763	0.1086
biLSTM	0.2528	0.2272	0.2104	0.0568	0.0995	0.1362
Att-Gen	0.2364	0.2104	0.1946	0.0512	0.0898	0.1226
SIMWORD	0.1649	0.1413	0.1319	0.0273	0.0476	0.0639
ACN	0.1441	0.1276	0.1194	0.0222	0.0400	0.0565
W-NTFSL	0.2479	0.2397	0.2271	0.0553	0.1064	0.1445
NTFSL	0.3205	0.2884	0.2660	0.0670	0.1184	0.1608

Table 5. Classification Performance Comparison

First of all, results show that the proposed NTFSL model can significantly outperform the benchmark methods in both datasets, illustrating the effectiveness of our model for handling the tag-centered cold-start problem in different scenarios.

Second, our NTFSL performs better than MLP(topic), demonstrating that simply feeding pre-trained topic distribution to few-shot learning models cannot achieve a good performance. Our NTFSL model, however, can achieve superior performance with the generated topic distribution. The reason may be that the topic distributions generated by the classical LDA may not be useful for deciding the assignment of a new tag. For example, for dataset HTF, the pre-trained topics inferred by LDA include some topics related to accounting information like revenue and profit, which does not help with the tag classification since they are basically the topics that appear in the financial reports of each stock. NTFSL simultaneously handles the topic modeling task and the few-shot learning task, thus the topics identified by NTFSL are more useful for distinguishing items. Additionally, most of the benchmarks with the few-shot learning framework outperform the traditional automatic tagging algorithms SIMWORD and ACN, showing the effectiveness of the few-shot learning framework on cold-start problems.

Third, the results show that another implementation of our proposed few-shot learning framework, W-NTFSL, is the second best model. It performs better than any other benchmark models in most cases, which demonstrates the advantage of our propose framework. NTFSL outperforms W-NTFSL, which shows the effectiveness of our proposed neural topic model.

5.3 Analysis

In this section, we conduct several sensitivity studies of the NTFSL model by adjusting two key hyper-parameters. Then we conduct ablation studies for NTFSL, which helps demonstrate the superiority of our model in both few-shot classifications and topic extractions. Finally, qualitative examples are presented to elaborate on the interpretability of our model. The sensitivity studies and ablation studies on the SO dataset are deferred to Appendix B.

5.3.1 Topic Number.

The number of topics is a parameter of topic models. In Figure 7, we adjust the topic number in NTFSL. Clearly, when the topic number is small, increasing the topic number K can help with the classification performance improvement. However, after the topic number is larger than 50, classification performance does not change a lot. These results demonstrate the robustness of the NTFSL model.

5.3.2 Support Set Size.

The support set size is the number of seed items that help convey the underlying meaning of each new tag. Intuitively, with more seed items included to “define” the new tag, the model can acquire a higher accuracy on classification. Figure 8 illustrates the empirical results that are in accordance with our intuition. These results show that if we can find more seed items for each new tag, the few-shot classification model can achieve better performance. The results also show that when the support size is larger than 5, the performance increase speed becomes slow.

5.3.3 Ablation Study.

Compared to previous works on applying the VAE framework to the topic model [14, 80], NTFSL differs from them by introducing the topic mask distributions \(\boldsymbol {\alpha }\) instead of a symmetric prior to the topic distribution \(\boldsymbol {\theta }\) in the generative process. To better illustrate the effect of this modification, we develop an ablation model named NTFSL-S whose prior on the topic distribution is symmetric and compare NTFSL with it.

Figure 9 illustrates the classification performance comparison between NTFSL and the ablation model. Clearly, the ablation model is consistently less competitive compared to NTFSL. This experiment demonstrates that the introduction of the topic mask distribution \(\boldsymbol {\alpha }\) can significantly improve the usefulness of the identified topic distributions for conducting the few-shot classification tasks. To see the reason behind the result, we compare the difference in average topic distributions of items belonging to a tag and items not belonging to this tag. Figure 10 shows that model NTFSL tends to enlarge the topic distributions among items compared with model NTFSL-S, thus making it easier for the classification model to classify the items.

For topic models, the quality of the inferred topics and their interpretability depend on the quality of the top words in each topic [16]. To further validate the effectiveness of our proposed neural topic model, we use Pointwise Mutual Information (PMI) as a metric to evaluate the quality of the topics learned by our methods. Newman et al. [67] propose that this metric shows a strong correlation with human scores in topic quality evaluations. Each topic \(\boldsymbol {z}\) learned by the topic model is represented by the top-k words to which the topic is mostly related, and its PMI score is formally definedas

\begin{equation} \begin{aligned} PMI-Score(\boldsymbol {z}) &= median(PMI(w_i, w_j), i, j \in 1..k, i\ne j)\\ PMI(w_i, w_j) &= \log {\frac{p(w_i, w_j)}{p(w_i)p(w_j)}}, \end{aligned} \end{equation}

(22)

where the top-k word list of topic \(\boldsymbol {z}\) is \((w_1, w_2, \ldots , w_k)\). The core idea of Equation (22) is to evaluate the co-occurrence of the top words in the topics. To make a fair evaluation, an external large text corpus is needed for calculating the co-occurrence count \(p(w_i, w_j)\) and the occurrence of word \(w_i\) and \(w_j\)—that is, \(p(w_i)\) and \(p(w_j),\) respectively.

We compare the quality of topics inferred by NTFSL and NTFSL-S on dataset HTF. In Table 6, we use three different sources of corpus and use different window sizes for calculating the co-occurrence of words. The corpora include wiki-CN,⁵ a Chinese wiki corpus with more than 1 million wiki entries, and THUCNews,⁶ a Chinese news dataset with more than 740,000 pieces of news. THUCNews classifies the news into 14 different categories including a category of “stocks,” and in Table 6, we use THUCNews-Total to refer to the whole THUCNews dataset, and use THUCNews-Stock to refer to the dataset of “stocks” news. The PMI score comparison indicates that the topic quality in NTFSL is also better than that of the ablation model.

Table 6.

Window Size	5	10	15	5	10	15	5	10	15
	THUCNews-Stock			THUCNews-Total			wiki-CN
NTFSL-S	–15.88	–14.95	–14.46	–15.02	–13.12	–11.78	–14.90	–13.07	–12.29
NTFSL	–11.13	–8.41	–7.56	–7.38	–4.96	–4.03	–10.76	–7.02	–5.99

Table 6. PMI Score of NTFSL and the Ablation Model NTFSL-S (HTF)

To justify the design of the end-to-end learning framework, we also implement a two-stage model to perform the topic modeling task and the classification task in a separate way, with other model structures being the same. In Table 7, results show that model NTFSL performs much better than the two-stage model, demonstrating the necessity of the end-to-end learning method.

Table 7.

Method	Precision@P			Recall@P
	P = 10	P = 20	P = 30	P = 10	P = 20	P = 30
NTFSL	0.3205	0.2884	0.2660	0.0670	0.1184	0.1608
Two-Stage-NTFSL	0.2091	0.1834	0.1700	0.0368	0.0741	0.1020

Table 7. Classification Performance Comparison

5.3.4 Interpretability.

In the NTFSL model, both the new tag and the item to be tagged are represented by a vector, which is a topic distribution where each dimension corresponds to a topic. Each topic is represented by a word distribution, thus the meaning of each topic can be explained by a cluster of words with meanings easy to understand. Then the representation vectors of both tag and item represent the degree they bias toward certain topics, and the prediction made by the model is based on the similarity between these topic representations. Therefore, NTFSL has the property of decomposability [58], since the key parts of the model here admit intuitive explanations. Hence, the NTFSL model has a higher level of transparency. Moreover, with these topics, we are further able to verbally explain why certain items should be assigned with a new tag, and it is also a way we humans sometimes used to justify decisions [58]. In the following, we show the interpretability of the model by showcasing some qualitative examples.

Table 8 lists the top topics with probabilities and the top five words of each topic identified by NTFSL for some tags in the dataset of HTF. It can be seen that topics and topic words are related to each tag. We can understand the meaning of new tags through their topic distributions, especially through those most-related topics. For example, tag “One Belt One Road” refers to an initiative proposed by the Chinese government in 2013, which aims at developing economic cooperation globally. From the top topics in its topic distribution, we can see some features of the associated companies in the market: companies that are currently related to this policy are mostly state owned (topic 10) and infrastructure related, especially construction companies (topics 2 and 69), railway companies (topic 37), and mining companies (topic 77). The topic distribution of the tag “Block Chain” also demonstrates that the tag is not only related to high-tech businesses like AI (topic 18) but also related to some potential industries that may benefit from this new technology, including the financial industry (topic 21), the manufacturing industry (topic 37), supply chain management (topic 24), and medical institutions (topic 36).

Table 8.

Tag	Topic	Top 5 Topic Words
	53: 0.1077	fuel cell, downhill,⁷ electric vehicle, 10,000 cars,⁸ commercial vehicle
	31: 0.0580	energy storage, electricity, charging pile, power distribution, switch
New Energy	61: 0.0509	IT, injection molding, multinational company, mold, automated industry
Car	15: 0.0344	fabric, textile industry, textile, garment industry, home textile
	2: 0.0320	steel structure, concrete, metallurgy, power plant, contractor
	88: 0.0853	membrane, capacitance, coating, LCD monitor, glass
	52: 0.0749	smart phone, wearables, Xiaomi, ODM, structural component
OLED	37: 0.0539	robot, special equipment, packaging, welding, container
	66: 0.0529	optical communication, PCB, industrial control, cable, optical fiber
	4: 0.0525	integrated circuit, wafer, chip package, semiconductor, chips
	2: 0.0558	steel structure, concrete, metallurgy, power plant, contractor
	10: 0.0491	Party Building, Party Committee, CCP, Party Organization, Party Member
One Belt	68: 0.0449	railway, high-speed railway, passenger transport, highway, urban railway
One Road	77: 0.0410	Tungsten, concentrate, mineral separation, phosphate rock, mine
	69: 0.0361	decoration, curtain wall, architectural design, BIM, designer
	18: 0.0631	edge computing, artificial intelligence, AI, government affairs, data collection
	21: 0.0609	bank industry, self-service, block, smart card, POS
Block Chain	37: 0.0458	robot, special equipment, packaging, welding, container
	24: 0.0452	enterprise, supply chain, IDC, visualization, data center
	36: 0.0376	medical institute, hospital, specialist hospital, medical, doctor

Table 8. Top 5 Words for Selected Topics

Meanwhile, these topics (see Table 8) can also tell you the underlying logic of the classification results or the model’s decisions. Taking the tag “new energy car” as an example, the top 3 topics of this tag are topics 53/31/61, and three stocks that are predicted to associate with it are SZ-300750, SZ-002074, and SH-603701. Their top topics are listed in Table 9, and the top 5 words of these topics are listed in Table 10. We can find that their topics cover multiple different business areas, including energy distribution (topic 31), the manufacturing industry (topics 37 and 61), and the mineral industry (topic 77). The topic distributions of stocks and the tag are matched on the whole, whereas the variation of the top topics of these three stocks with those of the tag demonstrates the uniqueness of each stock: stock SZ-300750 focuses more on manufacturing energy distributional systems, stock SZ-002074 has some business related to battery materials, and stock SH-603701 is an auto parts manufacturer.

Table 9.

Stock Ticker	Topic Distributions
SZ-300750	Topic 53:0.1423, Topic 31:0.1120, Topic 37:0.0502, Topic 52:0.0388, Topic 89:0.0382
SZ-002074	Topic 53:0.1394, Topic 31:0.1141, Topic 77: 0.0758, Topic 97: 0.0748, Topic 35: 0.0437
SH-603701	Topic 86:0.1384, Topic 53: 0.1075, Topic 61:0.0874, Topic 15:0.0683, Topic 35: 0.0419

Table 9. Top Topic Distributions of Three Stocks

Table 10.

Topic#	Top 5 Topic Words
Topic 15	fabric, textile industry, textile, garment industry, home textile
Topic 31	energy storage, electricity, charging pile, power distribution, switch
Topic 35	chlor-alkali, chemicals, silicone, additives, caustic soda
Topic 37	robot, special equipment, packaging, welding, container
Topic 52	smart phone, wearables, Xiaomi, ODM, structural component
Topic 53	fuel cell, downhill, electric vehicle, 10000 cars, commercial vehicle
Topic 61	IT, injection molding, multinational company, mold, automated industry
Topic 77	Tungsten, concentrate, mineral separation, phosphate rock, mine
Topic 86	tire, bearing, civil-used explosion, clutch, water pump
Topic 89	video, video conference, computer vision, security, options
Topic 97	alloy, modified, graphite, composite materials, nickel

Table 10. Top 5 Words for Topics Related to Three Stocks

From these case examples, we can find that NTFSL can produce classification results that are to some degree interpretable, whereas previous few-shot learning models cannot generate such understandable hints for their classification judgment. The interpretability not only can reflect the underlying logic behind the model but also can increase our confidence in the classification result.

6 Discussion

6.1 Managerial Implication

Our work has the following managerial implications. First, the tag-centered cold-start problem is a problem that many online platforms may encounter, and we provide a way to cope with the new emerging tags in the tagging system. With the proposed model, online platforms can enhance their tagging systems to be able to evolve over time quickly and save lots of efforts made to update tags manually. Second, our study provides implications for designers of tagging systems. With our proposed method, a new tagging system can be designed with the capability to offer personalized services. For example, for online financial service platforms, the tagging system can allow users to input their interested new tag with several example stocks. The new tag has not been used to annotate any stocks. Using our proposed model, the tagging system can return other stocks that should be tagged with the new tag. In this way, the tagging system not only provides more flexible and personalized services but also leverages the power of end users to make the tagging system evolve dynamically. Third, our proposed method can extract latent topic information for each item as described in Section 5.3.4. In platforms such as financial service platforms, e-commerce platforms, and content creation platforms, we can present the topic information to multiple stakeholders such as end users, product managers, and financial/health analysts. In this way, end users can obtain an overview about the item and save their time on reading detailed textual description information, product managers can evaluate the informativeness and quality of the product description texts, and financial/health analysts can stay up to date on information that end users care about.

6.2 Limitations

Our work has some limitations. We only leverage the textual information of the items as input of the proposed model. The items may have some other features that can help with the classification tasks. For example, if the item is a stock, the geographic information or the share structure of the firm can also relate to the assignment of tags. In addition, a new tag itself may also have some prior information. Proper integration of the prior information may help define the tag more clearly and improve the tagging accuracy. Additionally, our work focuses on the interpretability of the classification results. Hence, we tend to use a simple classifier in NTFSL. By introducing more sophisticated classifiers, we may further improve the classification performance if we ignore the interpretability. Additionally, in this research, items in the support set used in the training process of few-shot learning are randomly selected, which may also have an impact on the experimental results.

6.3 Future Work

Possible future research directions may target the above limitations. First, we may consider using other types of information to solve the tag-centered cold-start problem. For example, using some other features of items or relationships between items to further enhance classification performance. How to utilize more information to encode items without loss of generality is an interesting but challenging research problem. In addition, complicated classifiers are usually not interpretable, while simple classifiers may have limited ability in finding complicated relationships. How to combine our proposed neural topic model with an expressive enough classifier without loss of interpretability may be a direction worth further studying. The selection of items in the support set may impact experimental results. Therefore, opting for suitable seed items might enhance the performance of few-shot learning, representing a viable direction for further research. Additionally, our NTFSL framework can be easily extended beyond the domain of automatic tagging. For example, the combination of the topic modeling task and the few-shot learning task also holds potential for application in recommendation systems with sparse user-item interactions.

7 Conclusion

In this article, we identified a problem of realistic needs that has been ignored, which we define as the tag-centered cold-start problem. We formalized the problem and proposed a novel neural topic model based FSL method to handle it. We designed a new model named NTFSL based on this method, which has two modules: the topic modeling module that generates topic distributions to encode each item, and the classification module that uses the inferred topic distributions to classify items. In the topic modeling module, we proposed a novel neural topic model as well as the corresponding inference method based on the VAE framework. Our improvement of the generative process of the neural topic model can help alleviate the component collapsing problem and generate topics of higher quality. In the classification module, we fused the neural topic model with the classification task. Thus, we can conduct classifications based on the interpretable topic distributions and enjoy the benefits of neural networks at the same time. The topic modeling module and the classification module are trained in an end-to-end way so that the model can infer high-quality latent topics that are also effective for classification. Using real-world datasets, we demonstrated that NTFSL outperforms multiple competing models in terms of classification accuracy. Additionally, by scrutinizing the top words of each topic and the topic distributions of tags and items, the NTFSL model can provide some underlying hints for us to understand the factors that lead to the specific tagging results. As well, the detailed experiments also demonstrate that the proposed novel topic model contributes to the superior performance of our model.

Our work belongs to computational design science research, in which we define a challenging business problem: the tag-centered cold start problem. We made methodological contributions by proposing novel machine learning models and learning algorithms to solve the new problem. We conducted extensive experiments to evaluate our proposed method and demonstrated its advantage over state-of-the-art methods suitable to solve the problem. Additionally, we contributed to existing topic modeling methods by developing a novel neural topic model and novel inference network for model inference. Empirical studies illustrated that the improvement made to existing models not only can help increase the tagging performance but also can provide a higher quality of topics. Meanwhile, we contributed to the few-shot learning method by the introduction of neural topic models into the few-shot learning framework. The neural topic encoder can help improve the few-shot text classification performance compared to previous models. We also contributed to deep learning literature by introducing interpretability to the classification process in a few-shot learning setting.

Footnotes

Named Lists on robinhood.com.

https://languages.oup.com/word-of-the-year/2020/

https://openai.com/blog/chatgpt

⁴

http://q.10jqka.com.cn/gn/detail/code/309046/

⁵

https://github.com/brightmart/nlp_chinese_corpus

⁶

http://thuctc.thunlp.org/

⁷

In 2017, the Chinese government gradually cut off the subsidy for battery electric vehicles, and medias tend to use the term downhill for this change.

⁸

A measure word in Chinese.

A Model Structure of the Inference Network and the Reconstruction Network

The structure of the inference network and the reconstruction network are shown in Figures 11 and 12, respectively. In these figures, each ellipse represents an operation, and each square represents a variable. The \(\lambda\) in Figure 12 is a hyper-parameter, which gradually moves from 0 to 1 during the training process, and this setting can help improve the interpretability of the document representations \(\boldsymbol {\theta }\) in practice.

Fig. 11.

Fig. 12.

B Experimental Results On the SO Dataset

In this section, we provide the experimental results on the SO dataset. Table 11 shows the performance of our proposed model and other benchmark models on dataset SO with a support set size M setting of 10. The topic number is set to 100 for NTFSL and W-NTFSL. The table shows that the proposed NTFSL model can significantly outperform the benchmark methods on the SO dataset, which is consistent with the results on the HTF dataset, illustrating the effectiveness of our model for handling the tag-centered cold-start problem in different scenarios.

Table 11.

Method	Precision@P			Recall@P
	P = 10	P = 20	P = 30	P = 10	P = 20	P = 30
MLP	0.0625	0.0595	0.0590	0.0035	0.0066	0.0099
MLP(topic)	0.1693	0.1620	0.1530	0.0081	0.0174	0.0275
CNN-Att	0.1380	0.1303	0.1135	0.0061	0.0139	0.0222
biLSTM	0.2075	0.2040	0.2008	0.0115	0.0223	0.0331
Att-Gen	0.1620	0.1597	0.1525	0.0083	0.0176	0.0262
SIMWORD	0.1097	0.1068	0.1060	0.0059	0.0116	0.0173
ACN	0.1065	0.1051	0.1085	0.0058	0.0114	0.0179
W-NTFSL	0.3487	0.3490	0.3479	0.0188	0.0378	0.0565
NTFSL	0.3822	0.3783	0.3712	0.0208	0.0412	0.0606

Table 11. Classification Performance Comparison (SO)

Figure 13 shows the performance of NTFSL with different numbers of topics. Figure 14 shows the performance of NTFSL with different support set sizes. Figure 15 illustrates the classification performance comparison between NTFSL and the ablation model. Additionally, Figure 16 shows that NTFSL tends to enlarge the topic distributions among items compared with NTFSL-S, thus making it easier for the model to perform classification. Overall, from the experimental results on SO, we draw conclusions similar to those from the experimental results on HTF.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

References

[1]

Aly Abdelrazek, Yomna Eid, Eman Gawish, Walaa Medhat, and Ahmed Hassan. 2023. Topic modeling algorithms and applications: A survey. Information Systems 112 (2023), 102131.

Abstract

1 Introduction

2 Related Work

2.1 Automatic Tagging Methods

2.2 Topic Model Based Classification

2.3 Few-Shot Learning

2.4 Neural Network Interpretability

3 Problem Formulation

4 Model Description

4.1 Topic Modeling Task

4.1.1 Generative Process.

4.1.2 Model Inference.

4.2 Few-Shot Classification Task

5 Empirical Study

5.1 Experimental Setups

5.1.1 Dataset Description.

5.1.2 Evaluation Metrics.

5.1.3 Benchmark Methods.

5.2 Experimental Results

5.3 Analysis

5.3.1 Topic Number.

5.3.2 Support Set Size.

5.3.3 Ablation Study.

5.3.4 Interpretability.

6 Discussion

6.1 Managerial Implication

6.2 Limitations

6.3 Future Work

7 Conclusion

Footnotes

A Model Structure of the Inference Network and the Reconstruction Network

B Experimental Results On the SO Dataset

References

Index Terms

Recommendations

Tagging Items Automatically Based on Both Content Information and Browsing Behaviors

Applying Kumaraswamy distribution on stick-breaking process: a Dirichlet neural topic model approach

Topic modeling methods for short texts: A survey

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations