Topic Extraction In Chatbot Applications

Following the launch of OpenAI’s third iteration in the GPT series, GPT-3.5, the landscape of natural language applications experienced a significant surge in popularity. Specifically crafted for natural language understanding and generation, ChatGPT demonstrates an exceptional ability to excel in various language-related tasks, including text completion, question answering, language translation, and more. Its optimization for conversational purposes positions it as an excellent choice for chat-based applications.

OpenAI has further facilitated the accessibility of ChatGPT by providing an API, enabling developers to seamlessly integrate its robust capabilities into their applications and services. This initiative has fueled widespread adoption and usage, propelling ChatGPT to the forefront of the conversational AI landscape.

One potential application is retrieval-augmented generation (RAG), representing a hybrid approach in natural language processing that combines elements of both retrieval-based models and generative models. This hybrid approach is often utilized in applications where it’s essential to balance the precision of retrieval-based methods with the creativity and adaptability of generative models. It can significantly enhance the overall performance and effectiveness of natural language processing systems, making them more suitable for various tasks, including chatbots, question answering systems, and conversational agents.

How can one embark on developing a RAG application? Each design decision holds the potential to impact the system’s responses, whether through adjustments in prompts, text embeddings, or the scale of the text to embed. The challenge lies in systematically testing these modifications. Unlike software engineering, where testing can be largely automated, working with a chatbot introduces complexity due to the stochastic nature of its responses. While automated testing frameworks exist, human intervention is often necessary to effectively evaluate model performance.

In this context, formulating test questions becomes critical. Given the inherent stochasticity in chatbot responses, evaluation questions should be concise and insightful. This approach minimizes the time invested in the assessment process, recognizing the need for human involvement for nuanced evaluation.

One can categorize questions by assigning them to specific topics, forming cohesive clusters. The primary goal is to autonomously uncover and extract the overarching themes discussed among a set of questions, all without prior knowledge of the content. This methodology enables the inclusion of only a select number of questions from each topic in the test collection, ensuring a representative dataset. Furthermore, topics offer valuable analytical insights into customer behavior, enhancing our understanding of their interaction with the system.

The subsequent sections outline a straightforward method for extracting topics from questions. The code from the presented approach is available in the following repository: .

A simple approach to topic extraction



An embedding is a numerical representation of various types of information, including text, documents, images, audio, and more. This representation effectively captures the semantic meaning of the embedded content.

Numerous embedding models are available, and a viable method for benchmarking them is through the MTEB benchmark. This benchmark comprises 8 embedding tasks, covering a total of 56 datasets and spanning 112 languages. For the purpose of this article, we have opted for the gte-large model, readily available on Hugging Face. Its relatively compact size allows for swift experimentation on a personal computer, and it currently ranks among the top 15 models on the MTEB leaderboard.

Manifold Learning

t-SNE stands for T-distributed Stochastic Neighbor Embedding. Its objective is to minimize the Kullback-Leibler divergence between the joint probabilities of the alternative low-dimensional representation of the data and the original high-dimensional data. Notably, t-SNE features a non-convex cost function, meaning different initializations can yield varied results.

In our context, t-SNE serves dual purposes:
1. It functions as a powerful tool for visualizing high-dimensional data, enabling manual inspection, potentially aiding in the selection of an optimal number of clusters.
2. Additionally, t-SNE is employed to reduce the dimensionality to a more manageable number. This reduction not only mitigates noise but also enhances the performance of clustering algorithms, which often struggle with sparse and high-dimensional data. In scenarios involving high-dimensional data, distances tend to become large, resulting in significant dispersion, especially when measured using Euclidean distance, such as in k-means clustering. This phenomenon is commonly referred to as the “curse of dimensionality.

K-means Clustering

The k-means clustering algorithm stands as a widely embraced unsupervised machine learning technique, proficient in partitioning a dataset into k distinct, non-overlapping subsets. The primary objective is to categorize data points into clusters according to their similarity, with each cluster denoted by its centroid—the mean of all data points within it. It is computationally efficient and relatively easy to understand and implement.

Much like the t-SNE algorithm, the k-means method is susceptible to the influence of initializations. Diverse starting points may yield distinct outcomes, emphasizing the need for careful consideration and potentially multiple runs to ensure robust results.

Selection of the Number of Clusters

A critical consideration arises in clustering algorithms: determining the optimal number of clusters. While a manual inspection of the data is a valid approach, it becomes challenging when clusters are not easily separable. In such cases, an automated method proves advantageous.

One effective automated solution involves leveraging the silhouette score. This score is computed by evaluating the mean intra-cluster distance and mean nearest-cluster distance for each data point. Essentially, it serves as a metric for assessing how well an object aligns with its own cluster in comparison to neighboring clusters. The silhouette score varies between -1 and +1, with a higher value indicating a strong match within its cluster and a weaker match with neighboring clusters.

A favorable clustering configuration is characterized by a prevalence of high silhouette scores among objects. Conversely, if a significant number of data points exhibit low or negative silhouette values, it suggests an imbalance in the clustering configuration, indicating either an excess or deficiency of clusters.


Illustrative example: The Stanford Question Answering Dataset


The Stanford Question Answering Dataset (SQuAD) is a collection of reading comprehension data, featuring questions formulated by crowdworkers based on a curated selection of Wikipedia articles. The dataset contains over 100,000 question-answer pairs on 500+ articles. These questions, mapped to the article, can serve as a ground truth for the number of various topics, or in our example, the number of clusters. In the article, a smaller development dataset was used, which contains approximately 12000 questions originating from 35 articles.

The histogram illustrates outcomes obtained by applying the silhouette score to select the number of clusters in 150 runs, each initialized differently within the range of 20 to 45 clusters. The orange line represents the ground truth, specifically 35 distinct articles. Notably, both the mean and mode of the resulting distribution converge at 36, closely aligning with the count of articles that served as the source for the questions.

The scatter plot visually represents the outcomes of k-means clustering employing 36 clusters on a 2-dimensional manifold derived from the t-SNE model, applied to question embeddings. At first glance, the results appear suitable.

For validation purposes, we additionally estimated the histogram depicting the conditional probability of topics (corresponding to the titles of Wikipedia articles) given the predicted cluster. Ideally, one anticipates a minimal number of topics within each cluster, with the optimal scenario being the prevalence of only one predominant topic.

As we can see, usually one topic dominates the cluster, which indicates good performance. The prevalent topic is shown in the respective histogram title. Moreover, visualizations of word clouds corresponding to the clusters depicted in the histograms

are also shown. Word clouds offer a visual representation of text data, commonly employed to showcase keywords to visualize free-form text. Typically, tags are single words, and the significance of each tag is conveyed through font size or color.

Once more, it is evident that the keywords align with the displayed article title situated above the corresponding word cloud. This alignment signifies effective clustering of the questions, demonstrating their cohesive association with the predominant topic encapsulated within the predicted cluster.


Limitations and alternative approaches


The presented approach in this article is simple and pragmatic. Numerous options exist for each part of the pipeline, and we want to suggest some improvements.

The embedding model was selected to perform well on the MTEB benchmark while ensuring a compact size suitable for local computer usage. However, exploring multiple embedding models could be advantageous, especially when dealing with questions in foreign languages not rigorously benchmarked. Additionally, for domain-specific applications, considering models fine-tuned for that specific domain or undertaking the fine-tuning process yourself could be beneficial. A guide to text embedding is available on OpenAI’s website

The proposed approach leveraged the t-SNE algorithm to effectively reduce the dimensionality of embeddings. Within a broader context, dimensionality reduction falls under the category of manifold learning problems. Various frameworks address linear dimensionality reduction, e.g., PCA. Manifold learning, in essence, extends linear frameworks like PCA to capture nonlinear structures within data. Exploring alternative algorithms could be beneficial during this phase. It’s also worth noting that optimizing the KL divergence in t-SNE can be somewhat challenging. For detailed insights on utilizing t-SNE effectively, a comprehensive guide is available at An alternative would also be to cluster the questions directly on the text embeddings.

The employed clustering algorithm was k-means clustering. However, a prerequisite for k-means is the a priori knowledge of the number of clusters. Exploring alternative clustering techniques that would select the number of clusters automatically could be a valuable avenue. Moreover, it’s worth considering alternative performance assessment scores for the clustering algorithm.

Finally, our approach relies on a significant assumption – that each question exclusively belongs to a singular cluster, representing a topic. In reality, this assumption falls short, as questions can often span multiple topics. Introducing a fuzzy clustering algorithm might prove beneficial, enabling the prediction of a distribution across clusters to which a question may belong. Noteworthy methodologies in Bayesian clustering and topic modeling encompass the Dirichlet distribution, exemplified by Dirichlet mixture models. Furthermore, nonparametric extensions of these models can automatically infer the optimal number of clusters.




The analysis of textual data has gained significant relevance in recent years, thanks to the rapid advancement of natural language processing models that empower a multitude of applications. This analytical process offers valuable insights into customer behavior, facilitates the identification of areas for model enhancement, and streamlines the collection process by condensing it into a representative set for manual testing.

The methodology outlined in this article introduces a straightforward approach to topic extraction, contributing to the realization of the aforementioned benefits. While the pragmatic approach demonstrated strong performance on an illustrative dataset, it’s crucial to acknowledge the existing limitations, paving the way for potential refinements and enhancements in future work.