SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

Paper · arXiv 2508.20514 · Published August 28, 2025

Abstract—Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights1.

However, these bag-of-words methods ignore contextual word interrelations, failing to capture the intricate semantics of modern scientific texts. In addition, these techniques often necessitate dimensionality reduction processes such as Principal Component Analysis (PCA) [4] or Uniform Manifold Approximation and Projection (UMAP) [5], potentially leading to significant information loss that is vital for maintaining the thematic depth of the documents. Deep embedding methods have gained prominence for their ability to represent textual data in a high-dimensional space, capturing semantic relationships between words and phrases. Unlike traditional bag-of-words approaches that treat words in isolation, document embeddings encapsulate the overall significance of a document in a continuous vector space, aligning semantically akin words more closely [6]. Even advanced deep topic modeling like the Embedded Topic Model (ETM) [7] and Neural Variational Document Model (NVDM) [8], though incorporating word embeddings to capture semantic nuances, still results in a limited understanding of the intricate relationships. These limitations ultimately impact the quality of insights derived from topic discovery, potentially leading to incomplete or inaccurate representations of the underlying thematic structures within the literature.

Along this line, by leveraging the strengths of LLMs, we propose SciTopic, an effective method to enhance scientific topic identification and provide deeper insights for researchers. Firstly, we construct a text encoder that captures essential content from scientific publications, including metadata, titles, and abstracts. Through this module, we can extract meaningful textual features crucial for accurate topic identification. Then, we introduce an LLM-guided clustering technique that leverages entropy-based sampling and triplet tasks. This approach diverges from traditional unsupervised clustering methods by actively involving the LLM in the clustering process. We utilize an entropy-based sampling strategy to identify the most ambiguous or uncertain documents, where cluster membership is less defined. These high-entropy instances are used as anchors for the triplet tasks, where two candidate titles or abstracts from nearby clusters are selected. By analyzing these triplets, the LLM refines document embeddings, sharpening the distinctions between closely related clusters. This method not only improves clustering precision but also minimizes computational overhead by focusing the LLM’s attention toward the most informative cases. Ultimately, this approach leads to a more contextually accurate and thematically coherent clustering result, ensuring that even subtle topic differences are effectively captured.

Sampling from Closest Clusters. High-entropy anchors, representing ambiguous or uncertain instances, are paired with candidate points sampled from nearby clusters based on cosine similarity of their embeddings. Specifically, we identify the closest clusters by calculating the mean embedding vectors of all clusters and selecting clusters with minimal Euclidean or cosine distances to the anchor. To enhance informativeness, candidates are then sampled proportionally to the density of these nearby clusters to ensure diverse coverage.

This targeted sampling generates informative triplets (anchor, positive, negative), where the positive is sampled from the same cluster as the anchor, and the negative is sampled from a closely related but distinct cluster. By focusing on ambiguous instances, this entropy-driven method reduces LLM query frequency and improves cost-effectiveness while maintaining high-quality clustering results.

Triplet Task for Clustering Perspective. At the core of the method is a triplet task that allows LLMs to evaluate thematic relationships among three elements: an anchor and two candidates. The LLM is prompted with the query: ρ = “Select the paper closest to a : c1, c2, or Neither.” The LLM processes this prompt and determines which candidate aligns more closely with the anchor in terms of thematic similarity. If the LLM responds with ”Neither,” the triplet is excluded, ensuring only meaningful comparisons guide the clustering process. This mechanism enhances contextual precision by eliminating ambiguous or irrelevant triplets, providing a cleaner input for fine-tuning the clustering model.

Each valid triplet consists of an anchor, a positive example (closer candidate), and a negative example (distant candidate), which are used to fine-tune the embedding model. By learning from these structured comparisons, the model develops embeddings that capture nuanced thematic distinctions, improving clustering coherence and accuracy.