ClusterLDA: Clustering-Based Topic Extraction for Online News

  • Juan Kenichi Sutan*
  • , Jumanah Alshehri
  • , Rafaa Aljurbua*
  • , Zoran Obradovic*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Topic extraction from large text datasets is a critical task for understanding and analyzing thematic patterns across documents. This study explores three methods for improving topic modeling: standard Latent Dirichlet Allocation (LDA), a clustering-based approach combined with LDA, and an experimental method enhanced with Wikipedia-supplemented clustering. We demonstrate that applying clustering prior to LDA significantly improves topic extraction by aligning the modeling process with the natural structure of the data, allowing for the identification of smaller, yet meaningful, topics that are often overshadowed in standard LDA due to thematic imbalances. Additionally, we applied the improved topic modeling approach to construct an entity-topic network that contextualizes entities' sentiments within the topics they were mentioned in, providing a nuanced view of the dataset. Although our hypothesis that Wikipedia-enhanced clustering would further improve topic extraction was not supported, as it introduced noise and worsened performance, ClusterLDA approach proved effective in enhancing the granularity of topic extraction and addressing the limitations of standard LDA. These findings highlight the potential of clustering to support more coherent and semantically relevant topic extraction, offering a foundation for future advancements in topic modeling and sentiment analysis.

Original languageEnglish
Title of host publicationProceedings - 2025 8th International Women in Data Science Conference at Prince Sultan University, WiDS-PSU 2025
EditorsTanzila Saba, Amjad Rehman
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages145-150
Number of pages6
ISBN (Electronic)9798331520922
DOIs
StatePublished - 2025
Event8th International Women in Data Science Conference at Prince Sultan University, WiDS-PSU 2025 - Riyadh, Saudi Arabia
Duration: 13 Apr 202514 Apr 2025

Publication series

NameProceedings - 2025 8th International Women in Data Science Conference at Prince Sultan University, WiDS-PSU 2025

Conference

Conference8th International Women in Data Science Conference at Prince Sultan University, WiDS-PSU 2025
Country/TerritorySaudi Arabia
CityRiyadh
Period13/04/2514/04/25

Keywords

  • Clustering
  • Latent Dirichlet Allocation (LDA)
  • Text Mining
  • Topic Modeling

Fingerprint

Dive into the research topics of 'ClusterLDA: Clustering-Based Topic Extraction for Online News'. Together they form a unique fingerprint.

Cite this