From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

  • Shelly Gupta*
  • , Jumanah Alshehri
  • , Ameen Abdel Hai
  • , Hussain Otudi
  • , Zoran Obradovic
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

Original languageEnglish
Title of host publicationArtificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings
EditorsIlias Maglogiannis, Lazaros Iliadis, Antonios Papaleonidas, John Macintyre, Markos Avlonitis
PublisherSpringer Science and Business Media Deutschland GmbH
Pages290-304
Number of pages15
ISBN (Print)9783031632228
DOIs
StatePublished - 2024
Event20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024 - Corfu, Greece
Duration: 27 Jun 202430 Jun 2024

Publication series

NameIFIP Advances in Information and Communication Technology
Volume714
ISSN (Print)1868-4238
ISSN (Electronic)1868-422X

Conference

Conference20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024
Country/TerritoryGreece
CityCorfu
Period27/06/2430/06/24

Keywords

  • Domain Adaptation
  • Reddit
  • Relevance learning
  • Semi-supervised
  • Twitter

Fingerprint

Dive into the research topics of 'From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering'. Together they form a unique fingerprint.

Cite this