From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Shelly Gupta; Jumanah Alshehri; Ameen Abdel Hai; Hussain Otudi; Zoran Obradovic

doi:10.1007/978-3-031-63223-5_22

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Shelly Gupta^*
, Jumanah Alshehri
, Ameen Abdel Hai
, Hussain Otudi
, Zoran Obradovic

^*Corresponding author for this work

Management Information Systems Department

Temple University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

Original language	English
Title of host publication	Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings
Editors	Ilias Maglogiannis, Lazaros Iliadis, Antonios Papaleonidas, John Macintyre, Markos Avlonitis
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	290-304
Number of pages	15
ISBN (Print)	9783031632228
DOIs	https://doi.org/10.1007/978-3-031-63223-5_22
State	Published - 2024
Event	20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024 - Corfu, Greece Duration: 27 Jun 2024 → 30 Jun 2024

Publication series

Name	IFIP Advances in Information and Communication Technology
Volume	714
ISSN (Print)	1868-4238
ISSN (Electronic)	1868-422X

Conference

Conference	20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024
Country/Territory	Greece
City	Corfu
Period	27/06/24 → 30/06/24

Keywords

Domain Adaptation
Reddit
Relevance learning
Semi-supervised
Twitter

Access to Document

10.1007/978-3-031-63223-5_22

Cite this

Gupta, S., Alshehri, J., Hai, A. A., Otudi, H., & Obradovic, Z. (2024). From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. In I. Maglogiannis, L. Iliadis, A. Papaleonidas, J. Macintyre, & M. Avlonitis (Eds.), Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings (pp. 290-304). (IFIP Advances in Information and Communication Technology; Vol. 714). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-63223-5_22

Gupta, Shelly ; Alshehri, Jumanah ; Hai, Ameen Abdel et al. / From Tweets to Reddit : Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings. editor / Ilias Maglogiannis ; Lazaros Iliadis ; Antonios Papaleonidas ; John Macintyre ; Markos Avlonitis. Springer Science and Business Media Deutschland GmbH, 2024. pp. 290-304 (IFIP Advances in Information and Communication Technology).

@inproceedings{6e5b381a421e493299e68961b7a9647c,

title = "From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering",

abstract = "Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model{\textquoteright}s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73\% and an F1 score of 0.77, which is a significant improvement of 20\% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.",

keywords = "Domain Adaptation, Reddit, Relevance learning, Semi-supervised, Twitter",

author = "Shelly Gupta and Jumanah Alshehri and Hai, \{Ameen Abdel\} and Hussain Otudi and Zoran Obradovic",

note = "Publisher Copyright: {\textcopyright} IFIP International Federation for Information Processing 2024.; 20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024 ; Conference date: 27-06-2024 Through 30-06-2024",

year = "2024",

doi = "10.1007/978-3-031-63223-5\_22",

language = "English",

isbn = "9783031632228",

series = "IFIP Advances in Information and Communication Technology",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "290--304",

editor = "Ilias Maglogiannis and Lazaros Iliadis and Antonios Papaleonidas and John Macintyre and Markos Avlonitis",

booktitle = "Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings",

}

Gupta, S, Alshehri, J, Hai, AA, Otudi, H & Obradovic, Z 2024, From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. in I Maglogiannis, L Iliadis, A Papaleonidas, J Macintyre & M Avlonitis (eds), Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings. IFIP Advances in Information and Communication Technology, vol. 714, Springer Science and Business Media Deutschland GmbH, pp. 290-304, 20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024, Corfu, Greece, 27/06/24. https://doi.org/10.1007/978-3-031-63223-5_22

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. / Gupta, Shelly; Alshehri, Jumanah; Hai, Ameen Abdel et al.
Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings. ed. / Ilias Maglogiannis; Lazaros Iliadis; Antonios Papaleonidas; John Macintyre; Markos Avlonitis. Springer Science and Business Media Deutschland GmbH, 2024. p. 290-304 (IFIP Advances in Information and Communication Technology; Vol. 714).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - From Tweets to Reddit

T2 - 20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024

AU - Gupta, Shelly

AU - Alshehri, Jumanah

AU - Hai, Ameen Abdel

AU - Otudi, Hussain

AU - Obradovic, Zoran

N1 - Publisher Copyright: © IFIP International Federation for Information Processing 2024.

PY - 2024

Y1 - 2024

N2 - Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

AB - Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

KW - Domain Adaptation

KW - Reddit

KW - Relevance learning

KW - Semi-supervised

KW - Twitter

UR - https://www.scopus.com/pages/publications/85197336579

U2 - 10.1007/978-3-031-63223-5_22

DO - 10.1007/978-3-031-63223-5_22

M3 - Conference contribution

AN - SCOPUS:85197336579

SN - 9783031632228

T3 - IFIP Advances in Information and Communication Technology

SP - 290

EP - 304

BT - Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings

A2 - Maglogiannis, Ilias

A2 - Iliadis, Lazaros

A2 - Papaleonidas, Antonios

A2 - Macintyre, John

A2 - Avlonitis, Markos

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 27 June 2024 through 30 June 2024

ER -

Gupta S, Alshehri J, Hai AA, Otudi H, Obradovic Z. From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. In Maglogiannis I, Iliadis L, Papaleonidas A, Macintyre J, Avlonitis M, editors, Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. p. 290-304. (IFIP Advances in Information and Communication Technology). doi: 10.1007/978-3-031-63223-5_22