TY - GEN
T1 - From Tweets to Reddit
T2 - 20th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2024
AU - Gupta, Shelly
AU - Alshehri, Jumanah
AU - Hai, Ameen Abdel
AU - Otudi, Hussain
AU - Obradovic, Zoran
N1 - Publisher Copyright:
© IFIP International Federation for Information Processing 2024.
PY - 2024
Y1 - 2024
N2 - Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.
AB - Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.
KW - Domain Adaptation
KW - Reddit
KW - Relevance learning
KW - Semi-supervised
KW - Twitter
UR - https://www.scopus.com/pages/publications/85197336579
U2 - 10.1007/978-3-031-63223-5_22
DO - 10.1007/978-3-031-63223-5_22
M3 - Conference contribution
AN - SCOPUS:85197336579
SN - 9783031632228
T3 - IFIP Advances in Information and Communication Technology
SP - 290
EP - 304
BT - Artificial Intelligence Applications and Innovations - 20th IFIP WG 12.5 International Conference, AIAI 2024, Proceedings
A2 - Maglogiannis, Ilias
A2 - Iliadis, Lazaros
A2 - Papaleonidas, Antonios
A2 - Macintyre, John
A2 - Avlonitis, Markos
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 27 June 2024 through 30 June 2024
ER -