Developing efficient classification for Reddit posts/comments/communities with Graph Neural Networks (GNNs)

Background

Reddit is a social media platform where users can post, comment, and upvote/downvote on various topics. Reddit is divided into subcommunities (subreddits) based on topics of interest. This platform is a rich source of data for studying social networks and community detection.

A person’s affiliation with communities can be predicted based on various features, such as their demographic, language use, clickstream, etc. While we can use these features to predict a person’s affiliation with a community through traditional classification models, we can also use the graph structure of the entire platform to predict a person’s affiliation with a community.

This project aims to efficiently discover subcommunities and their changes using the pure graph structure of the entire Reddit platform, constructed solely from the interactions between users and posts (likes, comments, etc.) and avoiding the use of other categorical features. This project also performs node classification on new posts/comments.

Objective

  • Collect community data and set up environment
    • Pull in community data using PyTorch Geometric (Reddit dataset)
    • Connect to remote workstation through SSH (sponsored by AI Safety at UCLA)
  • Test and validate various community detection models
    • GraphSAGE
    • Graph Attention Networks
    • FlashAttention

Relevant Files

  • Code for this project is available on GitHub.

The following slides were presented during the quarterly DataRes Demo Day at UCLA.

Download PDF