Research Paper Classification Optimization Using NLP and Instance-Based Learning

About

This project aims to develop efficient classification for research papers using Natural Language Processing (NLP) and Instance-Based Learning. This project was developed for the Mathematics 156 - Machine Learning course at UCLA.

Background & Motivation

The arXiv is a free distribution service and an open-access archive for scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics. The arXiv is a valuable resource for researchers to share their work and stay up-to-date with the latest research in their field. While the current site provides various search and filtering options, including a map that visualizes the connections between papers based on citations, there is no built-in classification system that groups papers by topic.

This project aims to develop an efficient classification system for research papers based on its abstract. The goal is to provide a tool that quickly identifies existing papers related to their research topic of interest, allowing for a layer of prevention in overlap and redundancy in research.

Methodology & Key Concepts

Data Collection
- The dataset used for this project is the arXiv dataset from Kaggle.
- The dataset contains metadata for over 1.7 million papers from the arXiv.
- The metadata includes the paper ID, title, abstract, authors, categories, and publication date.
  - In the context of this project, only the abstract and categories are used.
Data Preprocessing
- The abstracts are preprocessed using NLP techniques.
- The abstracts are tokenized, lemmatized, and vectorized using the TF-IDF vectorizer.
- The categories are one-hot encoded.
- The dataset is split into training and testing sets.
Instance-Based Learning
We test several techniques for instance-based learning:
- After TF-IDF vectorization:
  - Truncated Singular Value Decomposition (SVD) –> Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  - K-Means Clustering (K-Means++) –> Approximate Nearest Neighbors (ANNOY)
  - K-Nearest Neighbors (KNN)
  - Random Forest
- Directly using FastText embeddings

Relevant Files

The following slides were used as part of the final project presentation.

Download PDF

The following report provides a detailed overview of the project.

Download PDF

Junwon Choi

About

Background & Motivation

Methodology & Key Concepts

Relevant Files