Research Paper Classification Optimization Using NLP and Instance-Based Learning
About
This project aims to develop efficient classification for research papers using Natural Language Processing (NLP) and Instance-Based Learning. This project was developed for the Mathematics 156 - Machine Learning course at UCLA.
Background & Motivation
The arXiv is a free distribution service and an open-access archive for scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics. The arXiv is a valuable resource for researchers to share their work and stay up-to-date with the latest research in their field. While the current site provides various search and filtering options, including a map that visualizes the connections between papers based on citations, there is no built-in classification system that groups papers by topic.
This project aims to develop an efficient classification system for research papers based on its abstract. The goal is to provide a tool that quickly identifies existing papers related to their research topic of interest, allowing for a layer of prevention in overlap and redundancy in research.
Methodology & Key Concepts
- Data Collection
- The dataset used for this project is the arXiv dataset from Kaggle.
- The dataset contains metadata for over 1.7 million papers from the arXiv.
- The metadata includes the paper ID, title, abstract, authors, categories, and publication date.
- In the context of this project, only the abstract and categories are used.
- Data Preprocessing
- The abstracts are preprocessed using NLP techniques.
- The abstracts are tokenized, lemmatized, and vectorized using the TF-IDF vectorizer.
- The categories are one-hot encoded.
- The dataset is split into training and testing sets.
Instance-Based Learning
We test several techniques for instance-based learning:
- After TF-IDF vectorization:
- Truncated Singular Value Decomposition (SVD) –> Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
- K-Means Clustering (K-Means++) –> Approximate Nearest Neighbors (ANNOY)
- K-Nearest Neighbors (KNN)
- Random Forest
- Directly using FastText embeddings
- After TF-IDF vectorization:
Relevant Files
The following slides were used as part of the final project presentation.
The following report provides a detailed overview of the project.