NeurIPS Conference Papers Classification Based on Topic Modeling
Paper illustrates the process of topic modeling and text classification. Specifically, the dataset used is a corpus consisting of scientific publications published by Neural Information Systems Processing Conference. Topic modeling itself is performed using Latent Dirichlet Allocation model. It is followed by optimization of a number of topics on the basis of topic coherence, a quality measure of human interpretability. Results of topic modeling are used for labeling data prior to text classification. Labels are determined based on the distribution of assigned papers' topics over time. Specifically, peak changes used for differentiating between time periods dominated by specific topics are calculated as a Kullback-Leibler divergence. Finally, transforming data into the feature vectors, several different text classification approaches are evaluated. As observed, the greatest accuracy score is recorded for the use of extreme gradient boosting classifier being 77.1%.