Feature selection for the classification of large document collections

Janez Brank*, Dunja Mladenić, Marko Grobelnik, Nataša Milić-Frayling

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)

Abstract

Feature selection methods are often applied in the context of document classification. They are particularly important for processing large data sets that may contain millions of documents and are typically represented by a large number, possibly tens of thousands of features. Processing large data sets thus raises the issue of computational resources and we often have to find the right trade-off between the size of the feature set and the number of training data that we can taken into account. Furthermore, depending on the selected classification technique, different feature selection methods require different optimization approaches, raising the issue of compatibility between the two. We demonstrate an effective classifier training and feature selection method that is suitable for large data collections. We explore feature selection based on the weights obtained from linear classifiers themselves, trained on a subset of training documents. While most feature weighting schemes score individual features independently from each other, the weights of linear classifiers incorporate the relative importance of a feature for classification as observed for a given subset of documents thus taking the feature dependence into account. We investigate how these feature selection methods combine with various learning algorithms. Our experiments include a comparative analysis of three learning algorithms: Naïve Bayes, Perception, and Support Vector Machines (SVM) in combination with three feature weighting methods: Odds ratio, Information Gain, and weights from the linear SVM and Perceptron. We show that by regulating the size of the feature space (and thus the sparsity of the resulting vector representation of the documents) using an effective feature scoring, like linear SVM, we need only a half or even a quarter of the computer memory to train a classifier of almost the same quality as the one obtained from the complete data set. Feature selection using weights from the linear SVMs yields a better classification performance than other feature weighting methods when combined with the three learning algorithms. The results support the conjecture that it is the sophistication of the feature weighting method rather than its compatibility with the learning algorithm that improves the classification performance.

Original languageEnglish
Pages (from-to)1562-1596
Number of pages35
JournalJournal of Universal Computer Science
Volume14
Issue number10
Publication statusPublished - 2008
Externally publishedYes

Fingerprint

Dive into the research topics of 'Feature selection for the classification of large document collections'. Together they form a unique fingerprint.

Cite this