Is it hard to learn a classifier on this dataset?

Published in "In 8th ACM IKDD CODS and 26th COMAD."
Sudarsun Santhiappan , Nitin Shravan , Balaraman Ravindran

Identifying how hard it is to achieve a good classification performance on a given dataset can be useful in data analysis, model selection, and meta-learning. We hypothesize that the dataset clustering indices which capture the characteristics of a dataset are related to the respective classification complexity. In this work, we propose a method for determining the empirical classification complexity of a dataset based on its clustering indices. We model this mapping problem as a supervised classification task where the estimated clustering indices of a given dataset form the features and with an indicator variable representing its classification complexity as the label. For the experiments, we use a set of clustering and classification algorithms spanning different modeling assumptions. To test whether the given dataset is complex, we estimate its clustering indices and feed it to the trained complexity classifier to output the prediction. Our approach is simple, but very effective and robust across many datasets and classifiers. We evaluate our method using 60 publicly available datasets.