A framework for analyzing the relationship between size and complexity of data sets

Authors

DOI:

https://doi.org/10.5335/rbca.v13i2.10898

Keywords:

Bagging, Boosting, Complexity Measures, Dataset Size

Abstract

In the Pattern Recognition field, a classification problem is complex when the samples of different classes are highly similar. Consequently, the literature proposed a variety of complexity descriptors, considering the importance of complexity as a promising factor to obtain accuracy. However, the sensitivity of these descriptors regarding the variation in the size of the training sets is not known. In this work, the goal is to analyze this behavior. For that reason, a variety of descriptors were estimated in 20,800 subsets created from: i) 26 classification problems, ii) 2 generators, and iii) 4 sizes. The results proved that the descriptors' sensitivity to size is a reality, being less noticeable in F1, F2, L2, N4, L3, T1, D2, and D3. The metrics F3, F4, N1, N2 and N3 are more influenced by variations in the number of instances present in the set.

Downloads

Download data is not yet available.

Published

2021-05-18

Issue

Section

Original Paper

How to Cite

[1]
2021. A framework for analyzing the relationship between size and complexity of data sets. Brazilian Journal of Applied Computing. 13, 2 (May 2021), 1–15. DOI:https://doi.org/10.5335/rbca.v13i2.10898.