Continuous Variable Binning Algorithm to Maximize Information Value Using Genetic Algorithm

Authors

ผศ.ดร.ปราโมทย์ กั่วเจริญ, นายณัฐวุฒิ เวชกาญจนา

Published

Communications in Computer and Information Science

Abstract

Binning (bucketing or discretization) is a commonly used data pre-processing technique for continuous predictive variables in machine learning. There are guidelines for good binning which can be treated as constraints. However, there are also statistics which should be optimized. Therefore, we view the binning problem as a constrained optimization problem. This paper presents a novel supervised binning algorithm for binary classification problems using a genetic algorithm, named GAbin, and demonstrates usage on a well-known dataset. It is inspired by the way that human bins continuous variables. To bin a variable, first, we choose output shapes (e.g., monotonic or best bins in the middle). Second, we define constraints (e.g., minimum samples in each bin). Finally, we try to maximize key statistics to assess the quality of the output bins. The algorithm automates these steps. Results from the algorithm are in the user-desired shapes and satisfy the constraints. The experimental results reveal that the proposed GAbin provides competitive results when compared to other binning algorithms. Moreover, GAbin maximizes information value and can satisfy user-desired constraints such as monotonicity or output shape controls.

(2019). Continuous Variable Binning Algorithm to Maximize Information Value Using Genetic Algorithm. Communications in Computer and Information Science, 2019(1), 158-172.