As described last week, the Scikit-learn chi-square feature selection is not usable until the bug #21455 is addressed. The problem concerns sklearn.feature_selection.chi2 and the derivative methods, including SelectKBest, if used for categorical features other than binary.
The nature of the problem
As described in my last week article, the current implementation of Scikit-learn feature selection returns faulty results, if chi2 (chi-square) is used as the ranking mechanism for categorical features. The chi2 statistics gets computed incorrectly, because the code was originally intended for binary features only.
I originally wrote that the problem only concerned sklearn.feature_selection.SelectKBest. In fact, it also impacts those models using sklearn.feature_selection.chi2 directly.
Also, when calculating the chi-square statistics, at least 80% of the contingency table (RC table) cells must have the expected count at or above 5. To comply with this, one must aggregate some rows before proceeding with the calculation. The current implementation of SelectKBest, originally not meant for categorical features, does not implement this condition. A notebook available here demonstrates how this changes the results.
The implications
Eventhough the issue may seem abstract, I believe its business implications are substantial. Many machine-learning models in production today could be wrong (inaccurate).
A model will only deliver good predictions if it receives good data as input. What does this mean? The input data must contain the causal relationship to the prediction target.
Hence the goal of the feature selection is to remove the noise from the data and keep only the best data (best features) so that the predictions will be of high quality. And what happens if this step is messed up? The good data (good features) will be mistakenly treated as noise and thrown away, while poor data will be used to train the model. Garbage in, garbage out – the model will respond with inferior results (lower-than-possible accuracy, precision, recall, confusion table).
The good news
The good news is that the damage (if any) has already happened, and if your model delivered bad results in the past due to suboptimal feature selection, it can only get better now, after the problem gets fixed.
How to see if your model was affected? Compare your current feature ranking, with the alternative implementation of chi-square, called chi2_score, available in this github folder: the source code and the example usage. If you see a difference in the result, this may signal that you missed out some of the features that may actually improve your model.
The “correct” way of ranking features with chi-square
The alternative implementation of the chi2 feature ranking now implements the aforementioned (expected>=5) condition. The “correct” in the title here is in quotation as any improvements or critics are welcome. In the calculation I used the following steps:
- build the contingency table (RC table) for each feature separately
- in each case, calculate the expected count
- eliminate the cells with expected count below 5, by aggregating rows
- only then calculate the chi2 statistics and p-values (I use scipy.stats.chi2_contingency)
- calculate the critical value (I use scipy.stats.chi2.ppf )
- rank the features
I hope this method serves as useful replacement until the improvement in Scikit-learn is implemented.