With Machine Learning in Python, you may do feature selection with SelectKBest. As I just confirmed, this method sometimes returns faulty results. This potentially impacts the accuracy of numerous ML models worldwide. Below the details and the way out.
Machine Learning models predict some result based on source data features (columns). In a typical situation, there are too many features for an algorithm to handle. Therefore feature reduction (feature selection, feature elimination) must be used, so that the insignificant features can be detected and ignored, for the most powerful features to stay. Then the model can be more accurate, because it is based on the features that have best correlation with the target.
To distinguish between significant and insignificant features, the method SelectKBest from scikit-learn library is commonly used. Under the hood, it uses sklearn.feature_selection.chi2. If the features are categorical (which is the typical case), it allows to use the parameter score_func=skfs.chi2, to invoke the statistical test chi-square to rank the features. Last month I noted that ranking was incorrect. I wrote an article about it and filed a bug report. The details can be found here:
- the article: Don’t trust Data Science. Ask the People
- the cross-validated thread: Does chi-square test for independence produce incorrect results?
- scikit-learn bug report #21455
The bug in SelectKBest
As it was confirmed yesterday (see the bug #21455 thread), the results of SelectKBest are indeed incorrect. As it was explained by Guillaume Lemaitre, the scikit-learn contributor, the following happened. The method sklearn.feature_selection.chi2 is based on an older core, originally meant for binary features. At one step of the algorithm, the counts of the features are added, to create the statistics. So for instance, if the binary label 1 occurs 100 times, the statistics will correctly add all the ones to come up with the result 100.
The problem is that the same algorithm cannot be used for categorical features. For instance, if a binary feature has codes 1,2 and 3, then adding up the 100 occurences of the code 3 will produce the number 300.
This step of the algorithm impacts the end results so that they become untrusted, unless the features are binary. But most features are not binary.
All Machine Learning code that work with categorical features and use either SelectKBest(score_func=chi2), or directly sklearn.feature_selection.chi2 for feature selection are potentially impacted. Note: If you use SelectKBest() without the score_func parameter, you will not be impacted, provided that you use the correct alternative scoring function appropriate for your features. This bug makes almost all existing usages of SelectKBest using chi2 technically invalid. Examples of articles that are now incorrect:
- DataTechNotes: SelectKBest Feature Selection Example in Python
- Feature selection using SelectKBest | Kaggle
- sklearn.feature_selection.SelectKBest Example (programtalk.com)
- Using the Chi-Squared test for feature selection with implementation | by Dr. Saptarsi Goswami | Towards Data Science
- Chi-Squared For Feature Selection (chrisalbon.com)
How many ML models, using Scikit-learn, will be significantly impacted? Not all.
The nature of that broken internal calculation is such that the ranking of the features will deviate only in certain situations, where features have “nasty” distribution of values. So it may be that in your model it will not impact the feature selection at all, and even if it does, your model may still work well even if some features are missing. What’s more, some automatic hyperparameter tuning procedures, such as GridSearchCV, may help to reduce the impact.
I can however see some situations when due to the bad selection, the key feature is removed from view of the otherwise best performing model. In such a case, your current model may have significantly lower performance than it could potentially have, until you apply the fix as described below.
To summarize: in practice, some production models will be impacted significantly, but majority will not. The problem is that it is difficult to guess whether your model is impacted or not. the good news is that the harm, if any, has been done already and now the fix (below) can only improve the performance of your model.
The way out
As the the way out, you have the following options:
- wait for the future edition of scikit-learn, containing the fix to the bug #21455
- continue to use SelectKBest, but without chi2: change the parameter score_func to something else. Be careful though, because not all scoring functions work with all kinds of features
- Use directly the lower level chi2 implementations: scipy.stats.chi2_contingency or sklearn.feature_selection.chi2
- use the utility chi2_score I wrote, as described below
For my purposes, I wrote a utility chi2_score which could be used as temporary alternative to SelectKBest. It allows to rank the features accorging to chi2 statistics and p-value, and eventually to find the top K best features. It is a wrapper over scipy.stats.chi2_contingency. The utility is available at github here, and here is the usage example. In addition, here is the original article demonstrating the difference between chi2_util and SelectKBest, and here is the notebook doing the same. The notebooks are not perfect, you will find notes in the source code and improvements are welcome.
Update 2021-11-29: The impacts of this bug is actually broader. Here is a follow-up article.
Credits for this work are primarily for Sopra Steria, where I currently work and this led to the discovery. Sopra Steria people who made this work possible, either through funding, management support, or organization and technical support are: Fabien Colin, Joachim von Ekensteen, Mohammed Sijelmassi, Marzena Rybicka-Szudera, Roman Dróżdż, Nagarajan Muthunagagowda Lakshmanan, Wiktor Flis, and Katarzyna Kopeć.
Special thanks to Marcello Anselmi Tamburini, who delivered the original observation that our results might have been incorrect, which led to discovery of the bug. Without his remark nothing would happen.
Separately, Guillaume Lemaitre, the scikit-learn contributor from Inria Foundation was kind to very thoroughly analyse the case I provided and even run his code over my data, and in the end identified the low level issue.