Can intuition beat the popular data science tools? Is SelectKBest, the popular feature selection method, wrong? Here is a story of a recent project. The story ends with a puzzle which I cannot solve. Help is welcome. Both data and code is available here in github [here]. The question has been also posted to Stack Exchange/Cross Validated community [here].
import local_params as config
import pandas as pd, numpy as np, seaborn as sns
import os, sys
from sklearn.feature_selection import SelectKBest, chi2
import sklearn.feature_selection as skfs
df = pd.read_csv(config.SAMPLE_300)
The problem description¶
Here is our job.
The database SAMPLE_300 contains data on historical tasks performed by employees. The column ‘label’ contains information whether the task was completed successfully (1) or the result was unsatisfactory (2). Our goal is to understand what are the factors that determine the success.
Categorical columns A through F represents varius traits of tasks, such as, perhaps, the technical category, the complexity, the execution time and so on. Two colums are interesting: the column A represents the type of education of the task executor (which school the employee complete), while the column E represents the technical expertise of the executor (which team of our company the employee was trained). For clarity, let’s rename these two columns as ‘education’ and ‘expertise’, and also rename the label, representing the task’s success, as ‘success’.
df = df.rename(columns = {'A': 'education',
'E': 'expertise',
'label': 'success'})
label = 'success'
df.head()
education | B | C | D | expertise | F | success | |
---|---|---|---|---|---|---|---|
0 | 4 | 11 | 131 | 45 | 20 | 159 | 2 |
1 | 0 | 6 | 12 | 63 | 73 | 64 | 2 |
2 | 4 | 8 | 137 | 56 | 102 | 240 | 2 |
3 | 3 | 14 | 137 | 58 | 116 | 59 | 1 |
4 | 4 | 4 | 137 | 10 | 50 | 200 | 2 |
Note: for the purpose of this article I significantly reworked the original data. The sample contains just 300 records and 6 features, selected from a much larger set of features and records. The data is balanced and anonymized. The sample is representative, which means that it produced same statistical results as the entire set. For the purpose of the article, the meaning of columns have also been changed.
So the question is: what determines the success? Is it employee education? Or employee expertise? Or something else?
Quick win: Chi2 correlation (with sklearn)¶
An obvious first step is to check whether there is a strong correlation between one of the columns and the column named ‘success’. In statistical terms we would be checking whether the variable ‘success’ is dependent on any of the features. One popular way to do this is chi-square test for independence. Chi-square, also known as chi2 (out of scope of this article) is a popular statistical test allowing to verify whether two variables are correlated.
So this is what I did. I used the function SelectKBest() from sklearn package to performs the chi2 test. Under the hood it uses another method: sklearn.feature_selection.chi2(), to check what is the most important feature:
cat_feature_cols = list(set(df.columns) - set([label, 'id']))
X, y = df[cat_feature_cols], df[label]
fs = SelectKBest(score_func=skfs.chi2, k = 1)
selector = fs.fit(X, y)
filter = fs.get_support()
best_feature = (np.array(cat_feature_cols))[filter]
best_feature[0]
'expertise'
This method revealed that the success depends mainly on the expertise. Let’s also check the ranking of the remaining features:
fs = SelectKBest(score_func=skfs.chi2, k = 'all')
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()
index | feature | score | |
---|---|---|---|
0 | 0 | expertise | 1647.696011 |
1 | 5 | C | 232.577148 |
2 | 2 | D | 116.422861 |
3 | 3 | B | 69.178778 |
4 | 1 | F | 24.250652 |
5 | 4 | education | 1.412797 |
The result showed that the expertise of the executor is the most fundamental condition of success. It also showed that the education of the executor is practically significant. In other words, there is practically no correlation between the employee education and his/her effectiveness.
I consulted the result with the domain expert, an experienced manager.
The domain expert disagreed with the last point. He said that in his experience the education is quite important.
Manual verification¶
Puzzled by this response, I looked carefully at the column ‘education’ and manually built the contingency table, to see myself if there really is no correlation between education and success.
df['dummy'] = 1
df.pivot_table(values = 'dummy', columns = label, index = 'education', aggfunc = len).fillna(0)
success | 1 | 2 |
---|---|---|
education | ||
0 | 1.0 | 18.0 |
1 | 1.0 | 0.0 |
2 | 0.0 | 1.0 |
3 | 127.0 | 23.0 |
4 | 21.0 | 108.0 |
For my inexperienced eye, correlation not only exist, but is quite strong. For instance, the employees who completed school 3 usually complete the task successfully (task code 1). In contrast, people coming from school 0 or school 4 tend to fail (task code 2).
Chi2 computed manually (with scipy.stats)¶
How could such a strong correlation return low chi2 results?
Trying to understand this, I built my own piece of code, ranking the features with chi2. (chi2_contingency). The utility used below is available for review in github and does the following:
- prior to chi2 test, remove the categories with too few values
- use the chi2 test directly from scipy.stats.chi2_contingency, rather than the version from sklearn
- implement ranking based on p-value, chi2 score, and the critical value
I wasn’t certain about some minor intricacies (my concerns are marked in the code) so I will welcome the review. Nevertheless, I do not think those issues visibly influence the results. the results are:
import chi2_util
chi2_util.chi2_score(df,
features = cat_feature_cols,
target = label,
alpha = 0.05,
deep = True)
chi2 | critical | dof | p | rank | reverse_rank | |
---|---|---|---|---|---|---|
education | 127.497517 | 3.841459 | 1.0 | 1.445816e-29 | 33.189870 | 0.030130 |
D | 32.384980 | 9.487729 | 4.0 | 1.595989e-06 | 3.413354 | 0.292967 |
expertise | 6.153006 | 3.841459 | 1.0 | 1.311889e-02 | 1.601737 | 0.624322 |
B | 10.313977 | 7.814728 | 3.0 | 1.607738e-02 | 1.319813 | 0.757683 |
F | 0.000000 | NaN | 0.0 | 1.000000e+00 | NaN | NaN |
C | 0.000000 | NaN | 0.0 | 1.000000e+00 | NaN | NaN |
In summary: this test confirms that the education is the most important feature. Just like the domain expert said. Then it is even more surprising that our previous attempt (also using chi2) gave different result.
Practical confirmation¶
Still confused why two chi2 tests return various results, I implemented a quick practical verification which implementation should be more trusted. In order to have a practical proof which feature is most correlated with success, I built a naive Machine Learning classifier that predicts success based on one feature only. Here are the results for two popular classifiers: LogisticRegression and XGBoost.
from sklearn.linear_model import LogisticRegression
chi2_util.accuracy_by_feature(X, y, classifier = LogisticRegression(max_iter = 1000)).round(2)
accuracy | feature | |
---|---|---|
0 | 0.81 | expertise |
4 | 0.80 | education |
3 | 0.66 | B |
2 | 0.61 | D |
5 | 0.56 | C |
1 | 0.46 | F |
from xgboost import XGBClassifier
chi2_util.accuracy_by_feature(X, y, classifier = XGBClassifier()).round(2)
accuracy | feature | |
---|---|---|
4 | 0.86 | education |
0 | 0.84 | expertise |
2 | 0.69 | D |
3 | 0.65 | B |
5 | 0.59 | C |
1 | 0.44 | F |
Interpretation: two popular classifiers, XGBoost and LinearRegression, both achieve top accuracy (over 80%) when trying to predict ‘success’ based on ‘education’.
Conclusion¶
Chi2 implementation from sklearn, encapsulated in SelectKBest() method, suggests that the employee education is uncorrelated with task success.
But the evidence I gathered suggest this is not true: the manual verification of the feature’s RC table, the alternative implementation of Chi2 (scipy.stats package), and finally the experimental data deploying two classifiers. I also used more tests, like Kendall Tau, and more classifiers not mentioned here, roughly confirming the conclusion. Finally, let’s not forget the opinion of my management colleague, who knows nothing about the data science but was the first to indicate the flow in the original answer.
This leaves me with a puzzle. I see two possibilities.
Either I have done a methodic error with my reasoning. This is naturally quite possible. Or sklearn’s implementation of chi2 has a flaw. The latter would be surprising, as this implementation, wrapped by the SelectKBest() utility, is commonly used by thousands of data scientists worldwide, and influences the feature reduction. A flow would potentially mean that numerous models in production today have been built with wrong features and produce suboptimal results.
I will welcome hints from readers. Below are the versions of the libraries I use, while all the code and data is available in github at the link given at the beginning of this article.
pd.__version__
'1.2.3'
import sklearn
sklearn.__version__
'0.24.1'
import scipy
scipy.__version__
'1.6.2'
from platform import python_version
python_version()
'3.7.7'
Update 2021-11-04
I still aren’t convinced whether this is a bug or fault in my reasoning. I submitted this issue as a Scikit-learn bug request [click here] #21455 and I would hope the feedback there will either confirm or disprove the logic presented above.
I also opened discussion at Cross Validated hoping to get more community feedback: https://stats.stackexchange.com/questions/549536/does-chi-square-test-for-independence-sklearn-feature-selection-selectkbest-pr
Meanwhile I also noted that similar isues were reported in the past. They may or may not refer to the same bug (feature): https://stackoverflow.com/questions/50932433/scipy-and-sklearn-chi2-implementations-give-different-results
Update 2021-11-29: the bug has been confirmed and I published more details and an alternative implementation which can be used as a fix. [All described here.]
Thank you for your tremendous efforts