Can intuition beat the popular data science tools? Is SelectKBest, the popular feature selection method, wrong? Here is a story of a recent project. The story ends with a puzzle which I cannot solve. Help is welcome. Both data and code is available here in github [here]. The question has been also posted to Stack Exchange/Cross Validated community [here].

Update 2021-11-04

I still aren’t convinced whether this is a bug or fault in my reasoning. I submitted this issue as a Scikit-learn bug request [click here] #21455 and I would hope the feedback there will either confirm or disprove the logic presented above.

I also opened discussion at Cross Validated hoping to get more community feedback: https://stats.stackexchange.com/questions/549536/does-chi-square-test-for-independence-sklearn-feature-selection-selectkbest-pr

Meanwhile I also noted that similar isues were reported in the past. They may or may not refer to the same bug (feature): https://stackoverflow.com/questions/50932433/scipy-and-sklearn-chi2-implementations-give-different-results

Update 2021-11-29: the bug has been confirmed and I published more details and an alternative implementation which can be used as a fix. [All described here.]

Don’t trust Data Science. Ask the people
Tagged on:                         

One thought on “Don’t trust Data Science. Ask the people

  • October 26, 2021 at 8:15 am
    Permalink

    Thank you for your tremendous efforts

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.