用kydavra chisquaredselector解决分类特征选择

    科技2025-02-18  11

    So how we said in previous articles about Kydavra library, Feature selection is a very important part of Machine Learning model development. Unfortunately, there is not only one unique way to get the ideal model, mostly because of the fact that data almost every time has different forms, but this also implies different approaches. In this article, I would like to share a way to select the categorical features using Kydavra ChiSquaredSelector created by Sigmoid.

    因此,我们在之前有关Kydavra库的文章中怎么说,特征选择是机器学习模型开发中非常重要的一部分。 不幸的是,不仅只有一种获得理想模型的独特方法,主要是因为数据几乎每次都具有不同的形式,但这也意味着不同的方法。 在本文中,我想分享一种使用Sigmoid创建的Kydavra ChiSquaredSelector来选择分类特征的方法。

    使用Kydavra库中的ChiSquaredSelector。 (Using ChiSquaredSelector from Kydavra library.)

    As always, for those that are there mostly just for the solution to their problem their are the commands and the code:

    像往常一样,对于那些只为解决问题提供解决方案的人来说,它们是命令和代码:

    To install kydavra just write the following command in terminal:

    要安装kydavra,只需在终端中编写以下命令:

    pip install kydavra

    Now you can import the Selector and apply it on your data set a follows:

    现在,您可以导入选择器,并将其应用于数据集,如下所示:

    from kydavra import ChiSquaredSelectorselector = ChiSquaredSelector()new_columns = selector.select(df, ‘target’)

    To test it let’s apply it on the Heart Disease UCI Dataset with a little change. Instead of keeping all features, we will erase the numerical columns. So our new dataset will consist only from the next features:

    要对其进行测试,请稍作更改将其应用于“ 心脏病UCI”数据集 。 除了保留所有功能,我们将清除数字列。 因此,我们的新数据集将仅包含以下功能:

    sex, cp, fbs, restecg, exang, slope, ca and thal

    So as the algorithm I chose is SVC, and before feature selection it’s cross_val_score was:

    因此,由于我选择的算法是SVC,因此在选择特征之前,它的cross_val_score是:

    0.6691582491582491

    But after applying ChiSquaredSelector the cross_val_score become:

    但是在应用ChiSquaredSelector之后,cross_val_score变为:

    0.8452525252525251

    Keeping the next features: sex, cp, exang, slope, ca, thal.

    保留下一个特征: 性别,cp,exang,坡度,ca,thal。

    那么它是如何工作的呢? (So how it works?)

    So, as with other selectors, ChiSquaredSelector was inspired by statistics, of course from Chi2-test. As p-values, Chi2-test is used to prove or disprove null-hypothesis. Just to remind:

    因此,与其他选择器一样,ChiSquaredSelector的灵感当然来自Chi2-test的统计数据。 作为p值,Chi2-test用于证明或否定原假设。 提醒一下:

    Null hypothesis is a general statement that there is no relationship between two measured phenomena (or also saying features).

    零假设是一个一般性的陈述,即两个测得的现象(或者说特征)之间没有关系。

    So to find if features are related we need to see if we can reject the null hypothesis. Technically saying ChiSquaredSelector, takes the p-values obtained when chi2-s are calculated. Just to recapitulate.

    因此,要确定特征是否相关,我们需要看看是否可以拒绝原假设。 从技术上讲,ChiSquaredSelector取计算chi2-s时获得的p值。 只是概括一下。

    P-value — is the probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations, is greater than or equal in magnitude to the observed results.

    P值 -是给定统计模型的概率值,如果无效假设为真,则一组统计观察值的大小大于或等于观察到的结果。

    So setting the significance level (parameter of the ChiSquaredSelector) we iteratively eliminate features with the highest p-values.

    因此,设置显着性水平(ChiSquaredSelector的参数)时,我们会迭代地消除具有最高p值的特征。

    奖金! (BONUS!)

    If you are interested why did the selector chose some features and others left out, you can always plot the process of choosing features. ChiSquaredSelector has 2 plotting functions one for Chi2 and another for p-values:

    如果您对选择器为何选择某些功能而忽略了其他功能感兴趣,可以随时绘制选择功能的过程。 ChiSquaredSelector有2个绘图函数,一个用于Chi2,另一个用于p值:

    selector .plot_chi2()

    and for p-values:

    对于p值:

    selector.plot_p_value()

    Each function has the following parameters:

    每个功能具有以下参数:

    title — the title of the plot.

    title-地块的标题。

    save — the boolean value, True meaning that it will save the plot, and False not. By default, it is set to false.

    save —布尔值,True表示将保存绘图,而False表示不保存。 默认情况下,它设置为false。

    file_path — the file path to the newly created plot.

    file_path —新建图的文件路径。

    If you want to dig deeper into the notions as Null hypothesis, Chi2 — test and p-values, or how this feature selection works, bellow you have a list of links.

    如果您想深入了解Null假设,Chi2-测试和p值等概念,或者该功能选择的工作方式,则下面有一个链接列表。

    If you want to dive deeper into how Chi-squared works I highly recommend the links at the end of the article. If you tried kydavra I invite you to leave some feedback and share your experience using it throw responding to this form.

    如果您想深入了解卡方的工作原理,我强烈建议您在文章末尾提供链接。 如果您尝试过kydavra,我邀请您留下一些反馈并分享使用它的经验,并回复此表格 。

    Made with ❤ by Sigmoid.

    由Sigmoid制造的❤。

    Useful links:

    有用的链接:

    https://en.wikipedia.org/wiki/Null_hypothesis

    https://zh.wikipedia.org/wiki/空假设

    https://en.wikipedia.org/wiki/P-value

    https://zh.wikipedia.org/wiki/P-value

    https://en.wikipedia.org/wiki/Chi-squared_test

    https://zh.wikipedia.org/wiki/卡方检验

    翻译自: https://towardsdatascience.com/solving-categorical-feature-selection-with-kydavra-chisquaredselector-1aa19aa1fe4d

    相关资源:四史答题软件安装包exe
    Processed: 0.009, SQL: 8