jquery是不是被放弃了
Some commonly used correlation filtering methods have a tendency to drop more features than required. This problem is amplified as datasets become larger and with more pairwise correlations above a specified threshold. If we drop more variables than necessary, less information will be available potentially leading to suboptimal model performance. In this article, I will be demonstrating the shortcomings of current methods and proposing a possible solution.
一些常用的相关滤波方法倾向于丢弃比所需更多的特征。 随着数据集变得更大并且在指定阈值以上具有更多的成对相关性,这个问题就被放大了。 如果我们丢弃了不必要的变量,那么将减少可用的信息,从而可能导致模型性能欠佳。 在本文中,我将演示当前方法的缺点并提出可能的解决方案。
Let’s look at an example of how current methods drop features that should have remained in the dataset. We will use the Boston Housing revised dataset and show examples in both R and Python.
让我们看一个例子,说明当前方法如何删除应该保留在数据集中的特征。 我们将使用Boston Housing修订的数据集,并在R和Python中显示示例。
R: The code below uses the findCorrelation() function from the caret package to determine which columns should be dropped.
R:下面的代码使用插入符号包中的findCorrelation()函数来确定应删除的列。
The function determined that [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ] should be dropped based on the correlation cutoff of 0.6.
该函数确定为['indus','nox'。 应基于0.6的相关性截止将'lstat','age','dis']删除。
Python: Python doesn’t have a built in function like findCorrelation(), so I wrote a function called corrX_orig().
Python :Python没有诸如findCorrelation()之类的内置函数,因此我编写了一个名为corrX_orig()的函数。
We get the same result as R: drop columns [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ]
我们得到与R相同的结果:删除列['indus','nox'。 'lstat','age','dis']
How do these functions work?A correlation matrix is created first. These numbers represent the pairwise correlations for all combinations of numeric variables.
这些功能如何工作? 首先创建相关矩阵。 这些数字代表数字变量所有组合的成对相关性。
Correlation Matrix for Boston Housing 波士顿住房的相关矩阵Then, the mean correlation for each variable is calculated. This can be accomplished by taking the mean of every row or every column since they are equivalent.
然后,计算每个变量的平均相关性。 这可以通过取每一行或每一列的均值来实现,因为它们是等效的。
Mean Correlations for Columns and Rows 列和行的均值相关After, the lower triangle of the matrix and the diagonal is masked. We don’t need the lower triangle because the same information exists on either side of the diagonal (see matrix above). We don’t require the diagonal because that represents correlations between variables and themselves (it’s always 1).
之后,掩盖矩阵的下三角形和对角线。 我们不需要下部三角形,因为在对角线的两侧都存在相同的信息(请参见上面的矩阵)。 我们不需要对角线,因为它表示变量与其自身之间的相关性(始终为1)。
Matrix with Lower Triangle and Diagonal Masked 下三角形对角线蒙版的矩阵Here is pseudocode to demonstrate how the rest of the function works. I hard coded 0.6 as the correlation cutoff for this example:
这是伪代码,用于演示该函数其余部分的工作方式。 我将0.6硬编码为该示例的相关性截止:
Now to the part you’ve been waiting for. Why should the functions not drop age?
现在到您一直在等待的部分。 为什么功能不能降低年龄?
Below is a table that shows variables I captured from the original function. Remember the functions told us to drop [ ‘indus’, ‘nox’. ‘lstat’, ‘age’, ‘dis’ ]. So we manually eliminate [ ‘indus’, ‘nox’, ‘lstat’, ‘dis’ ] from the table. As you can see in the table, there are no other variables left to compare against age to make a drop decision. Therefore age should not be dropped.
下表显示了我从原始函数捕获的变量。 记住函数告诉我们删除['indus','nox'。 'lstat','age','dis']。 因此,我们从表中手动删除了['indus','nox','lstat','dis']。 如您在表格中所见,没有其他变量可与年龄进行比较以做出弃牌决定。 因此年龄不应该下降。
But why is this happening?
但是为什么会这样呢?
Because of the sequential nature of the R and python functions, they are unable to consider the state of all the variables holistically. The decision to drop variables happens in order and is final.
由于R和python函数的顺序性质,它们无法整体考虑所有变量的状态。 删除变量的决定是按顺序进行的,并且是最终决定。
How can we prove age belongs in the dataset?
我们如何证明年龄属于数据集?
We can remove age from the drop list resulting in [ indus, nox, dis, lstat], and then remove those four columns from the original dataset. When we rerun this subset of variables, we would expect ‘age’ as the output if it should be dropped. If we get no output, that means ‘age’ should have stayed in the set.
我们可以从产生[indus,nox,dis,lstat]的下拉列表中删除年龄,然后从原始数据集中删除这四列。 当我们重新运行该变量子集时,如果应该删除“ age”作为输出,则可以期望。 如果没有输出,则意味着“年龄”应该留在集合中。
As you will see below both functions provided no output. Age should have stayed.
正如您将在下面看到的,两个函数均未提供输出。 年龄应该停留。
R
[R
# Remove age from drop set drop_no_age = drop[-4] print(drop_no_age) # [1] "indus" "nox" "lstat" "dis" # rerun correlation analysis with age included n = setdiff(colnames(num), drop_no_age) print(n) # [1] "lon" "lat" "crim" "zn" "rm" "age" "tax" "ptratio" "b" mtx2 = mtx[n, n] drop2 = findCorrelation(mtx2, cutoff = .6) print(drop2) # integer(0)Python
Python
drop.remove('age') df = df.drop(drop, axis = 1) print(df.columns) # ['lon', 'lat', 'crim', 'zn', 'rm', 'age', 'tax', 'ptratio', 'b'] drop2 = corrX_orig(df, cut = 0.6) print(drop2) # []Brief Recap
简要回顾
In this example, we have demonstrated that the commonly used correlation filter functions overstate the number of drop columns. My assertion is that this is due to the sequential nature of how each cell in the correlation matrix is evaluated and dropped.
在此示例中,我们证明了常用的相关过滤器函数夸大了丢弃列的数量。 我的断言是这归因于相关矩阵中每个单元的评估和丢弃方式的顺序性质。
Original : The original solution drops the columns sequentially, immediately, and with finality.
原始:原始解决方案按顺序,立即且最终确定地丢弃列。
def corrX_orig(df, cut = 0.9) : # Get correlation matrix and upper triagle corr_mtx = df.corr().abs() avg_corr = corr_mtx.mean(axis = 1) up = corr_mtx.where(np.triu(np.ones(corr_mtx.shape), k=1).astype(np.bool)) drop = list() # For loop implements this pseudocode # For every cell in the upper triangle: # If cell.value > 0.6: # If mean(row_correlation) > mean(column_correlation):drop(column) # Else: drop(row) for row in range(len(up)-1): col_idx = row + 1 for col in range (col_idx, len(up)): if(corr_mtx.iloc[row, col] > cut): if(avg_corr.iloc[row] > avg_corr.iloc[col]): drop.append(row) else: drop.append(col) drop_set = list(set(drop)) dropcols_names = list(df.columns[[item for item in drop_set]]) return(dropcols_names)Revised: Captures the variable states without dropping into a dataframe res
修订:捕获变量状态,而无需放入数据框res
def corrX_new(df, cut = 0.9) : # Get correlation matrix and upper triagle corr_mtx = df.corr().abs() avg_corr = corr_mtx.mean(axis = 1) up = corr_mtx.where(np.triu(np.ones(corr_mtx.shape), k=1).astype(np.bool)) dropcols = list() res = pd.DataFrame(columns=(['v1', 'v2', 'v1.target', 'v2.target','corr', 'drop' ])) for row in range(len(up)-1): col_idx = row + 1 for col in range (col_idx, len(up)): if(corr_mtx.iloc[row, col] > cut): if(avg_corr.iloc[row] > avg_corr.iloc[col]): dropcols.append(row) drop = corr_mtx.columns[row] else: dropcols.append(col) drop = corr_mtx.columns[col] s = pd.Series([ corr_mtx.index[row], up.columns[col], avg_corr[row], avg_corr[col], up.iloc[row,col], drop], index = res.columns) res = res.append(s, ignore_index = True) dropcols_names = calcDrop(res) return(dropcols_names)2. Revised: Calculate which variables to drop using res
2.修订:使用res计算要删除的变量
Below is the output of res containing the variable states along with variable definitions
以下是包含变量状态和变量定义的res的输出
v1, v2: The row and column being analyzedv1, v2 [.mean]: The average correlation for each v1 and v2corr: The pairwise correlation between v1 and v2drop: The initial drop decision to drop higher of (v1.mean, v2.mean)
v1,v2 :要分析的行和列v1,v2 [.mean] :每个v1和v2的平均相关性corr :v1和v2之间的成对相关性drop :初始丢弃决定将(v1.mean, v2.mean)
Captured variable states (res) 捕获的变量状态(分辨率)Revised (2) Steps in drop calculation
修订(2)墨滴计算步骤
I would encourage the reader to manually walk through the steps below using captured variable states (res) illustration above. I’ve also embedded the code for each step from the calcDrop() function. The entire function is at the end of this section.
我鼓励读者使用上面捕获的变量状态(res)插图手动完成以下步骤。 我还从calcDrop()函数嵌入了每个步骤的代码。 整个功能在本节的结尾。
Step 1: all_vars_corr = All variables that exceeded the correlation cutoff of 0.6. Since our logic will capture variables meeting this condition, this will be the set of unique variables in columns v1 + v2 from the res table above.
步骤1: all_vars_corr =超过相关性临界值0.6的所有变量。 由于我们的逻辑将捕获满足此条件的变量,因此这将是上面res表中v1 + v2列中的唯一变量集。
all_corr_vars = list(set(res['v1'].tolist() + res['v2'].tolist()))Result: [‘tax’, ‘indus’, ‘lstat’, ‘rm’, ‘zn’, ‘age’, ‘nox’, ‘dis’]
结果:['tax','indus','lstat','rm','zn','age','nox','dis']
Step 2: poss_drop = Unique variables from the drop column. These may or may not be dropped in the end.
步骤2: poss_drop =放置列中的唯一变量。 这些可能最终也可能不会删除。
poss_drop = list(set(res['drop'].tolist()))Result: [‘indus’, ‘lstat’, ‘age’, ‘nox’, ‘dis’]
结果:['indus','lstat','age','nox','dis']
Step 3: keep = Variables from v1 and v2 not in poss_drop. Essentially, any variables that aren’t possibly going to be dropped are going to be kept
步骤3:保持= v1和v2中的变量不在poss_drop中。 本质上,所有不可能删除的变量都将保留
keep = list(set(all_corr_vars).difference(set(poss_drop)))Result: [‘zn’, ‘tax’, ‘rm’]
结果:['zn','tax','rm']
Step 4: drop = Variables from v1 and v2 appearing in the same row as keep. If we know which variables to keep, then any variable paired with those will be dropped.
步骤4: drop = v1和v2中的变量与keep出现在同一行。 如果我们知道要保留哪些变量,则将删除与它们配对的任何变量。
p = res[ res['v1'].isin(keep) | res['v2'].isin(keep) ][['v1', 'v2']] q = list(set(p['v1'].tolist() + p['v2'].tolist())) drop = (list(set(q).difference(set(keep))))Result: [‘lstat’, ‘nox’, ‘dis’, ‘indus’]
结果:['lstat','nox','dis','indus']
Step 5: poss_drop = Remove drop variables from poss_drop. We are removing variables we know we are dropping from the list of possibles.
步骤5 :poss_drop =从poss_drop删除放置变量。 我们正在删除从已知可能性列表中删除的变量。
poss_drop = list(set(poss_drop).difference(set(drop)))Result: [‘age’] This is the last variable left out of the possibles.
结果:['age']这是所有可能的变量。
Step 6: Subset the dataframe to include only poss_drop variables in v1 and v2. We want to see if there is any reason to drop age.
步骤6 :将数据框细分为仅在v1和v2中包含poss_drop变量。 我们想看看是否有任何理由降低年龄。
m = res[ res['v1'].isin(poss_drop) | res['v2'].isin(poss_drop) ][['v1', 'v2','drop']] Result of Step 6 步骤6的结果Step7: Remove rows where drop variables are in v1 or v2 and store unique variables from drop column. Store the result in more_drop. Here we are removing rows we know contain variables we are dropping. In this smaller example, we will get an empty set since all the rows contained variables we know we are dropping. This is correct result: age is not in this set.
步骤7 :删除放置变量在v1或v2中的行,并存储放置列中的唯一变量。 将结果存储在more_drop中。 在这里,我们将删除包含已知要删除的变量的行。 在这个较小的示例中,由于所有行都包含我们知道要删除的变量,因此我们将获得一个空集。 这是正确的结果:年龄不在此集合中。
more_drop = set(list(m[~m['v1'].isin(drop) & ~m['v2'].isin(drop)]['drop']))Result: set()
结果:set()
Step 8: Add more_drop variables to drop and return drop
步骤8 :添加more_drop变量以放置并返回drop
for item in more_drop: drop.append(item)Result: [‘lstat’, ‘nox’, ‘dis’, ‘indus’] : more_drop doesn’t contain age after manually completing the steps on the res table which is exactly what we expect
结果:['lstat','nox','dis','indus']:手动完成步骤后,more_drop不包含年龄 在res表上,这正是我们期望的
Here is the entire calcDrop() function:
这是整个calcDrop()函数:
def calcDrop(res): # All variables with correlation > cutoff all_corr_vars = list(set(res['v1'].tolist() + res['v2'].tolist())) # All unique variables in drop column poss_drop = list(set(res['drop'].tolist())) # Keep any variable not in drop column keep = list(set(all_corr_vars).difference(set(poss_drop))) # Drop any variables in same row as a keep variable p = res[ res['v1'].isin(keep) | res['v2'].isin(keep) ][['v1', 'v2']] q = list(set(p['v1'].tolist() + p['v2'].tolist())) drop = (list(set(q).difference(set(keep)))) # Remove drop variables from possible drop poss_drop = list(set(poss_drop).difference(set(drop))) # subset res dataframe to include possible drop pairs m = res[ res['v1'].isin(poss_drop) | res['v2'].isin(poss_drop) ][['v1', 'v2','drop']] # remove rows that are decided (drop), take set and add to drops more_drop = set(list(m[~m['v1'].isin(drop) & ~m['v2'].isin(drop)]['drop'])) for item in more_drop: drop.append(item) return dropBrief Recap
简要回顾
In this example, we have demonstrated a revised pair of functions for filtering variables based on correlation. The functions work in the following way:
在此示例中,我们演示了一对经过修改的函数,用于基于相关性过滤变量。 这些功能以下列方式工作:
corrX_new: Log the variable states based on the original logic
corrX_new :根据原始逻辑记录变量状态
calcDrop: Calculate which variables to drop
calcDrop :计算要删除的变量
Let’s use the (mdrr) dataset from R’s caret package which contains many correlated features. We will use the old and new functions in this section, and it will be less verbose since we’ve covered the general testing routine.
让我们使用R的插入符号包中的(mdrr)数据集,其中包含许多相关功能。 我们将在本节中使用旧功能和新功能,由于我们已经介绍了常规的测试例程,因此它的详细程度将降低。
R (original)
R(原始)
library(caret) data(mdrr) nzv <- nearZeroVar(mdrrDescr) f <- mdrrDescr[, -nzv] # Save the set for Python write.csv(f, ".../mdrr_filtered.csv") mtx3 = cor(f) drop3 = findCorrelation(mtx3, cutoff = .9) length(names(f)[drop3]) #[1] 203findCorrelation() drops 203 columns
findCorrelation()删除203列
Python (original)
Python(原始)
import numpy as np import pandas as pd # Be sure to run the R code above first to generate this csv file source2 = '.../mdrr_filtered.csv' mdrr = pd.read_csv(source2, index_col= 0) drop = corrX_orig(mdrr, cut = 0.9) len(drop) # 203corrX_orig() drops 203 columns
corrX_orig()删除203列
Python (revised)
Python(修订版)
drop_new = corrX_new(mdrr, cut = 0.9) len(drop_new) # Out[247]: 194 list(set(drop).difference(set(drop_new))) # Out[268]: ['DDI', 'ZM1V', 'X2v', 'piPC05', 'VAR', 'SPAN', 'QYYe', 'GMTIV', 'X5sol']There are 9 columns identified that shouldn’t have been dropped from the dataset. Let’s confirm in R and Python.
标识了不应从数据集中删除的9列。 让我们在R和Python中进行确认。
R
[R
# These were the columns identified by Python code that shouldn't be dropped pydrops = c('DDI', 'ZM1V', 'X2v', 'piPC05', 'VAR', 'SPAN', 'QYYe', 'GMTIV', 'X5sol') drops = names(f)[drop3] findrop = setdiff(drops, pydrops) print(f %>% select(-findrop) %>% cor() %>% findCorrelation(cutoff = .9)) # numeric(0)When the columns identified by python are added back to the main set in R, no columns drops are identified.
当将由python标识的列添加回R中的主集中时,未标识任何列删除。
Python
Python
mdrr2 = mdrr.drop(drop_new, axis = 1) drop_orig = corrX_orig(mdrr2, cut = 0.9) len(drop_orig) # Out[273]: 0The results in Python are identical. The columns [‘DDI’, ‘ZM1V’, ‘X2v’, ‘piPC05’, ‘VAR’, ‘SPAN’, ‘QYYe’, ‘GMTIV’, ‘X5sol’] shouldn’t have been dropped originally.
Python中的结果是相同的。 列['DDI','ZM1V','X2v','piPC05','VAR','SPAN','QYYe','GMTIV','X5sol']最初不应删除。
In this article, we have demonstrated how commonly used correlation filtering methods have a tendency to unnecessarily drop features. We’ve shown how the problem can be exacerbated when the data becomes larger. Although we haven’t shown evidence, it’s a fair assumption that unnecessary feature removal can have a negative effect on model performance
在本文中,我们演示了常用的相关过滤方法如何趋向于不必要地删除特征。 我们已经展示了当数据变大时如何加剧该问题。 尽管我们还没有提供证据,但是可以合理地假设不必要的功能删除会对模型性能产生负面影响
We have also provided an efficacious solution with code, explanations and examples. In a future article, we will extend this solution adding target correlation to the filtering decision.
我们还提供了带有代码,解释和示例的有效解决方案。 在以后的文章中,我们将扩展此解决方案,将目标相关性添加到过滤决策中。
Feel free to reach out to me on LinkedIn.
随时通过LinkedIn与我联系。
翻译自: https://towardsdatascience.com/are-you-dropping-too-many-correlated-features-d1c96654abe6
jquery是不是被放弃了