从大熊猫的大型相关矩阵中列出最高相关对? ?

你如何在熊猫的相关矩阵中找到最高的相关性?关于如何使用 R (将相关性显示为有序列表,而不是大矩阵用 Python 或 R 语言从大数据集中获取高度相关对的有效方法)有很多答案,但我想知道如何使用熊猫?在我的例子中,矩阵是4460x4460,所以不能直观地做出来。

193378 次浏览

您可以使用 DataFrame.values获得数据的数字数组,然后使用 NumPy 函数(如 argsort())获得最相关的对。

但是,如果你想在熊猫中这样做,你可以 unstack和数据帧排序:

import pandas as pd
import numpy as np


shape = (50, 4460)


data = np.random.normal(size=shape)


data[:, 1000] += data[:, 2000]


df = pd.DataFrame(data)


c = df.corr().abs()


s = c.unstack()
so = s.sort_values(kind="quicksort")


print so[-4470:-4460]

以下是输出结果:

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

@ HYRY 的答案是完美的。在这个答案的基础上,增加一点逻辑,以避免重复和自相关以及适当的排序:

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6],
'x2': [0, 0, 8, 2, 4],
'x3': [2, 8, 8, 10, 12],
'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()


print("Correlation Matrix")
print(df.corr())
print()


def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop


def get_top_abs_correlations(df, n=5):
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]


print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

结果如下:

Data Frame
x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5


Correlation Matrix
x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000


Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

没有冗余变量对的几行解法:

corr_matrix = df.corr().abs()


#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)


sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
.stack()
.sort_values(ascending=False))


#first element of sol series is the pair with the biggest correlation

然后,您可以迭代变量对的名称(即熊猫。系列多索引)和它们的值,如下所示:

for index, value in sol.items():
# do some staff

使用 itertools.combinations从熊猫自己的相关矩阵 .corr()中获取所有独特的相关性,生成列表列表并反馈到 DataFrame 中以使用’。Sort _ values’。将 ascending = True设置为在顶部显示最低的相关性

corrank接受 DataFrame 作为参数,因为它需要 .corr()

  def corrank(X: pandas.DataFrame):
import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))


corrank(X) # prints a descending list of correlation pair (Max on top)

使用下面的代码按降序查看相关性。

# See the correlations in descending order


corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

结合@HYRY 和@arun 的答案的一些特性,你可以用一行打印数据帧 df的顶部相关性,使用:

df.corr().unstack().sort_values().drop_duplicates()

注意: 一个缺点是,如果你有1.0的相关性,是 没有的一个变量本身,增加 drop_duplicates()将删除他们

这里有很多好的答案。我找到的最简单的方法是将上面的一些答案组合起来。

corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
corr = corr.unstack().transpose()\
.sort_values(by='column', ascending=False)\
.dropna()

我不想让 unstack或者这个问题过于复杂,因为我只想放弃一些高度相关的特性,作为特性选择阶段的一部分。

因此,我最终得出了以下简化的解决方案:

# map features to their absolute correlation values
corr = features.corr().abs()


# set equality (self correlation) as zero
corr[corr == 1] = 0


# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)


# display the highly correlated features
display(corr_cols[corr_cols > 0.8])

在这种情况下,如果您想删除相关的特性,您可以映射经过过滤的 corr_cols数组并删除奇数索引(或偶数索引)的特性。

我在这里尝试了一些解决方案,但后来我实际上想出了我自己的一个。我希望这篇文章对下一篇文章有用,所以我在这里分享一下:

def sort_correlation_matrix(correlation_matrix):
cor = correlation_matrix.abs()
top_col = cor[cor.columns[0]][1:]
top_col = top_col.sort_values(ascending=False)
ordered_columns = [cor.columns[0]] + top_col.index.tolist()
return correlation_matrix[ordered_columns].reindex(ordered_columns)

这是来自@MiFi 的改进代码。这是一个腹肌的顺序,但不排除负值。

   def top_correlation (df,n):
corr_matrix = df.corr()
correlation = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
correlation = pd.DataFrame(correlation).reset_index()
correlation.columns=["Variable_1","Variable_2","Correlacion"]
correlation = correlation.reindex(correlation.Correlacion.abs().sort_values(ascending=False).index).reset_index().drop(["index"],axis=1)
return correlation.head(n)


top_correlation(ANYDATA,10)

通过替换数据,可以根据这个简单的代码以图形方式进行操作。

corr = df.corr()


kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

enter image description here

下面的函数应该可以完成这个任务

  • 消除自相关性
  • 删除副本
  • 启用顶部 N 个最高相关特征的选择

而且它也是可配置的,因此您可以保持自相关性以及重复。您还可以报告任意数量的特性对。


def get_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):
"""
Compute the feature correlation and sort feature pairs based on their correlation


:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool


:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()


# Remove comparisons of the same feature
if remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]


# Remove duplicates
if remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]


# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']


if top_n:
return sorted_correlated_features[:top_n]


return sorted_correlated_features


我最喜欢 Addison Klinke 的帖子,因为它是最简单的,但是使用了 Wojciech Moszczy sk 的建议来过滤和绘制图表,但是扩展了过滤器以避免绝对值,所以给出一个大的相关矩阵,过滤它,绘制图表,然后把它变平:

创建,过滤和制图

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

filtered heat map

功能

最后,我创建了一个小函数来创建相关矩阵,对其进行过滤,然后将其平坦化。作为一种思想,它可以很容易地被扩展,例如,不对称的上界和下界等。

def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened


corrFilter(df, .7)

enter image description here

跟进

最后,我改进了函数

# Returns correlation matrix
def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
return xFiltered


# flattens correlation matrix with bounds
def corrFilterFlattened(x: pd.DataFrame, bound: float):
xFiltered = corrFilter(x, bound)
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened


# Returns correlation for a variable from flattened correlation matrix
def filterForLabels(df: pd.DataFrame, label):
try:
sideLeft = df[label,]
except:
sideLeft = pd.DataFrame()


try:
sideRight = df[:,label]
except:
sideRight = pd.DataFrame()


if sideLeft.empty and sideRight.empty:
return pd.DataFrame()
elif sideLeft.empty:
concat = sideRight.to_frame()
concat.rename(columns={0:'Corr'},inplace=True)
return concat
elif sideRight.empty:
concat = sideLeft.to_frame()
concat.rename(columns={0:'Corr'},inplace=True)
return concat
else:
concat = pd.concat([sideLeft,sideRight], axis=1)
concat["Corr"] = concat[0].fillna(0) + concat[1].fillna(0)
concat.drop(columns=[0,1], inplace=True)
return concat

将上面的大部分答案合并成一个简短的片段:

def top_entries(df):
mat = df.corr().abs()
    

# Remove duplicate and identity entries
mat.loc[:,:] = np.tril(mat.values, k=-1)
mat = mat[mat>0]


# Unstack, sort ascending, and reset the index, so features are in columns
# instead of indexes (allowing e.g. a pretty print in Jupyter).
# Also rename these it for good measure.
return (mat.unstack()
.sort_values(ascending=False)
.reset_index()
.rename(columns={
"level_0": "feature_a",
"level_1": "feature_b",
0: "correlation"
}))

越简单越好

from collections import defaultdict
res = defaultdict(dict)
corr = returns.corr().replace(1, -1)
names = list(corr)


for name in names:
idx = corr[name].argmax()
max_pairwise_name = names[idx]
res[name][max_pairwise_name] = corr.loc[max_pairwisename, name]

现在 res 包含每对的最大成对相关性