嵌套字典到多索引数据框架,其中字典键是列标签

假设我有一本这样的字典:

dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},


'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1,2]}}

我想要一个像这样的数据框架:

     A   B
a b a b
0  1 6 2 7
1  2 7 3 8
2  3 8 4 9
3  4 9 5 1
4  5 1 6 2

有没有一种方便的方法可以做到这一点? 如果我尝试:

In [99]:


DataFrame(dictionary)


Out[99]:
A               B
a   [1, 2, 3, 4, 5] [2, 3, 4, 5, 6]
b   [6, 7, 8, 9, 1] [7, 8, 9, 1, 2]

我得到一个数据框架,其中每个元素都是一个列表。我需要的是一个多索引,其中每个级别对应于嵌套结构中的键,行对应于上面所示的列表中的每个元素。我想我可以提出一个非常粗糙的解决方案,但我希望可以有一个更简单的解决方案。

61764 次浏览

Pandas wants the MultiIndex values as tuples, not nested dicts. The simplest thing is to convert your dictionary to the right format before trying to pass it to DataFrame:

>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1, 2]}
>>> pandas.DataFrame(reform)
A     B
a  b  a  b
0  1  6  2  7
1  2  7  3  8
2  3  8  4  9
3  4  9  5  1
4  5  1  6  2


[5 rows x 4 columns]
dict_of_df = {k: pd.DataFrame(v) for k,v in dictionary.items()}
df = pd.concat(dict_of_df, axis=1)

Note that the order of columns is lost for python < 3.6

You're looking for the functionality in .stack:

df = pandas.DataFrame.from_dict(dictionary, orient="index").stack().to_frame()
# to break out the lists into columns
df = pd.DataFrame(df[0].values.tolist(), index=df.index)

This recursive function should work:

def reform_dict(dictionary, t=tuple(), reform={}):
for key, val in dictionary.items():
t = t + (key,)
if isinstance(val, dict):
reform_dict(val, t, reform)
else:
reform.update({t: val})
t = t[:-1]
return reform

If lists in the dictionary are not of the same lenght, you can adapte the method of BrenBarn.

>>> dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},
'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1]}}


>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1]}


>>> pandas.DataFrame.from_dict(reform, orient='index').transpose()
>>> df.columns = pd.MultiIndex.from_tuples(df.columns)
A     B
a  b  a  b
0  1  6  2  7
1  2  7  3  8
2  3  8  4  9
3  4  9  5  1
4  5  1  6  NaN
[5 rows x 4 columns]

This solution works for a larger dataframe, it fits what was requested

cols = df.columns
int_cols = len(cols)
col_subset_1 = [cols[x] for x in range(1,int(int_cols/2)+1)]
col_subset_2 = [cols[x] for x in range(int(int_cols/2)+1, int_cols)]


col_subset_1_label = list(zip(['A']*len(col_subset_1), col_subset_1))
col_subset_2_label = list(zip(['B']*len(col_subset_2), col_subset_2))
df.columns = pd.MultiIndex.from_tuples([('','myIndex'),*col_subset_1_label,*col_subset_2_label])

OUTPUT

                        A                      B
myIndex    a              b          c          d
0   0.159710    1.472925    0.619508    -0.476738   0.866238
1   -0.665062   0.609273    -0.089719   0.730012    0.751615
2   0.215350    -0.403239   1.801829    -2.052797   -1.026114
3   -0.609692   1.163072    -1.007984   -0.324902   -1.624007
4   0.791321    -0.060026   -1.328531   -0.498092   0.559837
5   0.247412    -0.841714   0.354314    0.506985    0.425254
6   0.443535    1.037502    -0.433115   0.601754    -1.405284
7   -0.433744   1.514892    1.963495    -2.353169   1.285580