ValueError: 值的长度与索引的长度不匹配 | Pandas DataFrame.only()

我尝试获取一个新的数据集,或者将当前数据集列的值更改为它们的唯一值。 下面是我试图得到的一个例子:

   A B
-----
0| 1 1
1| 2 5
2| 1 5
3| 7 9
4| 7 9
5| 8 9


Wanted Result    Not Wanted Result
A B              A B
-----             -----
0| 1 1           0| 1 1
1| 2 5           1| 2 5
2| 7 9           2|
3| 8             3| 7 9
4|
5| 8

我真的不关心指数,但它似乎是个问题。 到目前为止,我的代码非常简单,我尝试了两种方法,一种是使用新的 dataFrame,另一种是不使用 dataFrame。

#With New DataFrame
def UniqueResults(dataframe):
df = pd.DataFrame()
for col in dataframe:
S=pd.Series(dataframe[col].unique())
df[col]=S.values
return df


#Without new DataFrame
def UniqueResults(dataframe):
for col in dataframe:
dataframe[col]=dataframe[col].unique()
return dataframe

这两次,我都得到了错误的答案:

Length of Values does not match length of index
381198 次浏览

The error comes up when you are trying to assign a list of numpy array of different length to a data frame, and it can be reproduced as follows:

A data frame of four rows:

df = pd.DataFrame({'A': [1,2,3,4]})

Now trying to assign a list/array of two elements to it:

df['B'] = [3,4]   # or df['B'] = np.array([3,4])

Both errors out:

ValueError: Length of values does not match length of index

Because the data frame has four rows but the list and array has only two elements.

Work around Solution (use with caution): convert the list/array to a pandas Series, and then when you do assignment, missing index in the Series will be filled with NaN:

df['B'] = pd.Series([3,4])


df
#   A     B
#0  1   3.0
#1  2   4.0
#2  3   NaN          # NaN because the value at index 2 and 3 doesn't exist in the Series
#3  4   NaN

For your specific problem, if you don't care about the index or the correspondence of values between columns, you can reset index for each column after dropping the duplicates:

df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))


#   A     B
#0  1   1.0
#1  2   5.0
#2  7   9.0
#3  8   NaN

One way to get around this issue is to keep the unique values in a list and use itertools.zip_longest to transpose the data and pass it into the DataFrame constructor:

from itertools import zip_longest
def UniqueResults(dataframe):
tmp = [dataframe[col].unique() for col in dataframe]
return pd.DataFrame(zip_longest(*tmp), columns=dataframe.columns)


out = UniqueResults(df)

Output:

   A    B
0  1  1.0
1  2  5.0
2  7  9.0
3  8  NaN

At least for small DataFrames, this seems to be faster (for example on OP's sample):

%timeit out = df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.27 ms ± 50.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%timeit x = UniqueResults(df)
426 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Another simple solution is to make the solution suggested by OP into a working one. We just need to cast the unique values of each column into a pandas Series:

df1 = df.apply(lambda col: pd.Series(col.unique()))
df1

result