熊猫版的捆绑

在 R 中,可以通过使用 rbind 将一个数据框的列粘贴到另一个数据框的列底部来组合两个数据框。在熊猫中,你是如何做到同样的事情的?看起来非常困难。

使用 append 会导致严重的混乱,包括 NaN 和一些我不理解的原因。我只是试图“绑定”两个相同的框架,看起来像这样:

编辑: 我以一种愚蠢的方式创建了数据框架,这导致了一些问题。追加 = 绑定到所有意图和目的。请看下面的答案。

        0         1       2        3          4          5        6                    7
0   ADN.L  20130220   437.4   442.37   436.5000   441.9000  2775364  2013-02-20 18:47:42
1   ADM.L  20130220  1279.0  1300.00  1272.0000  1285.0000   967730  2013-02-20 18:47:42
2   AGK.L  20130220  1717.0  1749.00  1709.0000  1739.0000   834534  2013-02-20 18:47:43
3  AMEC.L  20130220  1030.0  1040.00  1024.0000  1035.0000  1972517  2013-02-20 18:47:43
4   AAL.L  20130220  1998.0  2014.50  1942.4999  1951.0000  3666033  2013-02-20 18:47:44
5  ANTO.L  20130220  1093.0  1097.00  1064.7899  1068.0000  2183931  2013-02-20 18:47:44
6   ARM.L  20130220   941.5   965.10   939.4250   951.5001  2994652  2013-02-20 18:47:45

但我得到了一些可怕的东西:

        0         1        2        3          4         5        6                    7       0         1       2        3          4          5        6                    7
0     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   ADN.L  20130220   437.4   442.37   436.5000   441.9000  2775364  2013-02-20 18:47:42
1     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   ADM.L  20130220  1279.0  1300.00  1272.0000  1285.0000   967730  2013-02-20 18:47:42
2     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   AGK.L  20130220  1717.0  1749.00  1709.0000  1739.0000   834534  2013-02-20 18:47:43
3     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN  AMEC.L  20130220  1030.0  1040.00  1024.0000  1035.0000  1972517  2013-02-20 18:47:43
4     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   AAL.L  20130220  1998.0  2014.50  1942.4999  1951.0000  3666033  2013-02-20 18:47:44
5     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN  ANTO.L  20130220  1093.0  1097.00  1064.7899  1068.0000  2183931  2013-02-20 18:47:44
6     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   ARM.L  20130220   941.5   965.10   939.4250   951.5001  2994652  2013-02-20 18:47:45
0     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   ADN.L  20130220   437.4   442.37   436.5000   441.9000  2775364  2013-02-20 18:47:42
1     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   ADM.L  20130220  1279.0  1300.00  1272.0000  1285.0000   967730  2013-02-20 18:47:42
2     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN   AGK.L  20130220  1717.0  1749.00  1709.0000  1739.0000   834534  2013-02-20 18:47:43
3     NaN       NaN      NaN      NaN        NaN       NaN      NaN                  NaN

我不明白为什么。我开始想念 R: (

132072 次浏览

Ah, this is to do with how I created the DataFrame, not with how I was combining them. The long and the short of it is, if you are creating a frame using a loop and a statement that looks like this:

Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData))

You must ignore the index

Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData), ignore_index=True)

Or you will have issues later when combining data.

[EDIT] append() is deprecated since 1.4.0 - use concat() instead - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

This worked for me:

import numpy as np
import pandas as pd


dates = np.asarray(pd.date_range('1/1/2000', periods=8))
df1 = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df2 = df1.copy()
df = df1.append(df2)

Yields:

                   A         B         C         D
2000-01-01 -0.327208  0.552500  0.862529  0.493109
2000-01-02  1.039844 -2.141089 -0.781609  1.307600
2000-01-03 -0.462831  0.066505 -1.698346  1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05  0.693749  0.544329 -1.606851  0.527733
2000-01-06 -2.461177 -0.339378 -0.236275  0.155569
2000-01-07 -0.597156  0.904511  0.369865  0.862504
2000-01-08 -0.958300 -0.583621 -2.068273  0.539434
2000-01-01 -0.327208  0.552500  0.862529  0.493109
2000-01-02  1.039844 -2.141089 -0.781609  1.307600
2000-01-03 -0.462831  0.066505 -1.698346  1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05  0.693749  0.544329 -1.606851  0.527733
2000-01-06 -2.461177 -0.339378 -0.236275  0.155569
2000-01-07 -0.597156  0.904511  0.369865  0.862504
2000-01-08 -0.958300 -0.583621 -2.068273  0.539434

If you don't already use the latest version of pandas I highly recommend upgrading. It is now possible to operate with DataFrames which contain duplicate indices.

import pandas as pd
import numpy as np

If you have a DataFrame like this:

array = np.random.randint( 0,10, size = (2,4) )
df = pd.DataFrame(array, columns = ['A','B', 'C', 'D'], \
index = ['10aa', '20bb'] )  ### some crazy indexes
df


A  B  C  D
10aa  4  2  4  6
20bb  5  1  0  2

And you want add some NEW ROW which is a list (or another iterable object):

List = [i**3 for i in range(df.shape[1]) ]
List
[0, 1, 8, 27]

You should transform list to dictionary with keys equals columns in DataFrame with zip() function:

Dict = dict(  zip(df.columns, List)  )
Dict
{'A': 0, 'B': 1, 'C': 8, 'D': 27}

Than you can use append() method to add new dictionary:

df = df.append(Dict, ignore_index=True)
df
A   B   C   D
0   7   5   5   4
1   5   8   4   1
2   0   1   8   27

N.B. the indexes are dropped.

And yeah, it's not as simple as cbind() in R :(

pd.concat will serve the purpose of rbind in R.

import pandas as pd
df1 = pd.DataFrame({'col1': [1,2], 'col2':[3,4]})
df2 = pd.DataFrame({'col1': [5,6], 'col2':[7,8]})
print(df1)
print(df2)
print(pd.concat([df1, df2]))

The outcome will looks like:

   col1  col2
0     1     3
1     2     4
col1  col2
0     5     7
1     6     8
col1  col2
0     1     3
1     2     4
0     5     7
1     6     8

If you read the documentation careful enough, it will also explain other operations like cbind, ..etc.

dplyr's bind_rows does the same thing.

In python, you can do it the same way:

>>> from datar.all import bind_rows, head, tail
>>> from datar.datasets import iris
>>>
>>> iris >> head(3) >> bind_rows(iris >> tail(3))
Sepal_Length  Sepal_Width  Petal_Length  Petal_Width    Species
<float64>    <float64>     <float64>    <float64>   <object>
0           5.1          3.5           1.4          0.2     setosa
1           4.9          3.0           1.4          0.2     setosa
2           4.7          3.2           1.3          0.2     setosa
3           6.5          3.0           5.2          2.0  virginica
4           6.2          3.4           5.4          2.3  virginica
5           5.9          3.0           5.1          1.8  virginica

I am the author of the datar package. Feel free to submit issues if you have any questions.

Yes, rbind() (row bind dataframes) and cbind() (column bind dataframes) in R are very simple and intuitive.

You can use the "concat()" function from the pandas library for both of them to achieve the same thing. The rbind(df1,df2) equivalent in pandas will be the following:

pd.concat([df1, df2], ignore_index = True)

However, I have written rbind() and cbind() functions below using pandas for ease of use.


def rbind(df1, df2):
import pandas as pd
return pd.concat([df1, df2], ignore_index = True)


def cbind(df1, df2):
import pandas as pd
# Note this does not keep the original indexes of the df's and resets them to 0,1,...
return pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis = 1)


If you copy, paste, and run the above functions you can use these functions in python the same as you would use them in R. Also, they have the same assumptions as their R counterparts such as for rbind(df1, df2): df1 and df2 need to have the same column names.

Below is an example of the rbind() function:

import pandas as pd


dict1 = {'Name': ['Ali', 'Craig', 'Shaz', 'Maheen'], 'Age': [36, 38, 33, 34]}
dict2 = {'Name': ['Fahad', 'Tyler', 'Thai-Son', 'Shazmeen', 'Uruj', 'Tatyana'], 'Age': [42, 27, 29, 60, 42, 31]}


data1 = pd.DataFrame(dict1)
data2 = pd.DataFrame(dict2)


# We now row-bind the two dataframes and save it as df_final.


df_final = rbind(data1, data2)


print(df_final)


Here is an open public GitHub repo file I created for writing and consolidating python equivalent R functions in one central place: https://github.com/CubeStatistica/Learning-Data-Science-Properly-for-Work-and-Production-Using-Python/blob/main/Writing-R-Functions-in-Python.ipynb

Feel free to contribute.

Happy coding!