熊猫版的捆绑 - 开卷题库

小开

最佳答案

Ah, this is to do with how I created the DataFrame, not with how I was combining them. The long and the short of it is, if you are creating a frame using a loop and a statement that looks like this:

Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData))

You must ignore the index

Frame = Frame.append(pandas.DataFrame(data = SomeNewLineOfData), ignore_index=True)

Or you will have issues later when combining data.

小开

[EDIT] append() is deprecated since 1.4.0 - use concat() instead - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

This worked for me:

import numpy as np
import pandas as pd


dates = np.asarray(pd.date_range('1/1/2000', periods=8))
df1 = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df2 = df1.copy()
df = df1.append(df2)

Yields:

                   A         B         C         D
2000-01-01 -0.327208  0.552500  0.862529  0.493109
2000-01-02  1.039844 -2.141089 -0.781609  1.307600
2000-01-03 -0.462831  0.066505 -1.698346  1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05  0.693749  0.544329 -1.606851  0.527733
2000-01-06 -2.461177 -0.339378 -0.236275  0.155569
2000-01-07 -0.597156  0.904511  0.369865  0.862504
2000-01-08 -0.958300 -0.583621 -2.068273  0.539434
2000-01-01 -0.327208  0.552500  0.862529  0.493109
2000-01-02  1.039844 -2.141089 -0.781609  1.307600
2000-01-03 -0.462831  0.066505 -1.698346  1.123174
2000-01-04 -0.321971 -0.544599 -0.486099 -0.283791
2000-01-05  0.693749  0.544329 -1.606851  0.527733
2000-01-06 -2.461177 -0.339378 -0.236275  0.155569
2000-01-07 -0.597156  0.904511  0.369865  0.862504
2000-01-08 -0.958300 -0.583621 -2.068273  0.539434

If you don't already use the latest version of pandas I highly recommend upgrading. It is now possible to operate with DataFrames which contain duplicate indices.

小开

import pandas as pd
import numpy as np

If you have a DataFrame like this:

array = np.random.randint( 0,10, size = (2,4) )
df = pd.DataFrame(array, columns = ['A','B', 'C', 'D'], \
index = ['10aa', '20bb'] )  ### some crazy indexes
df


A  B  C  D
10aa  4  2  4  6
20bb  5  1  0  2

And you want add some NEW ROW which is a list (or another iterable object):

List = [i**3 for i in range(df.shape[1]) ]
List
[0, 1, 8, 27]

You should transform list to dictionary with keys equals columns in DataFrame with zip() function:

Dict = dict(  zip(df.columns, List)  )
Dict
{'A': 0, 'B': 1, 'C': 8, 'D': 27}

Than you can use append() method to add new dictionary:

df = df.append(Dict, ignore_index=True)
df
A   B   C   D
0   7   5   5   4
1   5   8   4   1
2   0   1   8   27

N.B. the indexes are dropped.

And yeah, it's not as simple as cbind() in R :(

小开

pd.concat will serve the purpose of rbind in R.

import pandas as pd
df1 = pd.DataFrame({'col1': [1,2], 'col2':[3,4]})
df2 = pd.DataFrame({'col1': [5,6], 'col2':[7,8]})
print(df1)
print(df2)
print(pd.concat([df1, df2]))

The outcome will looks like:

   col1  col2
0     1     3
1     2     4
col1  col2
0     5     7
1     6     8
col1  col2
0     1     3
1     2     4
0     5     7
1     6     8

If you read the documentation careful enough, it will also explain other operations like cbind, ..etc.

小开

dplyr's bind_rows does the same thing.

In python, you can do it the same way:

>>> from datar.all import bind_rows, head, tail
>>> from datar.datasets import iris
>>>
>>> iris >> head(3) >> bind_rows(iris >> tail(3))
Sepal_Length  Sepal_Width  Petal_Length  Petal_Width    Species
<float64>    <float64>     <float64>    <float64>   <object>
0           5.1          3.5           1.4          0.2     setosa
1           4.9          3.0           1.4          0.2     setosa
2           4.7          3.2           1.3          0.2     setosa
3           6.5          3.0           5.2          2.0  virginica
4           6.2          3.4           5.4          2.3  virginica
5           5.9          3.0           5.1          1.8  virginica

I am the author of the datar package. Feel free to submit issues if you have any questions.

小开

Yes, rbind() (row bind dataframes) and cbind() (column bind dataframes) in R are very simple and intuitive.

You can use the "concat()" function from the pandas library for both of them to achieve the same thing. The rbind(df1,df2) equivalent in pandas will be the following:

pd.concat([df1, df2], ignore_index = True)

However, I have written rbind() and cbind() functions below using pandas for ease of use.


def rbind(df1, df2):
import pandas as pd
return pd.concat([df1, df2], ignore_index = True)


def cbind(df1, df2):
import pandas as pd
# Note this does not keep the original indexes of the df's and resets them to 0,1,...
return pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis = 1)

If you copy, paste, and run the above functions you can use these functions in python the same as you would use them in R. Also, they have the same assumptions as their R counterparts such as for rbind(df1, df2): df1 and df2 need to have the same column names.

Below is an example of the rbind() function:

import pandas as pd


dict1 = {'Name': ['Ali', 'Craig', 'Shaz', 'Maheen'], 'Age': [36, 38, 33, 34]}
dict2 = {'Name': ['Fahad', 'Tyler', 'Thai-Son', 'Shazmeen', 'Uruj', 'Tatyana'], 'Age': [42, 27, 29, 60, 42, 31]}


data1 = pd.DataFrame(dict1)
data2 = pd.DataFrame(dict2)


# We now row-bind the two dataframes and save it as df_final.


df_final = rbind(data1, data2)


print(df_final)

Here is an open public GitHub repo file I created for writing and consolidating python equivalent R functions in one central place: https://github.com/CubeStatistica/Learning-Data-Science-Properly-for-Work-and-Production-Using-Python/blob/main/Writing-R-Functions-in-Python.ipynb

Feel free to contribute.

Happy coding!