合并熊猫数据框架,其中一个值位于另外两个值之间

我需要合并两个熊猫数据帧上的一个标识符和一个条件,其中一个数据帧中的日期是在另一个数据帧中的两个日期之间。

Dataframe A 有一个日期(“ fdate”)和一个 ID (“ cusip”) :

enter image description here

我需要把它和这个数据框 B 合并:

enter image description here

A.cusip==B.ncusipA.fdate之间是 B.namedtB.nameenddt

在 SQL 中,这是微不足道的,但是我认为在熊猫中做到这一点的唯一方法是首先无条件地合并标识符,然后根据日期条件进行过滤:

df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip')
df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])]

这真的是最好的办法吗?看起来,如果能够在合并中进行过滤,以避免在合并之后但在过滤完成之前出现潜在的非常大的数据帧,那么效果会好得多。

82229 次浏览

There is no pandamic way of doing this at the moment.

This answer used to be about tackling the problem with polymorphism, which tured out to be a very bad idea.

Then the numpy.piecewise function appeared in another answer, but with little explanation, so I thought I would clarify how this function can be used.

Numpy way with piecewise (Memory heavy)

The np.piecewise function can be used to generate the behavior of a custom join. There is a lot of overhead involved and it is not very efficient perse, but it does the job.

Producing conditions for joining

import pandas as pd
from datetime import datetime




presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})


start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
(war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]


> conditions
[array([ True,  True], dtype=bool),
array([False, False], dtype=bool),
array([False, False], dtype=bool),
array([False, False], dtype=bool),
array([False, False], dtype=bool)]

This is a list of arrays where each array tells us if the term time span matched for each of the two war declarations we have. The conditions can explode with larger datasets as it will be the length of the left df and the right df multiplied.

The piecewise "magic"

Now piecewise will take the president_id from the terms and place it in the war_declarations dataframe for each of the corresponding wars.

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
conditions,
terms.president_id.values)
date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

Now to finish this example we just need to regularly merge in the presidents' name.

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])


date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

Polymorphism (does not work)

I wanted to share my research efforts, so even if this does not solve the problem, I hope it will be allowed to live on here as a useful reply at least. Since it is hard to spot the error, someone else may try this and think they have a working solution, while in fact, they don't.

The only other way I could figure out is to create two new classes, one PointInTime and one Timespan

Both should have __eq__ methods where they return true if a PointInTime is compared to a Timespan which contains it.

After that you can fill your DataFrame with these objects, and join on the columns they live in.

Something like this:

class PointInTime(object):


def __init__(self, year, month, day):
self.dt = datetime(year, month, day)


def __eq__(self, other):
return other.start_date < self.dt < other.end_date


def __ne__(self, other):
return not self.__eq__(other)


def __repr__(self):
return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)


class Timespan(object):
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date


def __eq__(self, other):
return self.start_date < other.dt < self.end_date


def __ne__(self, other):
return not self.__eq__(other)


def __repr__(self):
return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
self.end_date.year, self.end_date.month, self.end_date.day)

Important note: I do not subclass datetime because pandas will consider the dtype of the column of datetime objects to be a datetime dtype, and since the timespan is not, pandas silently refuses to merge on them.

If we instantiate two objects of these classes, they can now be compared:

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

We can also fill two DataFrames with these objects:

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})


df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

And then the merging kind of works:

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')


pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

But only kind of.

PointInTime(2015,3,3) should also have been included in this join on Timespan(datetime(2015,2,1), datetime(2015,4,1))

But it is not.

I figure pandas compares PointInTime(2015,3,3) to PointInTime(2015,2,2) and makes the assumption that since they are not equal, PointInTime(2015,3,3) cannot be equal to Timespan(datetime(2015,2,1), datetime(2015,4,1)), since this timespan was equal to PointInTime(2015,2,2)

Sort of like this:

Rose == Flower
Lilly != Rose

Therefore:

Lilly != Flower

Edit:

I tried to make all PointInTime equal to each other, this changed the behaviour of the join to include the 2015-3-3, but the 2015-2-2 was only included for the Timespan 2015-2-1 -> 2015-2-5, so this strengthens my above hypothesis.

If anyone has any other ideas, please comment and I can try it.

A pandas solution would be great if implemented similar to foverlaps() from data.table package in R. So far I've found numpy's piecewise() to be efficient. I've provided the code based on an earlier discussion Merging dataframes based on date range

A['permno'] = np.piecewise(np.zeros(A.count()[0]),
[ (A['cusip'].values == id) & (A['fdate'].values >= start) & (A['fdate'].values <= end) for id, start, end in zip(B['ncusip'].values, B['namedf'].values, B['nameenddt'].values)],
B['permno'].values).astype(int)

As you say, this is pretty easy in SQL, so why not do it in SQL?

import pandas as pd
import sqlite3


#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)


qry = '''
select
start_date PresTermStart,
end_date PresTermEnd,
wars.date WarStart,
presidents.name Pres
from
terms join wars on
date between start_date and end_date join presidents on
terms.president_id = presidents.president_id
'''
df = pd.read_sql_query(qry, conn)

df:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

You should be able to do this now using the package pandasql

import pandasql as ps


sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''


newdf = ps.sqldf(sqlcode,locals())

I think the answer from @ChuHo is good. I believe pandasql is doing the same for you. I haven't benchmarked the two, but it is easier to read.