Python 熊猫通过 dt 访问器有效地将日期时间转换为时间戳

我有一个包含大约(数亿)行的 DataFrame。我想要有效地将日期时间转换为时间戳。我该怎么做？

我的样本 df:

df = pd.DataFrame(index=pd.DatetimeIndex(start=dt.datetime(2016,1,1,0,0,1),
end=dt.datetime(2016,1,2,0,0,1), freq='H'))\
.reset_index().rename(columns={'index':'datetime'})

看起来像是:

             datetime
0 2016-01-01 00:00:01
1 2016-01-01 01:00:01
2 2016-01-01 02:00:01
3 2016-01-01 03:00:01
4 2016-01-01 04:00:01

现在，我使用 .apply()将 datetime 转换为逐值时间戳值，但是如果我有一些(数亿)行，则需要很长的时间(几个小时) :

df['ts'] = df[['datetime']].apply(lambda x: x[0].timestamp(), axis=1).astype(int)

产出:

             datetime          ts
0 2016-01-01 00:00:01  1451602801
1 2016-01-01 01:00:01  1451606401
2 2016-01-01 02:00:01  1451610001
3 2016-01-01 03:00:01  1451613601
4 2016-01-01 04:00:01  1451617201

上面的结果就是我想要的。

如果我尝试使用 pandas.Series的 .dt访问器，就会得到错误消息:

df['ts'] = df['datetime'].dt.timestamp

AttributeError: “ DatetimeProperties”对象没有属性 “时间戳”

如果我尝试用 .dt访问器创建日期时间的日期部分，那么它比使用 .apply()快得多:

df['date'] = df['datetime'].dt.date

产出:

             datetime          ts        date
0 2016-01-01 00:00:01  1451602801  2016-01-01
1 2016-01-01 01:00:01  1451606401  2016-01-01
2 2016-01-01 02:00:01  1451610001  2016-01-01
3 2016-01-01 03:00:01  1451613601  2016-01-01
4 2016-01-01 04:00:01  1451617201  2016-01-01

我想要一些类似的时间戳..。

但是我并不真正理解官方文档: 它谈到“ 转换为时间戳”，但是我没有看到任何时间戳; 它只是谈到用 pd.to_datetime()转换为日期时间，但是没有谈到时间戳..。

pandas.Timestamp构造函数也不工作(返回以下错误) :

df['ts2'] = pd.Timestamp(df['datetime'])

TypeError: 无法将输入转换为 Timestamp

pandas.Series.to_timestamp 还做了一些与我想要的完全不同的东西:

df['ts3'] = df['datetime'].to_timestamp

产出:

             datetime          ts                                                ts3
0 2016-01-01 00:00:01  1451602801  <bound method Series.to_timestamp of 0    2016...
1 2016-01-01 01:00:01  1451606401  <bound method Series.to_timestamp of 0    2016...
2 2016-01-01 02:00:01  1451610001  <bound method Series.to_timestamp of 0    2016...
3 2016-01-01 03:00:01  1451613601  <bound method Series.to_timestamp of 0    2016...
4 2016-01-01 04:00:01  1451617201  <bound method Series.to_timestamp of 0    2016...

133699

小开

最佳答案

我认为你首先需要通过 values转换成 numpy array，然后转换成 int64-输出是在 ns，所以需要除以 10 ** 9:

df['ts'] = df.datetime.values.astype(np.int64) // 10 ** 9
print (df)
datetime          ts
0  2016-01-01 00:00:01  1451606401
1  2016-01-01 01:00:01  1451610001
2  2016-01-01 02:00:01  1451613601
3  2016-01-01 03:00:01  1451617201
4  2016-01-01 04:00:01  1451620801
5  2016-01-01 05:00:01  1451624401
6  2016-01-01 06:00:01  1451628001
7  2016-01-01 07:00:01  1451631601
8  2016-01-01 08:00:01  1451635201
9  2016-01-01 09:00:01  1451638801
10 2016-01-01 10:00:01  1451642401
11 2016-01-01 11:00:01  1451646001
12 2016-01-01 12:00:01  1451649601
13 2016-01-01 13:00:01  1451653201
14 2016-01-01 14:00:01  1451656801
15 2016-01-01 15:00:01  1451660401
16 2016-01-01 16:00:01  1451664001
17 2016-01-01 17:00:01  1451667601
18 2016-01-01 18:00:01  1451671201
19 2016-01-01 19:00:01  1451674801
20 2016-01-01 20:00:01  1451678401
21 2016-01-01 21:00:01  1451682001
22 2016-01-01 22:00:01  1451685601
23 2016-01-01 23:00:01  1451689201
24 2016-01-02 00:00:01  1451692801

to_timestamp 用于转换从周期到日期时间索引。

小开

还有另一种方法可以使用 DatetimeIndex的“隐藏”属性 asi8来实现这一点，它创建了一个整数时间戳。

pd.DatetimeIndex(df.datetime).asi8

Wes McKinney 在这个与给你无关的堆栈溢出问题中提出了这个观点

小开

如果你不想使用 numpy，你可以使用纯熊猫转换

df['ts'] = pd.to_timedelta(df['datetime'], unit='ns').dt.total_seconds().astype(int)

小开

我认为你不应该使用应用程序, 简单的 astype就可以了:

df['ts'] = df.datetime.astype('int64') // 10**9

小开

一种选择是使用这样的 lambda 表达式

df['datetime'] = df['datetime'].apply(lambda x: pd.Timestamp(x))

小开

将 Pandas.datetime 转换为 unix 时间戳最简单的方法是:
df['datetime'].values.tolist()

小开

另一种选择是使用 pandas.to_numeric:

df['datetime'] = pandas.to_numeric(df['datetime'].values) / 10 ** 9

小开

大量的正确答案只是要注意使用 astype的弃用通知。现在推荐的方法是:

df['ts'] = df.datetime.view('int64')