如何在熊猫中将列转换为一个日期时间列?

我有一个数据框架,其中的前3列是’月’,’日’,’年’

每列中都有一个整数。 是否有一种 Python 方法可以在数据框中将所有三列转换为日期时间?

来自:

M    D    Y    Apples   Oranges
5    6  1990      12        3
5    7  1990      14        4
5    8  1990      15       34
5    9  1990      23       21

变成:

Datetimes    Apples   Oranges
1990-6-5        12        3
1990-7-5        14        4
1990-8-5        15       34
1990-9-5        23       21
136164 次浏览

In 0.13 (coming very soon), this is heavily optimized and quite fast (but still pretty fast in 0.12); both orders of magnitude faster than looping

In [3]: df
Out[3]:
M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21


In [4]: df.dtypes
Out[4]:
M          int64
D          int64
Y          int64
Apples     int64
Oranges    int64
dtype: object


# in 0.12, use this
In [5]: pd.to_datetime((df.Y*10000+df.M*100+df.D).apply(str),format='%Y%m%d')


# in 0.13 the above or this will work
In [5]: pd.to_datetime(df.Y*10000+df.M*100+df.D,format='%Y%m%d')
Out[5]:
0   1990-05-06 00:00:00
1   1990-05-07 00:00:00
2   1990-05-08 00:00:00
3   1990-05-09 00:00:00
dtype: datetime64[ns]

I re-approached the problem and I think I found a solution. I initialized the csv file in the following way:

pandas_object = DataFrame(read_csv('/Path/to/csv/file', parse_dates=True, index_col = [2,0,1] ))

Where the:

index_col = [2,0,1]

represents the columns of the [year, month, day]

Only problem now is that now I have three new index columns, one represent the year, another the month, and another the day.

Here is a alternative which uses NumPy datetime64 and timedelta64 arithmetic. It appears to be a bit faster for small DataFrames and much faster for larger DataFrames:

import numpy as np
import pandas as pd


df = pd.DataFrame({'M':[1,2,3,4], 'D':[6,7,8,9], 'Y':[1990,1991,1992,1993]})
#    D  M     Y
# 0  6  1  1990
# 1  7  2  1991
# 2  8  3  1992
# 3  9  4  1993


y = np.array(df['Y']-1970, dtype='<M8[Y]')
m = np.array(df['M']-1, dtype='<m8[M]')
d = np.array(df['D']-1, dtype='<m8[D]')
dates2 = pd.Series(y+m+d)
# 0   1990-01-06
# 1   1991-02-07
# 2   1992-03-08
# 3   1993-04-09
# dtype: datetime64[ns]

In [214]: df = pd.concat([df]*1000)


In [215]: %timeit pd.to_datetime((df['Y']*10000+df['M']*100+df['D']).astype('int'), format='%Y%m%d')
100 loops, best of 3: 4.87 ms per loop


In [216]: %timeit pd.Series(np.array(df['Y']-1970, dtype='<M8[Y]')+np.array(df['M']-1, dtype='<m8[M]')+np.array(df['D']-1, dtype='<m8[D]'))
1000 loops, best of 3: 839 µs per loop

Here's a helper function to make this easier to use:

def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)


In [437]: combine64(df['Y'], df['M'], df['D'])
Out[437]: array(['1990-01-06', '1991-02-07', '1992-03-08', '1993-04-09'], dtype='datetime64[D]')

Convert the dataframe to strings for easy string concatenation:

df=df.astype(str)

then convert to datetime, specify the format:

df.index=pd.to_datetime(df.Y+df.M+df.D,format="%Y%m%d")

which replaces the index rather than creating a new column.

Let's assume you've got a dictionary foo with each column of dates in parallel. If so, here's your one liner:

>>> from datetime import datetime
>>> foo = {"M": [1,2,3], "D":[30,30,21], "Y":[1980,1981,1982]}
>>>
>>> df = pd.DataFrame({"Datetime": [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]})

The real guts of it are this bit:

>>> [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]
[datetime.datetime(1980, 1, 30, 0, 0), datetime.datetime(1981, 2, 28, 0, 0), datetime.datetime(1982, 3, 21, 0, 0)]

This is the sort of thing zip was made for. It takes parallel lists and turns them into tuples. Then they get tuple unpacked (the for y,m,d in bit) by the list comprehension there, then fed into the datetime object constructor.

pandas seems happy with the datetime objects.

In version 0.18.1 you can use to_datetime, but:

  • The names of the columns have to be year, month, day, hour, minute and second:
  • Minimal columns are year, month and day

Sample:

import pandas as pd


df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5],
'hour': [2, 3],
'minute': [10, 30],
'second': [21,25]})
    

print df
day  hour  minute  month  second  year
0    4     2      10      2      21  2015
1    5     3      30      3      25  2016


print pd.to_datetime(df[['year', 'month', 'day']])
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]


print pd.to_datetime(df[['year', 'month', 'day', 'hour']])
0   2015-02-04 02:00:00
1   2016-03-05 03:00:00
dtype: datetime64[ns]


print pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
0   2015-02-04 02:10:00
1   2016-03-05 03:30:00
dtype: datetime64[ns]


print pd.to_datetime(df)
0   2015-02-04 02:10:21
1   2016-03-05 03:30:25
dtype: datetime64[ns]

Another solution is convert to dictionary:

print df
M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21


print pd.to_datetime(dict(year=df.Y, month=df.M, day=df.D))
0   1990-05-06
1   1990-05-07
2   1990-05-08
3   1990-05-09
dtype: datetime64[ns]
 [pd.to_datetime(str(a)+str(b)+str(c),
format='%m%d%Y'
) for a,b,c in zip(df.M, df.D, df.Y)]

Even better way to do is as below:

import pandas as pd


import datetime


dataset = pd.read_csv('dataset.csv')


date=dataset.apply(lambda x: datetime.date(int(x['Yr']), x['Mo'], x['Dy']),axis=1)


date = pd.to_datetime(date)


dataset = dataset.drop(columns=['Yr', 'Mo', 'Dy'])


dataset.insert(0, 'Date', date)


dataset.head()