在 Python 中从字符串中提取日期

我怎样才能从“猴子2010-07-10爱香蕉”这样的字符串中提取日期呢? 谢谢!

162462 次浏览

If the date is given in a fixed form, you can simply use a regular expression to extract the date and "datetime.datetime.strptime" to parse the date:

import re
from datetime import datetime


match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()

Otherwise, if the date is given in an arbitrary form, you can't extract it easily.

Using python-dateutil:

In [1]: import dateutil.parser as dparser


In [18]: dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Out[18]: datetime.datetime(2010, 7, 10, 0, 0)

Invalid dates raise a ValueError:

In [19]: dparser.parse("monkey 2010-07-32 love banana",fuzzy=True)
# ValueError: day is out of range for month

It can recognize dates in many formats:

In [20]: dparser.parse("monkey 20/01/1980 love banana",fuzzy=True)
Out[20]: datetime.datetime(1980, 1, 20, 0, 0)

Note that it makes a guess if the date is ambiguous:

In [23]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True)
Out[23]: datetime.datetime(1980, 10, 1, 0, 0)

But the way it parses ambiguous dates is customizable:

In [21]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True, dayfirst=True)
Out[21]: datetime.datetime(1980, 1, 10, 0, 0)

For extracting the date from a string in Python; the best module available is the datefinder module.

You can use it in your Python project by following the easy steps given below.

Step 1: Install datefinder Package

pip install datefinder

Step 2: Use It In Your Project

import datefinder


input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))


if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'

note: if you are expecting a large number of matches; then typecasting to list won't be a recommended way as it will be having a big performance overhead.

Using Pygrok, you can define abstracted extensions to the Regular Expression syntax.

The custom patterns can be included in your regex in the format %{PATTERN_NAME}.

You can also create a label for that pattern, by separating with a colon: %s{PATTERN_NAME:matched_string}. If the pattern matches, the value will be returned as part of the resulting dictionary (e.g. result.get('matched_string'))

For example:

from pygrok import Grok


input_string = 'monkey 2010-07-10 love banana'
date_pattern = '%{YEAR:year}-%{MONTHNUM:month}-%{MONTHDAY:day}'


grok = Grok(date_pattern)
print(grok.match(input_string))

The resulting value will be a dictionary:

{'month': '07', 'day': '10', 'year': '2010'}

If the date_pattern does not exist in the input_string, the return value will be None. By contrast, if your pattern does not have any labels, it will return an empty dictionary {}

References:

If you know the position of the date object in the string (for example in a log file), you can use .split()[index] to extract the date without fully knowing the format.

For example:

>>> string = 'monkey 2010-07-10 love banana'
>>> date = string.split()[1]
>>> date
'2010-07-10'

You could also try the dateparser module, which may be slower than datefinder on free text but which should cover more potential cases and date formats, as well as a significant number of languages.

Hands Down The Best Ways

There are two good modules on PyPI and GitHub, that make this task easier for us. Those are

  1. DATEFINDER Module, useful for finding dates in strings of text.

Installation pip install datefinder

EXAMPLE

import datefinder


input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))


if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'


SOURCE: Finny Abraham

  1. DATERPARSER, extremely useful for scraping dates from an HTML file, in different lingual formats, supports Hijri and Jalali Calender as well. And supporters almost 200+ Languages in Different Formats

Features

Generic parsing of dates in over 200 language locales plus numerous formats in a language agnostic fashion. Generic parsing of relative dates like: '1 min ago', '2 weeks ago', '3 months, 1 week and 1 day ago', 'in 2 days', 'tomorrow'.

Advanced Features

Generic parsing of dates with time zones abbreviations or UTC offsets like: 'August 14, 2015 EST', 'July 4, 2013 PST', '21 July 2013 10:15 pm +0500'. Date lookup in longer texts. Support for non-Gregorian calendar systems. See Supported Calendars. Extensive test coverage.

SOURCE CODE [Example]

>>> parse('1 hour ago')
datetime.datetime(2015, 5, 31, 23, 0)
>>> parse('Il ya 2 heures')  # French (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
>>> parse('1 anno 2 mesi')  # Italian (1 year 2 months)
datetime.datetime(2014, 4, 1, 0, 0)
>>> parse('yaklaşık 23 saat önce')  # Turkish (23 hours ago)
datetime.datetime(2015, 5, 31, 1, 0)
>>> parse('Hace una semana')  # Spanish (a week ago)
datetime.datetime(2015, 5, 25, 0, 0)
>>> parse('2小时前')  # Chinese (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)


HARD MODE:

If your dates are not separated by whitespace from surrounding text, combining datefinder with wordninja will solve this problem:

>>>import datefinder
>>>import wordninja
>>>example = '04.02.22ILeftMyHeartInSF ---> I Left My Heart In Sf - blah blah blah'
>>>list(datefinder.find_dates(' '.join(wordninja.split(example))))
[datetime.datetime(2022, 4, 22, 0, 0)]

Well sorta. That date was actually February 2004 not April 2022, but any tool would have to guess.

Just to be clear, this is what wordninja does to squishedtogethertext:

>>>wordninja.split(example)
['04', '02', '22', 'I', 'Left', 'My', 'Heart', 'In', 'SF', 'I', 'Left', 'My', 'Heart', 'In', 'Sf', 'blah', 'blah', 'blah']