Best way to identify and extract dates from text Python?

redct picture redct · Nov 15, 2013 · Viewed 38.8k times · Source

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.

For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:

Central design committee session Tuesday 10/22 6:30 pm

Th 9/19 LAB: Serial encoding (Section 2.2)

There will be another one on December 15th for those who are unable to make it today.

Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm

He will be flying in Sept. 15th.

While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).

As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parser module and parsedatetime, but those seem to be for after you've isolated the date.

Because of this, is there any good way to extract the date and the extraneous text

input:  Th 9/19 LAB: Serial encoding (Section 2.2)
output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)']

or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

Answer

akoumjian picture akoumjian · Jan 28, 2016

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

datefinder -- find and extract dates inside text

Here's an example:

import datefinder

string_with_dates = '''
    Central design committee session Tuesday 10/22 6:30 pm
    Th 9/19 LAB: Serial encoding (Section 2.2)
    There will be another one on December 15th for those who are unable to make it today.
    Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
    He will be flying in Sept. 15th.
    We expect to deliver this between late 2021 and early 2022.
'''

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)