views:

29

answers:

1

I am trying to write some code that will function like google calendars quick add feature . You know the One where you can input any of the following : 1) 24th sep 2010 , Johns Birthday 2) John's Birthday , 24/9/10 3) 24 September 2010 , Birthday of John Doe 4) 24-9-2010 : John Does Birthday 5) John Does Birthday 24th of September 2010

And it can figure out that we want an event on a date 24/9/2010 have the rest of the material as the event text.

I want to do this is python .

I am thinking of a design where I write regular expressions that may match all of the cases listed above and extract the date. But I am sur there is a smarter way to approach this problem . Since I clearly am not trained in lexical analysis or the many types of parsers styles. I am looking for whats a good way to approach this problem.

+1  A: 

NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.

Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).

A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.

Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, 'birthday', date):

I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.

You could then return the everything else as strings (separated by whitespace).

You then have:

  1. date ',' string 'birthday'
  2. string 'birthday' ',' date
  3. date 'birthday' 'of' string string
  4. date ':' string string 'birthday'
  5. string string 'birthday' date

NOTE: 'birthday', ',', ':' and 'of' here are keywords, so:

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

All except 3 use the possessive form of the person ('s if the last character is a consonant, s if it is a vowel). This can be tricky, as 'Alexis' could be the plural form of 'Alexi', but since you are restricting where plural forms can be, it is easy to detect:

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

Now, name can either be first-name or first-name last-name (yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

Finally, you can process forms 1-5 using something like this:

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date
reece