ansaurus

Question

Answer 1

A:

You have given two pattern samples for text files.
I think these can be handled with scripting.
Something like: AWK, sed, grep with bash scripting.

One pattern in the first sample,

Section starts with keyword Location [Number]
second line of section has columns describing product names
third line of section has columns with prices for the products

There can be variable number of products per section.
There can be variable number of sections per file.
Products and prices are always on their designated lines of a section.
Whitespace separation identifies the (product,price) column-association.
Number of products in a section matches the number of prices in that section.

The collected data would probably be assimilated in a database.

nik 2009-08-07 15:47:39

Answer 2

A:

The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.

Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

John Pirie 2009-08-07 18:18:19

Thanks. Right now we're exploring because it may be that the more general method could triple revenues, which changes our usual definition of 'overprogramming'. :) If you have ideas about more general methods, even if they seem like they might be overkill, please add them.

Scott Saunders 2009-08-07 18:28:17

Answer 3

+2 A:

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

from pyparsing import *

aaa ="""    This is example text that could be many lines long...
             another line

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    stuff in here you want to ignore

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59 """

result = SkipTo("Location").suppress() \  
# in place of "location" could be any type of match like a re.
         + OneOrMore(Word(alphas) + Word(nums)) \
         + OneOrMore(Word(nums+"$.")) \

all_results = OneOrMore(Group(result))

parsed = all_results.parseString(aaa)

for block in parsed:
    print block

This returns a list of lists.

['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']

You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

I do not know if there are equivalents in other languages.

David Raznick 2009-08-08 00:59:18

ansaurus

tags:

views:

answers:

Algorithms or Patterns for reading text

related questions