My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.
These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.
Some look sort of like this:
This is example text that could be many lines long... Location 1 Product 1 Product 2 Product 3 $20.99 $21.99 $33.79 Location 2 Product 1 Product 2 Product 3 $24.99 $22.88 $35.59
Others look sort of like this:
PRODUCT PRICE + / - ------------ -------- ------- Location 1 1 2007.30 +048.20 2 2022.50 +048.20 Maybe some multiline text here about a holiday or something... Location 2 1 2017.30 +048.20 2 2032.50 +048.20
Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.
It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?
The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.
Thanks for any suggestions about where to start.