views:

44

answers:

1

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:

  • Name
  • Address (physical)
  • Email Address
  • Phone number
  • website URL

I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.

Other services I've looked for, while valuable for other purposes, do not address this specific need.

Any thoughts, suggestions or leads?

A: 

Have you found a lead to your question? I found some research articles:

www.cis.upenn.edu/~pereira/papers/crf.pdf

citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf

www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf

But no specific examples of code on implementing any of these ideas.

Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text

(sorry I excluded the http, this system is not allowing me to post more than one url/link)