Extracting Demographic and Contact Information from unstructured text files | ansaurus

tags:

views:

44

answers:

1

Q:

Extracting Demographic and Contact Information from unstructured text files

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:

Name
Address (physical)
Email Address
Phone number
website URL

I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.

Other services I've looked for, while valuable for other purposes, do not address this specific need.

Any thoughts, suggestions or leads?

A:

Have you found a lead to your question? I found some research articles:

www.cis.upenn.edu/~pereira/papers/crf.pdf

citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf

www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf

But no specific examples of code on implementing any of these ideas.

Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text

(sorry I excluded the http, this system is not allowing me to post more than one url/link)

2010-07-26 20:52:53

related questions

What is the best way to change text contained in an XML file using Python?

Parsing Performance (If, TryParse, Try-Catch)

How to remove accents and tilde in a C++ std::string

newline character(s)

Windows batch command(s) to read first line from text file

XML vs Text for Non-web development applications

How do I modify a text file in Python?

python regular expression to split paragraphs.

Custom Text Wrapping in WPF

SQL strip text and convert to integer

What's the canonical way to store arbitrary (possibly marked up) text in SQL?

Text message receiving API - UK and USA

How can I detect the encoding/codepage of a text file

Keyboard scancodes?

How do I duplicate a whole line in Emacs?

Font rendering libraries for C# / dot-NET?

How to programmatically normalize music tags?

Best way to convert text files between character sets?

Highlight parents in xml string

A good algorithm similar to Levenstein but weighted for Qwerty keyboards?

Most elegant way to force a TEXTAREA element to line-wrap, *regardless* of whitespace

Adapt Replace all strings in all tables to work with text

Formatting text in WinForm Label

Text Editor For Linux (Besides Vi)?

In HTML, how to word-break on a dash?