views:

144

answers:

5

I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.

I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"

to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293

I imagine it would be some combination of lexical parsing and machine learning techniques.

I am rather language agnostic but if pushed would prefer python, Matlab or C++ references

Thanks

+4  A: 

You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...

Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.

Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.

I hope this helps...

Tal Weiss
A: 

Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.

leancz
A: 

After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection

http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/

Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction

zenna
A: 

If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..

You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..

The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)

Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.

adi92
Agreed, I'm suspecting Google calendar uses a rule based system (think regular expressions) to do the parsing. Machine learning is more useful when pulling out information from more uncertain text. But if your domain is fairly well known and input string is limited then you can get away with rules.
Thien
A: 

I am curious about what you're working on. What's the application? Mostly structuring products data?

SODA
Trying to solve the biggest problem in online clothing retail!
zenna
I am guessing you're using affiliate data feeds to get the data?
SODA
Affiliates and crawling.
zenna
We might have an interesting conversation. Where are you located?
SODA