information-extraction

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some gu...

Looking for an information retrival / text mining application or library

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provid...

Date Extraction Libraries

Does anyone know if there are any libraries around that will extract dates and times given a body of text? It doesn't matter which language, I'm just looking for a library to play with. ...

How to parse a rendered web page containing javascript.

How can one extract data from a rendered web page? In which java script would update the data with time. Is it possible to write user script which can access varibles from webpage java script? Please suggest possible way to achieve this. ...

Advanced PDF Parsing Using Python (extracting text without tables, etc.): What's the Best Library?

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be prob...

Using Conditional Random Fields for Named Entity Recognition

What is Conditional Random Field? How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text? For example: This product is ordered by StackOverFlow Inc. What does Conditional Random Field do to identify StackOverFlow Inc. a...

extract single string from html using ruby/mechanize (and nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath. Sample code: require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http:/...

Is there a better tool than opencalais?

Opencalais lets you submit a string (REST API) ....and it will analyze that string and break it down into named-entities, relationships, keywords, etc. Are there better tools other than opencalais? (both free and commercial) ...

Parsing SGML and storing it in a PHP array

If you can help with this you're a genius. Basically, I will have some text like this: <parent wealthy> <parent> <children female> <child> jessica <hobbies> basketball, soccer, video games </hobbies> </child> <child> jane <hobbies> ...

Media Information Extractor for Java

I need a media information extraction library (pure Java or JNI wrapper) that can handle common media formats. I primarily use it for video files and I need at least these information: Video length (Runtime) Video bitrate Video framerate Video format and codec Video size (width X height) Audio channels Audio format Audio bitrate and sa...

Parsing date from text using Ruby

I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? ...

Any Java library for address extraction from emails?

I'm looking for an Java open-source library which is able to extract address information from a (German) email (signature). The library should find name street city, city code/postal code email tel/fax address-parser.com is an commercial product, but a free (albeit simple) library would be great. stackoverflow.com/questions/16413/pa...

Extracting Demographic and Contact Information from unstructured text files

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least: Name Address (physical) Email Address Phone number website URL I'm looking for a semantic parser that can attempt to extract...

extracting graphics from crawled sites (ARC files)

I'm working with ARC files that were generated by a Heritrix crawl. When I view these pages in the Wayback Machine, it looks like most of the graphics are being loaded from my local machine, so I'm assuming that those graphics are stored inside the ARC files. Is that correct? If so, what is the best way to extract the images? ...

How do I get started with information extraction?

I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started? I am open ...

Medical information extraction using Python

Hello there, I am a nurse and I know python but I am not an expert, just used it to process DNA sequences We got hospital records written in human languages and I am supposed to insert these data into a database or csv file but they are more than 5000 lines and this can be so hard. All the data are written in a consistent format let me s...