Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some gu...
We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provid...
Does anyone know if there are any libraries around that will extract dates and times given a body of text? It doesn't matter which language, I'm just looking for a library to play with.
...
How can one extract data from a rendered web page?
In which java script would update the data with time.
Is it possible to write user script which can access varibles from webpage java script?
Please suggest possible way to achieve this.
...
I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be prob...
What is Conditional Random Field?
How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text?
For example: This product is ordered by StackOverFlow Inc.
What does Conditional Random Field do to identify StackOverFlow Inc. a...
I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http:/...
Opencalais lets you submit a string (REST API) ....and it will analyze that string and break it down into named-entities, relationships, keywords, etc.
Are there better tools other than opencalais? (both free and commercial)
...
If you can help with this you're a genius.
Basically, I will have some text like this:
<parent wealthy>
<parent>
<children female>
<child>
jessica
<hobbies>
basketball, soccer, video games
</hobbies>
</child>
<child>
jane
<hobbies>
...
I need a media information extraction library (pure Java or JNI wrapper) that can handle common media formats. I primarily use it for video files and I need at least these information:
Video length (Runtime)
Video bitrate
Video framerate
Video format and codec
Video size (width X height)
Audio channels
Audio format
Audio bitrate and sa...
I'm trying to figure out how to extract dates from unstructured text using Ruby.
For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered."
Any suggestions?
...
I'm looking for an Java open-source library which is able to extract address information from a (German) email (signature). The library should find
name
street
city, city code/postal code
email
tel/fax
address-parser.com is an commercial product, but a free (albeit simple) library would be great.
stackoverflow.com/questions/16413/pa...
I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:
Name
Address (physical)
Email Address
Phone number
website URL
I'm looking for a semantic parser that can attempt to extract...
I'm working with ARC files that were generated by a Heritrix crawl. When I view these pages in the Wayback Machine, it looks like most of the graphics are being loaded from my local machine, so I'm assuming that those graphics are stored inside the ARC files. Is that correct? If so, what is the best way to extract the images?
...
I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?
I am open ...
Hello there, I am a nurse and I know python but I am not an expert, just used it to process DNA sequences
We got hospital records written in human languages and I am supposed to insert these data into a database or csv file but they are more than 5000 lines and this can be so hard. All the data are written in a consistent format let me s...