text-extraction

read pdf file with original contents

Hi I want to read pdf file with original content like its font(its possible that some font size is small while some font size is big ) and paragraph and table if it is. how its possible. plz help. ...

Extracting Demographic and Contact Information from unstructured text files

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least: Name Address (physical) Email Address Phone number website URL I'm looking for a semantic parser that can attempt to extract...

Extracting text from PDF, DOC, HTML after crawling with Heritrix

I'm looking to use Heritrix to crawl web-sites. I'm wondering what tools Heritrix users are using to extract text from crawled files prior to indexing them with Lucene. ...

Extracting readable text from HTML using Python?

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them. htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True) Alternately, from stripogram import html2text extract = html2text(webPage) Both of these extract all the java...

Java text extraction and data structure design

I have a huge set of data of tables in Open Office 3.0 document format. Table 1: (x range)|(x1,y1) |(x2,y2)|(x3,x3)|(x4,y4) (-20,90) |(-20,0) |(-5,1) |(5,1) |(10,0) ... Like wise i have n number of tables.All of these tables are fuzzy set membership functions.In simple terms they are computational model's according to...

Extract data from nested tables in pdf (c#)

I have a few pdf files that were created from word or excel files. I need to get the information thats in the tables. The text in the document is not an image so I'm able to extract the text using tools such as pdfbox. When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the tab...

garbage character at end of string?

Hi there I'm reading a string and breaking each word and sorting it into name email and phone number. with the string joe bloggs [email protected] 12345. But once i break everything down, the individual separated variables which hold the name,email and phone number have garbage characters at the end of them. I cant figure out why. test f...

Regex to extract info from SQL query

As I am new for the REGEX i am not able to solve below thing. And please share some parser related links so the i can learn it. I am facing problem in solving int below SQL statement. Its more line added to the previous INPUT. Please help me to slove this. DECLARE numerator NUMBER; BEGIN SELECT x, y INTO numerator, denominator FROM...

Looking for Find in files that supports all formats (in c++ or java)

How does for example components like the "Total Commander " search can open every file format And search inside it ? Is there free library that offer me such feature ? Basically in the end I will like to extract texts from files be able to support all formats ( pdf,Microsoft doc ,chm …) ...

c# regex to extract link after =

Couldn't find better title but i need a Regex to extract link from sample below. snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip assuming regex is the best way. thanks ...