text-extraction

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use. ...

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to. Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF. This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external...

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts. ...

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that? ...

Strip HTML from a web page and calculate word frequency?

In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter. Finally, let me mention again that I'd like to do this in Groovy. ...

Extracting data from an email message (or several thousand emails) [Exchange based]

My marketing department, bless them, has decided to make a sweepstakes where people enter over a webpage. That is great but the information isn't stored to a DB of any sort but is sent to an exchange mail box as an email. Great. My challenge is to extract the entry (and marketing info) from these emails and store them someplace more u...

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus. The platform is linux. ...

Moving data from one master pdf to other individual pdf's with different layouts

I have 8-10 different company applications that have to be filled out. About 85-90% of the information is common (however it is not located in the same spot on each application form). I want to create a master application with the common fields and the application specific fields in the master application. I want to have a person fill...

how to extract a portion of a string in php

I am using preg_replace() for some string replacement. $str = "<aa>Let's find the stuff qwe in between <id>12345</id> these two previous brackets</h>"; $do = preg_match("/qwe(.*)12345/", $str, $matches); which is working just fine and gives the following result $match[0]=qwe in between 12345 $match[1]=in between but I am using s...

Best open source library or application to crawl and data mine web sites

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads. ...

Text detection / location libraries

I need to detect the bounding box(es) around portions of text in an image, and while there are quite a number of scholarly articles describing algorithms, I haven't found any implementations. The specific problem I'm trying to solve is this: Given an image that may or may not contain text, determine if the image does contain text, an...

Extract text from a PowerPoint (.ppt or .pptx) file?

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file. I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the f...

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI. ...

optical character recognition of PDFs of parliamentary debates

Hi, For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-column format: I would love to read your answer to my following questions: How I can split the two columns before feeding them into...

Search by topics and extract keywords from articles in Wikipedia

Hi. I'm doing a project in java in which I have to process a wikipedia dump file. I'm looking for a library to extract keywords in wikipedia articles... Basically I want to read every tag page in the wikipedia xml dump and compare it with a list of topics and categories and if it is correct , to choose it and add to my results. I'm not i...

Regexp for extracting a mailto: address

I'd like a reg exp which can take a block of string, and find the strings matching the format: <a href="mailto:[email protected]">....</a> And for all strings which match this format, it will extract out the email address found after the mailto:. Any thoughts? This is needed for an internal app and not for any spammer purposes! ...

Text Extraction from HTML Java

Hi. I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffR...

Is OCR a solved problem?

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation. My question is: is this true? Is the current state-of-the-art so good that - for a good sca...

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run th...

Is there a way to extract text from PostScript (.ps , .eps) files using Java?

I am looking for a solution similiar to PDFBox for PDFs of Apache Tika, however, for PS files. thanks. ...