questions about layout-extraction

Text detection / location libraries

I need to detect the bounding box(es) around portions of text in an image, and while there are quite a number of scholarly articles describing algorithms, I haven't found any implementations. The specific problem I'm trying to solve is this: Given an image that may or may not contain text, determine if the image does contain text, an...

image-processing

ocr

text-extraction

layout-extraction

optical character recognition of PDFs of parliamentary debates

Hi, For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-column format: I would love to read your answer to my following questions: How I can split the two columns before feeding them into...

Is OCR a solved problem?

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation. My question is: is this true? Is the current state-of-the-art so good that - for a good sca...

ocr

text-extraction

layout-extraction

Extracting html elements in a given region?

Given a region defined by a rectangle and a url, is there any way to determine what elements lie within the given rectangle on the page at the given url? EDIT: Screen resolution, Font size, etc.. can all be set to reasonable defaults. ...

html

url

screen-scraping

html-content-extraction

layout-extraction

dvi2rtf - who can convert DVI files to RTF files?

Consider a *NIX executable, dvi2rtf, whose contents are: #!/bin/sh TMPX=`mktemp /tmp/dvi2rtf.XXXXXX` dvitty $1 $TMPX # CTAN txt2rtf $TMPX $2 # CTAN, in rtfutils If my head is working this morning and the right executables are on the PATH, this clobbers the second argument with an rtf file whose text contents will roughl...