views:

341

answers:

4

This isn't really OCR, since it's not recognizing characters, but it's the same idea. Anyone know of an image-processing library or established algorithm for retrieving the values from a (raster) plot image? For instance, in this graph, it's hard for me to read exact values with my eyes because there's such gaps between gridlines:

alt text

I can use a straight edge or whatever, but it's still going to be error-prone. It would be great if there were software that could just take a screenshot of any old graph and automatically convert it into a table of values or a function that could be queried.

Seems to be called "curve recognition"?

And it's ok to have some human guidance. There's no reason an OCR couldn't read the "100" and match it up with the line, for instance, but it's ok to have a human give the lines numerical values after the machine has extracted the curve's path relative to the gridlines. I'm mostly interested in the part that traces the curve relative to the grid.

+4  A: 

This is extremely hard and error-prone. (We do this sort of thing a lot in chemistry where we try to analyze chemistry.) It depends critically on various parameters and conditions.

  1. Is the image a bit-map (pixels-only) or vectors (EMF, WMF, SVG, PS, PDF...)? Vectors are vastly better than pixels. We tackle vectors (including PDF) but don't touch pixels. Some of our collbaorators will try to use pixels but only on fairly recent documents.
  2. If you are stuck with pixels then are your images all from the same source? If so you have a small chance of extracting font information. I am afraid your image is so poor that it would require a great deal of work. However if you can work out the font you have a chance of extracting text and numbers if all docs are from the same source. You could use heuristics (rules such as where the numbers might be) or machine-learning (a list of features on whioch the methods can be trained).
  3. Your image appears to have been scanned (as the axes are pixelated). That makes it even worse. What appears a straight line to the eye is horrible for a machine. Is your image skewed on the page? You may have to deskew it.
  4. If you have a model for the lines and curves then you may have a change of modelling expected parameters into the image. But it's not trivial.

I'm sorry to be pessimistic. If you really want the info then it can be done with a lot of investment or collaboration with groups which do this sort of thing.

peter.murray.rust
I don't think it's as hard as you imagine it to be. What specific experience do you have with this? I don't understand what scraping graphs has to do with "analyzing chemistry".
endolith
And yes, I mean rasterized graphs, not vector images.
endolith
@endolith the graph above could well appear in a chemistry paper. We have analysed (and published in peer-reviwed journals) on how to extract information from scientific papers. These happen to be mainly in chemistry but they contain graphs that show all the aspects of this problem. You "don't think it's as hard as I imagine". If you have actually managed to write software than can extract information (without human help) from the picture shown then you will amaze a lot of people.
peter.murray.rust
@endolith even OCR on the characters in your graph (let alone the lines) will give rise to considerable errors. If you don't believe this, get an OCR program and try.
peter.murray.rust
*There's no reason an OCR couldn't read the "100" *The quality of these glyphs is so poor that you will almost certainly get thinks like "lOO" (el-oh-oh, not one-zero-zero). Indeed the pixels bleed from one glyph to another so I doubt you would even get this. Remember the OCR has not been trained on this graph.It is, of course, possible to create software that allows manual annotation on an overlay of the graph but I assumed you wanted something more automatic.
peter.murray.rust
The point of my question is to read the position of the curve in relation to the grid lines, not to read the text. I said so in the first sentence of the question. But I still stand by my statement that OCR has no trouble reading the number "100", especially since I just ran this image through ocrterminal.com, onlineocr.net, free-ocr.com, and googlecodesamples.com and they all read "100". And those are optimized for pages of text. If an OCR algorithm knows it's looking for numbers and not letters, and that they're aligned along a grid, it's going to be even more accurate.
endolith
"Your image appears to have been scanned ... That makes it even worse. What appears a straight line to the eye is horrible for a machine."I don't see why. Even an example Hough transform script can find the lines in the image: http://www.flickr.com/photos/56868697@N00/4071011102/ A dedicated program looking for evenly-spaced parallel lines of equal length should be able to do this very well.
endolith
+1  A: 

I don't know of any software that does what you're asking, but if you can get just a few points you can use some kind of regression to find the best function that fits those points. This particular graph looks like an exponential function. So you'd want to find an exponential regression calculator.

Nali4Freedom
+2  A: 

google for "curve recognition software" suggests http://www.curveunscan.com/

anonymous
Hmmm... It says "curve recognition algorithm", but also talks about picking the points by hand: http://www.curveunscan.com/features.htm
endolith
It kind of works, but requires a lot of hand-picking of points, tracks curves poorly, and crashes often. :/
endolith
Here's another software solution, with some curve following ability: http://digitizer.sourceforge.net/
endolith
+1  A: 

There is also potrace which is related, and that page in turn mentions other alternatives

pixelbeat