views:

497

answers:

4

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.

I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?

+1  A: 

Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.

This all can be run in the background:

  • download web page (net/http)
  • save metadata + image file for each book (paperclip)
  • run GOCR on all the images

All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.

klochner
+1  A: 

GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.

You could also use tesseract-ocr for the OCR part, it's also open source and in active development.

For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.

moritz
A: 

There's a free API that can help you with this: http://www.webservius.com/corp/docs/wisetrend.pdf - it takes the URL of an image as input, and returns back OCRed text

Eugene Osovetsky