Does anyone know of a good free or cheap (under £100/$200) OCR library? It needs to run on Windows and preferably be a .NET library, though a COM interface is fine.
maybe not exactly what you are looking for, but this might point you to the right direction.
The code below is a unmodified copy/paste from the source. (only given to let the readers easily find the essence of this solution in one place)
Original Author: Martin Welker
_MODIDocument = new MODI.Document();
_MODIDocument.Create(filename);
// The MODI call for OCR
_MODIDocument.OCR(_MODIParameters.Language,
_MODIParameters.WithAutoRotation,
_MODIParameters.WithStraightenImage);
// add event handler for progress visualisation
_MODIDocument.OnOCRProgress +=
new MODI._IDocumentEvents_OnOCRProgressEventHandler(this.ShowProgress);
public void ShowProgress(int progress, ref bool cancel)
{
statusBar1.Text = progress.ToString() + "% processed.";
}
axMiDocView1.Document = _MODIDocument;
private void Statistic()
{
// iterating through the document's structure doing some statistics.
string statistic = "";
for (int i = 0 ; i < _MODIDocument.Images.Count; i++)
{
int numOfCharacters = 0;
int charactersHeights = 0;
MODI.Image image = (MODI.Image)_MODIDocument.Images[i];
MODI.Layout layout = image.Layout;
// getting the page's words
for (int j= 0; j< layout.Words.Count; j++)
{
MODI.Word word = (MODI.Word) layout.Words[j];
// getting the word's characters
for (int k = 0; k < word.Rects.Count; k++)
{
MODI.MiRect rect = (MODI.MiRect) word.Rects[k];
charactersHeights += rect.Bottom-rect.Top;
numOfCharacters++;
}
}
float avHeight = (float )charactersHeights/numOfCharacters;
statistic += "Page "+i+ ": Avarage character height is: "+
avHeight.ToString(" 0.00 ") +" pixel!"+ "\r\n";
}
MessageBox.Show("Document Statistic:\r\n"+statistic);
}
// initialize MODI search
MODI.MiDocSearchClass search = new MODI.MiDocSearchClass();
search.Initialize(
_MODIDocument,
_DialogSearch.Properties.Pattern,
ref PageNum,
ref WordIndex,
ref StartAfterIndex,
ref Backward,
MatchMinus,
MatchFullHalfWidthForm,
MatchHiraganaKatakana,
IgnoreSpace);
MODI.IMiSelectableItem SelectableItem = null;
// the one and only search call
search.Search(null,ref SelectableItem);
It uses the Microsoft Office Document Imaging Library from office 2003 to provide the OCR functionality for your application (need to add a reference to MDIVWCTL.DLL).
as Jon Galloway describes the Microsoft Office Document Imaging libraries included with Microsoft Office are available on many computers, and easy to automate with .net.
Jon lists a few others in his article.
tessnet (http://www.pixel-technology.com/freeware/tessnet2/) is an open-source .NET OCR engine based on tesseract
The best OCR engine is tesseract. You can check how it works in this online OCR tool.
If you're ok with using an external, web-based API to do the OCR, take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml
Sample code in .NET (C#) to use this: http://snipt.org/lOgh/
Charges are per-page, with a free trial available