Improving OCR Accuracy
OCR is the term used to describe the process of electronically reading the text contained within an image and converting into text. Whilst you might be able to see printed text clearly with the eye, OCR technology relies on different logic.
Here is a list of things to consider for getting the best OCR results:
- Fonts that are plain with clear separate characters that don't contain unusual flourishes or exotic serifs tend to provide better accuracy. Common fonts such as Arial, Times New Roman etc will tend to produce better results.
- Scanned documents at around 300 dpi B&W provide better success. Low resolution scanned documents often create a pixelated, jagged image to OCR, whilst very high resolutions can amplify imperfections in the document and cause errors. If scanning in color/greyscale 200 dpi is an ideal resolution.
- Aim for no image artifacts (speckling, lines, images. Avoid any non-character ink)
- Choose clearly formed characters that are not too small.
- Ensure good image contrast. White should be white. Black should be solid black.
- Characters should be clearly defined, fully formed without blurred edges.
- OCR will be more accurate on recognized words from the language selected. Words like names and random values will always be less accurate.
- Ensure the text is not skewed.
- Ensure the least number of languages are selected as possible. If your document is only expecting English content, then only select English in the OCR settings.
- Ensure you are not using highly compressed source documents such as "compact PDF" or similar as these may compress both the resolution and reduce image quality. Even PDF's that report being 300 dpi may have compression applied that can reduce image clarity. Try and use TIF files where possible.
- Important! - ensure good quality source documents. Trying to run image enhancers over poor quality images will never be as effective as starting with good quality source document prior to scanning. Poor quality documents will provide poor OCR results.
- Select the most suitable OCR engine for the job. Different OCR engines can give different results on the same printed text. Umango offers a choice between Tesseract and ABBYY - try testing both to see which provides the best result.
- Using regular expressions as part of your OCRing rules can be helpful to ensure an accurate result. Together with OCR confidence thresholds, these features can be helpful in providing rules around when Umango should present the user with the captured text to be confirmed/corrected before saving.
Link to this article http://umango.com/KB?article=120