Regular Expressions 'RegEx' are a fast, powerful and accurate way to be able to identify exactly the text you want to extract from an area of a document. For those with some technical know-how the steps in creating Regular Expressions can be simple, for others it may take a little more time.
The internet is full of website to assist in building and validating Regular Expression. Some examples that you may wish to utilize include;
Here are a few examples of Regular Expressions to help get you started.
Hints:
Fig 1. Test your results using Tesseract and ABBYY OCR engines can give you different results. ABBYY set to Accurate tends to provide the best result and doesn't slow the extraction down (compared to Fast). When using Tesseract leaving the setting on Fast tends to provide the best results vs. speed.
Fig 2. Remember Format and Validation provides the rules around how data should be structured once it has been captured.
Fig 3. Smart Seek is the rules we use to capture data we want to locate and capture.
Use 'Test' in Umango Extract to check your settings before saving your job.
Fig 1. | Fig 2. | Fig 3. |
Objective |
Regular Expression for Format and Validation |
Regular Expression for Smart Seek |
Image |
A 6 to 7 digit number |
Reg Ex: REGEX(\d{6,7}) |
||
A number after the word BALANCE |
REGEX(\d{1,5}.\d{2}) |
REGEX((?<=BALANCE.*)\d{1,5}.\d{2}) |
|
Date below looking for NN/NN/NNNN |
REGEX(\d{1,2}/\d{1,2}/\d{2,4}) |
REGEX(\d{1,2}/\d{1,2}/\d{2,4}) |
|
Looking for a number NN NNN NNN NNN |
REGEX(\d{2}\s*\d{3}\s*\d{3}\s*\d{3}) |
||
Looking for a string of data after the word Name: |
RegEx((?<=Name:\s).*\n) |
||
Looking for the dollar amount |
REGEX(\$[0-9]{1,5}.[0-9]{2}) |
REGEX(\$[0-9]{1,5}.[0-9]{2}) |
|
Extracting the account number |
REGEX([0-9]{3}\s?[0-9]{3}\s?[0-9]{3}) |
REGEX([0-9]{3}\s?[0-9]{3}\s?[0-9]{3}) |
Link to this article http://umango.com/KB?article=85