The default OCR action of Foxtrot offers a very powerful and precise ability to perform optical character recognition either on a target on the screen or an image based on a set of coordinates. However, in some cases, you might find the output of the OCR action unsatisfying or maybe it does not offer the flexibility you need. For example, in some cases you do not know where you actually need to OCR, you need to be able to OCR a larger portion of the screen to find the position of a specific word. These types of features are coming to the inbuild OCR action in Foxtrot, until then, if you need such functionality, you may use the Google Cloud Vision API, the open source Tesseract OCR engine as explained in this article, or any third solution.
Please reference a full example project and the test images at the end of the article.
Tesseract is an open source OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. You may access the official website for Tesseract here. The engine can run on many different platforms and used with many different approaches. In this article, we will go through a simple approach of using the Windows Tesseract OCR engine via Foxtrot using the DOS Command action. This is a simple but limited approach. If you need to perform some advanced OCR or find the output of this approach unsatisfying, you may consider reading this guide and solution using Tesseract OCR via Python (article coming...).
IMPORTANT: If you are working with PDF files, you can use the "Poppler Utility Library", please reference below article, to convert your PDF files to images in order to be able to perform OCR using Tesseract:
Getting started
The first step is to download and install Tesseract. Go to this website, this is the official place to download Tesseract for Windows as specified here. We recommend downloading the latest version appropriate for your bit version of Windows. In this article, we will be using:
- tesseract-ocr-w64-setup-v4.1.0.20190314 (rc1)
After downloading Tesseract, run the simple installation. We do recommend placing the installed Tesseract OCR somewhere easily accessible for later use, for example, directly on the C: drive or in your Program Files folder.
When the installation is completed, you should be all set to open Foxtrot and make your first OCR test, it is that simple!
Simple OCR
This article is based on this information. Here, you may find all the information you need to use the Tesseract OCR engine through the Foxtrot DOS Command action. All of the images tested in throughout the article, you may find them at the end of the article. Let's make our first test.
The general concept of all DOS Command actions in Foxtrot will be:
cd TESSERACT_PATH && tesseract imagename outputbase [options...] [configfile...]
So, for example:
cd C:\Tesseract-OCR && tesseract C:\test_1.png C:\test_1
This will OCR the image located at "C:\test_1.png" and generate a text file output with the same name at the same location.
This is the image tested.
And this is the output.
This is the first line of
this text example.
This is the second line
of the same text.
Let's try one more.
cd C:\Tesseract-OCR && tesseract C:\test_2.png C:\test_2
This is the image tested.
And this is the output.
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
‘A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
‘A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
‘A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
A Quick Brown Fox Jumps Over The Lazy Dog 0123456789
Notice how the Tesseract OCR engine struggles a bit in the beginning. This illustrates that is it not flawless, especially if the text is either very small, unclear, or in many different colors and thickness. Let's have one last simple test.
cd C:\Tesseract-OCR && tesseract C:\test_3.png C:\test_3
This is the image tested.
And this is the output.
No.
01121212
01445544
01454545
Name
Spotsmey er's Furnishings
Mr. Mike Nash
612 South Sunset Drive
Miami, FL US-FL 37125
USA
Progressive Home Furnishings
Mr. Scott Mitchell
3000 Roosevelt Blvd.
Chicago, US-IL 61236
USA
New Concepts Furniture
Ms. Tammy L. McDonald
705 West Peachtree Street
Atlanta, GA US-GA 31772
USA
VAT Registration No.
Amount of Sales
(Ley)
0,00
1.499,03
0,00
This time, it is almost perfect. It changes "Spotsmeyer's" --> "Spotsmey er's" and "(LCY)" --> "(Ley)", but other than that, the output is correct.
Performance improvements
In this article, we will not discuss more in detail how the output could be improved. We recommend that you play around with all the different available settings as explained in the official documentation. It is especially relevant to specify the language, page segmentation mode, and OCR Engine mode for optimal performance. Also, the quality and size of the images plays a significant role. Therefore, in some cases, you might need to write some additional supplementary code in, for example, Python to achieve any useful results.
Remember that Foxtrot offers a wide variety of image actions as well. For example, you could take a screenshot of the whole application and then use the Crop Image before the DOS Command action in order to only scan the desired area of the image. You can also grayscale and resize the image for better results.
Advanced features
So far, we have simply performed OCR on different images to extract all the text. But, this is something that Foxtrot is already capable of doing, in fact, in most cases Foxtrot performs better than Tesseract OCR on this. However, there are some advanced output features available with Tesseract OCR that makes it very useful in many cases.
Find location of words
Please reference this solution through the last part of this article as we will use position-based clicks in combination with the Tesseract OCR engine.
Currently with the OCR action in Foxtrot, you need to know the location of where you wish to OCR. However, in some cases, you need to OCR in order to find the location of something. For example, if you are looking for a specific word in an application that you cannot find sufficiently with the standard targeting technology in Foxtrot, you want to be able to OCR the application in order to find the word and then perform an action like a click based on the position of that (either click on the word itself or somewhere relative to the position of the word). Of course, it does not have to be a click, it could be any other action, it is the concept of using the Tesseract OCR to find the location of words.
This is quite complex and advanced, therefore, we will go through it in a step-by-step manner. We will use Notepad as that is an application everyone have. Open the Notepad application and write something in it like this.
Now it is time to set up the flow in Foxtrot. The goal of this case/exercise is to be able to click on specific words inside of the Notepad application. Let's start by saying that we need to click on the word "together".
In Foxtrot, the first thing is to always resize the target window and make sure it is placed properly (typically the top left corner) on the screen.
Hereafter, we target the whole window (targeting the whole window is important!) and create a Screenshot action.
Straight after, we can use the Tesseract OCR engine to OCR that screenshot generated by Foxtrot.
IMPORTANT: notice the addition of "tsv" and "-l eng" at the end. The "tsv" will change the method we use from being simply extract all text to instead generate a list-formatted output with information on every detected word, and the "-l eng" specifies that we are working with English words.
cd C:\Tesseract-OCR && tesseract C:\test_4.png C:\test_4 tsv -l eng
The output is now as ".tsv" file. You can open it with Notepad (right-click and select "Open with") to have a look. It is a tab-separated list of data, and we will now generate two actions in Foxtrot to be able to dynamically load in the information. So far, we have these actions.
Next, we create a "Rename file" action to change the extension of the file.
Now, we use the Open List action to load in the data from the txt file into a list. Select the action from the action list, fill out the first three fields and then click "Import".
Now, make sure to select "Tab" as the delimiter before clicking "Next".
Here, make sure to select "Use included field names" to use the column headers from the OCR output. Here you can properly see the structure of the output, where the last column is the detected words.
Hereafter, simply continue to the end and the run the action. You should now have a list in your Foxtrot similar to this.
Now, we will mostly focus on the last columns as this is the information on where the word is detected (left, top, width, height), how sure Tesseract is of the word, and then the word/text. The only thing left is to perform the click as we now have all the information we need to click on the word "together". We can see in the list the location of the word.
Here are the actions we have so far.
So, now we can make a command using the pyautogui_Screen solution to click somewhere on the screen based on coordinates - the coordinates from the list generated by the Tesseract OCR. We need retrieve the coordinates based on the information from the list. The correct values all depend on where you actually want to click. For example, if you do not want to click on the word "together" but rather 50 pixels to the right of the word, your Coordinate_X should be:
- [left] + [width] + 50
In our case, we would like to click right in the middle of the word "together", so our variables should be:
- Coordinate_X = [left] + ([width] / 2)
- Coordinate_Y = [top] + ([height] / 2)
Here is the full overview of the approach. First, we create and set a variable that defines the word we wish to work with, then, we retrieve all the information from the list to finally calculate the desired coordinates.
After performing the calculations, it is important to round the values to have zero decimals as pixels coordinates needs to be full integers. Therefore, do that for both coordinate variables.
Hereafter, we are ready to click! This is how you perform the click using the pyautogui_Screen solution. Be aware that you need to validate the path to your pyautogui_Screen solution as it could likely be different than below!
C:\pyautogui_Screen_v0.3\pyautogui_Screen_v0.3.exe -command_click "click" -position "[%Coordinate_X], [%Coordinate_Y]"
That's it! That is the general concept of working with the tsv output of Tesseract OCR to, for example, click on a specific word. As mentioned, you can also easily do other things than simply clicking, whether that is sending values, performing a new OCR with the Foxtrot OCR action based on the location information retrieved via Tesseract OCR, etc.
It is also important to acknowledge that this does require a lot of testing and practice. You need to consider the "conf" value from the tsv output in the list as this indicates how sure (0-100) Foxtrot is on the output. And what if the same word appears more than once? Or if the Tesseract OCR engine is not able to sufficiently locate the desired word? You will have to test the approach and find the appropriate method in the specific application(s) you work with.
Comments
6 comments
I mean where right now it is returning coordinates for each word found, but can it bed to discern whole lines of text?
If a screenshot has multiple lines of text, including this one: "Cross Application Transactions Enter/Update"
It will return the coordinates for "Cross", for "Application", for "Transaction", etc. with the X, Y, width, etc. for each.
Can it be told to treat whole lines of text as one segment, instead of returning a result for each word in that line of text?
I see. Well, the standard command for Tesseract (the one not returning coordinates) will give you the full text output with linebreaks, etc. There is not such a command build into Tesseract that returns the coordinates of full sentences/lines of text, so you would have to make the logic yourself.
But, it shouldn't be too hard to do. You can use the Query List action to get rid of, for example, all blank rows in your list to begin with (so you only have the actual words detected). Then, you could use either the value of the "line_num" column or the "top" column to determine when you get to the next row of text. There are also other ways to do it, but that's the general idea. It's not super smooth, but that's the options offered by Tesseract.
This is an awesome example writeup and sample script, thank you! Made template-izing it very easy.
I've found a good trick for making the findings more accurate by eliminating clutter and cutting down on the amount of work Tesseract has to do, plus helping avoid false positives if a word could potentially be found in more than one place--
Get your initial screenshot with the pre-determined window size and location, then use Foxtrot image crop on all four sides using variables for each (CropLeft, CropRight, CropTop, CropBottom) to keep track of what dimensions the original was.
Now during or immediately after retrieval of the Top and Left variables, just add values of CropTop and CropLeft back into each to get the true screen position.
Is it possible to have Tesseract return text as whole lines/rows from the screenshots, or will it only return individual words? For a multiple word hit I have ended up adding coordinates together and averaging them, and that worked, but seems like a clumsy way to do it.
Line_num! That one got past me. Awesome, I see how to do this now.
James,
Great input! In terms of your question, I'm not sure I understand your question. Are you asking if it is possible to return all text from a screenshot or? That's what is illustrated and explained in the first part of the article. But are you referring to something else?
Please sign in to leave a comment.