PDF files are a common document format in business processes. They come in different layouts, types, and can contain varies different data. Typically, PDF files are associated with purchase order/invoice processes where you need to be able to receive an email with an attached PDF file that you need to download to then extract the data and possibly put in a system or send the document for approval. So, how do you do this with Foxtrot? How do you work with PDF files in Foxtrot?
It is very important to recognize the fact that the right approach to PDF files greatly rely on a number of factors:
- Is it a system/text PDF file or an image saved as a PDF file? In other words, can you open the PDF and copy-paste the text?
- What type of data does it contain? Is it plain text, a table of data, an invoice/order (typically with multiple columns in combination with a table of item lines), or something else?
- How much of the data do you need to extract? If it is an invoice, do you only need a few of the master fields such as invoice number and date or do you need everything including the individual item lines?
- How much does the layout and quality of the PDF files fluctuate? Is it a single format that you can always expect (typically internal documents you are confident does not change without notice), is it documents sent by external parties such as vendors, is it possibly many different documents from different external parties all with their own unique layout?
Generally speaking, automating processes that involve PDF files is quite complicated as the layout and structure of the documents are typically hard to predict, making it hard to set up a bullet-proof solution. However, that does not mean it is not possible by any means, it is just important to be aware of!
Working with "text" PDF files
It is at all times much easier to work with system generated PDF files, or "text" PDF files, recognizable by the fact that you can open the document and select/copy the text from it. In contrast, when you open "image" PDF files and attempt to click on the content to select/copy the text, you will see that it will not let you, indicating that the document is essentially an image.
Why is "text" PDF files easier than "image" PDF files? Because before you can work with the content of "image" PDF files, you will have to perform optical character recognition (OCR) on the document to convert the image to text and then from there work with the content. With "text" PDF files, you can skip the OCR part (which is by far the most complicated process of them all) and go straight to working with the content of the documents. Later in the article, we reference articles on how to work with OCR.
If you are working with "text" PDF files, we recommend that you take a look at the two following approaches. One is not necessarily better than the other, it is a matter of tast. Using the free Poppler utility offers a quite wide variety of features, not only converting PDF files to text for processing but you can also use it to convert PDF files to images before performing OCR. On the other hand, the tool called A-PDF Text Extractor is a free desktop application offering a graphical interface to convert PDF files to text, which for some users is more user-friendly and therefore preferred.
Working with "image" PDF files
If your documents are actually scanned images, the best approach in terms of OCR definitely depends on the type of PDF and the quality + structure of the data. For very simple OCR, you can use the in-built OCR action. With the include action, you can either simply open up your files (using, for example, the Open File action) and then target the whole window of the application (a Resize Window action should be used before the OCR action to make sure the size of the window is always the same) to OCR the part of the file you need. You could also use the Poppler Utility called "pdftoppm" to convert the PDF to an image, then use the Open Image action to OCR that opened image. However, it's only useful in simple jobs.
For more powerful OCR solutions, you could consider either using the Tesseract OCR engine as explained in the article below or you may consider using, for example, the Google Cloud Vision API as that offers a very powerful OCR engine. Again, the appropriate method of OCR fully depends on the type(s) of documents you are going to work with. It is very important to note that OCR is an extremely difficult process to master and most people have to compromise with their expectations.