Poppler is a free software utility library for rendering Portable Document Format (PDF) documents. Among the list of very useful features, Poppler enables you to convert .pdf files to .txt enabling you to utilize all the formula in Foxtrot to extract information from the document with high precision, flexibility, and speed. Here is the official Poppler website:
Now, Poppler is developed for Linux, a different operating system than Windows, therefore, it can be quite tricky to get it working if you download the release on the official website as that will be a bunch of "C++" files (files containing code). Luckily, you may download a compiled version of the utility library via the following link:
You might notice that the compiled version is not quite up-to-date with the official release. If you are eager to use the latest version and know how to do it, you may download the release from the official website and compile the code yourself (Google is always your friend!).
If you find this to be too complicated and technical, you may consider using the A-PDF Text Extractor tool that offers the same ability to extract text from PDF files but through a desktop application:
IMPORTANT: Remember, both the Poppler utility library and A-PDF Text Extractor works well with "text" PDF files, however, they are not able to extract text from "image" PDF files. If you are not sure about this, please read our overall guide on how to work with PDF files:
Getting started
Begin by downloading the Poppler utility library from the above (blog) link with the binary version. The download will be a "7z" file, therefore, you will need a program to open and extract the downloaded folder. You can use 7-Zip, WinRar, or anything similar in order to do this.
After download, make sure to extract the zipped folder to an appropriate destination. For this article, we will extract it directly to the C: drive as that makes it using it in Foxtrot much more convenient, however, you can place it anywhere you would like.
You should now have a folder in the directory you selected with a similar content as this.
All of the programs available in the Poppler utility library is available in the bin folder. This is the folder you will be working with as this is the one containing the programs that we are interested in running in order to perform certain actions such as converting a .pdf to .txt file using the "pdftotext.exe".
For this article, we will not go into details with all the different programs available in Poppler. Luckily, this great website offers some detailed guides on all the different features:
Command | Description |
---|---|
pdfdetach | Portable Document Format (PDF) document embedded file extractor (version 3.03) |
pdffonts | Portable Document Format (PDF) font analyzer (version 3.03) |
pdfimages | Portable Document Format (PDF) image extractor (version 3.03) |
pdfinfo | Portable Document Format (PDF) document information extractor (version 3.03) |
pdfseparate | Portable Document Format (PDF) page extractor |
pdfsig | Portable Document Format (PDF) digital signatures tool |
pdftocairo | Portable Document Format (PDF) to PNG/JPEG/TIFF/PDF/PS/EPS/SVG using cairo |
pdftohtml | program to convert PDF files into HTML, XML and PNG images |
pdftoppm | Portable Document Format (PDF) to Portable Pixmap (PPM) converter (version 3.03) |
pdftops | Portable Document Format (PDF) to PostScript converter (version 3.03) |
pdftotext | Portable Document Format (PDF) to text converter (version 3.03) |
pdfunite | Portable Document Format (PDF) page merger |
Convert PDF to TXT
After successfully downloading and extracting the Poppler utility library, it is time to open Foxtrot and get started using the awesome features! In order to convert .pdf files to .txt files, we will use the "pdftotext.exe" (this is the equivalent of A-PDF Text Extractor). For the purpose of this article, we have uploaded two dummy invoices that you can download at the end of the article for practicing. Now, as discussed already, you need to make sure that the PDF files that you are working with is text. If we take a look at the first invoice, you can see that we can mark the text (the individual characters), indicating that the document is, in fact, a "text" PDF file.
Now, let us test using Poppler's "pdftotext" to convert this PDF file in order to be able to extract certain information from the document such as invoice number, invoice date, invoice terms, etc.
In Foxtrot, we will use the DOS Command action. Before setting up the action, we need to consider:
- Where did we place the Poppler utility library? In this case, we placed it directly in the C:\ drive, therefore, the full path to the "pdftotext" program will be: "C:\poppler-0.68.0\bin\pdftotext.exe"
- Where is the file we need to convert? In this case, we placed the "INV0001.pdf" file in the downloads directory, therefore, the path will be: "[*DOWNLOADS_DIRECTORY]INV0001.pdf"
- Where would we like to place the output (.txt) file? In this case, we would like to place it in the same directory as the existing .pdf file, simply with .txt as the file extension.
In this case, the DOS Command line in Foxtrot would be:
cd c:\poppler-0.68.0\bin && pdftotext "[*DOWNLOADS_DIRECTORY]INV0001.pdf" "[*DOWNLOADS_DIRECTORY]INV0001.txt"
And here is the output:
Notice how the structure of the output is kind of weird. As also explained in the "pdftotext" guide, it is possible to use the "-layout" option to maintain (as best as possible) the original physical layout of the text. The default is to ´undo' physical layout (columns, hyphenation, etc.) and output the text in reading order. So, let us try that:
cd c:\poppler-0.68.0\bin && pdftotext -layout "[*DOWNLOADS_DIRECTORY]INV0001.pdf" "[*DOWNLOADS_DIRECTORY]INV0001.txt"
And here is the new output after implementing the "-layout" option:
Notice the significant change in the output. Now, one approach is not necessarily better than the other, it is all up to you to find the best approach with your specific files.
Now, after converting the PDF file to a text file, you can simply use the Read File action to load the text into a variable and then perform formulas or Regular Expressions in order to extract the information you are looking for. Here is a simple example. First, we read the file into a variable.
And now, here is the variable:
To then extract some specific data, we could use the Formula action. If the input contains line breaks, you always need to clean the input first using the Clean formula under "Text". Here, you basically remove any line breaks and replace them with an appropriate character like a pipe. We are not done making the formula, therefore, hit "Hold" after setting up the Clean formula.
We can then, for example, use the Between formula.
The input will be the output of our Clean formula.
Tip: Click on the textbox containing the Clean formula in order to activate it and see the preview below.
Now, if we would like, for example, to extract the invoice number, we can simply setup the formula like this, taking the value between "Number" and the next line break:
Of course, the formulas you need depends on the documents and the data you are working with. And if you are working with different types of documents and layouts, you might need to set up logic for each document. But, hopefully, this gives you an idea of how to use the Poppler utility library to convert PDF files to text files and extract information from the output. We recommend that you spend some time training on the two attached files to learn the basics before heading in to some real projects.
Convert PDF to image
Another very useful feature of the Poppler utility library is the ability to convert PDF files to images. This is useful in cases where you work with PDF files that are no "text" but "image" PDF files. Because, in order to OCR the PDF file, you should have it in an image format. For the purpose of testing this, we have uploaded a file called "Image_PDF.pdf" that you may find at the end of the article. As you can see below, this document is clearly an image as it is not possible to mark anything in the document.
Therefore, in order to read the content of the document, we have to perform OCR. For more details on your different OCR options, please read our overall guide on how to work with PDF files:
To illustrate the concept of using the Poppler utility library to convert a PDF file to an image in order to be able to perform OCR, we will convert this file and use the in-built OCR action in Foxtrot to extract text from the image.
"pdftoppm.exe" is the program to use to convert PDF files to images (you may also test some of the other programs in the Poppler utility library that are also able to convert PDF files to images - use the program you find to perform best for your needs).
In Foxtrot, we will use the DOS Command action. Before setting up the action, we need to consider:
- Where did we place the Poppler utility library? In this case, we placed it directly in the C:\ drive, therefore, the full path to the "pdftoppm" program will be: "C:\poppler-0.68.0\bin\pdftoppm.exe"
- Where is the file we need to convert? In this case, we placed the "Image_PDF.pdf" file in the downloads directory, therefore, the path will be: "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf"
- Where would we like to place the output (image) file? In this case, we would like to place it in the same directory as the existing .pdf file, simply as an image file instead of PDF.
In this case, the DOS Command line in Foxtrot would be:
cd c:\poppler-0.68.0\bin && pdftoppm "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf" "[*DOWNLOADS_DIRECTORY]Image_PDF"
Notice how we do NOT specify the file extension on the output file. That is because this will automatically be defined by the "pdftoppm.exe" program. The standard output format is ".ppm", however, you typically would like the output file to be either ".png" or ".jpg". To change the output file format, you can simply use the options like this.
PNG ("-png"):
cd c:\poppler-0.68.0\bin && pdftoppm -png "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf" "[*DOWNLOADS_DIRECTORY]Image_PDF"
JPEG ("-jpeg"):
cd c:\poppler-0.68.0\bin && pdftoppm -jpeg "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf" "[*DOWNLOADS_DIRECTORY]Image_PDF"
The output file name is going to be the name of the file you specified plus an appended page number as the "pdftoppm.exe" program will convert every page of the PDF file to an image file. If you would like to avoid the appended number in the file name, it is possible to use the "-singlefile" option that tells the "pdftoppm.exe" program to only convert the first page of the file and not append any numbers to the file name.
cd c:\poppler-0.68.0\bin && pdftoppm -singlefile -png "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf" "[*DOWNLOADS_DIRECTORY]Image_PDF"
One other option that is quite useful is the "-r" option that allows you to specifiy the X and Y resolution, in DPI, of the output image. The default is 150 DPI. So, if you have a smaller PDF file or possibly a PDF file where the OCR does not perform well on the standard output image, it is worth trying to adjust the DPI to something higher like this:
cd c:\poppler-0.68.0\bin && pdftoppm -singlefile -r 300 -png "[*DOWNLOADS_DIRECTORY]Image_PDF.pdf" "[*DOWNLOADS_DIRECTORY]Image_PDF"
Now, after converting the PDF file to an image, let us take a quick look at how we could now OCR the image using the in-built OCR action in Foxtrot. Start by using the Open Image action to open the output image.
After running the action, Foxtrot will have opened the image file in the image editor.
You can now use the OCR action to OCR the whole image or portions of the image in order to extract text.
Comments
0 comments
Please sign in to leave a comment.