OCR: How to Convert PDF and Image Files to Text

As the translation industry continues to grow, with expanding global trade and what not, some useful tools come around that can make our work a lot easier. Such as translation memory (TM) / CAT tools<, or OCR, which stands for Optical Character Recognition.

OCR_optical-character-recognition

The character would be the letters in words, so basically software which Recognises the letters visually (optical), such as in PDF or image files (with extension .tiff or .jpg and so forth), converting such image files into text documents, such as Microsoft Word, which is predominantly used by customers in the industry.

OCR-process-1

Sometimes a customer will create their documents in Word and send that to the translator to translate, while many times they will send the translator a PDF file.

If they send a Word file, it is easy to write the translation over the text as per these instructions. If not, then one way is to create a Word document from scratch and format as per these instructions. But since creating the formatting can be rather time consuming, and because time is money, using effective OCR software is definitely the way to go.

If the customer sends an image file, such as a photo of some documents, Finereader is a good reliable program for that.

If your customer faxes you the document to translate, keep in mind that, if you receive the fax through a modem connected to your computer (often cheaper than a fax machine), instead of printing out the fax (waste of paper), note that the incoming fax has already been converted to an image file on your computer. It may take some tinkering and research, but you should be able to find this image (such as a multi-page .tiff file) and convert that to a text document through OCR, which will save you tons of time in manual formatting.

ocr_scanners

The same applies with printed documents. One option is to take a picture of each page and feed that to the software to convert for you (assuming it is worthwhile, such as with complex formatting or a lot of figures and names which do not get translated), but having received a lot of such work in the past, I figured out how to plug my physical fax machine directly into my computer modem and use the fax machine as a scanner to quickly convert all the pages into a multi-page image file (which can then be OCRed). As they say, there are many ways to skin a cat, so be creative in your problem solving, meditate on solutions, and I am sure you will find lots of ways to save time. The more time you save, the more money you can make, or spend more time doing what you enjoy.

Abby PDF Transformer

OCR-PDFs-with-Abby-PDF-TransformerI started off with version 2.0, which I was quite happy with, but after upgrading to a new computer a few times, it eventually would not let me install the software using my existing licence. I downloaded and purchased their latest software, Abby PDF Transformer +, but I was highly unsatisfied with the new setup, wrote to them if they could transfer my licence to 2.0, but because they did not respond, I went ahead and downloaded 3.0 from The Pirate Bay and use that now. These instructions are for that.

The trick is not to let the software run in automatic mode but to go manual and designate certain areas as either text or table (or picture) and make other adjustments.

I usually have my PDFs automatically open in Nitro Reader (also free), because I find it faster and less problematic than Adobe Reader. When you right mouse click on a file, you will see the Open With option from the popup menu, as shown below. This shows you the type of software installed on your system with which you can open the file you are clicking on. You can select Choose Default Program if you want to add something to this menu or change the default setting. PDFill PDF Editor is good software if you need to make changes to a PDF file, such as fill in a form and add a picture of your signature, instead of printing and sending by post (such as confidentiality agreements with the customer).

OCR-convert-image-files-to-text-image003

Since I don’t use this software so often, I Open With whenever I need it.
Below is a screenshot of how I set up a sample page (you can click to enlarge).

OCR-convert-image-files-to-text-image001

In the middle along the top, to the left of the hand icon, you will find three icons with which you can designate a section as either Text, Table or Picture.

Areas 1, 2 and 5, in green, I chose for text, while areas 3 and 4, in blue, I chose for table. Choosing 3 and 4 for text instead would not yield the nice boxes.

In the left panel, under PDF document languages, I always keep the three languages shown (Czech, Slovak and English), because those are the only languages I translate from and some of my documents contain English/bilingual text, like this one.

Under that I usually use Original Layout instead of Text Flow. This though may result in some paragraphs being broken up with paragraph marks (hard returns), like this example below:

OCR-convert-image-files-to-text-image005

As suggested in my instructions on how to Format in Word, you should have yours set to always show the Backwards P / Paragraph Mark.

Such cases are easily remedied by dragging the mouse across the paragraph to select all the paragraph marks you want to get rid of, like so:

OCR-convert-image-files-to-text-image007

and then run a macro< to replace the paragraph marks (^p) with spaces, and then double spaces by single spaces a few times, as a precaution (more detailed search and replace instructions<). This pieces the paragraph together. If it ends up stretched out across the entire page width but you would rather have short as above, simply drag the right indent to the left (figure X in the Ruler Bar instructions). When delivering a project to a customer, not only do they often want your translation to appear just like the source file, but leaving in paragraphs broken up in this way is quite unprofessional.

If choosing Original Layout, most of the time the paragraphs are not broken up this way. If you plan to use the exported Word file in a translation memory software<, it is especially important to go through the document first to look for such cases of broken up paragraphs, otherwise they will be considered as individual sentences by the software, which will reduce the potency of the translation memory.

If your customer wants to keep in the document pictures, logos, graphs and so forth, simply check the Keep Pictures box and make sure to select the pictures with the picture icon, as you had done with the text and table sections. Otherwise the picture will be interpreted as text and it won’t look pretty.

If you want to remove a part of a section, click on it once so that the + and – icons appear, as such:

OCR-convert-image-files-to-text-image013

Click on the – (minus) icon in the top right and then drag across the area you want to remove, such as:

OCR-convert-image-files-to-text-image015This may seem like an unimportant, extra step, but it may cost you more time later deleting the interpretation of the signature when working in the exported Word file.

Similarly, for tables, click inside the selection for six icons to appear across the top right of the table selection, as shown below. Hold the mouse still over each of the icons to see what they do. In this case, the mouse was held over the fourth icon for Show Table Structure. Clicking on this icon will draw blue lines across where the software presumes the table borders (column and row separators) should be. If you are not satisfied with the results, you can use any of the other icons to:

  • add vertical separator
  • add horizontal separator
  • delete separator
  • split cells, or
  • merge cells

OCR-convert-image-files-to-text-image013b

Inevitably any of these operations can also be performed in Word, but sometimes the vertical separators are not aligned exactly and it becomes difficult to fix it in Word later, so experimenting with the various options can lead to better and more efficient results.

Notice that the three, far right columns were not selected for the creation of this table, because I find that the signatures often muddle things up and it is easier, later in the exported Word file, to select from the menu

Table > Insert > Columns to the Right

and fill it in later. Or if there is a lot of text in the header column, to select the entire table but consider copying the text in the header, deleting the three columns, adding them back in manually, then paste back in the header text.

As with anything on the computer, try to always reserve some free space in your mind to search for alternative and better ways to do things. Do not quickly get into some easy routine and always repeat yourself, without constantly seeking out better ways. Just remember, the moment you find a better way to do anything, if it saves you x seconds every time you perform that same operation, these seconds will add up moving forward and represent incrementally higher income into the future. Every little savings in time adds up, with every way you learn to work faster.

OCR-convert-image-files-to-text-image015bOn the far right of the PDF Transformer window you will see images of the various pages, such as the image to the left.

If you right click on one of the pages you will see the popout menu as shown, which is useful if you only want to convert some of the pages to Word. If you want to convert several pages, simply select several of them by holding down either the shift or ctrl key while left clicking on them. Otherwise just press the large Convert icon at the top left hand corner of the software to convert all the pages.

Under this icon, note that you can also choose a different format to convert the PDF file into, such as:

OCR-convert-image-files-to-text-image017

Now, for a second sample page:

OCR-convert-image-files-to-text-image019

Note that I have only selected the first two columns of the table, because it would just be a mess to deal with the converted signatures. Once exported into Word, it is an easy matter to add a column to the right and type (paste) in “[signature]”, or whatever the customer’s instructions are regarding this.

Now, in Area 3 of our first sample page above, which we chose to convert into a table, it turned out that it converted this section:

OCR-convert-image-files-to-text-shot

to look like this in Word, meaning that the text in the blocks was broken up into many rows:

OCR-convert-image-files-to-text-image021

Fortunately, that is easily resolved in Word by simply selecting the cells you want to merge and then pressing the Merge Table Cells icon, which you should have permanently placed on your main toolbar.

It would look more professional to merge the cells so that the final output more reflects the source file. Or, it can happen that a customer does not want you to use OCR software at all. Not merging these cells would raise suspicion that you indeed are. The reason why some customers are against the use of OCR software is because, if not used properly, it can lead to formatting migraines when making changes later. But if you use it properly, there shouldn’t be a problem.

Once you have preliminarily prepared the document, you can commence translation by writing over top of the source text, applying Search and Replace< (such as the titles MUDr. becomes Dr.) or Autotext (such the expansion and translation of abbreviations for institutions) when possible.

Always think and be on the lookout for ways to save yourself time and get your job done faster. Perhaps you do a lot of similar text and it is worthwhile to pump the Word files through translation memory software.

For both of the sample pages above, I will show you one nifty trick. A lot of the material is bilingual, with the source text in normal font, followed by the translation in English. Some parts of the document are only source, which I am meant to translate, but the customer wants the end result to show only English and no source text. In the expanded Search and Replace window< , click on the Format icon at the bottom, and then Font from the list to get this window:

OCR-convert-image-files-to-text-image023

Select Latin text > Font style > Not Italic, as shown above. That would give you the resulting window:

OCR-convert-image-files-to-text-image025

This essentially instructs Word to replace all text that is not italic (the source text in our sample project) with nothing – meaning it gets erased. You can either press the Find Next icon and go through the entire document, pressing Replace for every case that applies, or select entire sections and press Replace All.

This is just to show you some examples how you can use your creativity to greatly speed up your work. Where most translators might take the traditional route and force the agencies to pay a higher price for the time required, I get a lot of these documents and, after coming up with all sorts of tricks and shortcuts, I find myself earning a hundred bucks an hour for rather simple work!

madmin
howdy

madmin

Head Honcho at KENAX
After translating and managing translation projects for more than 20 years, I'm happy to teach others the ropes and move on to other interests. My greatest perk from this profession is that it has given me the freedom to work when and where I want, and eventually to loosen the straps and travel freely around the world.
madmin
howdy

Latest posts by madmin (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *

translation CV campaign