Link Search Menu Expand Document

How to use PDF Multitool OCR Analyzer to create and test OCR image to text configurations for cloud and on-prem version of PDF.co

If you are working with scanned PDFs and the extracted text (text, csv, json, xml) is incomplete or inaccurate, consider using our desktop app, ByteScout PDF Multitool (compatible with Windows 7/10/11 and higher). This app emulates most of the major functions of the PDF.co API and, more importantly, allows you to create and test configurations for PDF extraction and image-to-text functions locally.

ByteScout PDF Multitool includes the OCR Analyzer tool, which helps you quickly find the best combination of OCR filters and parameters to enhance the quality of PDF text extraction results.

PDF Multitool and its OCR Analyzer provide JSON code for profiles that can be used with PDF.co cloud and on-premises versions. Simply set this JSON config to the profiles parameters for the PDF To Text/CSV/XML/JSON API methods.

Step-by-step guide on how to start using the PDF Multitool free app:

  1. First, download the free version of PDF Multitool from here.
  2. Next, load your PDF/JPG/PNG document into the multitool.
  3. Then, in the left navigation menu, select OCR Analyzer.
  4. Choose the OCR Language and OCR Resolution and click Go.
  5. Click Copy To button and select Send to CSV.. or similar to copy this configuration into the appropriate extractor.
  6. This will open PDF Extractor config for PDF to CSV/Text/XML/JSON accordingly.
  7. Try mew configuration by clicking Preview
  8. If you’re satisfied with the outcome, go to the Profile for PDF.co and API Server tab.
  9. Click on Copy as payload for PDF.co or API Server.
  10. Finally, paste this as a value to the profiles parameter value into your script/code or in Zapier/Make plugin accordingly.
  11. If you are not satisifed with results, try to adjust parameters and filters on the All Options tab (see Tips and Tricks below).

For a demo on how to use this tool, watch this video: https://youtu.be/NSyyohNNe6E

Tips and Tricks On Finding Best OCR Settings Using PDF Multitool

  1. For fuzzy or blurred scans: try to increase OCR Resolution from default 300 dpi (dots per inch) to 600 or even 800 or 1200 dpi and try again. Note: higher resolution means more time to process the document.
  2. For dark scans: try to add Gamma Correction filter with default value of 1.4 or 1.5 and try again. Note: this filter will make the dark images lighter automatically.
  3. ((To get text printed nearby borders or lines**, try to add filter that removes lines before extraction: For tables with borders or lines and if you see layout is reproduced incorrect or some words/letters are lost: try to add Horizontal Line Removal and Vertical Line Removal filters in All Options - OCRImageProcessingFilters section. Make sure to put this filters first in the list (use Up and Down buttons to move filters up and down in the list).
  4. For non-English documents set proper recognition language: set OCR Language to the appropriate language you see on the document. Default selected is eng (English). If you have a document in German, set it to deu (German). If you have multiple languages in the same document, select 2 languages (for example, eng and deu).
  5. If you don’t need a whole page the try to limit extraction area to a specific area on a page. It will increase the quality of text extraction as well as processing speed. To set extraction area, click on the Select tool on the main toolbar in PDF Multitool and use mouse to select the area with the source text. Then run extraction and preview again.
  6. If extracted text is missing some important text snippets, try to set an extraction area to extract from. Limiting to a specific area on a page may dramatically increase the quality of the text recognition.
  7. If extracting from the whole page produces broken results: try to run few extractions from the same page but limiting to selected areas, for example: extract from the top area, then from the middle area, then from the bottom area. Then combine results into one file. This will help to get better results if the page has different layouts or different fonts or different font sizes.
  8. Setting extraction area to exclude header and footer and / or side notes in the document may simplify text analysis greatly.
  9. Removing Background Noise: Lowering Gamma (with values below 1.4) and raising Contrast can effectively remove background noise from images.
  10. Extracting text from color photos or scans. Enhancing Gamma Effect on Color Photos improves the extraction quality. Applying the Grayscale filter before Gamma may yield better gamma effects on color photos. Grayscale alone is generally less useful.
  11. Removing Parasite Dots and Artifacts producing small garbled text snippets: Combining the Median filter with high-resolution rendering (600+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols.
  12. Fixing Etched/Distorted Letters: The Dilate filter can be used to repair etched or distorted letters in images.

List of OCR Image Preprocessing filters supported by PDF Multitool and PDFco API:

  • Contrast - Adds the Contrast image filter, which enhances image quality for OCR by improving contrast. This filter is particularly helpful for images where the text color is gray or similar to the background color. Lowering gamma and raising contrast can effectively remove background noise from images.

  • Deskew - Applies the Deskew image filter with a default angle threshold of 0.4 degrees (minimal admissible skew angle). This filter is useful for fixing slight rotatin of scanned images. For scans rotated 90, 180, 270 degrees, use the RotationAngle parameter in profiles instead, for example { 'rotationAngle': 1 }. RotationAngle parameters available are the following:

    • 0 no rotation (default)
    • 1 90 degrees
    • 2 180 degrees
    • 3 270 degrees
  • Dilate - Incorporates the “Dilate” image filter, which improves image quality for OCR by thickening the letter strokes. The Dilate filter can be used to repair etched or distorted letters in images.

  • Fit - Adds the Fit image filter with a specified size limit. The image is proportionally resized when its width or height exceeds the limit, which improves text extraction performance from large images.

  • Gamma - Implements the Gamma Correction filter with a default value of 1.4. This filter enhances image quality for OCR by automatically lightening dark images.

  • Grayscale - Applies the “Grayscale” image filter. Applying the Grayscale filter before Gamma may yield better gamma effects on color photos, although Grayscale alone is less useful.

  • HorizontalLinesRemover - Integrates the “Horizontal Lines Remover” image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing horizontal lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don’t need it, set profiles to { 'OCRImagePreprocessingFilters.Clear()': [] }

  • VerticalLinesRemover - Implements the “Vertical Lines Remover” image filter. This filter enhances OCR text recognition quality inside borders and near borders by removing vertical lines before text recognition. IMPORTANT: this filter is added by default in PDF.co cloud and on-prem. If you don’t need it, set profiles to { 'OCRImagePreprocessingFilters.Clear()': [] }

  • Invert - Adds the Invert (negative) image filter. Sometime, scanned documents are inverted (white text on black background). This filter can be used to fix this issue by inverting all colors before extracting text.

  • Median - Incorporates the “Median” image filter. Combining the Median filter with high-resolution rendering (600+ DPI) can help remove parasite dots from scanned images or fax rasterization artifacts. However, this approach may also remove punctuation symbols.

  • Scale - Adds the Scale image filter with a specified scale factor. For example, 2.0 doubles the size of the input image, improving the recognition quality of small letters.