Link Search Menu Expand Document

PDF to JSON/PDF to Text - fixing malformed PDF or incorrectly embedded font

For Zapier, Integromat and others plugins insert custom profiles into profiles field. For API calls please set value as string in profiles parameter as string.

Sometimes PDF file used is malformed. The embedded font used to draw characters has modified character table that doesn’t allow to get correct symbol codes of any relevant charset. In this case we can ensure that if document opens in Adobe Reader and copy-paste the text from it. If all characters are garbled too, This might be some sort of extraction protection.

If we need to get the text from this kind of file at any cost, we can try a special mode that renders document page and pass it to Optical Character Recognition (OCR). This allows to “repair” the text. In Web API you can enable this mode using profiles parameter allowing to change advanced options of underlying PDF Extractor engine.

{"OCRMode": "TextFromImagesAndVectorsAndRepairedFonts" }

or

{'OCRMode": 3}

If you are running pdf/convert/to/json then you can check the output JSON for ocrWasPerformed to check if OCR was performed on given pages. If this JSON reponse has this property set to true then it means that the engine detected malformed font on that page and ran OCR engine to extract correct text from this page.

Applies To:

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx