Link Search Menu Expand Document

Using “profiles” parameter to customize API calls

Barcode Generator: change redundancy level

To control the QR Code error correction level, please add the following string in the Custom Profiles parameter for PDF.co:

{ "profiles": [ { "profile1": { "Options.QRErrorCorrectionLevel": "Low" } } ] }

Applies To

  • /barcode/generate

Valid error correction levels:

LevelDescription
Low[default] Lowest error correction level. (Approx. 7% of codewords can be restored).
MediumMedium error correction level. (Approx. 15% of codewords can be restored).
QuarterQuarter error correction level (Approx. 25% of codewords can be restored).
HighHighest error correction level (Approx. 30% of codewords can be restored).

By default all these 2d barcodes (pdf417, datamatrix, qr code) have built-in redundancy, so you can erase like 10% of the barcode and it will be decodable still. 1d barcodes (ean13) can not do this, they have checksum protection to verify if value is decoded OK or not, but not protection against damage

Line Grouping Mode not working. PDF to XML/JSON/CSV

There is issues with line grouping with double-width spaces, as they’re treated by extractor as separators. To resolve this issue we should increase the internal criteria controlling this aspect of extraction.

We can simply achieve this by adding following Profiles parameter for PDF.co api call.

{ 'profiles': [ { 'profile1': { 'DetectNewColumnBySpacesRatio': '2.0' } } ] }

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

OCR - Perform explicit page rotation and converting PDF to XML/JSON/CSV

By default PDF.co OCR is smart to detect page roation and extract text. However there are cases when PDF is drawn unusually. I.e. Instead of marking horizontal pages with the rotation flag, all the text in the document is drawn with rotated text. This kind of rotation can not be detected by OCR automatically. In these kind of sitautions, we can explicitly perform rotation by following profiles values.

{ 'profiles': [ { 'profile1': { 'OCRDetectPageRotation': false, 'RotationAngle': 'Deg90' } } ] }

valid values for the RotationAngle param are Deg90, Deg180, Deg270. Note: 'OCRDetectPageRotation': false is required because this param overrides the RotationAngle if enabled.

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

On the other side if we want to have these rotation fixed prior to sending to PDF.co api, we can also opt some tool or library. For example qpdf command-line utility can rotate files fast and easy:

qpdf.exe --rotate=+90:1-z input.pdf output.pdf

Issue with extracting XML from Scanned PDF

There might be problem with extracting XML from Scanned PDF due to special cases when file contains both scanned images and long text objects (“Generated by Foxit PDF Creator ….”). The Optical Character Recognition (OCR) runs automatically only when a document contains no text. We can force the OCR for such documents. It can be done with a custom profile by using DetectNewColumnBySpacesRatio option.

Following profile will force OCR.

{ 'profiles': [ { 'profile1': { 'OCRMode': 'TextFromImagesOnly' } } ] }

We can also combile profiles like below.

private const string Profiles = "{ 'profiles': [ { 'profile1': { 'DetectNewColumnBySpacesRatio': '2.0' } }, { 'profile2': { 'OCRMode': 'TextFromImagesOnly' } } ] }";

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF file not converting properly to XML or other format due to malformed font

There are instances when PDF file contains malformed fonts with customized character map making the backward conversion to text impossible. In that situations ensure this by opening the document in Adobe Reader, select text with mouse and copy-paste to any text editor. The same garbled text is copied.

If you need to extract the text from this file at any cost, you can try the special mode that recognizes text using OCR:

string profile = "{ 'profiles': [ { 'profile1': { 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' } } ] }";

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

Note, this mode works much slower so you should perform requests asynchronously.

PDF to XLS converstion, Issue with incorrect split of cells

Use following profiles with lineGrouping property set to 1.

lineGrouping: 1
{ 'profiles': [ { 'profile1': { 'AutoDetectNumbers': false } } ] }

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to JSON/TEXT conversion failed due to malformed PDF or embedded font

Some times PDF file used is malformed. The embedded font used to draw characters has modified character table that doesn’t allow to get correct symbol codes of any relevant charset. In this case we can ensure that if document opens in Adobe Reader and copy-paste the text from it. If all characters are garbled too, This might be some sort of extraction protection.

If we need to get the text from this kind of file at any cost, we can try a special mode that renders document page and pass it to Optical Character Recognition (OCR). This allows to “repair” the text. In Web API you can enable this mode using “profiles” URL parameter allowing to change advanced options of underlying PDF Extractor SDK.

{ "profiles": [ { "profile1": { "OCRMode": "TextFromImagesAndVectorsAndRepairedFonts" } } ] }

or

{ 'profiles': [ { 'profile1': { 'OCRMode': 3 } } ] }

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

Extract Text from PDF including from vectors Object

Sample Program: https://github.com/bytescout/ByteScout-SDK-SourceCode/tree/master/PDF.co%20Web%20API/PDF%20To%20XML%20API/C%23/Advanced%20Conversion%20Options

Following is the profiles setting used in such cases.

{
 "profiles": [
    {
      "profile1": {
        "SaveVectors": "True"
      }
    }
  ]
}

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

Rotating PDF prior to data extraction

Normally OCR detects PDF rotation and extracts text properly. But in some cases PDF is constructed in such a way that page is not rotated instead font is drawn vertically, OCR does not detect page rotation automatically. In such scenarios we can use following profile setting.

"{ 'profiles': [ { 'profile1': { 'RotationAngle': 2 } } ] }"

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

Valid RotationAngle values:

ValueDescription
0no rotation
190 degrees
2180 degrees
3270 degrees

Issue with data extraction due to huge number of vector objects

Some document has problem that it contains an enormous number of vector objects. Every line in it is drawn in very short separate segments. By default, PDF Extractor analyses each vector object whether it is a table border or a columns/row separator. To speed up the conversion of this document we should exclude vectors from the table structure analysis.

In Web API:

"{ 'profiles': [ { 'profile1': { 'ColumnDetectionMode': 'ContentGroups' } } ] }"

With Offline DLL

extractor.ColumnDetectionMode = ColumnDetectionMode.ContentGroups;

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to XLS conversion issue - space is replaced with ! symbol in PDF

This can be due to document containing font with modified charset. The space character in it has the code corresponding to the exclamation mark. Adobe Reader extracts exactly the same text.

This can be fixed by a special filter that allows replacing a text before the table structure reconstruction.

{
   'profiles': [ 
        { 
            'profile1': { 
                'AddFilter()': [ '!', ' ', false, false]
            } 
        } 
    ] 
}

Applies To

  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to XLS conversion issues

In case issue with numbers removed in resulted xls file, use following profiles.

"{ 'profiles': [ { 'profile1': { 'AutoDetectNumbers': false } } ] }"

In scenarios when issue with column detections can use following profile.

{ 'profiles': [ { 'profile1': { 'ColumnDetectionMode': 'Borders' } } ] }

In scenarios when there is no border between columns. Moreover there is a single whitespace between columns that looks as a single phrase even to human eyes; the only way to split object correctly is to specify column coordinate explicitly using CustomExtactionColumns property. The profile will look as follows:

{ 'profiles': [ { 'profile1': { 'CustomExtractionColumns': [ 0, 112.5, 172, 225, 431, 489, 543, 597, 651, 712 ] } } ] }

Applies To

  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

To measure columns and get PDF coordinates you can use our “PDF Multitool” app installed with the offline SDK. Or download it separately from ByteScout website.

PDF to CSV - result has wrongly positioned data

This might be case when document contains a number of overlapping invisible text and vector objects that affect column detection.

'profiles': [
        {
            'profile1': {
                'ColumnDetectionMode': 'ContentGroups'
            }
        }
    ]

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to TEXT with scanned documents - Configuring corrections

We can configure manually corrections for known errors when extracting data from PDF like below.

{
    "profiles": [ 
        { 
            "options": { 
                "TrimSpaces": "False", 
                "PreserveFormattingOnTextExtraction": "True", 
                "Unwrap": "True"
            }            
        },
        {
            "correction1": { 
                "OCRCorrections.Add()": [ "Test", "XXXX", false]
            } 
        },
        {
            "correction2": { 
                "OCRCorrections.Add()": [ "OCR", "YYY", false]
            } 
        }
    ] 
}

We can also specify regex based corrections like following. It’s for cases when date like 11/03/2018 is resulted in 11V03V2018.

{
  "profiles": [ 
        { 
            "profile1": { 
                "OCRCorrections.Add()": [ "(\\d)V(\\d)", "$1\\/$2", true], 
            } 
        } 
    ] 
}

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to TIFF rendering - always have black and white tiff image output

To get black-and-white TIFF images we need to set appropriate TIFF compression type. Following profiles value allows changing advanced conversion options in the underlying converter.

{
   "profiles": [
        {
            "profile1": {
                "TextSmoothingMode": "HighQuality", // Valid values: "HighSpeed", "HighQuality"
                "VectorSmoothingMode": "HighQuality", // Valid values: "HighSpeed", "HighQuality"
                "ImageInterpolationMode": "HighQuality", // Valid values: "HighSpeed", "HighQuality"
                "RenderTextObjects": true, // Valid values: true, false
                "RenderVectorObjects": true, // Valid values: true, false
                "RenderImageObjects": true, // Valid values: true, false
                "RenderCurveVectorObjects": true, // Valid values: true, false
                "JPEGQuality": 85, // from 0 (lowest) to 100 (highest)
                "TIFFCompression": "LZW", // Valid values: "None", "LZW", "CCITT3", "CCITT4", "RLE"
                "RotateFlipType": "RotateNoneFlipNone", // RotateFlipType enum values from here: https://docs.microsoft.com/en-us/dotnet/api/system.drawing.rotatefliptype?view=netframework-2.0
                "ImageBitsPerPixel": "BPP24", // Valid values: "BPP1", "BPP8", "BPP24", "BPP32"
                "OneBitConversionAlgorithm": "OtsuThreshold", // Valid values: "OtsuThreshold", "BayerOrderedDithering"
                "FontHintingMode": "Default", // Valid values: "Default", "Stronger"
                "NightMode": false // Valid values: true, false
            }
        }
    ]
}

Applies To

  • /pdf/convert/to/tiff

Read barcodes from negative and Rotated faxes

By default Barcode reader scan for horizontal barcodes. Use following profile to enable decoding of vertically oriented barcodes.

{ 'profiles': [ { 'profile1': { 'Orientation': 'HorizontalAndVertical' } } ] }

In some scenarios it is possible that when barcode rotated, it’s returning duplicate values due to same code is decoded with different values when rotated. The only way to get rid of this behaviour is set limit with MaxNumberOfBarcodesPerPage.

{ 'profiles': [ { 'profile1': { 'Orientation': 'HorizontalAndVertical', 'MaxNumberOfBarcodesPerPage': 1 } } ] }

Applies To

  • /barcode/read/from/url

Scanned PDF to CSV - English language scanned PDF’s output csv contains characters of other language

This is mainly due to corrupted font. Following profile setting is helpful.

{ 'profiles': [ { 'profile1': { 'OCRMode': 'TextFromRepairedFontsOnly' } } ] }

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to Image - Render only images from PDF in output image

We can control how PDF is rendered and turn on/off text layer, vector drawings layer, images layer. To turn off text use following profile.

{ "profiles": [ { "profile1": { "RenderTextObjects": false } } ] }

Applies To

  • /pdf/convert/to/tiff
  • /pdf/convert/to/png
  • /pdf/convert/to/jpg
  • /url/convert/to/jpg
  • /url/convert/to/png

PDF to XML - save images inside xml as base64 encoded strings

When converting from PDF to XML if we want to have output xml to contain images as base64 encoded string, following profile is helpful.

{ 'profiles': [ { 'profile1': { 'SaveImages': 'Embed' } } ] }

Applies To

  • /pdf/convert/to/xml
  • /pdf/convert/to/json

Reading barcode - not decoding due to abnormal distances between bars

In some scenarios barcode has abnormal distances between some bars making it not decodable. “Heuristic Mode” decoding option can help here.

{ "profiles": [ { "profile1": { "HeuristicMode": true } } ] }

Applies To

  • /barcode/read/from/url

Barcode generation - specifying bar height in generated barcode

Use following profile to control height of bar in resulting barcode image.

{ "profiles": [ { "profile1": { "BarHeight": 100 } } ] } 

Applies To

  • /barcode/generate

PDF to JSON - Get more detailed font information

To get additional information in extracted/converted result for example Vector lines and more detailed font information, we can use following profile. It allows changing advanced options in the underlying PDF Extractor SDK converter passing them in JSON string.

{ "profiles": [ { "profile1": { "SaveVectors": true, "ConsiderFontSizes": true, "ConsiderFontNames": true, "ConsiderFontColors": true } } ] }

Applies To

  • /pdf/convert/to/xml
  • /pdf/convert/to/json

PDF to CSV/XLS conversion - not converting properly

In certain files, document requires some fine-tuning because it contains wide spaces that affect the column detection. Use following profile value with api call.

{ "profiles": [ { "profile1": { "ShrinkMultipleSpaces": true } } ] }

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

Barcode generation - Hide barcode caption value

To produce the barcode without the addition of the text as shown under the barcode.

{ 'profiles': [ { 'profile1': { 'DrawCaption': false } } ] }

Applies To

  • /barcode/generate

Barcode generation - QR redundacy setting

QR Code barcodes are generated with default settings which allows some % of the surface to be damaged but allow to be decodable still. To control the QR Code error correction level, please add the following string in the Custom Profiles parameter of PDF.co Barcode Generator plugin:

{ "profiles": [ { "profile1": { "Options.QRErrorCorrectionLevel": "Low" } } ] }

Applies To

  • /barcode/generate

Valid error correction levels:

ValueDescription
Low[default] Lowest error correction level. (Approx. 7% of codewords can be restored).
MediumMedium error correction level. (Approx. 15% of codewords can be restored).
QuarterQuarter error correction level (Approx. 25% of codewords can be restored).
HighHighest error correction level (Approx. 30% of codewords can be restored).

Barcode reading issue due to variying lighting condition or image size

Certain input files for barcode reading can have varying lighting condition or image size. So we need to bring them to approximately the same state.

{ 'profiles': [ { 'profile1': { 'ImagePreprocessingFilters.AddScale()': [1500, 'true', 'HighQualityBicubic'] , 'ImagePreprocessingFilters.AddGamma()': [1.4] } } ] }

Applies To

  • /barcode/read/from/url

AddScale filter will auto-scale input images to the same size (1500 pixels).

AddGamma filter will increase the color intensity for better recognition.

Please Note: This does not guarantee that all similar images will be decoded, but it should work with most of them.

Crop PDF files with PDF Filler

PDF Filler action can crop PDF files. PDF Filler is usually used to add text and images on top of existing PDF files but it can also adjust PDF files via special profiles param available in new version of the plugin.

{ "profiles": [ { "profile1": { "Pages[0].SetCropBox()": ["28", "28", "539", "284"] } } ] }

Applies To

  • /pdf/edit/add

The crop box is defined by a rectangle (x, y, width, height) in PDF Points (1 Point = 1/72 in.). Rectangle calculations:


A4 page size in Points: 595 x 842

Crop rectangle:
x = 0.39 * 72 = 28
y = 0.39 * 72 = 28
width = 595 - 0.39 * 72 - x = 539
height = 842 - 7.36 * 72 - y = 284

Or we can check coordinates from PDF file using this interactive editor that displays coordinates: https://app.pdf.co/pdfviewer

PDF to CSV - Determine if need to force OCR or not

To determine if OCR is required, the converter checks each PDF document page if it contains at least one raster image or vector drawing but very little text (less than 8 characters).

The OCR can be forced by changing an option of underlying PDF Extractor SDK using the profiles parameter.

{ 'profiles': [ { 'profile1': { 'OCRMode': 'TextFromImagesAndVectorsAndFonts' } } ] }

In the TextFromImagesAndVectorsAndFonts mode, the converter will perform OCR on image and vector objects and combine the result with the extraction of normal text objects.

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx

PDF to CSV/JSON - considering font colors column seperation using

{ 'profiles': [ { 'profile1': { 'LineGroupingMode': 'JoinOrphanedRows', 'ConsiderFontColors': 'true', 'DetectNewColumnBySpacesRatio': '1.1', 'AutoAlignColumnsToHeader': 'false', 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ '1.4' ] } } ] }

AddGammaCorrection improves OCR, other params tunes the representation of extracted text objects.

ConsiderFontColors is useful to detect cell change by font-color

Applies To

  • /pdf/convert/to/csv
  • /pdf/convert/to/xml
  • /pdf/convert/to/json
  • /pdf/convert/to/xls
  • /pdf/convert/to/xlsx