Link Search Menu Expand Document

Profiles - COLLECTION of sample profiles

PDF Extractor SDK is not able to capture font styles in PDF and convert it into JSON format, even though the following settings are enabled:

"profiles": "{'ExtractColumnByColumn': true}"

The PDF to CSV conversion process is resulting in broken CSV files, which are not usable for further processing or analysis:

"profiles": "{ 'OCRImagePreprocessingFilters.Clear()': [],'TableXMinIntersectionRequiredInPercents': 27,'ExtractShadowLikeText': false,'DetectNewColumnBySpacesRatio': 1.6,'ColumnDetectionMode': 'ContentGroups','OCRMode': 'Auto','OCRResolution': 600.0,'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [],'OCRImagePreprocessingFilters.AddDilate()': [],'OCRImagePreprocessingFilters.AddContrast()': [-40] }"

The template crashes when a new PDF is converted:

"profiles" : "{ 'ExtractShadowLikeText': 'false','OCRMode':'Auto','CSVSeparatorSymbol':',','ColumnDetectionMode':'ContentGroupsAI','OCRDetectPageRotation': false }"

The PDF extraction process is resulting in missing and shadowed text:

"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"

Shadowed Text Issue:

"profiles": "{ 'ExtractShadowLikeText': false }"

PDF to JSON duplicate characters Issue:

"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"

Missing Texts in CSV Issue:

"profiles": "{ 'DetectNewColumnBySpacesRatio': 5.0, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'nld', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [0.6], 'OCRImagePreprocessingFilters.AddContrast()': [20] }"

PDF to XLS conversion process is not detecting the columns properly and auto-aligning the columns to the header:

"profiles": "{ 'ColumnDetectionMode': 'ColumnDetectionMode_ContentGroupsAndBorders', 'LineGroupingMode': 'LineGroupingMode_GroupByRows', 'DetectNewColumnBySpacesRatio': 2.0, 'AutoAlignColumnsToHeader': false }"

The PDF to text conversion process for scanned PDFs is slow:

"profiles": "{ 'OCRResolution': 200.0 }"

PDF Extractor is unable to extract a few particular word strings with light fonts:

"profiles": "{ 'ExtractShadowLikeText': false, 'OCRMode': 'Auto', 'OCRImagePreprocessingFilters.AddGammaCorrection()': [1.8] }"

PDF to XLS Extract All Pages into a Single Worksheet:

"profiles": "{ 'PageToWorksheet': false }"

“11” Extracted as Letter M in PDF to CSV Issue:

"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromRepairedFontsOnly', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddDilate()': [], 'OCRImagePreprocessingFilters.AddContrast()': [-60], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'CSVSeparatorSymbol': ',' }"
"profiles": "{ 'CustomScript': 'document.querySelectorAll(\'a\').forEach(a => { a.href = \'#\' });' }"

The PDF extractor is unable to extract specific text from the document:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"OCRMode\": \"Auto\",\n  \"OCRDisableAutoSegmentation\": true,\n  \"CSVSeparatorSymbol\": \",\"\n}"
"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"OCRMode\": \"Auto\",\n  \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n  \"CSVSeparatorSymbol\": \",\"\n}"
"profiles": "{ 'ExtractShadowLikeText': false,'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts','OCRLanguage': 'ita','OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddContrast()': [20] }"

The Scanned PDF to CSV OCR issue:

"profiles": "{'ExtractShadowLikeText': false,'OCRMode': 'TextFromImagesAndRepairedFonts'}"

Removes all asterisks from the document:

"profiles": {
"AddFilter()": [ "*", true, false ]
}

Extract all pages as a single worksheet:

"profiles": "{ 'TableXMinIntersectionRequiredInPercents': 89, 'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.5, 'ColumnDetectionByTextAlignment': 'Middle', 'ColumnDetectionMode': 'Borders', 'OCRMode': 'TextFromVectorsAndRepairedFonts', 'OCRLanguage': 'pol', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [1.4], 'ShrinkMultipleSpaces': true, 'PageToWorksheet': true }"

When adding Hebrew letters, the full letter is not being displayed:

"profiles": "{ 'DisableLigatures': true }"

Alignment issue when extracting data due to column detection problems:

"profiles": "{ 'DetectNewColumnBySpacesRatio': 1.5, 'ColumnDetectionMode': 'Borders', 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 0.8 ], 'OCRImagePreprocessingFilters.AddContrast()': [ 20 ], 'OCRDetectLines': true }"

The document is not being automatically rotated during processing:

"profiles": "{ 'OCRDetectPageRotation': true }"

Issue with extracting table data from the document:

"profiles": "{ 'DetectNewColumnBySpacesRatio': 2.0, 'ColumnDetectionMode': 'ContentGroupsAI' }"

Extract images from PDF and get output in JSON:

"profiles": "{ 'SaveImages': 'Embed' }"

PDF to XML conversion is not extracting the correct font name, style, and extra text:

"profiles": "{ 'ExtractionAreaUsageMode': 'UseObjectsCompetelyInsideAreaOnly' }"

Font style is not being extracted during the PDF to XML conversion:

"profiles": "{ 'ExtractionAreaUsageMode': 'UseObjectsCompetelyInsideAreaOnly', 'ConsiderFontNames': true }"

Hyphens are missing in the output:

"profiles": "{ 'NormalizeText': true }"

Crop empty space around images:

"profiles": "{ 'AutoCropImages': true }"
"profiles": "{ 'OutputStructure': 'OnlyLinks' }"
"profiles": "{ 'OutputStructure': 'OnlyLinks', 'OutputTransformation': '$..text' }"

The optical character recognition (OCR) process is not providing accurate recognition for scanned documents:

"profiles": { "OCRImagePreprocessingFilters.AddContrast()": [ 20 ], "OCRImagePreprocessingFilters.AddGammaCorrection()": [ 2.0 ] }

Unprintable characters are appearing in the output:

"profiles": {"OCRMode": "TextFromRepairedFontsOnly"}

The text contains duplicated words:

"profiles": {"ExtractShadowLikeText": false, "OCRMode": "Auto", "CSVSeparatorSymbol": ","}

Repeating words in CSV:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"ColumnDetectionMode\": \"ContentGroupsAI\",\n  \"OCRMode\": \"TextFromRepairedFontsOnly\",\n  \"CSVSeparatorSymbol\": \",\"\n}"

Extract text from a scanned PDF with Japanese characters using OCR:

"profiles": "{ 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 0.6 ],'OCRImagePreprocessingFilters.AddContrast()': [ 40 ] }"

Missing information and invisible text objects in output CSV (Franch Language):

"profiles": "{'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroups', 'OCRMode': 'TextFromImagesAndVectorsAndFonts', 'OCRLanguage': 'fra', 'DetectNewColumnBySpacesRatio': 1.8, 'CSVSeparatorSymbol': ',', 'ExtractInvisibleText': false}"

Extracting text from a document with vector objects packed into a font:

"profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"

Extracting text from a scanned file using OCR:

"profiles": "{ 'OCRMode': 'Auto' }"

Extracting data from bank statements:

"profiles": "{ 'DetectNewColumnBySpacesRatio': 1.3, 'ColumnDetectionMode': 'ContentGroupsAI' }"

Normalizing text for CSV containing Crucifux character:

"profiles": "{'NormalizeText': true}"

Extracting data from complex tables:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"ColumnDetectionMode\": \"ContentGroupsAI\",\n  \"OCRMode\": \"Auto\",\n  \"DetectUnderlineTextStyle\": true,\n  \"DetectStrikeoutTextStyle\": true\n}"

The output CSV file is broken:

"profiles": "{ 'DetectNewColumnBySpacesRatio': 5.0 }"

Auto-Resize column width in XLS output:

"profiles": "{ 'ColumnDetectionMode': 'ContentGroupsAI', 'ShrinkMultipleSpaces': true, 'CustomColumnWidths': [ 100, 100 ] }"

Converting non-editable image-based PDF to JSON format:

"profiles": "{ 'OCRMode': 'TextFromImagesAndFonts'}"

Setting up a PDF as read-only:

"profiles": "{ 'FlattenDocument()': [] }"

Separate the header from the rest of the content:

"profiles": "{ 'ConsiderFontSizes': true }"

OCRMode==AutoRepair mistakenly identifying a PDF in Czech as broken:

"profiles": "{'ExtractShadowLikeText': false, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'ces+eng'}"

Extracting text from images and vectors with repaired fonts:

"profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"

Encountering unknown characters when using the PDF to CSV conversion method:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"OCRLanguage\": \"deu\", \"OCRMode\": \"TextFromImagesAndFonts\",\n  \"CSVSeparatorSymbol\": \",\"\n}"

Experiencing text shadow and missing text problems in French:

"profiles": "{'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.8, 'ColumnDetectionMode': 'ContentGroups', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'fra', 'CSVSeparatorSymbol': ','}"

Missing values and unaligned columns:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"ColumnDetectionMode\": \"ContentGroupsAI\",\n  \"OCRMode\": \"AutoRepairFonts\",\n  \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n  \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n    1.8\n  ],\n  \"OCRImagePreprocessingFilters.AddContrast()\": [\n    -20\n  ],\n  \"ShrinkMultipleSpaces\": true,\n  \"CSVSeparatorSymbol\": \",\"\n}"

Creating a custom extraction area for converting PDF to CSV:

"profiles": "{\n  \"ExtractionArea\": [\n    8.25,\n    312.75,\n    583.5,\n    135.0\n  ]\n}"

Rows with misaligned cells:

"profiles": "{\n  \"ExtractShadowLikeText\": false,\n  \"LineGroupingMode\": \"GroupByRows\",\n  \"Unwrap\": true,\n  \"OCRMode\": \"Auto\",\n  \"ShrinkMultipleSpaces\": true,\n  \"CSVSeparatorSymbol\": \",\"\n}"

Text misalignment in 90degree rotated PDF:

"profiles": "{\n  \"DetectNewColumnBySpacesRatio\": 0.4,\n  \"RotationAngle\": \"Deg90\"\n}"

Auto-repairing fonts in PDF files:

"profiles": "{\n  \"OCRMode\": \"AutoRepairFonts\",\n  \"ColumnDetectionMode\": \"ContentGroupsAI\",\n  \"PageSeparator\": \"--- New Page ---\"\n}"

Alignment problems between column headers and data in a table:

"profiles": "{\n  \"TableXMinIntersectionRequiredInPercents\": 70,\n  \"ShrinkMultipleSpaces\": true\n}"
"profiles": "{\n  \"ColumnDetectionMode\": \"ContentGroupsAI\"\n}"

OCR for processing public reports:

"profiles": "{\n  \"OCRDetectLines\": true,\n  \"CSVSeparatorSymbol\": \",\",\n  \"TableYMinIntersectionRequiredInPercents\": 20,\n  \"DetectNewColumnBySpacesRatio\": 2.0,\n  \"AllowStandalonePunctuation\": true,\n  \"OCRMode\": \"Auto\",\n  \"OCRImagePreprocessingFilters.AddDeskew()\": [],\n  \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n  \"OCRImagePreprocessingFilters.AddVerticalLinesRemover()\": [],\n  \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n    1.4\n  ]\n}"

Encountering unknown symbols while processing data or text:

"profiles": "{\n  \"TableXMinIntersectionRequiredInPercents\": 51,\n  \"ExtractShadowLikeText\": false,\n  \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\",\n  \"OCRLanguage\": \"eng+spa\",\n  \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n  \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n    1.8\n  ],\n  \"OCRImagePreprocessingFilters.AddContrast()\": [\n    20\n  ],\n  \"OutputFormat\": \"XLSX\",\n  \"NumberDecimalSeparator\": \".\",\n  \"NumberGroupSeparator\": \",\"\n}"

Handling multi-line headers in a table:

"profiles": "{\n  \"TableXMinIntersectionRequiredInPercents\": 55\n}"
"profiles": "{ 'ShrinkMultipleSpaces': true }"

Grouping lines in a PDF based on the background color:

"profiles": "{ 'LineGroupingMode': 'GroupByRows', 'Unwrap': true, 'ConsiderBackgroundColors': true }"

Text recognition errors:

"profiles": "{ 'OCRMode': 'TextFromImagesOnly', 'OCRImagePreprocessingFilters.AddDeskew()': [] }"

Duplicate words appearing in a CSV file:

"profiles": "{\n \"TableXMinIntersectionRequiredInPercents\": 21,\n  \"ExtractInvisibleText\": false,\n  \"ExtractShadowLikeText\": false,\n  \"DetectNewColumnBySpacesRatio\": 2.5,\n  \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\",\n  \"OCRImagePreprocessingFilters.AddDeskew()\": [],\n  \"OCRImagePreprocessingFilters.AddMedian()\": [],\n  \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n    1.4\n  ],\n  \"ShrinkMultipleSpaces\": true,\n  \"CSVSeparatorSymbol\": \",\"\n}"

Cells merging when a hyphen precedes them:

"profiles": "{ 'AllowStandalonePunctuation': 'true' }"

Alignment problems between column headers and data in a table:

"profiles": "{ \"ColumnDetectionByTextAlignment\": \"Middle\", \"TableXMinIntersectionRequiredInPercents\": \"35\" }"

Using DocumentRotator and Deskew filter to perform OCR and normalize small page rotations:

"profiles": "{ \"RotationAngle\": \"Deg90\", \"OCRImagePreprocessingFilters.AddDeskew()\": \"\", \"DetectNewColumnBySpacesRatio\": \"1.5\" }"

Damage the character recognition in certain rare documents:

"profiles": "{ 'OCRImagePreprocessingFilters.Clear()': '' }"

Rows not being split correctly in a table:

"profiles": "{\n  \"LineGroupingMode\": \"GroupByRows\",\n  \"ColumnDetectionMode\": \"Borders\",\n  \"Unwrap\": false,\n  \"ShrinkMultipleSpaces\": false,\n  \"DetectNewColumnBySpacesRatio\": 1\n}"

OCR mode to exclude images during text extraction from a document:

"profiles": "{\n  \"DetectNewColumnBySpacesRatio\": \"1.5\",\n  \"OCRMode\": \"TextFromVectorsAndRepairedFonts\"\n}"

Incorrect values and merged columns in a table:

"profiles": "{ 'OCRResolution': 600, 'OCRMode': 'TextFromImagesOnly', 'OCRImagePreprocessingFilters.AddDeskew()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 2.0 ], 'OCRImagePreprocessingFilters.AddContrast()': [ -40 ], 'CustomExtractionColumns': [ 0, 73, 124, 163, 203, 252, 285, 342, 375, 410, 448, 489, 530 ] }"

Misalignment between column headers and values in a table:

"profiles": "{\n  \"ColumnDetectionByTextAlignment\": \"Right\",\n  \"TableXMinIntersectionRequiredInPercents\": \"10\"\n}"
"profiles": "{\n  \"OCRMode\": \"TextFromRepairedFontsOnly\"\n}"

Shrinking multiple spaces in a document:

"profiles": "{\n  \"ShrinkMultipleSpaces\": \"true\"\n}"

Extracting text from images and vectors with repaired fonts:

"profiles": "{\n  \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\"\n}"

Disabling OCR page rotation detection and setting a fixed rotation angle:

"profiles": "{\n  \"OCRDetectPageRotation\": false,\n  \"RotationAngle\": \"Deg270\"\n}"

Disabling automatic detection of numbers:

"profiles": "{\n  \"AutoDetectNumbers\": false\n}"

Uneven document design without column borders that makes proper extraction impossible using default parameters:

"profiles": "{\n  \"DetectNewColumnBySpacesRatio\": \"2.0\"\n}"
"profiles": "{\n  \"CheckPermissions\": false\n}"

Missing text caused by preset filters during text extraction:

"profiles": "{ 'OCRImagePreprocessingFilters.Clear()' : [ ], 'OCRMode': 'Auto' }"

Converting a PDF to text without preserving its layout:

"profiles": "{'ExtractColumnByColumn': true}"

Default PDF to CSV settings include page rotation detection, but in some cases, certain lines in the PDF may trigger the rotation detector, resulting in a rotated page that causes problems for OCR:

"profiles" : "{ 'ExtractShadowLikeText': 'false','OCRMode':'Auto','CSVSeparatorSymbol':',','ColumnDetectionMode':'ContentGroupsAI','OCRDetectPageRotation': false }"

Problems with shadowed and missing text:

"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"

The first letter in an expression is shifting to the last letter:

"profiles": "{ 'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.2, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'deu', 'ShrinkMultipleSpaces': true }"

Data in a field being duplicated and inserted after the first character:

"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"

Text appearing as separate and disconnected in a document:

"profiles": "{'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 5.0, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'nld', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [],
 'OCRImagePreprocessingFilters.AddGammaCorrection()': [0.6], 'OCRImagePreprocessingFilters.AddContrast()': [20], 'CSVSeparatorSymbol': ','}"

Extracting text from images and vectors with repaired fonts, using the German language OCR engine:

"profiles": "{'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'deu'}"

Using OCR mode to extract text only from images, and setting the OCR language to German:

"profiles": "{'OCRMode': 'TextFromImagesOnly', 'OCRLanguage: 'deu'}"

Remove Unwanted Invisible Text in PDF Documents:

{
    "profiles": "{\n  \"ExtractInvisibleText\": false,\n  \"ExtractShadowLikeText\": false,\n  \"OCRMode\": \"Auto\"\n}"
}
{
    "profiles": "{\n  \"ExtractInvisibleText\": false,\n  \"ExtractShadowLikeText\": false,\n  \"ColumnDetectionMode\": \"ContentGroups\",\n  \"OCRMode\": \"Auto\",\n  \"CSVSeparatorSymbol\": \",\"\n}"
}
{
    "profiles": "{\n  \"ExtractInvisibleText\": false,\n  \"ExtractShadowLikeText\": false,\n  \"LineGroupingMode\": \"GroupByRows\",\n  \"ColumnDetectionMode\": \"ContentGroups\",\n  \"OCRMode\": \"Auto\",\n  \"CSVSeparatorSymbol\": \",\"\n}"
}