Profiles - COLLECTION of sample profiles
PDF Extractor SDK is not able to capture font styles in PDF and convert it into JSON format, even though the following settings are enabled:
"profiles": "{'ExtractColumnByColumn': true}"
The PDF to CSV conversion process is resulting in broken CSV files, which are not usable for further processing or analysis:
"profiles": "{ 'OCRImagePreprocessingFilters.Clear()': [],'TableXMinIntersectionRequiredInPercents': 27,'ExtractShadowLikeText': false,'DetectNewColumnBySpacesRatio': 1.6,'ColumnDetectionMode': 'ContentGroups','OCRMode': 'Auto','OCRResolution': 600.0,'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [],'OCRImagePreprocessingFilters.AddDilate()': [],'OCRImagePreprocessingFilters.AddContrast()': [-40] }"
The template crashes when a new PDF is converted:
"profiles" : "{ 'ExtractShadowLikeText': 'false','OCRMode':'Auto','CSVSeparatorSymbol':',','ColumnDetectionMode':'ContentGroupsAI','OCRDetectPageRotation': false }"
The PDF extraction process is resulting in missing and shadowed text:
"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"
Shadowed Text Issue:
"profiles": "{ 'ExtractShadowLikeText': false }"
PDF to JSON duplicate characters Issue:
"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"
Missing Texts in CSV Issue:
"profiles": "{ 'DetectNewColumnBySpacesRatio': 5.0, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'nld', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [0.6], 'OCRImagePreprocessingFilters.AddContrast()': [20] }"
PDF to XLS conversion process is not detecting the columns properly and auto-aligning the columns to the header:
"profiles": "{ 'ColumnDetectionMode': 'ColumnDetectionMode_ContentGroupsAndBorders', 'LineGroupingMode': 'LineGroupingMode_GroupByRows', 'DetectNewColumnBySpacesRatio': 2.0, 'AutoAlignColumnsToHeader': false }"
The PDF to text conversion process for scanned PDFs is slow:
"profiles": "{ 'OCRResolution': 200.0 }"
PDF Extractor is unable to extract a few particular word strings with light fonts:
"profiles": "{ 'ExtractShadowLikeText': false, 'OCRMode': 'Auto', 'OCRImagePreprocessingFilters.AddGammaCorrection()': [1.8] }"
PDF to XLS Extract All Pages into a Single Worksheet:
"profiles": "{ 'PageToWorksheet': false }"
“11” Extracted as Letter M in PDF to CSV Issue:
"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromRepairedFontsOnly', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddDilate()': [], 'OCRImagePreprocessingFilters.AddContrast()': [-60], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'CSVSeparatorSymbol': ',' }"
Disable Active Links in PDF:
"profiles": "{ 'CustomScript': 'document.querySelectorAll(\'a\').forEach(a => { a.href = \'#\' });' }"
The PDF extractor is unable to extract specific text from the document:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"OCRMode\": \"Auto\",\n \"OCRDisableAutoSegmentation\": true,\n \"CSVSeparatorSymbol\": \",\"\n}"
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"OCRMode\": \"Auto\",\n \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n \"CSVSeparatorSymbol\": \",\"\n}"
The PDF file is not converting to XML format properly due to font-related issues:
"profiles": "{ 'ExtractShadowLikeText': false,'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts','OCRLanguage': 'ita','OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddContrast()': [20] }"
The Scanned PDF to CSV OCR issue:
"profiles": "{'ExtractShadowLikeText': false,'OCRMode': 'TextFromImagesAndRepairedFonts'}"
Removes all asterisks from the document:
"profiles": {
"AddFilter()": [ "*", true, false ]
}
Extract all pages as a single worksheet:
"profiles": "{ 'TableXMinIntersectionRequiredInPercents': 89, 'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.5, 'ColumnDetectionByTextAlignment': 'Middle', 'ColumnDetectionMode': 'Borders', 'OCRMode': 'TextFromVectorsAndRepairedFonts', 'OCRLanguage': 'pol', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [1.4], 'ShrinkMultipleSpaces': true, 'PageToWorksheet': true }"
When adding Hebrew letters, the full letter is not being displayed:
"profiles": "{ 'DisableLigatures': true }"
Alignment issue when extracting data due to column detection problems:
"profiles": "{ 'DetectNewColumnBySpacesRatio': 1.5, 'ColumnDetectionMode': 'Borders', 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 0.8 ], 'OCRImagePreprocessingFilters.AddContrast()': [ 20 ], 'OCRDetectLines': true }"
The document is not being automatically rotated during processing:
"profiles": "{ 'OCRDetectPageRotation': true }"
Issue with extracting table data from the document:
"profiles": "{ 'DetectNewColumnBySpacesRatio': 2.0, 'ColumnDetectionMode': 'ContentGroupsAI' }"
Extract images from PDF and get output in JSON:
"profiles": "{ 'SaveImages': 'Embed' }"
PDF to XML conversion is not extracting the correct font name, style, and extra text:
"profiles": "{ 'ExtractionAreaUsageMode': 'UseObjectsCompetelyInsideAreaOnly' }"
Font style is not being extracted during the PDF to XML conversion:
"profiles": "{ 'ExtractionAreaUsageMode': 'UseObjectsCompetelyInsideAreaOnly', 'ConsiderFontNames': true }"
Hyphens are missing in the output:
"profiles": "{ 'NormalizeText': true }"
Crop empty space around images:
"profiles": "{ 'AutoCropImages': true }"
Find and extract all links from a PDF document:
"profiles": "{ 'OutputStructure': 'OnlyLinks' }"
Find and extract all links from a PDF document and making the list more compact:
"profiles": "{ 'OutputStructure': 'OnlyLinks', 'OutputTransformation': '$..text' }"
The optical character recognition (OCR) process is not providing accurate recognition for scanned documents:
"profiles": { "OCRImagePreprocessingFilters.AddContrast()": [ 20 ], "OCRImagePreprocessingFilters.AddGammaCorrection()": [ 2.0 ] }
Unprintable characters are appearing in the output:
"profiles": {"OCRMode": "TextFromRepairedFontsOnly"}
The text contains duplicated words:
"profiles": {"ExtractShadowLikeText": false, "OCRMode": "Auto", "CSVSeparatorSymbol": ","}
Repeating words in CSV:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"ColumnDetectionMode\": \"ContentGroupsAI\",\n \"OCRMode\": \"TextFromRepairedFontsOnly\",\n \"CSVSeparatorSymbol\": \",\"\n}"
Extract text from a scanned PDF with Japanese characters using OCR:
"profiles": "{ 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 0.6 ],'OCRImagePreprocessingFilters.AddContrast()': [ 40 ] }"
Missing information and invisible text objects in output CSV (Franch Language):
"profiles": "{'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroups', 'OCRMode': 'TextFromImagesAndVectorsAndFonts', 'OCRLanguage': 'fra', 'DetectNewColumnBySpacesRatio': 1.8, 'CSVSeparatorSymbol': ',', 'ExtractInvisibleText': false}"
Extracting text from a document with vector objects packed into a font:
"profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"
Extracting text from a scanned file using OCR:
"profiles": "{ 'OCRMode': 'Auto' }"
Extracting data from bank statements:
"profiles": "{ 'DetectNewColumnBySpacesRatio': 1.3, 'ColumnDetectionMode': 'ContentGroupsAI' }"
Normalizing text for CSV containing Crucifux character:
"profiles": "{'NormalizeText': true}"
Extracting data from complex tables:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"ColumnDetectionMode\": \"ContentGroupsAI\",\n \"OCRMode\": \"Auto\",\n \"DetectUnderlineTextStyle\": true,\n \"DetectStrikeoutTextStyle\": true\n}"
The output CSV file is broken:
"profiles": "{ 'DetectNewColumnBySpacesRatio': 5.0 }"
Auto-Resize column width in XLS output:
"profiles": "{ 'ColumnDetectionMode': 'ContentGroupsAI', 'ShrinkMultipleSpaces': true, 'CustomColumnWidths': [ 100, 100 ] }"
Converting non-editable image-based PDF to JSON format:
"profiles": "{ 'OCRMode': 'TextFromImagesAndFonts'}"
Setting up a PDF as read-only:
"profiles": "{ 'FlattenDocument()': [] }"
Separate the header from the rest of the content:
"profiles": "{ 'ConsiderFontSizes': true }"
OCRMode==AutoRepair mistakenly identifying a PDF in Czech as broken:
"profiles": "{'ExtractShadowLikeText': false, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'ces+eng'}"
Extracting text from images and vectors with repaired fonts:
"profiles": "{ 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts' }"
Encountering unknown characters when using the PDF to CSV conversion method:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"OCRLanguage\": \"deu\", \"OCRMode\": \"TextFromImagesAndFonts\",\n \"CSVSeparatorSymbol\": \",\"\n}"
Experiencing text shadow and missing text problems in French:
"profiles": "{'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.8, 'ColumnDetectionMode': 'ContentGroups', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'fra', 'CSVSeparatorSymbol': ','}"
Missing values and unaligned columns:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"ColumnDetectionMode\": \"ContentGroupsAI\",\n \"OCRMode\": \"AutoRepairFonts\",\n \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n 1.8\n ],\n \"OCRImagePreprocessingFilters.AddContrast()\": [\n -20\n ],\n \"ShrinkMultipleSpaces\": true,\n \"CSVSeparatorSymbol\": \",\"\n}"
Creating a custom extraction area for converting PDF to CSV:
"profiles": "{\n \"ExtractionArea\": [\n 8.25,\n 312.75,\n 583.5,\n 135.0\n ]\n}"
Rows with misaligned cells:
"profiles": "{\n \"ExtractShadowLikeText\": false,\n \"LineGroupingMode\": \"GroupByRows\",\n \"Unwrap\": true,\n \"OCRMode\": \"Auto\",\n \"ShrinkMultipleSpaces\": true,\n \"CSVSeparatorSymbol\": \",\"\n}"
Text misalignment in 90degree rotated PDF:
"profiles": "{\n \"DetectNewColumnBySpacesRatio\": 0.4,\n \"RotationAngle\": \"Deg90\"\n}"
Auto-repairing fonts in PDF files:
"profiles": "{\n \"OCRMode\": \"AutoRepairFonts\",\n \"ColumnDetectionMode\": \"ContentGroupsAI\",\n \"PageSeparator\": \"--- New Page ---\"\n}"
Alignment problems between column headers and data in a table:
"profiles": "{\n \"TableXMinIntersectionRequiredInPercents\": 70,\n \"ShrinkMultipleSpaces\": true\n}"
"profiles": "{\n \"ColumnDetectionMode\": \"ContentGroupsAI\"\n}"
OCR for processing public reports:
"profiles": "{\n \"OCRDetectLines\": true,\n \"CSVSeparatorSymbol\": \",\",\n \"TableYMinIntersectionRequiredInPercents\": 20,\n \"DetectNewColumnBySpacesRatio\": 2.0,\n \"AllowStandalonePunctuation\": true,\n \"OCRMode\": \"Auto\",\n \"OCRImagePreprocessingFilters.AddDeskew()\": [],\n \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n \"OCRImagePreprocessingFilters.AddVerticalLinesRemover()\": [],\n \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n 1.4\n ]\n}"
Encountering unknown symbols while processing data or text:
"profiles": "{\n \"TableXMinIntersectionRequiredInPercents\": 51,\n \"ExtractShadowLikeText\": false,\n \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\",\n \"OCRLanguage\": \"eng+spa\",\n \"OCRImagePreprocessingFilters.AddHorizontalLinesRemover()\": [],\n \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n 1.8\n ],\n \"OCRImagePreprocessingFilters.AddContrast()\": [\n 20\n ],\n \"OutputFormat\": \"XLSX\",\n \"NumberDecimalSeparator\": \".\",\n \"NumberGroupSeparator\": \",\"\n}"
Handling multi-line headers in a table:
"profiles": "{\n \"TableXMinIntersectionRequiredInPercents\": 55\n}"
Problems related to table columns:
"profiles": "{ 'ShrinkMultipleSpaces': true }"
Grouping lines in a PDF based on the background color:
"profiles": "{ 'LineGroupingMode': 'GroupByRows', 'Unwrap': true, 'ConsiderBackgroundColors': true }"
Text recognition errors:
"profiles": "{ 'OCRMode': 'TextFromImagesOnly', 'OCRImagePreprocessingFilters.AddDeskew()': [] }"
Duplicate words appearing in a CSV file:
"profiles": "{\n \"TableXMinIntersectionRequiredInPercents\": 21,\n \"ExtractInvisibleText\": false,\n \"ExtractShadowLikeText\": false,\n \"DetectNewColumnBySpacesRatio\": 2.5,\n \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\",\n \"OCRImagePreprocessingFilters.AddDeskew()\": [],\n \"OCRImagePreprocessingFilters.AddMedian()\": [],\n \"OCRImagePreprocessingFilters.AddGammaCorrection()\": [\n 1.4\n ],\n \"ShrinkMultipleSpaces\": true,\n \"CSVSeparatorSymbol\": \",\"\n}"
Cells merging when a hyphen precedes them:
"profiles": "{ 'AllowStandalonePunctuation': 'true' }"
Alignment problems between column headers and data in a table:
"profiles": "{ \"ColumnDetectionByTextAlignment\": \"Middle\", \"TableXMinIntersectionRequiredInPercents\": \"35\" }"
Using DocumentRotator and Deskew filter to perform OCR and normalize small page rotations:
"profiles": "{ \"RotationAngle\": \"Deg90\", \"OCRImagePreprocessingFilters.AddDeskew()\": \"\", \"DetectNewColumnBySpacesRatio\": \"1.5\" }"
Damage the character recognition in certain rare documents:
"profiles": "{ 'OCRImagePreprocessingFilters.Clear()': '' }"
Rows not being split correctly in a table:
"profiles": "{\n \"LineGroupingMode\": \"GroupByRows\",\n \"ColumnDetectionMode\": \"Borders\",\n \"Unwrap\": false,\n \"ShrinkMultipleSpaces\": false,\n \"DetectNewColumnBySpacesRatio\": 1\n}"
OCR mode to exclude images during text extraction from a document:
"profiles": "{\n \"DetectNewColumnBySpacesRatio\": \"1.5\",\n \"OCRMode\": \"TextFromVectorsAndRepairedFonts\"\n}"
Incorrect values and merged columns in a table:
"profiles": "{ 'OCRResolution': 600, 'OCRMode': 'TextFromImagesOnly', 'OCRImagePreprocessingFilters.AddDeskew()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddGammaCorrection()': [ 2.0 ], 'OCRImagePreprocessingFilters.AddContrast()': [ -40 ], 'CustomExtractionColumns': [ 0, 73, 124, 163, 203, 252, 285, 342, 375, 410, 448, 489, 530 ] }"
Misalignment between column headers and values in a table:
"profiles": "{\n \"ColumnDetectionByTextAlignment\": \"Right\",\n \"TableXMinIntersectionRequiredInPercents\": \"10\"\n}"
Issues related to Chinese characters:
"profiles": "{\n \"OCRMode\": \"TextFromRepairedFontsOnly\"\n}"
Shrinking multiple spaces in a document:
"profiles": "{\n \"ShrinkMultipleSpaces\": \"true\"\n}"
Extracting text from images and vectors with repaired fonts:
"profiles": "{\n \"OCRMode\": \"TextFromImagesAndVectorsAndRepairedFonts\"\n}"
Disabling OCR page rotation detection and setting a fixed rotation angle:
"profiles": "{\n \"OCRDetectPageRotation\": false,\n \"RotationAngle\": \"Deg270\"\n}"
Disabling automatic detection of numbers:
"profiles": "{\n \"AutoDetectNumbers\": false\n}"
Uneven document design without column borders that makes proper extraction impossible using default parameters:
"profiles": "{\n \"DetectNewColumnBySpacesRatio\": \"2.0\"\n}"
Extraction being prohibited by the document creator. Disabling the permissions check is possible using the following URL parameter, but it is important to note that disabling it means taking sole responsibility for any copyright or other violations:
"profiles": "{\n \"CheckPermissions\": false\n}"
Missing text caused by preset filters during text extraction:
"profiles": "{ 'OCRImagePreprocessingFilters.Clear()' : [ ], 'OCRMode': 'Auto' }"
Converting a PDF to text without preserving its layout:
"profiles": "{'ExtractColumnByColumn': true}"
Default PDF to CSV settings include page rotation detection, but in some cases, certain lines in the PDF may trigger the rotation detector, resulting in a rotated page that causes problems for OCR:
"profiles" : "{ 'ExtractShadowLikeText': 'false','OCRMode':'Auto','CSVSeparatorSymbol':',','ColumnDetectionMode':'ContentGroupsAI','OCRDetectPageRotation': false }"
Problems with shadowed and missing text:
"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"
The first letter in an expression is shifting to the last letter:
"profiles": "{ 'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 1.2, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'deu', 'ShrinkMultipleSpaces': true }"
Data in a field being duplicated and inserted after the first character:
"profiles": "{ 'ExtractShadowLikeText': false, 'ColumnDetectionMode': 'ContentGroupsAI', 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRResolution': 600.0, 'OCRImagePreprocessingFilters.AddGammaCorrection()': [2.0], 'OCRImagePreprocessingFilters.AddContrast()': [-40] }"
Text appearing as separate and disconnected in a document:
"profiles": "{'ExtractShadowLikeText': false, 'DetectNewColumnBySpacesRatio': 5.0, 'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'nld', 'OCRImagePreprocessingFilters.AddVerticalLinesRemover()': [], 'OCRImagePreprocessingFilters.AddHorizontalLinesRemover()': [],
'OCRImagePreprocessingFilters.AddGammaCorrection()': [0.6], 'OCRImagePreprocessingFilters.AddContrast()': [20], 'CSVSeparatorSymbol': ','}"
Extracting text from images and vectors with repaired fonts, using the German language OCR engine:
"profiles": "{'OCRMode': 'TextFromImagesAndVectorsAndRepairedFonts', 'OCRLanguage': 'deu'}"
Using OCR mode to extract text only from images, and setting the OCR language to German:
"profiles": "{'OCRMode': 'TextFromImagesOnly', 'OCRLanguage: 'deu'}"
Remove Unwanted Invisible Text in PDF Documents:
{
"profiles": "{\n \"ExtractInvisibleText\": false,\n \"ExtractShadowLikeText\": false,\n \"OCRMode\": \"Auto\"\n}"
}
{
"profiles": "{\n \"ExtractInvisibleText\": false,\n \"ExtractShadowLikeText\": false,\n \"ColumnDetectionMode\": \"ContentGroups\",\n \"OCRMode\": \"Auto\",\n \"CSVSeparatorSymbol\": \",\"\n}"
}
{
"profiles": "{\n \"ExtractInvisibleText\": false,\n \"ExtractShadowLikeText\": false,\n \"LineGroupingMode\": \"GroupByRows\",\n \"ColumnDetectionMode\": \"ContentGroups\",\n \"OCRMode\": \"Auto\",\n \"CSVSeparatorSymbol\": \",\"\n}"
}
Copyright © 2016 - 2024 PDF.co