Document Classifier

Auto classification Of Incoming Documents

Use /pdf/classifier endpoint (see below) to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

Tip

To quickly create and test classification rules, download and install ByteScout PDF Multitool. Run it and check PDF Classifier at the left sidebar. Test rules and export them as a JSON request for PDF.co PDF Classifier.

Available Methods

[POST] /pdf/classifier

[POST] /pdf/classifier

Description: Document Classifier can automatically find class of input PDF, JPG, PNG document by analyzing its content using the built-in AI or custom defined classification rules.

IMPORTANT: the best way to develop, test and maintain classification rules is to use Classifier Tester Tool from PDF.co Document Classifier UI, you can download it from this page. Use this tool to quickly edit and test rules on single PDFs and on folders.

Tools and Guides

See Also:

Attributes

Hint: attributes should be inside JSON for POST request:

{
    "url": "url-input-link"
}

Attributes
url required URL to the source file. Supports links from Google Drive, Dropbox, and PDF.co built-in files storage. To upload files via API, Check out the Files Upload section. Note: If you experience intermittent `Too Many Requests` or `Access Denied` errors, please try to add `cache:` to enable built-in URL caching. (e.g `cache:https://example.com/file1.pdf`) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.
httpusername optional HTTP auth user name if required to access source `url`.
httppassword optional HTTP auth password if required to access source `url`.
rulescsv optional Define custom classification rules in CSV format. Rules are in CSV format where each row contains: `class name`, `logic` (`AND` or `OR` (default)), and keywords separated by a comma. Each row is separated by the `\n` symbol. You can use regular expressions for keywords with this syntax: `/keyword or regexp/i` where `i` is the case-insensitive flag. Please note that all `\` symbols should add the prefix `\` because of JSON format, so `\d` becomes `\\d` and so on. Custom Rules Example 1 for `rulescsv` (for more examples please check the classifier guide) `Amazon AWS, OR, Amazon Web Services Invoice, Amazon CloudFront\nDigital Ocean, OR,DigitalOcean, DOInvoice\nACME,OR, ACME Inc.,1540 Long Street` Custom Rules Example 2 (with regular expressions, for more examples please check the classifier guide) `Medical Report,AND,/Instructing Party\|Medical Report\|Date Of Injury\|Med Agency Ref/i\r\nInjured Claimant,OR, Injured Claimant, Injured Patient ID`
rulescsvurl optional Instead of inline CSV you can use this parameter and set the URL to a CSV file with classification rules. This is useful if you have a separate developer working on CSV rules. Sample link to a Dropbox: `https://www.dropbox.com/s/12345abcdef/document_sorting_rules.csv?dl=0` Sample content of `document_sorting_rules.csv` with `Medical Report` and `Injured Claimant` classes: `Medical Report,AND,/Instructing Party\|Medical Report\|Date Of Injury\|Med Agency Ref/i\r\nInjured Claimant,OR,Injured Claimant,Injured Patient ID`
caseSensitive optional (default to `true`). Defines if keywords in rules are case-sensitive or not.
inline optional Set to `true` to return results inside the response. Otherwise, the endpoint will return a link to the output file generated.
password optional Password of PDF file, The input must be in string format.
async optional Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with `/job/check` endpoint to check the status of the process and retrieve the output while you can proceed with other tasks without waiting for this process to finish.
name options File name for generated output, The input must be in string format.
expiration optional Set the expiration time for the output link in minutes (`default is 60` i.e 60 minutes or 1 hour), After this specified duration, any generated output file(s) will be automatically deleted from PDF.co temporary files storage. The maximum duration for link expiration varies based on your current subscription plan. Learn more To store permanent input files (e.g. re-usable images, pdf templates, documents), Consider using PDF.co built-in Files Storage.
profiles optional Use this parameter to set additional configurations for fine-tuning and extra options. Explore PDF.co knowledgebase for profile examples, The input must be in string format.

Method: POST
URL: /v1/pdf/classifier

Query parameters

No query parameters accepted.

Body payload

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
    "async": false,
    "inline": "true",
    "password": "",
    "profiles": ""
} 

Example responses

/pdf/classifier

{
    "body": {
        "classes": [
            {
                "class": "invoice"
            },
            {
                "class": "finance"
            },
            {
                "class": "documents"
            }
        ]
    },
    "pageCount": 1,
    "error": false,
    "status": 200,
    "credits": 42,
    "duration": 353,
    "remainingCredits": 98019328
}

Code Snippet

CURL

curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
    "async": false,
    "inline": "true",
    "password": "",
    "profiles": ""
} '