Link Search Menu Expand Document

Document Classifier

Auto classification Of Incoming Documents

Use /pdf/classifier endpoint (see below) to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

Tip

To quickly create and test classification rules, download and install ByteScout PDF Multitool. Run it and check PDF Classifier at the left sidebar. Test rules and export them as a JSON request for PDF.co PDF Classifier.

Available Methods

Go To Samples

[POST] /pdf/classifier

Description: PDF classifier to sort PDF, JPG, PNG files by their content using the set of rules with keywords.

IMPORTANT: the best way to develop, test and maintain classification rules is to use Classifier Tester Tool from ByteScout PDF Multitool desktop app for Windows, you can download it from this page. Use this tool to quickly edit and test rules on single PDFs and on folders.

See Also:

Parameters:

Endpoint Parameters

  • url required. URL to the source file. Supports links from Google Drive, Dropbox and from built-in PDF.co files storage. For uploading files via API please check Files Upload section. If you are randomly getting Too Many Requests or Access Denied error for your input url, please try to add cache: to enable built-in url caching.
  • httpusername (optinal) - http auth user name if required to access source url.
  • httppassword (optinal) - http auth password if required to access source url.
  • rulescsv. required. Sets classification rules in CSV format where each row contains class name and keywords separated by |. Each row is separated by \n symbol. You can use regular expressions with this syntax: /keyword or regexp/i where i is the case-insensitive flag. Please note that all \ symbols should add prefix \ because of JSON format, so \d becomes \\d and so on.

Example 1 for rulescsv:

Amazon AWS,Amazon Web Services Invoice|Amazon CloudFront\nDigital Ocean,DigitalOcean|DOInvoice\nACME,ACME Inc.|1540 Long Street, Jacksonville, 32099

Example 2 (with regular expressions):

Medical Report,Instructing Party|Medical Report|Date Of Injury|Med Agency Ref\r\nInjured Claimant,Injured Claimant

  • rulescsvurl. optional. Instead of inline CSV you can use this parameter and set the url to csv file with classification rules. This is useful if you have a separate developer working on csv rules. Example: https://www.dropbox.com/s/12345abcdef/document_sorting_rules.csv?dl=0

Sample content of document_sorting_rules.csv with Medical Report and Injured Claimant classes:

Medical Report,Instructing Party|Medical Report|Date Of Injury|Med Agency Ref
Injured Claimant,Injured Claimant
  • caseSensitive. optional (default to true). Defines if keywords in rules are case sensitive or not.
  • inline. optional. Set to true to return results inside the response. Otherwise endpoint will return a link to output file generated.
  • password optional. Password of PDF file. Must be a String
  • async optional. Runs processing asynchronously. Returns Use JobId that you may use with /job/check to check state of the processing (possible states: working, failed, aborted and success). Must be one of: true, false.
  • encrypt optional. Enable encryption for output file. Must be one of: true, false.
  • name optional. File name for generated output. Must be a String.
  • expiration (optional). Output link expiration in minutes. Default is 60 (i.e. 60 minutes or 1 hour). After this delay generated output file(s) (if any) will be auto-removed from PDF.co temporary files storage. Max allowed expiration period depends on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf, documents), please use PDF.co built-in Files Storage instead.
  • profiles optional. Must be a String. You can set additional and extra options using this parameter that allows you to set custom configuration. See profiles samples for examples.

Description

  • Method: POST
  • URL: /v1/pdf/classifier

Query parameters

No query parameters accepted.

Body payload

{
    "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
    "rulescsv": "Amazon,Amazon Web Services Invoice|Amazon CloudFront\nDigital Ocean,DigitalOcean|DOInvoice\nAcme,ACME Inc.|/ACME.*Inc\\./|1540 Long Street, Jacksonville, 32099",
    "caseSensitive": "true",
    "async": false,
    "encrypt": "false",
    "inline": "true",
    "password": "",
    "profiles": ""
} 

Example responses

/pdf/classifier
{
    "body": {
        "classes": [
            {
                "class": "ACME"
            }
        ]
    },
    "pageCount": 1,
    "error": false,
    "status": 200,
    "remainingCredits": 99972360,
    "credits": 42
}

Code Snippet

CURL
curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
    "rulescsv": "Amazon,Amazon Web Services Invoice|Amazon CloudFront\nDigital Ocean,DigitalOcean|DOInvoice\nAcme,ACME Inc.|/ACME.*Inc\\./|1540 Long Street, Jacksonville, 32099",
    "caseSensitive": "true",
    "async": false,
    "encrypt": "false",
    "inline": "true",
    "password": "",
    "profiles": ""
} '

Samples