Document Classifier
Auto classification Of Incoming Documents
Use /pdf/classifier
endpoint (see below) to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.
Tip
To quickly create and test classification rules, download and install ByteScout PDF Multitool. Run it and check PDF Classifier
at the left sidebar. Test rules and export them as a JSON request for PDF.co PDF Classifier.
Available Methods
[POST] /pdf/classifier
Description: PDF Classifier can determine class of input PDF, JPG, PNG documents by analyzing their content using the built-in AI content analyzer. Can also use custom rules.
IMPORTANT: the best way to develop, test and maintain classification rules is to use Classifier Tester Tool
from ByteScout PDF Multitool desktop app for Windows, you can download it from this page. Use this tool to quickly edit and test rules on single PDFs and on folders.
Tools and Guides
- PDF Classifier Guide
- PDF Multitool desktop app with PDF Classifier Tester (link to direct EXE for Windows)
See Also:
- Document Parser Template Editor (online version)
- Document Parser Template Coding Guide
- Document Parser Template Editor (offline desktop version)
Parameters:
Endpoint Parameters
url
required. URL to the source file. Supports links from Google Drive, Dropbox and from built-in PDF.co files storage. For uploading files via API please check Files Upload section. If you are randomly gettingToo Many Requests
orAccess Denied
error for your input url, please try to addcache:
to enable built-in url caching.httpusername
(optional) - http auth user name if required to access sourceurl
.httppassword
(optional) - http auth password if required to access sourceurl
.rulescsv
(optional). Define custom classification rules in CSV format. Rules are in CSV format where each row contains class name and keywords separated by|
. Each row is separated by\n
symbol. You can use regular expressions with this syntax:/keyword or regexp/i
wherei
is the case-insensitive flag. Please note that all\
symbols should add prefix\
because of JSON format, so\d
becomes\\d
and so on.
Example 1 for rulescsv
:
Amazon AWS,Amazon Web Services Invoice|Amazon CloudFront\nDigital Ocean,DigitalOcean|DOInvoice\nACME,ACME Inc.|1540 Long Street, Jacksonville, 32099
Example 2 (with regular expressions):
Medical Report,Instructing Party|Medical Report|Date Of Injury|Med Agency Ref\r\nInjured Claimant,Injured Claimant
rulescsvurl
(optional) Instead of inline CSV you can use this parameter and set the url to csv file with classification rules. This is useful if you have a separate developer working on csv rules. Example:https://www.dropbox.com/s/12345abcdef/document_sorting_rules.csv?dl=0
Sample content of document_sorting_rules.csv
with Medical Report
and Injured Claimant
classes:
Medical Report,Instructing Party|Medical Report|Date Of Injury|Med Agency Ref
Injured Claimant,Injured Claimant
caseSensitive
. optional (default totrue
). Defines if keywords in rules are case sensitive or not.inline
. optional. Set totrue
to return results inside the response. Otherwise endpoint will return a link to output file generated.password
optional. Password of PDF file. Must be a Stringasync
optional. Runs processing asynchronously. Returns UseJobId
that you may use with/job/check
to check state of the processing (possible states:working
,failed
,aborted
andsuccess
). Must be one of:true
,false
.encrypt
(legacy, now all files are stored at the encrypted cloud storage by default.
Important: you can also encrypt output files and decrypt input files with user-controlled data encryption (strong AES
encryption + custom keys). Click here to learn more.
name
optional. File name for generated output. Must be a String.expiration
(optional). Output link expiration in minutes. Default is60
(i.e. 60 minutes or 1 hour). After this delay generated output file(s) (if any) will be auto-removed from PDF.co temporary files storage. Max allowed expiration period depends on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf, documents), please use PDF.co built-in Files Storage instead.profiles
optional. Must be a String. Use this param to set additional configuration for fine tuning and extra options. Explore PDF.co knowledgebase for profile examples.- Method: POST
- URL: /v1/pdf/classifier
Query parameters
No query parameters accepted.
Body payload
{
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
"async": false,
"encrypt": "false",
"inline": "true",
"password": "",
"profiles": ""
}
Example responses
/pdf/classifier
{
"body": {
"classes": [
{
"class": "invoice"
},
{
"class": "finance"
},
{
"class": "documents"
}
]
},
"pageCount": 1,
"error": false,
"status": 200,
"credits": 42,
"duration": 353,
"remainingCredits": 98019328
}
Code Snippet
CURL
curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
"url": "https://bytescout-com.s3-us-west-2.amazonaws.com/files/demo-files/cloud-api/document-parser/sample-invoice.pdf",
"async": false,
"encrypt": "false",
"inline": "true",
"password": "",
"profiles": ""
} '
Samples
- C# - Classify PDF From URL
- Java - Classify PDF From URL
- JavaScript - Classify PDF From URL (jQuery)
- JavaScript - Classify PDF From URL (nodeJs)
- PHP - Classify PDF From URL
- PowerShell - Classify Uploaded PDF Asynchronously
- PowerShell - Classify Uploaded PDF From URL
- cURL - Classify PDF From URL
Copyright © 2016 - 2022 PDF.co