PDF To TEXT

Convert PDF and scanned images to Text with layout preserved.

Available Methods

[POST] /pdf/convert/to/text (with layout and ocr)
[POST] /pdf/convert/to/text-simple (no layout and ocr, cheaper and faster)

[POST] /pdf/convert/to/text (with layout and ocr)

No need to reproduce layout or OCR?

Then try /pdf/convert/to/text-simple` endpoint instead, it works faster and requires less credits.

Auto classification Of Incoming Documents

Use /pdf/classifier (Document Classifier) endpoint to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

Attributes
url required URL to the source file. Supports links from Google Drive, Dropbox, and PDF.co built-in files storage. To upload files via API, Check out the Files Upload section. Note: If you experience intermittent `Too Many Requests` or `Access Denied` errors, please try to add `cache:` to enable built-in URL caching. (e.g `cache:https://example.com/file1.pdf`) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.
httpusername optional HTTP auth user name if required to access source `url`.
httppassword optional HTTP auth password if required to access source `url`.
pages optional Comma-Separated list of page indices (or ranges) to process. IMPORTANT: The very first page starts at `0` (zero). To set a range use the dash `-`, for example: `0,2-5,7-`. To set a range from the index to the last page use range like this: `2-` (from page #3 as the index starts at zero and till the of the document). For ALL pages just leave this param empty. Example: `0,2-5,7-` means first page, then 3rd page to 6th page, and then the range from 8th (index = `7`) page till the end of the document.
unwrap optional Unwrap lines into a single line within table cells when `lineGrouping` is enabled. Must be one of: `true`, or `false`.
rect optional Defines coordinates for extraction, e.g. `51.8, 114.8, 235.5, 204.0`. Use PDF.co PDF Edit Add Helper to get or measure pdf coordinates. The input must be in string format.
lang optional Set the language for OCR (text from image) to use for scanned PDF, PNG, and JPG documents input when extracting text. The default is “eng”. Other languages are also supported: `deu`, `spa`, `chi_sim`, `jpn`, and many others (full list of supported OCR languages is here. You can also use 2 languages simultaneously like this: `eng+deu` or `jpn+kor` (any combination).
inline optional Must be one of: `true` to return data as inline or `false` to return a link to the output file (default).
lineGrouping optional Line grouping within table cells. Set to `1` to enable the grouping, The input must be in string format.
async optional Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with `/job/check` endpoint to check the status of the process and retrieve the output while you can proceed with other tasks without waiting for this process to finish.
name optional File name for the generated output, The input must be in string format.
expiration optional Set the expiration time for the output link in minutes (`default is 60` i.e 60 minutes or 1 hour), After this specified duration, any generated output file(s) will be automatically deleted from PDF.co temporary files storage. The maximum duration for link expiration varies based on your current subscription plan. Learn more To store permanent input files (e.g. re-usable images, pdf templates, documents), Consider using PDF.co built-in Files Storage.
profiles optional This parameter can be used to set additional configurations for fine-tuning and to enable more options. Visit PDF.co knowledgebase for profile examples and more. Make sure to provide the input in string format. For instance, to alter the CSV separator, you can use: `{ 'CSVSeparatorSymbol': ';' }`. Tip: Utilize the OCR Analyzer of PDF Multitool to generate and examine OCR configuration profiles. Learn More.

Method: POST
URL: /v1/pdf/convert/to/text

Query parameters

No query parameters accepted.

Body payload

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}

Example responses

/pdf/convert/to/text (with layout and ocr)

{
    "body": "   Your Company Name \r\n       Your Address \r\n        City, State Zip \r\n                                                                                      Invoice No. 123456 \r\n                                                                                   Invoice Date 01/01/2016 \r\n      Client Name \r\n       Address \r\n        City, State Zip \r\n\r\n       Notes \r\n\r\n\r\n       Item                                     Quantity                     Price                     Total \r\n       Item 1                                              1                      40.00                      40.00 \r\n       Item 2                                              2                      30.00                      60.00 \r\n       Item 3                                              3                      20.00                      60.00 \r\n       Item 4                                              4                      10.00                      40.00 \r\n                                                           TOTAL                200.00\r\n",
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "sample.txt",
    "remainingCredits": 99032333,
    "credits": 21
}

Code Snippet

CURL

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}'

[POST] /pdf/convert/to/text-simple (no layout and ocr, cheaper and faster)

Note: This pdf to plain text endpoint works faster and requires much fewer credits because not using AI-powered layout analysis, OCR support, and also no support for profiles for fine-tuning. For advanced pdf-to-text with layout analysis, OCR (for scanned pages), pdf repair, and other features please use the pdf/convert/to/text endpoint instead.

Auto classification Of Incoming Documents

Use the /pdf/classifier (Document Classifier) endpoint to automatically sort/detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

Attributes
url required URL to the source file. Supports links from Google Drive, Dropbox, and PDF.co built-in files storage. To upload files via API, Check out the Files Upload section. Note: If you experience intermittent `Too Many Requests` or `Access Denied` errors, please try to add `cache:` to enable built-in URL caching. (e.g `cache:https://example.com/file1.pdf`) For data security, you have the option to encrypt output files and decrypt input files. Learn more about user-controlled data encryption.
httpusername optional HTTP auth user name if required to access source `url`.
httppassword optional HTTP auth password if required to access source `url`.
pages optional Comma-separated list of page indices (or ranges) to process. IMPORTANT: The very first page starts at `0` (zero). To set a range use the dash `-`, for example: `0,2-5,7-`. To set a range from the index to the last page use range like this: `2-` (from page #3 as the index starts at zero and till the of the document). For ALL pages just leave this param empty. Example: `0,2-5,7-` means first page, then 3rd page to 6th page, and then the range from 8th (index = `7`) page till the end of the document, The input must be in string format.
inline optional Must be one of: `true` to return data as inline or `false` to return a link to the output file (default).
async optional Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with `/job/check` endpoint to check the status of the process and retrieve the output while you can proceed with other tasks without waiting for this process to finish.
name optional File name for the generated output, The input must be in string format.

Method: POST
URL: /v1/pdf/convert/to/text-simple

Query parameters

No query parameters accepted.

Body payload

{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}

Example responses

/pdf/convert/to/text-simple

{
    "body": "Your Company Name \r\nYour Address \r\nCity, State Zip \r\nInvoice No. 123456 \r\nInvoice Date 01/01/2016 \r\nClient Name \r\nAddress \r\nCity, State Zip  \r\nNotes   \r\nItem Quantity Price Total \r\nItem 1 1 40.00 40.00 \r\nItem 2 2 30.00 60.00 \r\nItem 3 3 20.00 60.00 \r\nItem 4 4 10.00 40.00   \r\nTOTAL 200.00   \r\n",
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "sample.txt",
    "remainingCredits": 99885491,
    "credits": 2
}

Code Snippet

CURL

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/text-simple' \
--header 'Content-Type: application/json' \
--header 'x-api-key: ' \
--data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/pdf-to-text/sample.pdf",
    "inline": true,
    "async": false
}'

PDF To TEXT

Available Methods

[POST] /pdf/convert/to/text (with layout and ocr)

Query parameters

Body payload

Example responses

/pdf/convert/to/text (with layout and ocr)

Code Snippet

CURL

[POST] /pdf/convert/to/text-simple (no layout and ocr, cheaper and faster)

Query parameters

Body payload

Example responses

/pdf/convert/to/text-simple

Code Snippet

CURL

Samples