--- layout: default title: Template Creation Guide --- Template Creation Guide

Document Parser: Template Creation Guide

Table of Contents:

What is Document Parser and How It Works?

Document Parser is the versatile document parsing engine that helps to do accurate and easy to maintain data extraction data from PDF invoices, statements, reports, paystubs, tables, reciepts. No programming is reqired to create and maintain data extraction templates! Supports both native and scanned PDF files, PNG, JPG, TIFF images and Doc, Docx, PPT files (only in Web API version) as well as English, German, French, Spanish and many other languages. Available as Web API, Zapier and on-premise API server or as direct integration module.

Template specification version: 3.

Templates can be written in YAML or JSON formats. A template defines one or more keywords to match the right template to the document and expressions for fields and tables to be extracted. A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].

Sample YAML template showing the main features:

--- templateVersion: 3 templatePriority: 1 sourceId: ACME Inc. Invoice culture: en-US detectionRules: keywords: - ACME Inc\. - Invoice No - ABN 01 234 567 890 fields: companyName: type: static expression: ACME Inc. invoiceNumber: type: regex expression: 'Invoice No.: ({{LettersOrDigitsOrSymbols}})' pageIndex: 0 invoiceDate: type: regex expression: 'Invoice Date: ({{SmartDate}})' dataType: date dateFormat: MM/dd/yyyy billTo: type: rectangle rectangle: - 32.5 - 64.5 - 200 - 100 pageIndex: 0 total: type: regex expression: 'TOTAL{{Spaces}}({{Number}})' dataType: decimal tables: - name: table1 start: expression: 'Item{{Spaces}}Quantity{{Spaces}}Price{{Spaces}}Total' end: expression: TOTAL row: expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<quantity>{{Digits}}){{Spaces}}(?<unitPrice>{{Number}}){{Spaces}}(?<itemTotal>{{Number}})' columns: - name: description type: string - name: quantity type: integer - name: unitPrice type: decimal - name: itemTotal type: decimal multipage: true

Template Parameters

TemplatePriority

Templates are sorted and tried by templatePriority, then alphabetically. 0 - the highest priority, 999999 - the lowest.

SourceId

Some name that identifies the design of the document. Passed to the result unchanged.

Culture

Template culture that affects the detection of dates and decimal numbers. For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with the dot as the decimal symbol and the comma as the digit grouping symbol. For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with the comma as the decimal symbol and the space as the digit grouping symbol. You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.

Example:

culture: fr-FR

DetectionRules

Few expressions that uniquely identify the document design. The expression can be exact phrase or contain macros and regular expressions (Regex).

Example:

detectionRules: keywords: - ACME Inc\. - 'Invoice No:{{Spaces}}{{6Digits}}' - \[CONFIDENTIAL\]

DocumentStart

If your PDF file contains multiple documents to parse, documentStart expression should indicate the beginning of new document in PDF file.

Example:

documentStart: TAX INVOICE

Fields

Standalone fields to extract. For example, invoice number, invoice date, etc.

Field parameters:

Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.

Example:

type: decimal[fr-FR]

Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.

Example:

type: date dateFormat: MM-dd-yyyy

The dateFormat can also contain auto-format strings:

Example:

type: date dateFormat: auto-DMY

Tables

This section defines tabular data you need to extract. Tables can be defined by coordinates or by expressions to find the table start, the end, and rows. Tables section can contain multiple table definitions arranged as an array.

Table parameters:

Example of table parsing:

Description Interval Quantity Amount ($)
Basic Plan Jan 1 - Jan 31 1 25.00
Basic Plan Feb 1 - Feb 28 1 25.00
Total in USD: 50.00

The table above, can be parsed with macro expressions or with explicitly defined column coordinates.

Macros approach:

tables: - name: table1 start: # The table will start after the text "Amount ($)" expression: 'Amount{{Space}}{{OpeningParenthesis}}{{Dollar}}{{ClosingParenthesis}}' end: # The table will end before the text "Total in USD" expression: Total in USD row: # Groups <description>, <interval>, <quantity>, and <amount> will become columns in the result table. expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}})(?<interval>{{3Letters}}{{Space}}{{Digits}}{{Space}}{{Minus}}{{Space}}{{3Letters}}{{Space}}{{Digits}}){{Spaces}}(?<quantity>{{Digits}}){{Spaces}}(?<amount>{{Number}})' columns: # Suggest data types for table columns: - name: description type: string - name: interval type: string - name: quantity type: integer - name: amount type: decimal

If the macros approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use included Template Editor application: it shows cursor coordinates in the toolbar.

Explicit column coordinates approach:

tables: - name: table1 start: # The table will start below the text "Description Interval" expression: 'Description{{Spaces}}Interval' end: # The table will end above the text "Total in USD" expression: 'Total in USD' columns: # Suggest coordinates and data types for table columns - name: description x: 0 type: string - name: interval x: 100 type: string - name: quantity x: 150 type: integer - name: amount x: 200 type: decimal

Options

Template options.

Template-level macros

Template-level (TL) macros allow to define reusable blocks that you can use in expression parameters of fields and tables.

TL macro can contain built-in macros and regular expressions (Regex).

TL macro can reuse macros defined above in the template code.

TL macros in expression should be enclosed in double angle brackets: << >>.

Example:

templateMacros: # Detects "Yes" or "No" in the text yesOrNo: '(Yes|No)' # Detects "12/01/2019 - 12/31/2019" date range in the text dateRange: '{{DateMM/DD/YY}} {{Minus}} {{DateMM/DD/YY}}' # Detects 24h time "13:00" in the text time: '{{2Digits}}{{Colon}}{{2Digits}}' # Detects "08:00-17:00" time range in the text. Example of reusing the macro defined above. timeRange: '<<time>>{{Minus}}<<time>>' # Example of use of template-level macros defined in templateMacros section fields: answer: expression: 'Answer: (<<yesOrNo>>)' period: expression: '<<dateRange>>' workHours: expression: '<<timeRange>>' workHoursStart: expression: '(?<value><<time>>){{Minus}}<<time>>' workHoursEnd: expression: '<<time>>{{Minus}}(?<value><<time>>)'

APPENDIX 1: Macros.

Built-in macros:

MacroDescription
{{Space}}Single space.
{{Spaces}}One or more spaces.
{{2Spaces}}Two spaces.
{{3Spaces}}Three spaces.
{{4Spaces}}Four spaces.
{{5Spaces}}Five spaces.
{{6Spaces}}Six spaces.
{{7Spaces}}Seven spaces.
{{8Spaces}}Eight spaces.
{{9Spaces}}Nine spaces.
{{10Spaces}}Ten spaces.
{{Digit}}One digit.
{{Digits}}One or more digits.
{{2Digits}}Two digits.
{{3Digits}}Three digits.
{{4Digits}}Four digits.
{{5Digits}}Five digits.
{{6Digits}}Six digits.
{{7Digits}}Seven digits.
{{8Digits}}Eight digits.
{{9Digits}}Nine digits.
{{10Digits}}Ten digits.
{{DigitOrSymbol}}One digit or symbol ("_-+=/").
{{DigitsOrSymbols}}One or more digits or symbols ("_-+=/").
{{2DigitsOrSymbols}}Two digits or symbols ("_-+=/").
{{3DigitsOrSymbols}}Three digits or symbols ("_-+=/").
{{4DigitsOrSymbols}}Four digits or symbols ("_-+=/").
{{5DigitsOrSymbols}}Five digits or symbols ("_-+=/").
{{6DigitsOrSymbols}}Six digits or symbols ("_-+=/").
{{7DigitsOrSymbols}}Seven digits or symbols ("_-+=/").
{{8DigitsOrSymbols}}Eight digits or symbols ("_-+=/").
{{9DigitsOrSymbols}}Nine digits or symbols ("_-+=/").
{{10DigitsOrSymbols}}Ten digits or symbols ("_-+=/").
{{Letter}}One letter from any language.
{{Letters}}One or more letters from any language.
{{2Letters}}Two letters from any language.
{{3Letters}}Three letters from any language.
{{4Letters}}Four letters from any language.
{{5Letters}}Five letters from any language.
{{6Letters}}Six letters from any language.
{{7Letters}}Seven letters from any language.
{{8Letters}}Eight letters from any language.
{{9Letters}}Nine letters from any language.
{{10Letters}}Ten letters from any language.
{{UppercaseLetter}}One uppercase letter from any language.
{{UppercaseLetters}}One or more uppercase letters from any language.
{{2UppercaseLetter}}Two uppercase letters from any language.
{{3UppercaseLetter}}Three uppercase letters from any language.
{{4UppercaseLetter}}Four uppercase letters from any language.
{{5UppercaseLetter}}Five uppercase letters from any language.
{{6UppercaseLetter}}Six uppercase letters from any language.
{{7UppercaseLetter}}Seven uppercase letters from any language.
{{8UppercaseLetter}}Eight uppercase letters from any language.
{{9UppercaseLetter}}Nine uppercase letters from any language.
{{10UppercaseLetter}}Ten uppercase letters from any language.
{{LetterOrDigit}}One letter or digit.
{{LettersOrDigits}}One or more letters or digits.
{{2LettersOrDigits}}Two letters or digits.
{{3LettersOrDigits}}Three letters or digits.
{{4LettersOrDigits}}Four letters or digits.
{{5LettersOrDigits}}Five letters or digits.
{{6LettersOrDigits}}Six letters or digits.
{{7LettersOrDigits}}Seven letters or digits.
{{8LettersOrDigits}}Eight letters or digits.
{{9LettersOrDigits}}Nine letters or digits.
{{10LettersOrDigits}}Ten letters or digits.
{{UppercaseLetterOrDigit}}One uppercase letter or digit.
{{UppercaseLettersOrDigits}}One or more uppercase letters or digits.
{{2UppercaseLettersOrDigits}}Two uppercase letters or digits.
{{3UppercaseLettersOrDigits}}Three uppercase letters or digits.
{{4UppercaseLettersOrDigits}}Four uppercase letters or digits.
{{5UppercaseLettersOrDigits}}Five uppercase letters or digits.
{{6UppercaseLettersOrDigits}}Six uppercase letters or digits.
{{7UppercaseLettersOrDigits}}Seven uppercase letters or digits.
{{8UppercaseLettersOrDigits}}Eight uppercase letters or digits.
{{9UppercaseLettersOrDigits}}Nine uppercase letters or digits.
{{10UppercaseLettersOrDigits}}Ten uppercase letters or digits.
{{LetterOrDigitOrSymbol}}One letter, or digit, or symbol ("_-+=/").
{{LettersOrDigitsOrSymbols}}One or more letters, or digits, or symbols ("_-+=/").
{{2LettersOrDigitsOrSymbols}}Two letters, or digits, or symbols ("_-+=/").
{{3LettersOrDigitsOrSymbols}}Three letters, or digits, or symbols ("_-+=/").
{{4LettersOrDigitsOrSymbols}}Four letters, or digits, or symbols ("_-+=/").
{{5LettersOrDigitsOrSymbols}}Five letters, or digits, or symbols ("_-+=/").
{{6LettersOrDigitsOrSymbols}}Six letters, or digits, or symbols ("_-+=/").
{{7LettersOrDigitsOrSymbols}}Seven letters, or digits, or symbols ("_-+=/").
{{8LettersOrDigitsOrSymbols}}Eight letters, or digits, or symbols ("_-+=/").
{{9LettersOrDigitsOrSymbols}}Nine letters, or digits, or symbols ("_-+=/").
{{10LettersOrDigitsOrSymbols}}Ten letters, or digits, or symbols ("_-+=/").
{{UppercaseLetterOrDigitOrSymbol}}One uppercase letter, or digit, or symbol ("_-+=/").
{{UppercaseLettersOrDigitsOrSymbols}}One or more uppercase letters, or digits, or symbols ("_-+=/").
{{2UppercaseLettersOrDigitsOrSymbols}}Two uppercase letters, or digits, or symbols ("_-+=/").
{{3UppercaseLettersOrDigitsOrSymbols}}Three uppercase letters, or digits, or symbols ("_-+=/").
{{4UppercaseLettersOrDigitsOrSymbols}}Four uppercase letters, or digits, or symbols ("_-+=/").
{{5UppercaseLettersOrDigitsOrSymbols}}Five uppercase letters, or digits, or symbols ("_-+=/").
{{6UppercaseLettersOrDigitsOrSymbols}}Six uppercase letters, or digits, or symbols ("_-+=/").
{{7UppercaseLettersOrDigitsOrSymbols}}Seven uppercase letters, or digits, or symbols ("_-+=/").
{{8UppercaseLettersOrDigitsOrSymbols}}Eight uppercase letters, or digits, or symbols ("_-+=/").
{{9UppercaseLettersOrDigitsOrSymbols}}Nine uppercase letters, or digits, or symbols ("_-+=/").
{{10UppercaseLettersOrDigitsOrSymbols}}Ten uppercase letters, or digits, or symbols ("_-+=/").
{{Dollar}}Dollar sign ($).
{{Euro}}Euro sign (€).
{{Pound}}Pound sign (£).
{{Yen}}Yen sign (¥).
{{Yuan}}Yuan sign (¥).
{{CurrencySymbol}}Any currency symbol ($, €, £, ¥, etc.)
{{Dot}}Single dot symbol (".").
{{Comma}}Single comma symbol (",").
{{Colon}}Single colon symbol (":").
{{Semicolon}}Single semicolon symbol (";").
{{Minus}}Single minus (dash, hyphen) symbol ("-").
{{Slash}}Slash symbol ("/").
{{Backslash}}Backslash symbol ("\").
{{Percent}}Percent symbol ("%").
{{LineStart}}Start of line (virtual symbol).
{{LineEnd}}End of line (virtual symbol).
{{SentenceWithSingleSpaces}}Single-space-separated sequence of words and symbols. Breaks on double space.
{{SentenceWithDoubleSpaces}}Extended {{SentenceWithSingleSpaces}} macro allowing two spaces between words. Breaks on triple space.
{{EndOfPage}}End of page or end of document.
{{WordBoundary}}Start or end of word (virtual symbol).
{{OpeningCurlyBrace}}Opening curly brace symbol ("{").
{{ClosingCurlyBrace}}Closing curly brace symbol ("}").
{{OpeningParenthesis}}Opening parenthesis symbol ("(").
{{ClosingParenthesis}}Closing parenthesis symbol (")").
{{OpeningSquareBracket}}Opening square bracket symbol ("[").
{{ClosingSquareBracket}}Closing square bracket symbol ("]").
{{OpeningAngleBracket}}Opening angle bracket symbol ("<").
{{ClosingAngleBracket}}Closing angle bracket symbol (">").
{{DateMM/DD/YY}}Date in format "01/01/19" (with leading zero).
{{DateM/D/YY}}Date in format "1/1/19" (without leading zero).
{{DateMM/DD/YYYY}}Date in format "01/01/2019" (with leading zero).
{{DateM/D/YYYY}}Date in format "1/1/2019" (without leading zero).
{{DateMM-DD-YY}}Date in format "01-01-19" (with leading zero).
{{DateM-D-YY}}Date in format "1-1-19" (without leading zero).
{{DateMM-DD-YYYY}}Date in format "01-01-2019" (with leading zero).
{{DateM-D-YYYY}}Date in format "1-1-2019" (without leading zero).
{{DateMM.DD.YY}}Date in format "01.01.19" (with leading zero).
{{DateM.D.YY}}Date in format "1.1.19" (without leading zero).
{{DateMM.DD.YYYY}}Date in format "01.01.2019" (with leading zero).
{{DateM.D.YYYY}}Date in format "01.01.2019" (without leading zero).
{{DateDD/MM/YY}}Date in format "01/01/19" (with leading zero).
{{DateD/M/YY}}Date in format "1/1/19" (without leading zero).
{{DateDD/MM/YYYY}}Date in format "01/01/2019" (with leading zero).
{{DateD/M/YYYY}}Date in format "1/1/2019" (without leading zero).
{{DateDD-MM-YY}}Date in format "01-01-19" (with leading zero).
{{DateD-M-YY}}Date in format "1-1-19" (without leading zero).
{{DateDD-MM-YYYY}}Date in format "01-01-2019" (with leading zero).
{{DateD-M-YYYY}}Date in format "1-1-2019" (without leading zero).
{{DateDD.MM.YY}}Date in format "01.01.19" (with leading zero).
{{DateD.M.YY}}Date in format "1.1.19" (without leading zero).
{{DateDD.MM.YYYY}}Date in format "01.01.2019" (with leading zero).
{{DateD.M.YYYY}}Date in format "1.1.2019" (without leading zero).
{{DateYYYYMMDD}}Date in format "20190101".
{{DateYYYY/MM/DD}}Date in format "2019/01/01" (with leading zero).
{{DateYYYY/M/D}}Date in format "2019/1/1" (without leading zero).
{{DateYYYY-MM-DD}}Date in format "2019-01-01" (with leading zero).
{{DateYYYY-M-D}}Date in format "2019-1-1" (without leading zero).
{{SmartDate}}Tries to detect the date in the most common formats.
{{Number}}Decimal number like the following: "12.34", "-123,456.78", "123.456". Decimal separator and thousands separator are automatically taken from the template culture.
{{Money}}Decimal number with currency symbol like the following: "USD 12.34", "$123,456.78", "123.45 €". Decimal separator and thousands separator are automatically taken from the template culture.
{{Anything}}Any characters up to the next macro in the expression.
{{AnythingGreedy}}Any characters up to the next macro in the expression or to the end of line. Greedy version.
{{ToggleSingleLineMode}}Enables or disables single-line mode. In single-line mode, {{Anything}} and {{AnythingGreedy}} macros do not stop at the end of the line and proceed to the next line of text.
{{ToggleCaseInsensitiveMode}}Enables or disables case-insensitive mode.

APPENDIX 2: Sample templates.

Sample 1.

Sample document text:

DigitalOcean 101 Avenue of the Americas, 10th Floor New York, NY 10013 Date Issued: February 1, 2016 Period: January 1 - 31, 2016 Invoice Number: 1234567 Description Hours Start End USD Website-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Website-Live (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Database-Live (2GB) 744 01-01 00:00 01-31 23:59 $20.00 Tasks-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00 Total: $50.00 Bill To: Samee Sikka <admin@meee.org> meee.org Gouran If you have a credit card on file it will be automatically charged within 24 hours.

Sample template (YAML):

--- templateVersion: 3 templatePriority: 0 sourceId: DigitalOcean Invoice detectionRules: keywords: # Template will match documents containing the following phrases: - 'DigitalOcean' - '101 Avenue of the Americas' - 'Invoice Number' fields: # Static field that will output "DigitalOcean" to the result companyName: type: static expression: DigitalOcean # Macro field that will find the text "Invoice Number: 1234567" and return "1234567" to the result invoiceId: type: macros expression: 'Invoice Number: ({{Digits}})' # Macro field that will find the text "Date Issued: February 1, 2016" and return the date "February 1, 2016" in ISO format to the result dateIssued: type: macros expression: 'Date Issued: ({{SmartDate}})' dataType: date dateFormat: auto-mdy # Macro field that will find the text "Total: $50.00" and return "50.00" to the result total: type: macros expression: 'Total: {{Dollar}}({{Number}})' dataType: decimal # Static field that will "USD" to the result currency: type: static expression: USD tables: - name: table1 # The table will start after the text "Description Hours" start: expression: 'Description{{Spaces}}Hours' # The table will end before the text "Total:" end: expression: 'Total:' # Macro expression that will find table rows "Website-Dev (1GB) 744 01-01 00:00 01-31 23:59 $10.00", etc. row: # Groups <description>, <hours>, <start>, <end> and <unitPrice> will become columns in the result table. expression: '{{LineStart}}{{Spaces}}(?<description>{{SentenceWithSingleSpaces}}){{Spaces}}(?<hours>{{Digits}}){{Spaces}}(?<start>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}(?<end>{{2Digits}}{{Minus}}{{2Digits}}{{Space}}{{2Digits}}{{Colon}}{{2Digits}}){{Spaces}}{{Dollar}}(?<unitPrice>{{Number}})' # Suggest data types for table columns (missing columns will have the default "string" type): columns: - name: hours type: integer - name: unitPrice type: decimal

Result (JSON):

{ "templateId": "DigitalOcean.yml", "templateVersion": "3", "sourceId": "DigitalOcean Invoice", "fields": { "companyName": { "value": "DigitalOcean" }, "invoiceId": { "value": "1234567", "pageIndex": 0 }, "dateIssued": { "value": "2016-02-01T00:00:00", "pageIndex": 0 }, "total": { "value": 50.00, "pageIndex": 0 }, "currency": { "value": "USD" } }, "tables": [ { "name": "table1", "rows": [ { "description": { "value": "Website-Dev (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } }, { "description": { "value": "Website-Live (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } }, { "description": { "value": "Database-Live (2GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 20.00, "pageIndex": 0 } }, { "description": { "value": "Tasks-Dev (1GB)", "pageIndex": 0 }, "hours": { "value": 744, "pageIndex": 0 }, "start": { "value": "01-01 00:00", "pageIndex": 0 }, "end": { "value": "01-31 23:59", "pageIndex": 0 }, "unitPrice": { "value": 10.00, "pageIndex": 0 } } ] } ] }

Copyright (c) 2018-2020 ByteScout, Inc.

PDF.co with Web API and Document Parser

ByteScout