Breaking the Container: Native PDF Text Extraction in FileMaker 2025

For nearly two decades, the PDF container field has been a "black box" in the FileMaker ecosystem. We could store the file, export it, or view it, but programmatic access to the data inside the document required external dependencies. Architects had to rely on expensive plugins, fragile OS-level scripting, or complex micro-services just to read an invoice number.

With the release of FileMaker 2025 (v22.0), Claris has closed this gap. The introduction of GetTextFromPDF marks a significant shift in how we architect document management systems. By moving text extraction into the native calculation engine, we reduce deployment complexity and improve solution portability across Windows, macOS, Linux, and iOS.

Here is an architectural breakdown of this function, its limitations, and how to implement it effectively.

The Function: Simplicity by Design

Unlike the complex plugin calls of the past, Claris has opted for extreme simplicity with GetTextFromPDF. It accepts a single argument: the container field holding the document.

Format: GetTextFromPDF ( container )

Parameters:

container: Any expression that returns container data for a PDF file.

Returns:

Text containing the content of the PDF.
"?" if the extraction fails (e.g., password-protected files, empty fields, or scanned documents without a text layer).

How It Works (and What It Doesn't Do)

It is vital to understand that GetTextFromPDF operates on the embedded text layer of a Portable Document Format file.

It is NOT OCR: This function does not perform Optical Character Recognition. If you scan a paper contract and save it as a PDF (resulting in a "flat" image wrapper), GetTextFromPDF will return "?" or an empty string. It requires the PDF to have selectable text (vectors).
No Formatting: The function returns the raw text stream. Fonts, sizes, colors, and layout positioning are discarded.
No Passwords: The function currently does not accept a password parameter. If a PDF is encrypted, the function will return "?" immediately.

Architectural Pattern: The "Ingest and Index"

Do not calculate GetTextFromPDF on the fly in unstored calculation fields. PDF parsing, while faster than OCR, is still resource-intensive compared to standard text operations. Displaying this calculation on a list view of 1,000 records will cause significant performance degradation.

Instead, use the Ingest and Index pattern.

Trigger: Use an OnObjectSave trigger on the container field or a generic "Process Inbox" server-side script.
Process: Extract the text immediately upon upload.
Store: Save the result in a static Text field (e.g., Documents::ContentIndex).
Search: Perform Find operations against the indexed text field, not the container.

Implementation Example

This script demonstrates how to handle the extraction safely, specifically trapping for the "?" error which indicates a failure to read the text layer.

# Script: Document_ProcessUpload
# Context: Invoices Layout

Set Error Capture [ On ]
Set Variable [ $pdf ; Value: Invoices::FileContainer ]

# 1. Validate Content Existence
If [ IsEmpty ( $pdf ) ]
    Exit Script [ Text Result: "No file present" ]
End If

# 2. Attempt Extraction
# Note: We trap for "?" which indicates a specific failure in this function
Set Variable [ $extractedText ; Value: GetTextFromPDF ( $pdf ) ]

If [ $extractedText = "?" ]
    # Handle Failure Cases (Password Protected or Image-Only PDF)
    Set Field [ Invoices::Status ; "Extraction Failed" ]
    Set Field [ Invoices::ErrorLog ; "File is either password protected or a flat image scan." ]
Else
    # Success Case
    Set Field [ Invoices::TextIndex ; $extractedText ]
    Set Field [ Invoices::Status ; "Indexed" ]

    # Optional: Parse specific data
    Perform Script [ "Sub_ParseInvoiceHeader" ]
End If

Save Records

Parsing the Output (The OG Method)

Once you have the raw text, the challenge shifts to parsing. The output of GetTextFromPDF generally preserves the visual reading order (left-to-right, top-to-bottom), but this depends heavily on how the PDF was generated.

A standard header in a PDF might look like this in the extracted text:

INVOICE # 10234
DATE: 2025-10-12
TOTAL: $450.00

To extract the Invoice Number specifically, you can use FileMaker's standard text parsing functions.

Let (
  [
    ~text = Invoices::TextIndex ;
    ~marker = "INVOICE # " ;
    ~start = Position ( ~text ; ~marker ; 1 ; 1 ) + Length ( ~marker ) ;
    ~end = Position ( ~text ; "¶" ; ~start ; 1 )
  ] ;
  If ( ~start > Length ( ~marker ) ; Middle ( ~text ; ~start ; ~end - ~start ) ; "" )
)

Parsing the Output (A Better Approach):

While standard text functions (Middle, Position) work for predictable layouts, they are brittle. A minor shift in the PDF structure can break your script. In FileMaker 2025, the most robust architectural pattern is to pipe the raw text from GetTextFromPDF directly into an LLM (Large Language Model) for semantic extraction.

You can construct a prompt within a generic Insert from URL call or a native AI script step. By enforcing a JSON schema in your prompt, you ensure the response is machine-readable and ready for JSONGetElement.

You are a specialized data extraction agent.
Analyze the unstructured text provided below, which was extracted from a PDF invoice.

Your task:
1. Identify the Invoice Number, Invoice Date, and Total Amount.
2. Standardize the date to YYYY-MM-DD format.
3. Return the result strictly as a JSON object. Do not include markdown formatting or conversational text.

Use the following JSON schema:
{
  "invoice_number": "string",
  "invoice_date": "YYYY-MM-DD",
  "total_amount": number,
  "confidence_score": number (0-1)
}

--- BEGIN SOURCE TEXT ---
{{$extractedText}}
--- END SOURCE TEXT ---

By providing the model with the exact output JSON to provide, you can easily use JSONGetElement to parse the extracted text into your FileMaker fields.

New to integrating AI in FileMaker with cURL? Read my article: Architecting AI into FileMaker

Trade-offs and Considerations

While GetTextFromPDF is a powerful addition to FileMaker 2025, it requires specific architectural accommodations.

The "Scanned Document" Problem: Since this function lacks OCR capabilities, you must architect a fallback. If GetTextFromPDF returns "?", your workflow might need to route the document to an external OCR service (like AWS Textract) or, if on macOS/iOS, fallback to GetLiveText (which works on images).
Security Gaps: Because you cannot pass a password to the function, you cannot natively index encrypted PDFs provided by clients (e.g., banking statements often come password-protected). In these scenarios, a plugin is still required to decrypt the file before indexing.
Search Noise: The function extracts everything, including legal footers and page numbers. When performing Finds against your index field, users may get false positives from boilerplate text. Consider using Substitute to strip known boilerplate during the indexing phase.

Conclusion

The GetTextFromPDF function in FileMaker 2025 allows developers to retire third-party dependencies for standard document indexing. It is fast, native, and server-compatible. However, its simplicity—specifically the lack of OCR and password support—means it is best used as the primary indexing method in a tiered strategy, with plugins or API integrations serving as the fallback for complex or secured documents.