Text Extraction

Extract text, search content, read metadata, inspect bookmarks, retrieve attachments, and examine signatures from existing PDF documents using the PdfiumDocument inspection API.

Overview

FolioPDF's inspection API is powered by PDFium — the same PDF engine embedded in Chromium. It can parse and extract data from any valid PDF, regardless of which tool created it. The entry point is PdfiumDocument, which wraps a native PDFium document handle and provides typed .NET APIs for every inspection operation.

using FolioPDF.Toolkit.Pdfium;

using var doc = PdfiumDocument.Load(File.ReadAllBytes("report.pdf"));
Console.WriteLine($"Pages: {doc.PageCount}");

Loading Documents

From Byte Array

byte[] pdfBytes = File.ReadAllBytes("report.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);

From File Path

using var doc = PdfiumDocument.LoadFile("report.pdf");

With Password (Encrypted PDFs)

using var doc = PdfiumDocument.Load(pdfBytes, password: "secret");
// or
using var doc = PdfiumDocument.LoadFile("encrypted.pdf", password: "secret");

Important: PdfiumDocument implements IDisposable. Always wrap it in a using statement to ensure the native handle and pinned buffer are released.

Document Properties

Basic properties are available directly on the document object:

using var doc = PdfiumDocument.Load(pdfBytes);

Console.WriteLine($"Page count:   {doc.PageCount}");
Console.WriteLine($"Is encrypted: {doc.IsEncrypted}");
Console.WriteLine($"PDF version:  {doc.Version}");
Console.WriteLine($"Permissions:  {doc.Permissions}");

Property	Type	Description
`PageCount`	`int`	Total number of pages in the document.
`IsEncrypted`	`bool`	Whether the document is password-protected.
`Version`	`PdfFileVersion`	PDF version (e.g. 1.4, 1.7, 2.0).
`Permissions`	`PdfPermissions`	Security permissions (printing, copying, modification, etc.).
`SignatureCount`	`int`	Number of digital signatures in the document.
`HasSignatures`	`bool`	Convenience: `SignatureCount > 0`.
`BookmarkCount`	`int`	Number of top-level bookmarks.

Metadata

Read the standard PDF metadata fields (Info dictionary):

var meta = doc.GetMetadata();

Console.WriteLine($"Title:         {meta.Title}");
Console.WriteLine($"Author:        {meta.Author}");
Console.WriteLine($"Subject:       {meta.Subject}");
Console.WriteLine($"Keywords:      {meta.Keywords}");
Console.WriteLine($"Creator:       {meta.Creator}");
Console.WriteLine($"Producer:      {meta.Producer}");
Console.WriteLine($"Created:       {meta.CreationDate}");
Console.WriteLine($"Modified:      {meta.ModDate}");

PdfiumDocumentMetadata Fields

Property	Type	PDF Key
`Title`	`string?`	`/Title`
`Author`	`string?`	`/Author`
`Subject`	`string?`	`/Subject`
`Keywords`	`string?`	`/Keywords`
`Creator`	`string?`	`/Creator`
`Producer`	`string?`	`/Producer`
`CreationDate`	`string?`	`/CreationDate`
`ModDate`	`string?`	`/ModDate`

Date fields use the raw PDF date string format (e.g. D:20260407120000+02'00') and are returned verbatim because real-world PDFs sometimes contain malformed dates.

Page-Level Text Extraction

Extract all text from a page in reading order:

using var page = doc.GetPage(0);
string text = page.ExtractText();
Console.WriteLine(text);

Extract text from every page:

for (int i = 0; i < doc.PageCount; i++)
{
    using var page = doc.GetPage(i);
    string text = page.ExtractText();
    Console.WriteLine($"--- Page {i + 1} ---");
    Console.WriteLine(text);
}

Note: ExtractText() returns an empty string for scanned image-only pages that lack an OCR text layer. Pages with embedded fonts and ToUnicode CMaps return faithfully readable text.

Text Runs with Bounding Boxes

For position-aware extraction, use ExtractTextRuns() which returns rectangular runs of text with their page coordinates:

using var page = doc.GetPage(0);

foreach (var run in page.ExtractTextRuns())
{
    Console.WriteLine($"Text: \"{run.Text}\"");
    Console.WriteLine($"  Bounds: X={run.Bounds.X:F1}, Y={run.Bounds.Y:F1}, " +
                      $"W={run.Bounds.Width:F1}, H={run.Bounds.Height:F1}");
}

Each PdfiumTextRun carries:

Property	Type	Description
`Text`	`string`	The text content in reading order.
`Bounds`	`RectangleF`	Bounding rectangle in PDF page coordinates (PostScript points, origin at lower-left).

Character-Level Positions

For fine-grained positioning (OCR comparison, custom layout analysis), get individual character bounding boxes:

using var page = doc.GetPage(0);

foreach (var ch in page.GetCharacterBoxes())
{
    Console.WriteLine($"[{ch.CharIndex}] '{ch.Character}' " +
                      $"L={ch.Left:F1} R={ch.Right:F1} B={ch.Bottom:F1} T={ch.Top:F1} " +
                      $"({ch.Width:F1} x {ch.Height:F1})");
}

PdfiumCharBox Properties

Property	Type	Description
`CharIndex`	`int`	Zero-based character index on the page.
`Character`	`char`	The Unicode character.
`Left`	`float`	Left edge in PDF points.
`Right`	`float`	Right edge in PDF points.
`Bottom`	`float`	Bottom edge in PDF points.
`Top`	`float`	Top edge in PDF points.
`Width`	`float`	Computed: `Right - Left`.
`Height`	`float`	Computed: `Top - Bottom`.

Document Search

Search for text across all pages in the document:

foreach (var match in doc.Search("confidential"))
{
    Console.WriteLine($"Page {match.PageIndex + 1}: \"{match.Text}\" " +
                      $"at char {match.CharIndex} (length {match.Length})");
    foreach (var rect in match.Rectangles)
        Console.WriteLine($"  Rect: ({rect.X:F1}, {rect.Y:F1}, {rect.Width:F1}, {rect.Height:F1})");
}

Page-Level Search

Search within a single page:

using var page = doc.GetPage(0);
foreach (var match in page.FindText("invoice"))
{
    Console.WriteLine($"\"{match.Text}\" at char {match.CharIndex}");
}

Search Options

// Case-sensitive search
var results = doc.Search("FolioPDF", new PdfiumSearchOptions { CaseSensitive = true });

// Whole-word search (won't match "informant" when searching for "form")
var results = doc.Search("form", PdfiumSearchOptions.WholeWord);

// Convenience presets
var results = doc.Search("term", PdfiumSearchOptions.Default);                 // case-insensitive substring
var results = doc.Search("TERM", PdfiumSearchOptions.CaseSensitiveDefault);    // case-sensitive substring

PdfiumSearchOptions Properties

Property	Type	Default	Description
`CaseSensitive`	`bool`	`false`	When true, uppercase and lowercase letters must match exactly.
`MatchWholeWord`	`bool`	`false`	When true, matches only when bounded by non-letter/non-digit characters on both sides.
`MatchConsecutive`	`bool`	`false`	When true, overlapping matches are reported separately.

PdfiumSearchMatch Properties

Property	Type	Description
`PageIndex`	`int`	Zero-based page index where the match was found.
`CharIndex`	`int`	Character index of the match start within the page's text stream.
`Length`	`int`	Number of characters covered by the match.
`Text`	`string`	The actual matched text as extracted from the PDF.
`Rectangles`	`IReadOnlyList<RectangleF>`	Bounding rectangles in page coordinates. Multi-line matches have one rectangle per visual line.

Hyperlink Extraction

Extract URLs detected by PDFium's web link analysis engine:

using var page = doc.GetPage(0);

foreach (var link in page.GetLinks())
{
    Console.WriteLine($"URL: {link.Url}");
    Console.WriteLine($"  Starts at char {link.StartCharIndex}, spans {link.CharCount} characters");
}

PdfiumLink Properties

Property	Type	Description
`Url`	`string`	The URL string (e.g. "https://example.com").
`StartCharIndex`	`int`	Zero-based character index where the link text starts.
`CharCount`	`int`	Number of characters in the link text.

Image Extraction

Extract all embedded images from the document or a specific page:

// All images from the entire document
foreach (var img in doc.GetAllImages())
{
    Console.WriteLine($"Size: {img.Width}x{img.Height}");
    Console.WriteLine($"Page bounds: {img.BoundsOnPage}");

    // Save as PNG
    File.WriteAllBytes($"image_{img.Width}x{img.Height}.png", img.ToPng());
}

// Images from a specific page
var pageImages = doc.GetImages(pageIndex: 0);
foreach (var img in pageImages)
{
    byte[] jpeg = img.ToJpeg(quality: 85);
    File.WriteAllBytes($"page0_image.jpg", jpeg);
}

PdfiumExtractedImage Properties

Property	Type	Description
`Width`	`int`	Pixel width of the rendered image.
`Height`	`int`	Pixel height of the rendered image.
`BoundsOnPage`	`RectangleF`	On-page bounds in PostScript points (may differ from pixel dimensions due to scaling).
`PixelsBgra32`	`byte[]`	Decoded BGRA32 pixel buffer.
`Stride`	`int`	Row pitch in bytes.

Method	Returns	Description
`ToPng()`	`byte[]`	Encode as PNG (lossless) via Skia.
`ToJpeg(quality)`	`byte[]`	Encode as JPEG at the given quality (1-100, default 85) via Skia.

Bookmarks

Read the document's bookmark (outline) tree:

foreach (var bm in doc.GetBookmarks())
{
    PrintBookmark(bm, indent: 0);
}

void PrintBookmark(PdfiumBookmark bm, int indent)
{
    Console.WriteLine($"{new string(' ', indent * 2)}{bm.Title} -> Page {bm.PageIndex + 1}");
    foreach (var child in bm.Children)
        PrintBookmark(child, indent + 1);
}

PdfiumBookmark Properties

Property	Type	Description
`Title`	`string`	Display title shown in PDF viewers.
`PageIndex`	`int`	Zero-based target page index, or -1 if the destination could not be resolved.
`Children`	`IReadOnlyList<PdfiumBookmark>`	Nested child bookmarks. Empty for leaf entries.

Attachments

Read embedded file attachments (common in ZUGFeRD/Factur-X invoices):

foreach (var att in doc.GetAttachments())
{
    Console.WriteLine($"Name: {att.Name}");
    Console.WriteLine($"MIME: {att.MimeType}");
    Console.WriteLine($"Desc: {att.Description}");
    Console.WriteLine($"Size: {att.Length} bytes");

    // Extract the payload
    byte[] data = att.GetBytes();
    File.WriteAllBytes(att.Name, data);
}

PdfiumAttachment Properties

Property	Type	Description
`Name`	`string`	Attachment file name as recorded in the PDF.
`MimeType`	`string?`	MIME type (e.g. `text/xml`). Null if absent.
`Description`	`string?`	Optional description from the `/Desc` entry.
`Length`	`int`	Payload length in bytes.

Method	Returns	Description
`GetBytes()`	`byte[]`	Extract the attachment payload. Cached after the first call.

Asymmetric routing: Attachments are written via qpdf (DocumentOperation.AddAttachment) but read via PDFium. This split is by design — qpdf's write path includes the /AFRelationship support needed for PDF/A-3 compliance, while PDFium has the cleanest read API.

Signatures

Enumerate digital signatures and their metadata:

foreach (var sig in doc.GetSignatures())
{
    Console.WriteLine($"SubFilter:    {sig.SubFilter}");
    Console.WriteLine($"Reason:       {sig.Reason}");
    Console.WriteLine($"Signing time: {sig.SigningTime}");
    Console.WriteLine($"Contents size: {sig.Contents.Length} bytes");
}

For cryptographic verification, see the Digital Signing page.

Complete Inspection Example

A comprehensive example that extracts all available information from a PDF:

using FolioPDF.Toolkit.Pdfium;

byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);

// Document properties
Console.WriteLine($"=== Document Properties ===");
Console.WriteLine($"Pages:     {doc.PageCount}");
Console.WriteLine($"Encrypted: {doc.IsEncrypted}");
Console.WriteLine($"Version:   {doc.Version}");

// Metadata
var meta = doc.GetMetadata();
Console.WriteLine($"\n=== Metadata ===");
Console.WriteLine($"Title:  {meta.Title ?? "(none)"}");
Console.WriteLine($"Author: {meta.Author ?? "(none)"}");

// Bookmarks
Console.WriteLine($"\n=== Bookmarks ({doc.BookmarkCount}) ===");
foreach (var bm in doc.GetBookmarks())
    Console.WriteLine($"  {bm.Title} -> Page {bm.PageIndex + 1}");

// Attachments
Console.WriteLine($"\n=== Attachments ===");
foreach (var att in doc.GetAttachments())
    Console.WriteLine($"  {att.Name} ({att.Length} bytes, {att.MimeType})");

// Signatures
Console.WriteLine($"\n=== Signatures ({doc.SignatureCount}) ===");
foreach (var sig in doc.GetSignatures())
{
    Console.WriteLine($"  {sig.SubFilter} - {sig.Reason}");
    var result = PdfiumSignatureVerifier.Verify(sig, pdfBytes);
    Console.WriteLine($"    Valid: {result.IsValid}");
}

// Text from each page
Console.WriteLine($"\n=== Text ===");
for (int i = 0; i < doc.PageCount; i++)
{
    using var page = doc.GetPage(i);
    var text = page.ExtractText();
    Console.WriteLine($"--- Page {i + 1} ({page.Width:F0}x{page.Height:F0} pts) ---");
    Console.WriteLine(text.Length > 200 ? text[..200] + "..." : text);

    // Links on this page
    foreach (var link in page.GetLinks())
        Console.WriteLine($"  Link: {link.Url}");
}

// Images
Console.WriteLine($"\n=== Images ===");
foreach (var img in doc.GetAllImages())
    Console.WriteLine($"  {img.Width}x{img.Height} pixels at {img.BoundsOnPage}");

Page Properties

Each PdfiumPage exposes basic geometry:

Property	Type	Description
`Index`	`int`	Zero-based page index.
`Width`	`float`	Page width in PostScript points (1 pt = 1/72 inch).
`Height`	`float`	Page height in PostScript points.
`Rotation`	`int`	Page rotation in degrees: 0, 90, 180, or 270.

Thread Safety

All PdfiumDocument and PdfiumPage operations route through a process-global lock. Multiple .NET threads can safely call inspection methods on the same or different documents concurrently — they serialize internally. The lock is reentrant on the same thread so callbacks cannot deadlock.