Text Extraction

Extract text, search content, read metadata, inspect bookmarks, retrieve attachments, and examine signatures from existing PDF documents using the PdfiumDocument inspection API.

Overview

FolioPDF's inspection API is powered by PDFium — the same PDF engine embedded in Chromium. It can parse and extract data from any valid PDF, regardless of which tool created it. The entry point is PdfiumDocument, which wraps a native PDFium document handle and provides typed .NET APIs for every inspection operation.

using FolioPDF.Toolkit.Pdfium;

using var doc = PdfiumDocument.Load(File.ReadAllBytes("report.pdf"));
Console.WriteLine($"Pages: {doc.PageCount}");

Loading Documents

From Byte Array

byte[] pdfBytes = File.ReadAllBytes("report.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);

From File Path

using var doc = PdfiumDocument.LoadFile("report.pdf");

With Password (Encrypted PDFs)

using var doc = PdfiumDocument.Load(pdfBytes, password: "secret");
// or
using var doc = PdfiumDocument.LoadFile("encrypted.pdf", password: "secret");

Important: PdfiumDocument implements IDisposable. Always wrap it in a using statement to ensure the native handle and pinned buffer are released.

Document Properties

Basic properties are available directly on the document object:

using var doc = PdfiumDocument.Load(pdfBytes);

Console.WriteLine($"Page count:   {doc.PageCount}");
Console.WriteLine($"Is encrypted: {doc.IsEncrypted}");
Console.WriteLine($"PDF version:  {doc.Version}");
Console.WriteLine($"Permissions:  {doc.Permissions}");
PropertyTypeDescription
PageCountintTotal number of pages in the document.
IsEncryptedboolWhether the document is password-protected.
VersionPdfFileVersionPDF version (e.g. 1.4, 1.7, 2.0).
PermissionsPdfPermissionsSecurity permissions (printing, copying, modification, etc.).
SignatureCountintNumber of digital signatures in the document.
HasSignaturesboolConvenience: SignatureCount > 0.
BookmarkCountintNumber of top-level bookmarks.

Metadata

Read the standard PDF metadata fields (Info dictionary):

var meta = doc.GetMetadata();

Console.WriteLine($"Title:         {meta.Title}");
Console.WriteLine($"Author:        {meta.Author}");
Console.WriteLine($"Subject:       {meta.Subject}");
Console.WriteLine($"Keywords:      {meta.Keywords}");
Console.WriteLine($"Creator:       {meta.Creator}");
Console.WriteLine($"Producer:      {meta.Producer}");
Console.WriteLine($"Created:       {meta.CreationDate}");
Console.WriteLine($"Modified:      {meta.ModDate}");

PdfiumDocumentMetadata Fields

PropertyTypePDF Key
Titlestring?/Title
Authorstring?/Author
Subjectstring?/Subject
Keywordsstring?/Keywords
Creatorstring?/Creator
Producerstring?/Producer
CreationDatestring?/CreationDate
ModDatestring?/ModDate

Date fields use the raw PDF date string format (e.g. D:20260407120000+02'00') and are returned verbatim because real-world PDFs sometimes contain malformed dates.

Page-Level Text Extraction

Extract all text from a page in reading order:

using var page = doc.GetPage(0);
string text = page.ExtractText();
Console.WriteLine(text);

Extract text from every page:

for (int i = 0; i < doc.PageCount; i++)
{
    using var page = doc.GetPage(i);
    string text = page.ExtractText();
    Console.WriteLine($"--- Page {i + 1} ---");
    Console.WriteLine(text);
}

Note: ExtractText() returns an empty string for scanned image-only pages that lack an OCR text layer. Pages with embedded fonts and ToUnicode CMaps return faithfully readable text.

Text Runs with Bounding Boxes

For position-aware extraction, use ExtractTextRuns() which returns rectangular runs of text with their page coordinates:

using var page = doc.GetPage(0);

foreach (var run in page.ExtractTextRuns())
{
    Console.WriteLine($"Text: \"{run.Text}\"");
    Console.WriteLine($"  Bounds: X={run.Bounds.X:F1}, Y={run.Bounds.Y:F1}, " +
                      $"W={run.Bounds.Width:F1}, H={run.Bounds.Height:F1}");
}

Each PdfiumTextRun carries:

PropertyTypeDescription
TextstringThe text content in reading order.
BoundsRectangleFBounding rectangle in PDF page coordinates (PostScript points, origin at lower-left).

Character-Level Positions

For fine-grained positioning (OCR comparison, custom layout analysis), get individual character bounding boxes:

using var page = doc.GetPage(0);

foreach (var ch in page.GetCharacterBoxes())
{
    Console.WriteLine($"[{ch.CharIndex}] '{ch.Character}' " +
                      $"L={ch.Left:F1} R={ch.Right:F1} B={ch.Bottom:F1} T={ch.Top:F1} " +
                      $"({ch.Width:F1} x {ch.Height:F1})");
}

PdfiumCharBox Properties

PropertyTypeDescription
CharIndexintZero-based character index on the page.
CharactercharThe Unicode character.
LeftfloatLeft edge in PDF points.
RightfloatRight edge in PDF points.
BottomfloatBottom edge in PDF points.
TopfloatTop edge in PDF points.
WidthfloatComputed: Right - Left.
HeightfloatComputed: Top - Bottom.

Document Search

Search for text across all pages in the document:

foreach (var match in doc.Search("confidential"))
{
    Console.WriteLine($"Page {match.PageIndex + 1}: \"{match.Text}\" " +
                      $"at char {match.CharIndex} (length {match.Length})");
    foreach (var rect in match.Rectangles)
        Console.WriteLine($"  Rect: ({rect.X:F1}, {rect.Y:F1}, {rect.Width:F1}, {rect.Height:F1})");
}

Page-Level Search

Search within a single page:

using var page = doc.GetPage(0);
foreach (var match in page.FindText("invoice"))
{
    Console.WriteLine($"\"{match.Text}\" at char {match.CharIndex}");
}

Search Options

// Case-sensitive search
var results = doc.Search("FolioPDF", new PdfiumSearchOptions { CaseSensitive = true });

// Whole-word search (won't match "informant" when searching for "form")
var results = doc.Search("form", PdfiumSearchOptions.WholeWord);

// Convenience presets
var results = doc.Search("term", PdfiumSearchOptions.Default);                 // case-insensitive substring
var results = doc.Search("TERM", PdfiumSearchOptions.CaseSensitiveDefault);    // case-sensitive substring

PdfiumSearchOptions Properties

PropertyTypeDefaultDescription
CaseSensitiveboolfalseWhen true, uppercase and lowercase letters must match exactly.
MatchWholeWordboolfalseWhen true, matches only when bounded by non-letter/non-digit characters on both sides.
MatchConsecutiveboolfalseWhen true, overlapping matches are reported separately.

PdfiumSearchMatch Properties

PropertyTypeDescription
PageIndexintZero-based page index where the match was found.
CharIndexintCharacter index of the match start within the page's text stream.
LengthintNumber of characters covered by the match.
TextstringThe actual matched text as extracted from the PDF.
RectanglesIReadOnlyList<RectangleF>Bounding rectangles in page coordinates. Multi-line matches have one rectangle per visual line.

Hyperlink Extraction

Extract URLs detected by PDFium's web link analysis engine:

using var page = doc.GetPage(0);

foreach (var link in page.GetLinks())
{
    Console.WriteLine($"URL: {link.Url}");
    Console.WriteLine($"  Starts at char {link.StartCharIndex}, spans {link.CharCount} characters");
}

PdfiumLink Properties

PropertyTypeDescription
UrlstringThe URL string (e.g. "https://example.com").
StartCharIndexintZero-based character index where the link text starts.
CharCountintNumber of characters in the link text.

Image Extraction

Extract all embedded images from the document or a specific page:

// All images from the entire document
foreach (var img in doc.GetAllImages())
{
    Console.WriteLine($"Size: {img.Width}x{img.Height}");
    Console.WriteLine($"Page bounds: {img.BoundsOnPage}");

    // Save as PNG
    File.WriteAllBytes($"image_{img.Width}x{img.Height}.png", img.ToPng());
}

// Images from a specific page
var pageImages = doc.GetImages(pageIndex: 0);
foreach (var img in pageImages)
{
    byte[] jpeg = img.ToJpeg(quality: 85);
    File.WriteAllBytes($"page0_image.jpg", jpeg);
}

PdfiumExtractedImage Properties

PropertyTypeDescription
WidthintPixel width of the rendered image.
HeightintPixel height of the rendered image.
BoundsOnPageRectangleFOn-page bounds in PostScript points (may differ from pixel dimensions due to scaling).
PixelsBgra32byte[]Decoded BGRA32 pixel buffer.
StrideintRow pitch in bytes.
MethodReturnsDescription
ToPng()byte[]Encode as PNG (lossless) via Skia.
ToJpeg(quality)byte[]Encode as JPEG at the given quality (1-100, default 85) via Skia.

Bookmarks

Read the document's bookmark (outline) tree:

foreach (var bm in doc.GetBookmarks())
{
    PrintBookmark(bm, indent: 0);
}

void PrintBookmark(PdfiumBookmark bm, int indent)
{
    Console.WriteLine($"{new string(' ', indent * 2)}{bm.Title} -> Page {bm.PageIndex + 1}");
    foreach (var child in bm.Children)
        PrintBookmark(child, indent + 1);
}

PdfiumBookmark Properties

PropertyTypeDescription
TitlestringDisplay title shown in PDF viewers.
PageIndexintZero-based target page index, or -1 if the destination could not be resolved.
ChildrenIReadOnlyList<PdfiumBookmark>Nested child bookmarks. Empty for leaf entries.

Attachments

Read embedded file attachments (common in ZUGFeRD/Factur-X invoices):

foreach (var att in doc.GetAttachments())
{
    Console.WriteLine($"Name: {att.Name}");
    Console.WriteLine($"MIME: {att.MimeType}");
    Console.WriteLine($"Desc: {att.Description}");
    Console.WriteLine($"Size: {att.Length} bytes");

    // Extract the payload
    byte[] data = att.GetBytes();
    File.WriteAllBytes(att.Name, data);
}

PdfiumAttachment Properties

PropertyTypeDescription
NamestringAttachment file name as recorded in the PDF.
MimeTypestring?MIME type (e.g. text/xml). Null if absent.
Descriptionstring?Optional description from the /Desc entry.
LengthintPayload length in bytes.
MethodReturnsDescription
GetBytes()byte[]Extract the attachment payload. Cached after the first call.

Asymmetric routing: Attachments are written via qpdf (DocumentOperation.AddAttachment) but read via PDFium. This split is by design — qpdf's write path includes the /AFRelationship support needed for PDF/A-3 compliance, while PDFium has the cleanest read API.

Signatures

Enumerate digital signatures and their metadata:

foreach (var sig in doc.GetSignatures())
{
    Console.WriteLine($"SubFilter:    {sig.SubFilter}");
    Console.WriteLine($"Reason:       {sig.Reason}");
    Console.WriteLine($"Signing time: {sig.SigningTime}");
    Console.WriteLine($"Contents size: {sig.Contents.Length} bytes");
}

For cryptographic verification, see the Digital Signing page.

Complete Inspection Example

A comprehensive example that extracts all available information from a PDF:

using FolioPDF.Toolkit.Pdfium;

byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);

// Document properties
Console.WriteLine($"=== Document Properties ===");
Console.WriteLine($"Pages:     {doc.PageCount}");
Console.WriteLine($"Encrypted: {doc.IsEncrypted}");
Console.WriteLine($"Version:   {doc.Version}");

// Metadata
var meta = doc.GetMetadata();
Console.WriteLine($"\n=== Metadata ===");
Console.WriteLine($"Title:  {meta.Title ?? "(none)"}");
Console.WriteLine($"Author: {meta.Author ?? "(none)"}");

// Bookmarks
Console.WriteLine($"\n=== Bookmarks ({doc.BookmarkCount}) ===");
foreach (var bm in doc.GetBookmarks())
    Console.WriteLine($"  {bm.Title} -> Page {bm.PageIndex + 1}");

// Attachments
Console.WriteLine($"\n=== Attachments ===");
foreach (var att in doc.GetAttachments())
    Console.WriteLine($"  {att.Name} ({att.Length} bytes, {att.MimeType})");

// Signatures
Console.WriteLine($"\n=== Signatures ({doc.SignatureCount}) ===");
foreach (var sig in doc.GetSignatures())
{
    Console.WriteLine($"  {sig.SubFilter} - {sig.Reason}");
    var result = PdfiumSignatureVerifier.Verify(sig, pdfBytes);
    Console.WriteLine($"    Valid: {result.IsValid}");
}

// Text from each page
Console.WriteLine($"\n=== Text ===");
for (int i = 0; i < doc.PageCount; i++)
{
    using var page = doc.GetPage(i);
    var text = page.ExtractText();
    Console.WriteLine($"--- Page {i + 1} ({page.Width:F0}x{page.Height:F0} pts) ---");
    Console.WriteLine(text.Length > 200 ? text[..200] + "..." : text);

    // Links on this page
    foreach (var link in page.GetLinks())
        Console.WriteLine($"  Link: {link.Url}");
}

// Images
Console.WriteLine($"\n=== Images ===");
foreach (var img in doc.GetAllImages())
    Console.WriteLine($"  {img.Width}x{img.Height} pixels at {img.BoundsOnPage}");

Page Properties

Each PdfiumPage exposes basic geometry:

PropertyTypeDescription
IndexintZero-based page index.
WidthfloatPage width in PostScript points (1 pt = 1/72 inch).
HeightfloatPage height in PostScript points.
RotationintPage rotation in degrees: 0, 90, 180, or 270.

Thread Safety

All PdfiumDocument and PdfiumPage operations route through a process-global lock. Multiple .NET threads can safely call inspection methods on the same or different documents concurrently — they serialize internally. The lock is reentrant on the same thread so callbacks cannot deadlock.