Text Extraction
Extract text, search content, read metadata, inspect bookmarks, retrieve attachments, and examine signatures from existing PDF documents using the PdfiumDocument inspection API.
Overview
FolioPDF's inspection API is powered by PDFium — the same PDF engine embedded in Chromium. It can parse and extract data from any valid PDF, regardless of which tool created it. The entry point is PdfiumDocument, which wraps a native PDFium document handle and provides typed .NET APIs for every inspection operation.
using FolioPDF.Toolkit.Pdfium;
using var doc = PdfiumDocument.Load(File.ReadAllBytes("report.pdf"));
Console.WriteLine($"Pages: {doc.PageCount}");
Loading Documents
From Byte Array
byte[] pdfBytes = File.ReadAllBytes("report.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);
From File Path
using var doc = PdfiumDocument.LoadFile("report.pdf");
With Password (Encrypted PDFs)
using var doc = PdfiumDocument.Load(pdfBytes, password: "secret");
// or
using var doc = PdfiumDocument.LoadFile("encrypted.pdf", password: "secret");
Important: PdfiumDocument implements IDisposable. Always wrap it in a using statement to ensure the native handle and pinned buffer are released.
Document Properties
Basic properties are available directly on the document object:
using var doc = PdfiumDocument.Load(pdfBytes);
Console.WriteLine($"Page count: {doc.PageCount}");
Console.WriteLine($"Is encrypted: {doc.IsEncrypted}");
Console.WriteLine($"PDF version: {doc.Version}");
Console.WriteLine($"Permissions: {doc.Permissions}");
| Property | Type | Description |
|---|---|---|
PageCount | int | Total number of pages in the document. |
IsEncrypted | bool | Whether the document is password-protected. |
Version | PdfFileVersion | PDF version (e.g. 1.4, 1.7, 2.0). |
Permissions | PdfPermissions | Security permissions (printing, copying, modification, etc.). |
SignatureCount | int | Number of digital signatures in the document. |
HasSignatures | bool | Convenience: SignatureCount > 0. |
BookmarkCount | int | Number of top-level bookmarks. |
Metadata
Read the standard PDF metadata fields (Info dictionary):
var meta = doc.GetMetadata();
Console.WriteLine($"Title: {meta.Title}");
Console.WriteLine($"Author: {meta.Author}");
Console.WriteLine($"Subject: {meta.Subject}");
Console.WriteLine($"Keywords: {meta.Keywords}");
Console.WriteLine($"Creator: {meta.Creator}");
Console.WriteLine($"Producer: {meta.Producer}");
Console.WriteLine($"Created: {meta.CreationDate}");
Console.WriteLine($"Modified: {meta.ModDate}");
PdfiumDocumentMetadata Fields
| Property | Type | PDF Key |
|---|---|---|
Title | string? | /Title |
Author | string? | /Author |
Subject | string? | /Subject |
Keywords | string? | /Keywords |
Creator | string? | /Creator |
Producer | string? | /Producer |
CreationDate | string? | /CreationDate |
ModDate | string? | /ModDate |
Date fields use the raw PDF date string format (e.g. D:20260407120000+02'00') and are returned verbatim because real-world PDFs sometimes contain malformed dates.
Page-Level Text Extraction
Extract all text from a page in reading order:
using var page = doc.GetPage(0);
string text = page.ExtractText();
Console.WriteLine(text);
Extract text from every page:
for (int i = 0; i < doc.PageCount; i++)
{
using var page = doc.GetPage(i);
string text = page.ExtractText();
Console.WriteLine($"--- Page {i + 1} ---");
Console.WriteLine(text);
}
Note: ExtractText() returns an empty string for scanned image-only pages that lack an OCR text layer. Pages with embedded fonts and ToUnicode CMaps return faithfully readable text.
Text Runs with Bounding Boxes
For position-aware extraction, use ExtractTextRuns() which returns rectangular runs of text with their page coordinates:
using var page = doc.GetPage(0);
foreach (var run in page.ExtractTextRuns())
{
Console.WriteLine($"Text: \"{run.Text}\"");
Console.WriteLine($" Bounds: X={run.Bounds.X:F1}, Y={run.Bounds.Y:F1}, " +
$"W={run.Bounds.Width:F1}, H={run.Bounds.Height:F1}");
}
Each PdfiumTextRun carries:
| Property | Type | Description |
|---|---|---|
Text | string | The text content in reading order. |
Bounds | RectangleF | Bounding rectangle in PDF page coordinates (PostScript points, origin at lower-left). |
Character-Level Positions
For fine-grained positioning (OCR comparison, custom layout analysis), get individual character bounding boxes:
using var page = doc.GetPage(0);
foreach (var ch in page.GetCharacterBoxes())
{
Console.WriteLine($"[{ch.CharIndex}] '{ch.Character}' " +
$"L={ch.Left:F1} R={ch.Right:F1} B={ch.Bottom:F1} T={ch.Top:F1} " +
$"({ch.Width:F1} x {ch.Height:F1})");
}
PdfiumCharBox Properties
| Property | Type | Description |
|---|---|---|
CharIndex | int | Zero-based character index on the page. |
Character | char | The Unicode character. |
Left | float | Left edge in PDF points. |
Right | float | Right edge in PDF points. |
Bottom | float | Bottom edge in PDF points. |
Top | float | Top edge in PDF points. |
Width | float | Computed: Right - Left. |
Height | float | Computed: Top - Bottom. |
Document Search
Search for text across all pages in the document:
foreach (var match in doc.Search("confidential"))
{
Console.WriteLine($"Page {match.PageIndex + 1}: \"{match.Text}\" " +
$"at char {match.CharIndex} (length {match.Length})");
foreach (var rect in match.Rectangles)
Console.WriteLine($" Rect: ({rect.X:F1}, {rect.Y:F1}, {rect.Width:F1}, {rect.Height:F1})");
}
Page-Level Search
Search within a single page:
using var page = doc.GetPage(0);
foreach (var match in page.FindText("invoice"))
{
Console.WriteLine($"\"{match.Text}\" at char {match.CharIndex}");
}
Search Options
// Case-sensitive search
var results = doc.Search("FolioPDF", new PdfiumSearchOptions { CaseSensitive = true });
// Whole-word search (won't match "informant" when searching for "form")
var results = doc.Search("form", PdfiumSearchOptions.WholeWord);
// Convenience presets
var results = doc.Search("term", PdfiumSearchOptions.Default); // case-insensitive substring
var results = doc.Search("TERM", PdfiumSearchOptions.CaseSensitiveDefault); // case-sensitive substring
PdfiumSearchOptions Properties
| Property | Type | Default | Description |
|---|---|---|---|
CaseSensitive | bool | false | When true, uppercase and lowercase letters must match exactly. |
MatchWholeWord | bool | false | When true, matches only when bounded by non-letter/non-digit characters on both sides. |
MatchConsecutive | bool | false | When true, overlapping matches are reported separately. |
PdfiumSearchMatch Properties
| Property | Type | Description |
|---|---|---|
PageIndex | int | Zero-based page index where the match was found. |
CharIndex | int | Character index of the match start within the page's text stream. |
Length | int | Number of characters covered by the match. |
Text | string | The actual matched text as extracted from the PDF. |
Rectangles | IReadOnlyList<RectangleF> | Bounding rectangles in page coordinates. Multi-line matches have one rectangle per visual line. |
Hyperlink Extraction
Extract URLs detected by PDFium's web link analysis engine:
using var page = doc.GetPage(0);
foreach (var link in page.GetLinks())
{
Console.WriteLine($"URL: {link.Url}");
Console.WriteLine($" Starts at char {link.StartCharIndex}, spans {link.CharCount} characters");
}
PdfiumLink Properties
| Property | Type | Description |
|---|---|---|
Url | string | The URL string (e.g. "https://example.com"). |
StartCharIndex | int | Zero-based character index where the link text starts. |
CharCount | int | Number of characters in the link text. |
Image Extraction
Extract all embedded images from the document or a specific page:
// All images from the entire document
foreach (var img in doc.GetAllImages())
{
Console.WriteLine($"Size: {img.Width}x{img.Height}");
Console.WriteLine($"Page bounds: {img.BoundsOnPage}");
// Save as PNG
File.WriteAllBytes($"image_{img.Width}x{img.Height}.png", img.ToPng());
}
// Images from a specific page
var pageImages = doc.GetImages(pageIndex: 0);
foreach (var img in pageImages)
{
byte[] jpeg = img.ToJpeg(quality: 85);
File.WriteAllBytes($"page0_image.jpg", jpeg);
}
PdfiumExtractedImage Properties
| Property | Type | Description |
|---|---|---|
Width | int | Pixel width of the rendered image. |
Height | int | Pixel height of the rendered image. |
BoundsOnPage | RectangleF | On-page bounds in PostScript points (may differ from pixel dimensions due to scaling). |
PixelsBgra32 | byte[] | Decoded BGRA32 pixel buffer. |
Stride | int | Row pitch in bytes. |
| Method | Returns | Description |
|---|---|---|
ToPng() | byte[] | Encode as PNG (lossless) via Skia. |
ToJpeg(quality) | byte[] | Encode as JPEG at the given quality (1-100, default 85) via Skia. |
Bookmarks
Read the document's bookmark (outline) tree:
foreach (var bm in doc.GetBookmarks())
{
PrintBookmark(bm, indent: 0);
}
void PrintBookmark(PdfiumBookmark bm, int indent)
{
Console.WriteLine($"{new string(' ', indent * 2)}{bm.Title} -> Page {bm.PageIndex + 1}");
foreach (var child in bm.Children)
PrintBookmark(child, indent + 1);
}
PdfiumBookmark Properties
| Property | Type | Description |
|---|---|---|
Title | string | Display title shown in PDF viewers. |
PageIndex | int | Zero-based target page index, or -1 if the destination could not be resolved. |
Children | IReadOnlyList<PdfiumBookmark> | Nested child bookmarks. Empty for leaf entries. |
Attachments
Read embedded file attachments (common in ZUGFeRD/Factur-X invoices):
foreach (var att in doc.GetAttachments())
{
Console.WriteLine($"Name: {att.Name}");
Console.WriteLine($"MIME: {att.MimeType}");
Console.WriteLine($"Desc: {att.Description}");
Console.WriteLine($"Size: {att.Length} bytes");
// Extract the payload
byte[] data = att.GetBytes();
File.WriteAllBytes(att.Name, data);
}
PdfiumAttachment Properties
| Property | Type | Description |
|---|---|---|
Name | string | Attachment file name as recorded in the PDF. |
MimeType | string? | MIME type (e.g. text/xml). Null if absent. |
Description | string? | Optional description from the /Desc entry. |
Length | int | Payload length in bytes. |
| Method | Returns | Description |
|---|---|---|
GetBytes() | byte[] | Extract the attachment payload. Cached after the first call. |
Asymmetric routing: Attachments are written via qpdf (DocumentOperation.AddAttachment) but read via PDFium. This split is by design — qpdf's write path includes the /AFRelationship support needed for PDF/A-3 compliance, while PDFium has the cleanest read API.
Signatures
Enumerate digital signatures and their metadata:
foreach (var sig in doc.GetSignatures())
{
Console.WriteLine($"SubFilter: {sig.SubFilter}");
Console.WriteLine($"Reason: {sig.Reason}");
Console.WriteLine($"Signing time: {sig.SigningTime}");
Console.WriteLine($"Contents size: {sig.Contents.Length} bytes");
}
For cryptographic verification, see the Digital Signing page.
Complete Inspection Example
A comprehensive example that extracts all available information from a PDF:
using FolioPDF.Toolkit.Pdfium;
byte[] pdfBytes = File.ReadAllBytes("document.pdf");
using var doc = PdfiumDocument.Load(pdfBytes);
// Document properties
Console.WriteLine($"=== Document Properties ===");
Console.WriteLine($"Pages: {doc.PageCount}");
Console.WriteLine($"Encrypted: {doc.IsEncrypted}");
Console.WriteLine($"Version: {doc.Version}");
// Metadata
var meta = doc.GetMetadata();
Console.WriteLine($"\n=== Metadata ===");
Console.WriteLine($"Title: {meta.Title ?? "(none)"}");
Console.WriteLine($"Author: {meta.Author ?? "(none)"}");
// Bookmarks
Console.WriteLine($"\n=== Bookmarks ({doc.BookmarkCount}) ===");
foreach (var bm in doc.GetBookmarks())
Console.WriteLine($" {bm.Title} -> Page {bm.PageIndex + 1}");
// Attachments
Console.WriteLine($"\n=== Attachments ===");
foreach (var att in doc.GetAttachments())
Console.WriteLine($" {att.Name} ({att.Length} bytes, {att.MimeType})");
// Signatures
Console.WriteLine($"\n=== Signatures ({doc.SignatureCount}) ===");
foreach (var sig in doc.GetSignatures())
{
Console.WriteLine($" {sig.SubFilter} - {sig.Reason}");
var result = PdfiumSignatureVerifier.Verify(sig, pdfBytes);
Console.WriteLine($" Valid: {result.IsValid}");
}
// Text from each page
Console.WriteLine($"\n=== Text ===");
for (int i = 0; i < doc.PageCount; i++)
{
using var page = doc.GetPage(i);
var text = page.ExtractText();
Console.WriteLine($"--- Page {i + 1} ({page.Width:F0}x{page.Height:F0} pts) ---");
Console.WriteLine(text.Length > 200 ? text[..200] + "..." : text);
// Links on this page
foreach (var link in page.GetLinks())
Console.WriteLine($" Link: {link.Url}");
}
// Images
Console.WriteLine($"\n=== Images ===");
foreach (var img in doc.GetAllImages())
Console.WriteLine($" {img.Width}x{img.Height} pixels at {img.BoundsOnPage}");
Page Properties
Each PdfiumPage exposes basic geometry:
| Property | Type | Description |
|---|---|---|
Index | int | Zero-based page index. |
Width | float | Page width in PostScript points (1 pt = 1/72 inch). |
Height | float | Page height in PostScript points. |
Rotation | int | Page rotation in degrees: 0, 90, 180, or 270. |
Thread Safety
All PdfiumDocument and PdfiumPage operations route through a process-global lock. Multiple .NET threads can safely call inspection methods on the same or different documents concurrently — they serialize internally. The lock is reentrant on the same thread so callbacks cannot deadlock.