pdftools_sdk.ocr.processor

Module Attributes

WarningFunc

Event for warnings occurring during OCR processing

Classes

Processor()

Process PDF documents with OCR

pdftools_sdk.ocr.processor.WarningFunc

Event for warnings occurring during OCR processing

Non-critical issues during processing are reported via this event. It is recommended to review the pdftools_sdk.ocr.warning_category.WarningCategory and handle warnings if necessary for the application.

Parameters:
  • message (str) – The message describing the warning

  • category (pdftools_sdk.ocr.warning_category.WarningCategory) – The category of the warning

  • pageNo (int) – The page number this warning is associated to, or 0 if not page-specific

  • context (str) – A description of the context where the warning occurred

alias of Callable[[str, WarningCategory, int, str], None]

class pdftools_sdk.ocr.processor.Processor[source]

Bases: _NativeObject

Process PDF documents with OCR

The processor applies Optical Character Recognition (OCR) to PDF documents. It can make scanned documents searchable, fix text extraction issues and generate PDF tagging/structure.

The processor is decoupled from the document - it takes a pdftools_sdk.pdf.document.Document as input and produces a new pdftools_sdk.pdf.document.Document as output.

__init__()[source]
process(document: Document, engine: Engine | None, out_stream: IOBase, options: OcrOptions | None = None, out_options: OutputOptions | None = None) Document[source]

Apply OCR to a PDF document

Process the input PDF document with OCR according to the specified options. The processed document is written to the output stream.

Non-critical processing issues raise a pdftools_sdk.ocr.processor.WarningFunc() . It is recommended to review the pdftools_sdk.ocr.warning_category.WarningCategory and handle them if necessary for the application.

Parameters:
  • document (pdftools_sdk.pdf.document.Document) – The input PDF document to process

  • engine (Optional[pdftools_sdk.ocr.engine.Engine]) – The OCR engine to use for recognition. This parameter may be None for operations that do not require OCR, such as pdftools_sdk.ocr.image_processing_mode.ImageProcessingMode.REMOVETEXT . For all other modes, a valid engine must be provided.

  • outStream (io.IOBase) – The stream to which the output PDF is written. The stream must support both random read and write access.

  • options (Optional[pdftools_sdk.ocr.ocr_options.OcrOptions]) – The OCR processing options. If None, default options are used.

  • outOptions (Optional[pdftools_sdk.pdf.output_options.OutputOptions]) – The PDF output options, e.g. to encrypt the output document.

Returns:

The resulting output PDF which can be used as a new input for further processing.

Note that this object must be disposed before the output stream object (method argument outStream).

Return type:

pdftools_sdk.pdf.document.Document

Raises:
add_warning_handler(handler: Callable[[str, WarningCategory, int, str], None]) None[source]

Add handler for the WarningFunc() event.

Parameters:

handler – Event handler. If a handler is added that is already registered, it is ignored.

remove_warning_handler(handler: Callable[[str, WarningCategory, int, str], None]) None[source]

Remove registered handler of the WarningFunc() event.

Parameters:

handler – Event handler that shall be removed. If a handler is not registered, it is ignored.