Class Processor


  • public class Processor
    extends NativeObject

    Process PDF documents with OCR

    The processor applies Optical Character Recognition (OCR) to PDF documents. It can make scanned documents searchable, fix text extraction issues and generate PDF tagging/structure.

    The processor is decoupled from the document - it takes a pdftools.pdf.Document as input and produces a new pdftools.pdf.Document as output.

    • Constructor Detail

      • Processor

        public Processor()
    • Method Detail

      • removeWarningListener

        public void removeWarningListener​(Processor.WarningListener listener)
        Remove registered listener for the Processor.Warning event.
        Parameters:
        listener - Listener for the Processor.Warning event that should be removed. If the listener is not registered, it is ignored.
      • process

        public Document process​(Document document,
                                Engine engine,
                                Stream outStream)
                         throws java.io.IOException,
                                GenericException,
                                LicenseException,
                                CorruptException,
                                PasswordException,
                                ConformanceException,
                                UnsupportedFeatureException,
                                ProcessingException

        Apply OCR to a PDF document

        Process the input PDF document with OCR according to the specified options. The processed document is written to the output stream.

        Non-critical processing issues raise a Processor.WarningListener. It is recommended to review the WarningCategory and handle them if necessary for the application.

        Parameters:
        document - The input PDF document to process
        engine - The OCR engine to use for recognition. This parameter may be null for operations that do not require OCR, such as ImageProcessingMode.REMOVE_TEXT. For all other modes, a valid engine must be provided.
        outStream - The stream to which the output PDF is written. The stream must support both random read and write access.
        Returns:

        The resulting output PDF which can be used as a new input for further processing.

        Note that this object must be disposed before the output stream object (method argument outStream).

        Throws:
        LicenseException - The license check has failed.
        java.io.IOException - Writing to the outStream failed.
        ProcessingException - The document could not be processed.
        java.lang.IllegalArgumentException - An OCR engine is required for the specified options but engine is null.
        java.lang.IllegalArgumentException - The options specifies invalid or contradictory settings.
        java.lang.IllegalArgumentException - The outOptions specifies document encryption for a PDF/A file, which is not allowed.
        GenericException - An unexpected failure occurred.
        CorruptException - An input image in the document is corrupt and cannot be read.
        PasswordException - The document is encrypted and the password is invalid.
        ConformanceException - The document has an invalid conformance level.
        UnsupportedFeatureException - The input PDF contains unrendered XFA form fields. See pdftools.pdf.Document.getXfa for more information.
        java.lang.IllegalArgumentException - if document is null
        java.lang.IllegalArgumentException - if outStream is null
      • process

        public Document process​(Document document,
                                Engine engine,
                                Stream outStream,
                                OcrOptions options)
                         throws java.io.IOException,
                                GenericException,
                                LicenseException,
                                CorruptException,
                                PasswordException,
                                ConformanceException,
                                UnsupportedFeatureException,
                                ProcessingException

        Apply OCR to a PDF document

        Process the input PDF document with OCR according to the specified options. The processed document is written to the output stream.

        Non-critical processing issues raise a Processor.WarningListener. It is recommended to review the WarningCategory and handle them if necessary for the application.

        Parameters:
        document - The input PDF document to process
        engine - The OCR engine to use for recognition. This parameter may be null for operations that do not require OCR, such as ImageProcessingMode.REMOVE_TEXT. For all other modes, a valid engine must be provided.
        outStream - The stream to which the output PDF is written. The stream must support both random read and write access.
        options - The OCR processing options. If null, default options are used.
        Returns:

        The resulting output PDF which can be used as a new input for further processing.

        Note that this object must be disposed before the output stream object (method argument outStream).

        Throws:
        LicenseException - The license check has failed.
        java.io.IOException - Writing to the outStream failed.
        ProcessingException - The document could not be processed.
        java.lang.IllegalArgumentException - An OCR engine is required for the specified options but engine is null.
        java.lang.IllegalArgumentException - The options specifies invalid or contradictory settings.
        java.lang.IllegalArgumentException - The outOptions specifies document encryption for a PDF/A file, which is not allowed.
        GenericException - An unexpected failure occurred.
        CorruptException - An input image in the document is corrupt and cannot be read.
        PasswordException - The document is encrypted and the password is invalid.
        ConformanceException - The document has an invalid conformance level.
        UnsupportedFeatureException - The input PDF contains unrendered XFA form fields. See pdftools.pdf.Document.getXfa for more information.
        java.lang.IllegalArgumentException - if document is null
        java.lang.IllegalArgumentException - if outStream is null
      • process

        public Document process​(Document document,
                                Engine engine,
                                Stream outStream,
                                OcrOptions options,
                                OutputOptions outOptions)
                         throws java.io.IOException,
                                GenericException,
                                LicenseException,
                                CorruptException,
                                PasswordException,
                                ConformanceException,
                                UnsupportedFeatureException,
                                ProcessingException

        Apply OCR to a PDF document

        Process the input PDF document with OCR according to the specified options. The processed document is written to the output stream.

        Non-critical processing issues raise a Processor.WarningListener. It is recommended to review the WarningCategory and handle them if necessary for the application.

        Parameters:
        document - The input PDF document to process
        engine - The OCR engine to use for recognition. This parameter may be null for operations that do not require OCR, such as ImageProcessingMode.REMOVE_TEXT. For all other modes, a valid engine must be provided.
        outStream - The stream to which the output PDF is written. The stream must support both random read and write access.
        options - The OCR processing options. If null, default options are used.
        outOptions - The PDF output options, e.g. to encrypt the output document.
        Returns:

        The resulting output PDF which can be used as a new input for further processing.

        Note that this object must be disposed before the output stream object (method argument outStream).

        Throws:
        LicenseException - The license check has failed.
        java.io.IOException - Writing to the outStream failed.
        ProcessingException - The document could not be processed.
        java.lang.IllegalArgumentException - An OCR engine is required for the specified options but engine is null.
        java.lang.IllegalArgumentException - The options specifies invalid or contradictory settings.
        java.lang.IllegalArgumentException - The outOptions specifies document encryption for a PDF/A file, which is not allowed.
        GenericException - An unexpected failure occurred.
        CorruptException - An input image in the document is corrupt and cannot be read.
        PasswordException - The document is encrypted and the password is invalid.
        ConformanceException - The document has an invalid conformance level.
        UnsupportedFeatureException - The input PDF contains unrendered XFA form fields. See pdftools.pdf.Document.getXfa for more information.
        java.lang.IllegalArgumentException - if document is null
        java.lang.IllegalArgumentException - if outStream is null