Pikepdf Readthedocs Io en Latest
Pikepdf Readthedocs Io en Latest
Pikepdf Readthedocs Io en Latest
Release 1.19.0
James R. Barlow
1 At a glance 3
1.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Similar libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 In use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Index 65
i
ii
pikepdf Documentation, Release 1.19.0
Introduction 1
pikepdf Documentation, Release 1.19.0
2 Introduction
CHAPTER 1
At a glance
pikepdf is a library intended for developers who want to create, manipulate, parse, repair, and abuse the PDF format.
It supports reading and write PDFs, including creating from scratch. Thanks to QPDF, it supports linearizing PDFs
and access to encrypted PDFs.
It is a low level library that requires knowledge of PDF internals and some familiarity with the PDF specification. It
does not provide a user interface of its own.
pikepdf would help you build apps that do things like:
3
pikepdf Documentation, Release 1.19.0
• Rasterize PDF pages for display (that is, produce an image that shows
what a PDF page looks like at a particular resolution/zoom level) – use
Ghostscript instead
• Convert from PDF to other similar paper capture formats like epub, XPS,
DjVu, Postscript – use MuPDF or PyMuPDF
• Print to paper
If you only want to generate PDFs and not read or modify them, consider report-
lab (a “write-only” PDF generator). Fig. 2: Pikemen bracing for a cal-
vary charge, carrying pikes.
1.1 Requirements
pikepdf currently requires Python 3.5+. There are no plans to backport to 2.7 or
older versions in the 3.x series.
Support for Python 3.5 will end in September 2020, when Python 3.5 itself
reaches “end of life”.
Unlike similar Python libraries such as PyPDF2 and pdfrw, pikepdf is not pure Python. Both were designed prior to
Python wheels which has made Python extension libraries much easier to work with. By leveraging the existing mature
code base of QPDF, despite being new, pikepdf is already more capable than both in many respects – for example, it
can read compress object streams, repair damaged PDFs in many cases, and linearize PDFs. Unlike those libraries,
it’s not pure Python: it is impure and proud of it.
1.3 In use
pikepdf is used by the same author’s OCRmyPDF to inspect input PDFs, graft the generated OCR layers on to page
content, and output PDFs. Its code contains several practical examples, particular in pdfinfo.py, graft.py, and
optimize.py. pikepdf is also used in its test suite.
1.3.1 Installation
Basic installation
Most users on Linux, macOS or Windows with x64 systems should use pip to install pikepdf in their
current Python environment (such as your project’s virtual environment).
Fig. 3: A pike
4 Chapter 1. installation
At a glance
failure.
pikepdf Documentation, Release 1.19.0
Use pip install --user pikepdf to install the package for the current user only. Use pip
install pikepdf to install to a virtual environment.
Linux users: If you have an older version of pip, such as the one that ships with Ubuntu 18.04, this
command will attempt to compile the project instead of installing the wheel. If you want to get the
binary wheel, upgrade pip with:
32- and 64-bit wheels are available for Windows, Linux and macOS. Binary wheels should work on
most systems work on Linux distributions 2010 and newer, macOS 10.11 and newer (for Homebrew),
Windows 7 and newer, provided a recent version of pip is used to install them. The Linux wheels
currently include copies of libqpdf, libjpeg, and zlib The Windows wheels include libqpdf. This is to
ensure that up-to-date, compatible copies of dependent libraries are included.
Currently we do not build wheels for architectures other than x86 and x64.
Alpine Linux does not support Python wheels.
Platform support
Some platforms include versions of pikepdf that are distributed by the system pack-
age manager (such as apt). These versions may lag behind the version dis-
tributed with PyPI, but may be convenient for users that cannot use binary wheels.
Fedora
Alpine Linux
Installing on FreeBSD
1.3. In use 5
pikepdf Documentation, Release 1.19.0
pkg install
˓→python3 py37-lxml py37-pip py37-pybind11 qpdf
Requirements
pikepdf requires:
• a C++14 compliant compiler - GCC (5 and up), clang (3.3 and up), MSVC (2015 or newer)
• pybind11
• libqpdf 8.4.2 or higher from the QPDF project.
On Linux the library and headers for libqpdf must be installed because pikepdf compiles code against it and links to it.
Check Repology for QPDF to see if a recent version of QPDF is available for your platform. Otherwise you must
build QPDF from source. (Consider using the binary wheels, which bundle the required version of libqpdf.)
Compiling with GCC or Clang
• clone this repository
• install libjpeg, zlib and libqpdf on your platform, including headers
• pip install .
Note: pikepdf should be built with the same compiler and linker as libqpdf; to be precise both must use the same C++
ABI. On some platforms, setup.py may not pick the correct compiler so one may need to set environment variables CC
and CXX to redirect it. If the wrong compiler is selected, import pikepdf._qpdf will throw an ImportError
about a missing symbol.
%VS140COMNTOOLS%\..\..\VC\vcvarsall.bat" x64
set DISTUTILS_USE_SDK=1
set MSSdk=1
6 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
4. Extract bin\*.dll (all the DLLs, both QPDF’s and the Microsoft Visual C++ Runtime library) from the zip
file above, and copy it to the src/pikepdf folder in the repository.
5. Run pip install . in the root directory of the repository.
Note: The user compiling pikepdf to must have registry editing rights on the machine to be able to run the
vcvarsall.bat script.
Documentation is generated using Sphinx and you are currently reading it. To regenerate it:
pip install -r requirements/docs.txt
cd docs
make html
v1.19.0
• Learned how to export CCITT images from PDFs that have ICC profiles
attached.
• Cherry-picked a workaround to a possible use-after-free caused by py-
bind11 (pybind11 PR 2223).
• Improved test coverage of code that handles inline images.
v1.18.0
1.3. In use 7
pikepdf Documentation, Release 1.19.0
v1.17.3
• Fixed crash when pikepdf.Pdf objects are used inside generators (#114) and not freed or closed before the
generator exits.
v1.17.2
• Fixed issue, “seek of closed file” where JBIG2 image data could not be accessed (only metadata could be) when
a JBIG2 was extracted from a PDF.
v1.17.1
• Fixed building against the oldest supported version of QPDF (8.4.2), and configure CI to test against the oldest
version. (#109)
v1.17.0
• Fixed a failure to extract PDF images, where the image had both a palette and colorspace set to an ICC profile.
The iamge is now extracted with the profile embedded. (#108)
• Added opt-in support for memory-mapped file access, using pikepdf.open(...
access_mode=pikepdf.AccessMode.mmap). Memory mapping file access performance considerably,
but may make application exception handling more difficult.
v1.16.1
• Fixed an issue with JBIG2 extraction, where the version number of the jbig2dec software may be written to
standard output as a side effect. This could interfere with test cases or software that expects pikepdf to be
stdout-clean.
• Fixed an error that occurred when updating DocumentInfo to match XMP metadata, when XMP metadata had
unexpected empty tags.
• Fixed setup.py to better support Python 3.8 and 3.9.
• Documentation updates.
v1.16.0
• Added support for extracting JBIG2 images with the image API. JBIG2 images are converted to PIL.Image.
Requires a JBIG2 decoder such as jbig2dec.
• Python 3.5 support is deprecated and will end when Python 3.5 itself reaches end of life, in September 2020. At
the moment, some tests are skipped on Python 3.5 because they depend on Python 3.6.
• Python 3.9beta is supported and is known to work on Fedora 33.
8 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
v1.15.1
• Fixed a regression - Pdf.save(filename) may hold file handles open after the file is fully written.
• Documentation updates.
v1.15.0
• Fixed an issue where Decimal objects of precision exceeding the PDF specification could be written to output
files, causing some PDF viewers, notably Acrobat, to parse the file incorrectly. We now limit precision to 15
digits, which ought to be enough to prevent rounding error and parsing errors.
• We now refuse to create pikepdf objects from float or Decimal that are NaN or ±Infinity. These
concepts have no equivalent in PDF.
• pikepdf.Array objects now implement .append() and .extend() with familiar Python list seman-
tics, making them easier to edit.
v1.14.0
v1.13.0
• Added support for editing PDF Outlines (also known as bookmarks or the table of contents). Many thanks to
Matthias Erll for this contribution.
• Added support for decoding run length encoded images.
• Object.read_bytes() and Object.get_stream_buffer() can now request decoding of uncom-
mon PDF filters.
• Fixed test suite warnings related to pytest and hypothesis.
• Fixed build on Cygwin. Thanks to @jhgarrison for report and testing.
v1.12.0
• Microsoft Visual C++ Runtime libraries are now included in the pikepdf Windows wheel, to improve ease of
use on Windows.
• Defensive code added to prevent using .emplace() on objects from a foreign PDF without first copying the
object. Previously, this would raise an exception when the file was saved.
v1.11.2
1.3. In use 9
pikepdf Documentation, Release 1.19.0
v1.11.1
• We now avoid creating an empty XMP metadata entry when files are saved.
• Updated documentation to describe how to delete the document information dictionary.
v1.11.0
• Prevent creation of dictionaries with invalid names (not beginning with /).
• Allow pikepdf’s build to specify a qpdf source tree, allowing one to compile pikepdf against an unre-
leased/modified version of qpdf.
• Improved behavior of pages.p() and pages.remove() when invalid parameters were given.
• Fixed compatibility with libqpdf version 10.0.1, and build official wheels against this version.
• Fixed compatibility with pytest 5.x.
• Fixed the documentation build.
• Fixed an issue with running tests in a non-Unicode locale.
• Fixed a test that randomly failed due to a “deadline error”.
• Removed a possibly nonfree test file.
v1.10.4
• Rebuild Python wheels with newer version of libqpdf. Fixes problems with opening certain password-protected
files (#87).
v1.10.3
v1.10.2
• Fixed an issue where pages added from a foreign PDF were added as references rather than copies. (#80)
• Documentation updates.
v1.10.1
10 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
v1.10.0
v1.9.0
v1.8.3
• If the XMP metadata packet is not well-formed and we are confident that it is essentially empty apart from XML
fluff, we fix the problem instead of raising an exception.
v1.8.2
• Fixed an issue where QPDF 8.4.2 would report different errors from QPDF 9.0.0, causing a test to fail. (#71)
v1.8.1
• Fixed an issue where files opened by name may not be closed correctly. Regression from v1.8.0.
• Fixed test for readable/seekable streams evaluated to always true.
v1.8.0
v1.7.1
• This release was incorrectly marked as a patch-level release when it actually introduced one minor new feature.
It includes the API change to support pikepdf.Pdf.objects.
1.3. In use 11
pikepdf Documentation, Release 1.19.0
v1.7.0
• Shallow object copy with copy.copy(pikepdf.Object) is now supported. (Deep copy is not yet sup-
ported.)
• Support for building on C++11 has been removed. A C++14 compiler is now required.
• pikepdf now generates manylinux2010 wheels on Linux.
• Build and deploy infrastructure migrated to Azure Pipelines.
• All wheels are now available for Python 3.5 through 3.8.
v1.6.5
• Fixed build settings to support Python 3.8 on macOS and Linux. Windows support for Python 3.8 is not currently
tested since continuous integration providers have not updated to Python 3.8 yet.
• pybind11 2.4.3 is now required, to support Python 3.8.
v1.6.4
• When images were encoded with CCITTFaxDecode, type G4, with the /EncodedByteAlign set to true (not
default), the image extracted by pikepdf would be a corrupted form of the original, usually appearing as a small
speckling of black pixels at the top of the page. Saving an image with pikepdf was not affected; this problem
only occurred when attempting to extract images. We now refuse to extract images with these parameters, as
there is not sufficient documentation to determine how to extract them. This image format is relatively rare.
v1.6.3
v1.6.2
• Fixed another build problem on Alpine Linux - musl-libc defines struct FILE as an incomplete type, which
breaks pybind11 metaprogramming that attempts to reason about the type.
• Documentation improved to mention FreeBSD port.
12 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
v1.6.1
• Dropped our one usage of QPDF’s C API so that we use only C++.
• Documentation improvements.
v1.6.0
• Added bindings for QPDF’s page object helpers and token filters. These enable: filtering content streams,
capturing pages as Form XObjects, more convenient manipulation of page boxes.
• Fixed a logic error on attempting to save a PDF created in memory in a way that overwrites an existing file.
• Fixed Pdf.get_warnings() failed with an exception when attempting to return a warning or exception.
• Improved manylinux1 binary wheels to compile all dependencies from source rather than using older versions.
• More tests and more coverage.
• libqpdf 8.4.2 is required.
v1.5.0
• Improved interpretation of images within PDFs that use an ICC colorspace. Where possible we embed the ICC
profile when extracting the image, and profile access to the ICC profile.
• Fixed saving PDFs with their existing encryption.
• Fixed documentation to reflect the fact that saving a PDF without specifying encryption settings will remove
encryption.
• Added a test to prevent overwriting the input PDF since overwriting corrupts lazy loading.
• Object.write(filters=, decode_parms=) now detects invalid parameters instead of writing in-
valid values to Filters and DecodeParms.
• We can now extract some images that had stacked compression, provided it is /FlateDecode.
• Add convenience function Object.wrap_in_array().
v1.4.0
• Added support for saving encrypted PDFs. (Reading them has been supported for a long time.)
• Added support for setting the PDF extension level as well as version.
• Added support converting strings to and from PDFDocEncoding, by registering a "pdfdoc" codec.
v1.3.1
• Updated pybind11 to v2.3.0, fixing a possible GIL deadlock when pikepdf objects were shared across threads.
(#27)
• Fixed an issue where PDFs with valid XMP metadata but missing an element that is usually present would be
rejected as malformed XMP.
1.3. In use 13
pikepdf Documentation, Release 1.19.0
v1.3.0
• Remove dependency on defusedxml.lxml, because this library is deprecated. In the absence of other
options for XML hardening we have reverted to standard lxml.
• Fixed an issue where PdfImage.extract_to() would write a file in the wrong directory.
• Eliminated an intermediate buffer that was used when saving to an IO stream (as opposed to a filename). We
would previously write the entire output to a memory buffer and then write to the output buffer; we now write
directly to the stream.
• Added Object.emplace() as a workaround for when one wants to update a page without generating a new
page object so that links/table of contents entries to the original page are preserved.
• Improved documentation. Eliminated all arg0 placeholder variable names, which appeared when the docu-
mentation generator could not read a C++ variable name.
• Added PageList.remove(p=1), so that it is possible to remove pages using counting numbers.
v1.2.0
• Implemented Pdf.close() and with-block context manager, to allow Pdf objects to be closed without
relying on del.
• PdfImage.extract_to() has a new keyword argument fileprefix=, which to specify a filepath where
an image should be extracted with pikepdf setting the appropriate file suffix. This simplifies the API for the most
common case of extracting images to files.
• Fixed an internal test that should have suppressed the extraction of JPEGs with a nonstandard ColorTransform
parameter set. Without the proper color transform applied, the extracted JPEGs will typically look very pink.
Now, these images should fail to extract as was intended.
• Fixed that Pdf.save(object_stream_mode=...) was ignored if the default
fix_metadata_version=True was also set.
• Data from one Pdf is now copied to other Pdf objects immediately, instead of creating a reference that required
source PDFs to remain available. Pdf objects no longer reference each other.
• libqpdf 8.4.0 is now required
• Various documentation improvements
v1.1.0
• Added workaround for macOS/clang build problem of the wrong exception type being thrown in some cases.
• Improved translation of certain system errors to their Python equivalents.
• Fixed issues resulting from platform differences in datetime.strftime. (#25)
• Added Pdf.new, Pdf.add_blank_page and Pdf.make_stream convenience methods for creating new
PDFs from scratch.
• Added binding for new QPDF JSON feature: Object.to_json.
• We now automatically update the XMP PDFVersion metadata field to be consistent with the PDF’s declared
version, if the field is present.
• Made our Python-augmented C++ classes easier for Python code inspectors to understand.
• Eliminated use of the imghdr library.
14 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
v1.0.5
• Fixed an issue where an invalid date in XMP metadata would cause an exception when updating DocumentInfo.
For now, we warn that some DocumentInfo is not convertible. (In the future, we should also check if the XMP
date is valid, because it probably is not.)
• Rebuilt the binary wheels with libqpdf 8.3.0. libqpdf 8.2.1 is still supported.
v1.0.4
• Updates to tests/resources (provenance of one test file, replaced another test file with a synthetic one)
v1.0.3
v1.0.2
• Fixed an issue where invalid values such as out of range years (e.g. 0) in DocumentInfo would raise exceptions
when using DocumentInfo to populate XMP metadata with .load_from_docinfo.
v1.0.1
• Fixed an exception with handling metadata that contains the invalid XML entity � (an escaped NUL)
v1.0.0
v0.10.2
Fixes
• Fixed segfault when overwriting the pikepdf file that is currently open on Linux.
• Fixed removal of an attribute metadata value when values were present on the same node.
v0.10.1
Fixes
1.3. In use 15
pikepdf Documentation, Release 1.19.0
v0.10.0
Fixes
• Fixed several issues related to generating XMP metadata that passed veraPDF validation.
• Fixed a random test suite failure for very large negative integers.
• The lxml library is now required.
v0.9.2
Fixes
• Added all of the commonly used XML namespaces to XMP metadata handling, so we are less likely to name
something ‘ns1’, etc.
• Skip a test that fails on Windows.
• Fixed build errors in documentation.
v0.9.1
Fixes
v0.9.0
Updates
• New API to access and edit PDF metadata and make consistent edits to the new and old style of PDF metadata.
• 32-bit binary wheels are now available for Windows
• PDFs can now be saved in QPDF’s “qdf” mode
• The Python package defusedxml is now required
• The Python package python-xmp-toolkit and its dependency libexempi are suggested for testing, but not required
Fixes
Breaking
• The Pdf.metadata property was removed, and replaced with the new metadata API
• Pdf.attach() has been removed, because the interface as implemented had no way to deal with existing
attachments.
16 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
v0.3.7
v0.3.6
v0.3.5
Breaking
Fixes
• A use-after-free memory error that caused occasional segfaults and “QPDFFakeName” errors when opening
from stream objects has been resolved.
v0.3.4
Updates
• pybind11 vendoring has ended now that v2.2.4 has been released
v0.3.3
Breaking
Updates
1.3. In use 17
pikepdf Documentation, Release 1.19.0
Fixes
• del obj.AttributeName was not implemented. The attribute interface is now consistent
• Deleting named attributes now defers to the attribute dictionary for Stream objects, as get/set do
• Fixed handling of JPEG2000 images where metadata must be retrieved from the file
v0.3.2
Updates
• Added support for direct image extraction of CMYK and grayscale JPEGs, where previously only RGB (inter-
nally YUV) was supported
• Array() now creates an empty array properly
• The syntax Name.Foo in Dictionary(), e.g. Name.XObject in page.Resources, now works
v0.3.1
Breaking
• pikepdf.open now validates its keyword arguments properly, potentially breaking code that passed invalid
arguments
• libqpdf 8.1.0 is now required - libqpdf 8.1.0 API is now used for creating Unicode strings
• If a non-existent file is opened with pikepdf.open, a FileNotFoundError is raised instead of a generic
error
• We are now temporarily vendoring a copy of pybind11 since its master branch contains unreleased and important
fixes for Python 3.7.
Updates
Fixes
v0.3.0
Breaking
• Modified Object.write method signature to require filter and decode_parms as keyword arguments
18 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
• Implement automatic type conversion from the PDF Null type to None
• Removed Object.unparse_resolved in favor of Object.unparse(resolved=True)
• libqpdf 8.0.2 is now required at minimum
Updates
v0.2.2
• Added Python 3.7 support to build and test (not yet available for Windows, due to lack of availability on Ap-
pveyor)
• Removed setter API from PdfImage because it never worked anyway
• Improved handling of PdfImage with trivial palettes
v0.2.1
v0.2.0
• Implemented automatic type conversion for int, bool and Decimal, eliminating the pikepdf.
{Integer,Boolean,Real} types. Removed a lot of associated numerical code.
Everything before v0.2.0 can be considered too old to document.
1.3.3 Tutorial
This brief tutorial should give you an introduction and orientation to pikepdf’s
paradigm and syntax. From there, we refer to you various topics.
1.3. In use 19
pikepdf Documentation, Release 1.19.0
In contrast to better known PDF libraries, pikepdf uses a single object to repre-
sent a PDF, whether reading, writing or merging. We have cleverly named this
pikepdf.Pdf. In this documentation, a Pdf is a class that allows manipulate
the PDF, meaning the file.
You may of course use from pikepdf import Pdf as ... if the short
class name conflicts or from pikepdf import Pdf as PDF if you prefer uppercase.
pikepdf.open() is a shorthand for pikepdf.Pdf.open().
The PDF class API follows the example of the widely-used Pillow image library. For clarity there is no default
constructor since the arguments used for creation and opening are different. To make a new empty PDF, use Pdf.
new() not Pdf().
Pdf.open() also accepts seekable streams as input, and Pdf.save() accepts streams as output. pathlib.
Path objects are fully supported anywhere pikepdf accepts a filename.
Inspecting pages
Manipulating pages is fundamental to PDFs. pikepdf presents the pages in a PDF through the pikepdf.Pdf.pages
property, which follows the list protocol. As such page numbers begin at 0.
Let’s open a simple PDF that contains four pages.
In [3]: len(pdf.pages)
Out[3]: 4
pikepdf integrates with IPython and Jupyter’s rich object APIs so that you can view PDFs, PDF pages, or images
within PDF in a IPython window or Jupyter notebook. This makes easier it to test visual changes.
In [4]: pdf
Out[4]: « In Jupyter you would see the PDF here »
In [5]: pdf.pages[0]
Out[5]: « In Jupyter you would see an image of the PDF page here »
You can also examine individual pages, which we’ll explore in the next section. Suffice to say that you can access
pages by indexing them and slicing them.
In [6]: pdf.pages[0]
Out[6]: « In Jupyter you would see an image of the PDF page here »
20 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Note: pikepdf.Pdf.open() can open almost all types of encrypted PDF! Just provide the password= keyword
argument.
For more details on document assembly, see PDF split, merge and document assembly.
In PDFs, the main data structure is the dictionary, a key-value data structure much like a Python dict or attrdict.
The major difference is that the keys can only be names, and the values can only be PDF types, including other
dictionaries.
PDF dictionaries are represented as pikepdf.Dictionary, and names are of type pikepdf.Name. A page is
just a dictionary with a certain required keys and a reference from the document’s “page tree”. (pikepdf manages the
page tree for you.)
In [7]: from pikepdf import Pdf
repr() output
The angle brackets in the output indicate that this object cannot be constructed with a Python expression because it
contains a reference. When angle brackets are omitted from the repr() of a pikepdf object, then the object can be
replicated with a Python expression, such as eval(repr(x)) == x. Pages typically have indirect references to
themselves and other pages, so they cannot be represented as an expression.
By convention, pikepdf uses attribute notation for standard names (the names that are normally part of a dictionary,
according to the PDF Reference Manual), and item notation for names that may not always appear. For example,
the images belong to a page always appear at page.Resources.XObject but the name of images is arbitrarily
1.3. In use 21
pikepdf Documentation, Release 1.19.0
chosen by whatever software generates the PDF (/Im0, in this case). (Whenever expressed as strings, names begin
with /.)
In [13]: page1.Resources.XObject['/Im0']
Deleting pages
In [14]: del pdf.pages[1:3] # Remove pages 2-3 labeled "second page" and "third page"
In [15]: len(pdf.pages)
Out[15]: 2
Saving changes
In [16]: pdf.save('output.pdf')
You may save a file multiple times, and you may continue modifying
it after saving. For example, you could create an unencrypted ver-
sion of document, then apply a watermark, and create an encrypted
version.
Note: You may not overwrite the input file (or whatever Python object provides the data) when saving or at any other
time. pikepdf assumes it will have exclusive access to the input file or input data you give it to, until pdf.close()
is called. Fig. 6: Saving pike.
22 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
As in all PDFs, if a user password is set, it will not be possible to open the PDF without the password. If the owner
password is set, changes will not be permitted with the owner password. If the user password is an empty string and
an owner password is set, the PDF can be viewed by anyone with the user (or owner) password. PDF viewers only
enforce pikepdf.Permissions restrictions when a PDF is opened with the user password, since the owner may
change anything.
pikepdf does not and cannot enforce the restrictions in pikepdf.Permissions if you open a file with the user
password. Someone with either the user or owner password can access all the contents of PDF. If you are developing
an application, however, you should consider enforcing the restrictions.
For widest compatibility, passwords should be ASCII, since the PDF reference manual is unclear about how non-ASCII
passwords are supposed to be encoded. See the documentation on Pdf.save() for more details.
Next steps
Have a look at pikepdf topics that interest you, or jump to our detailed API reference. . .
This section discusses working with PDF pages: splitting, merging, copying, deleting. We’re treating pages as a unit,
rather than working with the content of individual pages.
Let’s continue with fourpages.pdf from the Tutorial.
Note: This example will transfer data associated with each page, so that every page stands on its own. It will not
transfer some metadata associated with the PDF as a whole, such the list of bookmarks.
1.3. In use 23
pikepdf Documentation, Release 1.19.0
We create an empty Pdf which will be the container for all the others.
In [8]: pdf.save('merged.pdf')
This code sample is enough to merge most PDFs, but there are some things it does not do that a more sophisicated
function might do. One could call pikepdf.Pdf.remove_unreferenced_resources() to remove unref-
erenced resources. It may also be necessary to chose the most recent version of all source PDFs. Here is a more
sophisticated example:
In [13]: pdf.remove_unreferenced_resources()
This improved example would still leave metadata blank. It’s up to you to decide how to combine metadata from
multiple PDFs.
Suppose the file was scanned backwards. We can easily reverse it in place - maybe it was scanned backwards, a
common problem with automatic document scanners.
In [15]: pdf.pages.reverse()
In [16]: pdf
Out[16]: <pikepdf.Pdf description='../tests/resources/fourpages.pdf'>
Pretty nice, isn’t it? But the pages in this file already were in correct order, so let’s put them back.
In [17]: pdf.pages.reverse()
24 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Now, let’s add some content from another file. Because pdf.pages behaves like a list, we can use pages.
extend() on another file’s pages.
In [18]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
In [20]: pdf.pages.extend(appendix.pages)
We can use pages.insert() to insert into one of more pages into a specific position, bumping everything else
ahead.
Copying pages between Pdf objects will create a shallow copy of the source page within the target Pdf, rather
than the typical Python behavior of creating a reference. As such, modifying pdf.pages[-1] will not affect
appendix.pages[0]. (Normally, assigning objects between Python lists creates a reference, so that the two
objects are identical, list[0] is list[1].)
In [21]: graph = Pdf.open('../tests/resources/graph.pdf')
In [23]: len(pdf.pages)
Out[23]: 6
In [25]: pdf.pages[2].objgen
Out[25]: (4, 0)
In [27]: pdf.pages[2].objgen
Out[27]: (33, 0)
The method above will break any indirect references (such as table of contents entries and hyperlinks) within pdf
to pdf.pages[2]. Perhaps that is the behavior you want, if the replacement means those references are no longer
valid. This is shown by the change in pikepdf.Object.objgen.
Emplacing pages
To preserve indirect references, use pikepdf.Object.emplace(), which will (conceptually) delete all of the
content of target and replace it with the content of source, thus preserving indirect references to the page. (Think of
this as demolishing the interior of a house, but keeping it at the same address.)
In [28]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
In [30]: pdf.pages[2].objgen
Out[30]: (5, 0)
1.3. In use 25
pikepdf Documentation, Release 1.19.0
In [32]: pdf.pages[2].emplace(pdf.pages[-1])
In [34]: pdf.pages[2].objgen
Out[34]: (5, 0)
As you may have guessed, we can assign pages to copy them within a Pdf:
As above, copying a page creates a shallow copy rather than a Python object reference.
Also as above pikepdf.Object.emplace() can be used to create a copy that preserves the functionality of
indirect references within the PDF.
Because PDF pages are usually numbered in counting numbers (1, 2, 3. . . ), pikepdf provides a convenience accessor
.p() that uses counting numbers:
To avoid confusion, the .p() accessor does not accept Python slices, and .p(0) raises an exception. It is also not
possible to delete using it.
PDFs may define their own numbering scheme or different numberings for different sections, such as using Roman
numerals for an introductory section. .pages does not look up this information.
Warning: It’s possible to obtain page information through pikepdf.Pdf.Root object but not recommended.
(In PDF parlance, this is the /Root object).
The internal consistency of the various /Page and /Pages is not guaranteed when accessed in this manner, and
in some PDFs the data structure for these is fairly complex. Use the .pages interface.
This section details with how to view and edit the contents of a page.
pikepdf is not an ideal tool for producing new PDFs from scratch – and there are many good tools for that, as mentioned
elsewhere. pikepdf is better at inspecting, editing and transforming existing PDFs.
26 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
In [4]: pageobj1
Out[4]:
<pikepdf.Dictionary(type_="/Page")({
"/Contents": pikepdf.Stream(stream_dict={
"/Length": 50
}, data=<...>),
"/MediaBox": [ 0, 0, 200, 304 ],
"/Parent": <reference to /Pages>,
"/Resources": {
"/XObject": {
"/Im0": pikepdf.Stream(stream_dict={
"/BitsPerComponent": 8,
"/ColorSpace": "/DeviceRGB",
"/Filter": [ "/DCTDecode" ],
"/Height": 1520,
"/Length": 192956,
"/Subtype": "/Image",
"/Type": "/XObject",
"/Width": 1000
}, data=<...>)
}
},
"/Type": "/Page"
})>
The page’s /Contents key contains instructions for drawing the page content. This is a content stream, which is a
stream object that follows special rules.
Also attached to this page is a /Resources dictionary, which contains a single XObject image. The image is
compressed with the /DCTDecode filter, meaning it is encoded with the DCT (discrete cosine transform), so it is a
JPEG. pikepdf has special APIs for working with images.
The /MediaBox describes the bounding box of the page in PDF pt units (1/72” or 0.35 mm).
You can access the page dictionary data structure directly, but it’s fairly complicated. There are a number of rules,
optional values and implied values. It’s easier to use page helpers, which ensure that the page is modified in a seman-
tically correct manner.
Page helpers
pikepdf provides a helper class, pikepdf.Page, which provides higher-level functions to manipulate pages than the
standard page dictionary used in the previous examples.
Currently pikepdf does not automatically return helper classes. You must initialize them. In a future release, it will
return them automatically.
In [5]: from pikepdf import Pdf, Page
1.3. In use 27
pikepdf Documentation, Release 1.19.0
One advantage of page helpers is that they resolve implicit information. For example, page.trimbox will return an
appropriate trim box for this page, which in this case is equal to the media box.
This section covers the object model pikepdf uses in more detail.
A pikepdf.Object is a Python wrapper around a C++ QPDFObjectHandle which, as the name suggests, is a
handle (or pointer) to a data structure in memory, or possibly a reference to data that exists in a file. Importantly, an
object can be a scalar quantity (like a string) or a compound quantity (like a list or dict, that contains other objects).
The fact that the C++ class involved here is an object handle is an implementation detail; it shouldn’t matter for a
pikepdf user.
The simplest types in PDFs are directly represented as Python types: int, bool, and None stand for PDF integers,
booleans and the “null”. Decimal is used for floating point numbers in PDFs. If a value in a PDF is assigned to a
Python float, pikepdf will convert it to Decimal.
Types that are not directly convertible to Python are represented as pikepdf.Object, a compound object that offers
a superset of possible methods, some of which only if the underlying type is suitable. Use the EAFP (easier to ask
forgiveness than permission) idiom, or isinstance to determine the type more precisely. This partly reflects the
fact that the PDF specification allows many data fields to be one of several types.
For convenience, the repr() of a pikepdf.Object will display a Python expression that replicates the existing
object (when possible), so it will say:
28 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
The great news is that it’s often unnecessary to construct pikepdf.Object objects when working with pikepdf.
Python types are transparently converted to the appropriate pikepdf object when passed to pikepdf APIs – when
possible. However, pikepdf sends pikepdf.Object types back to Python on return calls, in most cases, because
pikepdf needs to keep track of objects that came from PDFs originally.
As mentioned above, a pikepdf.Object may reference data that is lazily loaded from its source pikepdf.Pdf.
Closing the Pdf with pikepdf.Pdf.close() will invalidate some objects, depending on whether or not the data
was loaded, and other implementation details that may change. Generally speaking, a pikepdf.Pdf should be held
open until it is no longer needed, and objects that were derived from it may or may not be usable after it is closed.
Simple objects (booleans, integers, decimals, None) are copied directly to Python as pure Python objects.
For PDF stream objects, use pikepdf.Object.read_bytes() to obtain a copy of the object as pure bytes data,
if this information is required after closing a PDF.
When objects are copied from one pikepdf.Pdf to another, the underlying data is copied immediately into the
target. As such it is possible to merge hundreds of Pdf into one, keeping only a single source and the target file open
at a time.
A pikepdf.Stream object works like a PDF dictionary with some encoded bytes attached. The dictionary is
metadata that describes how the stream is encoded. PDF can, and regularly does, use a variety of encoding filters. A
stream can be encoded with one or more filters. Images are a type of stream object.
Most of the interesting content in a PDF (images and content streams) are inside stream objects.
Because the PDF specification unfortunately defines several terms involve the word stream, let’s attempt to clarify:
stream object A PDF object that contains binary data and a metadata dictio-
nary to describes it, represented as pikepdf.Stream. In HTML this is
equivalent to a <object> tag with attributes and data.
object stream A stream object (not a typo, an object stream really is a type of
stream object) in a PDF that contains a number of other objects in a PDF,
grouped together for better compression. In pikepdf there is an option to
save PDFs with this feature enabled to improve compression. Otherwise,
this is just a detail about how PDF files are encoded.
content stream A stream object that contains some instructions to draw graph-
ics and text on a page, or inside a Form XObject. In HTML this is equiv-
alent to the HTML file itself. Content streams do not cross pages.
Form XObject A group of images, text and drawing commands that can be
rendered elsewhere in a PDF as a group. This is often used when a group
of objects are needed at different scales or multiple pages. In HTML this
is like an <svg>. It is not a fillable PDF form (although a fillable PDF Fig. 7: When it comes to taxonomy,
form could involve Form XObjects). software developers have it easy.
1.3. In use 29
pikepdf Documentation, Release 1.19.0
A content stream is a stream object associated with either a page or a Form XObject that describes where and how to
draw images, vectors, and text.
Content streams are binary data that can be thought of as a list of operators and zero or more operands. Operands
are given first, followed by the operator. It is a stack-based language, loosely based on PostScript. (It’s not actually
PostScript, but sometimes well-meaning people mistakenly say that it is!) Like HTML, it has a precise grammar, and
also like (pure) HTML, it has no loops, conditionals or variables.
A typical example is as follows (with additional whitespace and PostScript-style %-comments):
The pattern q, cm, <drawing commands>, Q is extremely common. The drawing commands may recurse
with another q, cm, ..., Q.
pikepdf provides a C++ optimized content stream parser and a filter. The parser is best used for reading and interpreting
content streams; the filter is better for low level editing.
This example prints a typical content stream from a real file, which like the contrived example above, displays an
actual image.
In [3]: commands = []
PDF content streams are stateful. The commands q, cm and Q manipulate the current transform matrix (CTM) which
describes where we will draw next. In most cases you have to track every manipulation of the CTM to figure out what
will happen, even to answer a question like, “where will this image be drawn, and how big will it be?”
But in this simple case, we can read the matrix directly. The decimal numbers 200.0 and 304.0 establish the width
and height at which the image should be drawn, in PDF points (1/72” or about 0.35 mm). The pixel dimensions of the
30 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
image have no effect. If we substituted that image for another, the new image would be drawn in the same location on
the page, painted into the 200 × 304 rectangle regardless of its pixel dimensions.
Let’s continue with the file above and center the image on the page, and reduce its size by 50%. Because we can! For
that, we need to rewrite the second command in the content stream.
We take the original matrix (original) and then translated it to the center of this page. We know that the full page
image is 200 × 304 PDF points, so we translate by one half on each axis: .translated(200/2, 304/2). Then
we scale by 0.5: .scaled(0.5, 0.5).
In [7]: new_matrix
Out[7]: pikepdf.Matrix(((100.0, 0.0, 0.0), (0.0, 152.0, 0.0), (50.0, 76.0, 1.0)))
On an important note, the PDF coordinate system is nailed to the bottom left corner of the page, and on y-axis, up is
positive. That is, the coordinate system is more like the first quadrant of a Cartesian graph than the down is positive
convention normally used in pixel graphics:
Thus the command .translated(200/2, 304/2) is translated from the origin at the bottom left, (0, 0), to the
right by 100 units, and up 152 units. (Some PDF programs insert a command to “flip” the coordinate system, by
translating to the top left corner and scaling by (1, -1).)
After calculating our new matrix, we need to insert it back into the parsed content stream, “unparse” it to binary data,
and replace the old content stream.
In [10]: new_content_stream
Out[10]: b' q\n100.000000 0.000000 0.000000 152.000000 50.000000 76.000000 cm\n/Im0
˓→Do\n Q'
Note: To rotate an image, first translate it so that the image is centered at (0, 0), rotate then apply the rotate, then
translate it to its new center position. This is because rotations occur around (0, 0).
Note: In this illustration, the page’s MediaBox is located at (0, 0) for simplicity. The MediaBox can be offset from
the origin, and code that edits content streams may need to account for this relatively condition.
1.3. In use 31
pikepdf Documentation, Release 1.19.0
The stateful nature of PDF content streams makes editing them complicated. Edits like the example above will work
when the input file is known to have a fixed structure (that is, the state at the time of editing is known). You can always
prepend content to the top of the content stream, since the initial state is known. And you can often append content to
the end the stream, since the final state is predictable if every q (push state) has a matching Q (pop state).
Otherwise, you must track the graphics state and maintain a stack of states.
Most applications will end up parsing the content stream into a higher level representation that is easier edit and then
serializing it back, totally rewriting the content stream. Content streams should be thought of as an output format.
If you guessed that the content streams were the place to look for text inside a PDF – you’d be correct. Unfortunately,
extracting the text is fairly difficult because content stream actually specifies as a font and glyph numbers to use.
Sometimes, there is a 1:1 transparent mapping between Unicode numbers and glyph numbers, and dump of the content
stream will show the text. In general, you cannot rely on there being a transparent mapping; in fact, it is perfectly
legal for a font to specify no Unicode mapping at all, or to use an unconventional mapping (when a PDF contains a
subsetted font for example).
We strongly recommend against trying to scrape text from the content stream.
pikepdf does not currently implement text extraction. We recommend pdfminer.six, a read-only text extraction tool. If
you wish to write PDFs containing text, consider reportlab.
PDFs embed images as binary stream objects within the PDF’s data stream. The stream object’s dictionary describes
properties of the image such as its dimensions and color space. The same image may be drawn multiple times on
multiple pages, at different scales and positions.
In some cases such as JPEG2000, the standard file format of the image is used verbatim, even when the file format
contains headers and information that is repeated in the stream dictionary. In other cases such as for PNG-style
encoding, the image file format is not used directly.
pikepdf currently has no facility to embed new images into PDFs. We recommend img2pdf instead, because it does
the job so well. pikepdf instead allows for image inspection and lossless/transcode free (where possible) “pdf2img”.
pikepdf provides a helper class PdfImage for manipulating images in a PDF. The helper class helps manage the
complexity of the image dictionaries.
In [4]: list(page1.images.keys())
Out[4]: ['/Im0']
32 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
In [7]: type(pdfimage)
Out[7]: pikepdf.models.image.PdfImage
In Jupyter (or IPython with a suitable backend) the image will be displayed.
You can also inspect the properties of the image. The parameters are similar to Pillow’s.
In [8]: pdfimage.colorspace
Out[8]: '/DeviceRGB'
Note: .width and .height are the resolution of the image in pixels, not the size of the image in page coordinates.
The size of the image in page coordinates is determined by the content stream.
Extracting images
Extracting images is straightforward. extract_to() will extract images to a specified file prefix. The extension
is determined while extracting and appended to the filename. Where possible, extract_to writes compressed data
directly to the stream without transcoding. (Transcoding lossy formats like JPEG can reduce their quality.)
In [10]: pdfimage.extract_to(fileprefix='image'))
Out[10]: 'image.jpg'
In [11]: type(pdfimage.as_pil_image())
Out[11]: PIL.JpegImagePlugin.JpegImageFile
1.3. In use 33
pikepdf Documentation, Release 1.19.0
Note: This simple example PDF displays a single full page image. Some PDF creators will paint a page using
multiple images, and features such as layers, transparency and image masks. Accessing the first image on a page is
like an HTML parser that scans for the first <img src=""> tag it finds. A lot more could be happening. There
can be multiple images drawn multiple times on a page, vector art, overdrawing, masking, and transparency. A set of
resources can be grouped together in a “Form XObject” (not to be confused with a PDF Form), and drawn at all once.
Images can be referenced by multiple pages.
Replacing an image
34 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
str(pikepdf.String) is performed by inspecting the binary data. If the binary data begins with a UTF-16 byte
order mark, then the data is interpreted as UTF-16 and returned as a Python str. Otherwise, the data is returned as
a Python str, if the binary data will be interpreted as PDFDocEncoding and decoded to str. Again, in most cases
this is correct behavior and will operate transparently.
Some functions are available in circumstances where it is necessary to force a particular conversion.
PDFDocEncoding
The PDF specification defines PDFDocEncoding, a character encoding used only in PDFs. This encoding matches
ASCII for code points 32 through 126 (0x20 to 0x7e). At all other code points, it is not ASCII and cannot be treated
as equivalent. If you look at a PDF in a binary file viewer (hex editor), a string surrounded by parentheses such as
(Hello World) is usually using PDFDocEncoding.
When pikepdf is imported, it automatically registers "pdfdoc" as a codec with the standard library, so that it may be
used in string and byte conversions.
"•".encode('pdfdoc') == b'\x81'
Other codecs
Two other codecs are commonly used in PDFs, but they are already part of the standard library.
WinAnsiEncoding is identical Windows Code Page 1252, and may be converted using the "cp1252" codec.
MacRomanEncoding may be converted using the "macroman" codec.
PDF has two different types of metadata: XMP metadata, and DocumentInfo, which is deprecated but still relevant. For
backward compatibility, both should contain the same content. pikepdf provides a convenient interface that coordinates
edits to both, but is limited to the most common metadata features.
XMP (Extensible Metadata Platform) Metadata is a metadata specification in XML format that is used many formats
other than PDF. For full information on XMP, see Adobe’s XMP Developer Center. The XMP Specification also
provides useful information.
pikepdf can read compound metadata quantities, but can only modify scalars. For more complex changes consider
using the python-xmp-toolkit library and its libexempi dependency; but note that it is not capable of synchro-
nizing changes to the older DocumentInfo metadata.
By default pikepdf will create a XMP metadata block and set pdf:PDFVersion to a value that matches the PDF
version declared elsewhere in the PDF, whenever a PDF is saved. To suppress this behavior, save with pdf.save(.
.., fix_metadata_version=False).
Also by default, Pdf.open_metadata() will synchronize the XMP metadata with the older document information
dictionary. This behavior can also be adjusted using keyword arguments.
1.3. In use 35
pikepdf Documentation, Release 1.19.0
Accessing metadata
The XMP metadata stream is attached the PDF’s root object, but to simplify management of this, use pikepdf.
Pdf.open_metadata(). The returned pikepdf.models.PdfMetadata object may be used for reading, or
entered with a with block to modify and commit changes. If you use this interface, pikepdf will synchronize changes
to new and old metadata.
A PDF must still be saved after metadata is changed.
In [3]: meta['xmp:CreatorTool']
Out[3]: 'ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01'
The list of available metadata fields may be found in the XMP Specification.
Use del meta['dc:title'] to delete a metadata entry. To remove all of the XMP metadata, use del pdf.
Root.Metadata.
The metadata interface can also test if a file claims to be conformant to the PDF/A specification.
In [7]: meta.pdfa_status
Out[7]: '1B'
Note: Note that this property merely tests if the file claims to be conformant to the PDF/A standard. Use a tool such
as veraPDF to verify conformance.
If you are using pikepdf to create some kind of PDF application, you should update the fields xmp:CreatorTool
and pdf:Producer. You could, for example, set xmp:CreatorTool to your application’s name and version, and
pdf:Producer to pikepdf. Refer to Adobe’s documentation to decide what describes the circumstances.
36 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
This will help PDF developers identify the application that generated a particular PDF and is valuable debugging
information.
You can read the raw XMP metadata if desired. For example, one could extract it and edit it using the full featured
python-xmp-toolkit library.
In [9]: type(xmp)
Out[9]: bytes
In [10]: print(xmp.decode())
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:creator>
<rdf:Seq>
<rdf:li>veraPDF Consortium</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
<rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about=""
˓→xmp:CreatorTool="veraPDF Test Builder" xmp:CreateDate="2015-03-10T17:19:21+01:00"
˓→xmp:ModifyDate="2015-03-10T17:19:21+01:00"/>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Editing XMP with a generic XML library is probably not worth the trouble; the semantics are fairly complex.
Warning: Manually changes to XMP stream object will not be synchronized with live PdfMetadata object or the
DocumentInfo block.
The Document Info block is an older, now deprecated object in which metadata may be stored. The Document Info
is not attached to the /Root object. It may be accessed using the .docinfo property. If no Document Info exists,
touching the .docinfo will properly initialize an empty one.
Here is an example of a Document Info block.
In [12]: pdf.docinfo
Out[12]:
pikepdf.Dictionary({
"/CreationDate": "D:20170911132748-07'00'",
(continues on next page)
1.3. In use 37
pikepdf Documentation, Release 1.19.0
It is permitted in pikepdf to directly interact with Document Info as with other PDF dictionaries. However, it is better to
use .open_metadata() because that interface will apply changes to both XMP and Document Info in a consistent
manner.
You may copy from data from a Document Info object in the current PDF or another PDF into XMP metadata using
load_from_docinfo().
1.3.12 Outlines
Outlines (sometimes also called bookmarks) are shown in a the PDF viewer aside of the page, allowing for navigation
within the document.
Creating outlines
Outlines can be created from scratch, e.g. when assembling a set of PDF files into a single document.
The following example adds outline entries referring to the 1st, 3rd and 9th page of an existing PDF.
In [4]: pdf.save('document_with_outline.pdf')
Another example, for automatically adding an entry for each file in a merged document:
In [7]: page_count = 0
38 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
In [9]: pdf.save('merged.pdf')
Editing outlines
Existing outlines can be edited. Entries can be moved and renamed without affecting the targets they refer to.
Destinations
Destinations tell the PDF viewer where to go when navigating through outline items. The simplest case is a reference
to a page, together with the page location, e.g. Fit (default). However, named destinations can also be assigned.
The PDF specification allows for either use of a destination (Dest attribute) or an action (A attribute), but not both on
the same element. OutlineItem elements handle this as follows:
• When creating new outline entries passing in a page number or reference name, the Dest attribute is used.
• When editing an existing entry with an assigned action, it is left as-is, unless a destination is set. The latter
is preferred if both are present.
Creating a more detailed destination with page location:
The above will call make_page_destination when saving to a Pdf document, roughly equivalent to the follow-
ing:
Outline structure
For nesting outlines, add items to the children list of another OutlineItem.
class pikepdf.Pdf
In-memory representation of a PDF
Root
The /Root object of the PDF.
add_blank_page(*, page_size=(612, 792))
Add a blank page to this PD. If pages already exist, the page will be added to the end. Pages may be
reordered using Pdf.pages.
The caller may add content to the page by modifying its objects after creating it.
1.3. In use 39
pikepdf Documentation, Release 1.19.0
Parameters page_size (tuple) – The size of the page in PDF units (1/72 inch or 0.35mm).
Default size is set to a US Letter 8.5” x 11” page.
allow
Report permissions associated with this PDF.
By default these permissions will be replicated when the PDF is saved. Permissions may also only be
changed when a PDF is being saved, and are only available for encrypted PDFs. If a PDF is not encrypted,
all operations are reported as allowed.
pikepdf has no way of enforcing permissions.
Returns pikepdf.models.Permissions
check()
Check if PDF is well-formed. Similar to qpdf --check.
Returns list of strings describing errors of warnings in the PDF
check_linearization(self: pikepdf.Pdf, stream: object = sys.stderr) → None
Reports information on the PDF’s linearization
Parameters stream – A stream to write this information too; must implement .write() and
.flush() method. Defaults to sys.stderr.
close()
Close a Pdf object and release resources acquired by pikepdf.
If pikepdf opened the file handle it will close it (e.g. when opened with a file path). If the caller opened
the file for pikepdf, the caller close the file.
pikepdf lazily loads data from PDFs, so some pikepdf.Object may implicitly depend on the
pikepdf.Pdf being open. This is always the case for pikepdf.Stream but can be true for any
object. Do not close the Pdf object if you might still be accessing content from it.
When an Object is copied from one Pdf to another, the Object is copied into the destination Pdf
immediately, so after accessing all desired information from the source Pdf it may be closed.
Caution: Closing the Pdf is currently implemented by resetting it to an empty sentinel. It is currently
possible to edit the sentinel as if it were a live object. This behavior should not be relied on and is
subject to change.
40 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Returns: pikepdf.models.EncryptionInfo
filename
The source filename of an existing PDF, when available.
get_object(*args, **kwargs)
Overloaded function.
1. get_object(self: pikepdf.Pdf, objgen: Tuple[int, int]) -> pikepdf.Object
Look up an object by ID and generation number
Return type: pikepdf.Object
2. get_object(self: pikepdf.Pdf, objid: int, gen: int) -> pikepdf.Object
Look up an object by ID and generation number
Return type: pikepdf.Object
get_warnings(self: pikepdf.Pdf ) → list
is_linearized
Returns True if the PDF is linearized.
Specifically returns True iff the file starts with a linearization parameter dictionary. Does no additional
validation.
make_indirect(*args, **kwargs)
Overloaded function.
1. make_indirect(self: pikepdf.Pdf, h: pikepdf.Object) -> pikepdf.Object
Attach an object to the Pdf as an indirect object
Direct objects appear inline in the binary encoding of the PDF. Indirect objects appear inline
as references (in English, “look up object 4 generation 0”) and then read from another location
in the file. The PDF specification requires that certain objects are indirect - consult the PDF
specification to confirm.
Generally a resource that is shared should be attached as an indirect object. pikepdf.
Stream objects are always indirect, and creating them will automatically attach it to the
Pdf.
See Also: pikepdf.Object.is_indirect()
Return type: pikepdf.Object
2. make_indirect(self: pikepdf.Pdf, obj: object) -> pikepdf.Object
Encode a Python object and attach to this Pdf as an indirect object
Return type: pikepdf.Object
make_stream(data)
Create a new pikepdf.Stream object that is attached to this PDF.
Parameters data (bytes) – Binary data for the stream object
static new() → pikepdf.Pdf
Create a new empty PDF from stratch.
objects
Return an iterable list of all objects in the PDF.
After deleting content from a PDF such as pages, objects related to that page, such as images on the page,
may still be present.
1.3. In use 41
pikepdf Documentation, Release 1.19.0
Examples
Parameters
• filename_or_stream (os.PathLike) – Filename of PDF to open.
• password (str or bytes) – User or owner password to open an encrypted PDF. If
the type of this parameter is str it will be encoded as UTF-8. If the type is bytes it will
be saved verbatim. Passwords are always padded or truncated to 32 bytes internally. Use
ASCII passwords for maximum compatibility.
• hex_password (bool) – If True, interpret the password as a hex-encoded version of
the exact encryption key to use, without performing the normal key computation. Useful
in forensics.
• ignore_xref_streams (bool) – If True, ignore cross-reference streams. See qpdf
documentation.
• suppress_warnings (bool) – If True (default), warnings are not printed to stderr.
Use pikepdf.Pdf.get_warnings() to retrieve warnings.
• attempt_recovery (bool) – If True (default), attempt to recover from PDF parsing
errors.
• inherit_page_attributes (bool) – If True (default), push attributes set on a
group of pages to individual pages
42 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Example
Parameters
• set_pikepdf_as_editor (bool) – Update the metadata to show that this version
of pikepdf is the most recent software to modify the metadata. Recommended, except for
testing.
• update_docinfo (bool) – Update the standard fields of DocumentInfo (the old PDF
metadata dictionary) to match the corresponding XMP fields. The mapping is described
in PdfMetadata.DOCINFO_MAPPING. Nonstandard DocumentInfo fields and XMP
metadata fields with no DocumentInfo equivalent are ignored.
• strict (bool) – If False (the default), we aggressively attempt to recover from any
parse errors in XMP, and if that fails we overwrite the XMP with an empty XMP record.
If True, raise errors when either metadata bytes are not valid and well-formed XMP
(and thus, XML). Some trivial cases that are equivalent to empty or incomplete “XMP
skeletons” are never treated as errors, and always replaced with a proper empty XMP
block. Certain errors may be logged.
Returns pikepdf.models.PdfMetadata
1.3. In use 43
pikepdf Documentation, Release 1.19.0
open_outline(max_depth=15, strict=False)
Open the PDF outline (“bookmarks”) for editing.
Recommend for use in a with block. Changes are committed to the PDF when the block exits. (The Pdf
must still be opened.)
Example
Parameters
• max_depth (int) – Maximum recursion depth of the outline to be imported and re-
written to the document. 0 means only considering the root level, 1 the first-level sub-
outline of each root element, and so on. Items beyond this depth will be silently ignored.
Default is 15.
• strict (bool) – With the default behavior (set to False), structural errors (e.g. refer-
ence loops) in the PDF document will only cancel processing further nodes on that partic-
ular level, recovering the valid parts of the document outline without raising an exception.
When set to True, any such error will raise an OutlineStructureError, leaving
the invalid parts in place. Similarly, outline objects that have been accidentally duplicated
in the Outline container will be silently fixed (i.e. reproduced as new objects) or raise
an OutlineStructureError.
Returns pikepdf.models.Outline
pages
Returns the list of pages.
Return type: pikepdf._qpdf.PageList
pdf_version
The version of the PDF specification used for this file, such as ‘1.7’.
remove_unreferenced_resources(self: pikepdf.Pdf ) → None
Remove from /Resources of each page any object not referenced in page’s contents
PDF pages may share resource dictionaries with other pages. If pikepdf is used for page splitting, pages
may reference resources in their /Resources dictionary that are not actually required. This purges all
unnecessary resource entries.
Suggested before saving.
root
Alias for .Root, the /Root object of the PDF.
save(filename_or_stream=None, static_id=False, preserve_pdfa=True, min_version=”,
force_version=”, fix_metadata_version=True, compress_streams=True,
stream_decode_level=None, object_stream_mode=ObjectStreamMode.preserve, normal-
ize_content=False, linearize=False, qdf=False, progress=None, encryption=None)
Save all modifications to this pikepdf.Pdf.
Parameters
• filename (Path or str or stream) – Where to write the output. If a
file exists in this location it will be overwritten. If the file was opened with
allow_overwriting_input=True, then it is permitted to overwrite the original
44 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
file, and this parameter may be omitted to implicitly use the original filename. Otherwise,
the filename may not be the same as the input file, as overwriting the input file would
corrupt data since pikepdf using lazy loading.
• static_id (bool) – Indicates that the /ID metadata, normally calculated as a hash of
certain PDF contents and metadata including the current time, should instead be generated
deterministically. Normally for debugging.
• preserve_pdfa (bool) – Ensures that the file is generated in a manner compliant with
PDF/A and other stricter variants. This should be True, the default, in most cases.
• min_version (str or tuple) – Sets the minimum version of PDF specification
that should be required. If left alone QPDF will decide. If a tuple, the second element
is an integer, the extension level. If the version number is not a valid format, QPDF will
decide what to do.
• force_version (str or tuple) – Override the version recommend by QPDF, po-
tentially creating an invalid file that does not display in old versions. See QPDF manual
for details. If a tuple, the second element is an integer, the extension level.
• fix_metadata_version (bool) – If True (default) and the XMP metadata contains
the optional PDF version field, ensure the version in metadata is correct. If the XMP
metadata does not contain a PDF version field, none will be added. To ensure that the field
is added, edit the metadata and insert a placeholder value in pdf:PDFVersion. If XMP
metadata does not exist, it will not be created regardless of the value of this argument.
• object_stream_mode (pikepdf.ObjectStreamMode) – disable prevents
the use of object streams. preserve keeps object streams from the input file.
generate uses object streams wherever possible, creating the smallest files but requiring
PDF 1.5+.
• compress_streams (bool) – Enables or disables the compression of stream objects
in the PDF. Metadata is never compressed. By default this is set to True, and should be
except for debugging.
• stream_decode_level (pikepdf.StreamDecodeLevel) – Specifies how to
encode stream objects. See documentation for StreamDecodeLevel.
• normalize_content (bool) – Enables parsing and reformatting the content stream
within PDFs. This may debugging PDFs easier.
• linearize (bool) – Enables creating linear or “fast web view”, where the file’s con-
tents are organized sequentially so that a viewer can begin rendering before it has the
whole file. As a drawback, it tends to make files larger.
• qdf (bool) – Save output QDF mode. QDF mode is a special output mode in QPDF to
allow editing of PDFs in a text editor. Use the program fix-qdf to fix convert back to a
standard PDF.
• progress (callable) – Specify a callback function that is called as the PDF is writ-
ten. The function will be called with an integer between 0-100 as the sole parameter, the
progress percentage. This function may not access or modify the PDF while it is being
written, or data corruption will almost certainly occur.
• encryption (pikepdf.models.Encryption or bool) – If False or omit-
ted, existing encryption will be removed. If True encryption settings are copied from
the originating PDF. Alternately, an Encryption object may be provided that sets the
parameters for new encryption.
1.3. In use 45
pikepdf Documentation, Release 1.19.0
You may call .save() multiple times with different parameters to generate different versions of a file,
and you may continue to modify the file after saving it. .save() does not modify the Pdf object in
memory, except possibly by updating the XMP metadata version with fix_metadata_version.
Note: pikepdf can read PDFs will incremental updates, but always any coalesces incremental updates into
a single non-incremental PDF file when saving.
46 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
specialized
In addition to uncompressing the generalized compression formats, supported non-lossy compression will
also be be decoded. At present, this includes the RunLengthDecode filter.
all
In addition to generalized and non-lossy specialized filters, supported lossy compression filters will be
applied. At present, this includes DCTDecode (JPEG) compression. Note that compressing the resulting
data with DCTDecode again will accumulate loss, so avoid multiple compression and decompression
cycles. This is mostly useful for (low-level) retrieving image data; see pikepdf.PdfImage for the
preferred method.
class pikepdf.Encryption(*, owner, user, R=6, allow=Permissions(accessibility=True, ex-
tract=True, modify_annotation=True, modify_assembly=False,
modify_form=True, modify_other=True, print_highres=True,
print_lowres=True), aes=True, metadata=True)
Specify the encryption settings to apply when a PDF is saved.
Parameters
• owner (str) – The owner password to use. This allows full control of the file. If blank,
the PDF will be encrypted and present as “(SECURED)” in PDF viewers. If the owner
password is blank, the user password should be as well.
• user (str) – The user password to use. With this password, some restrictions will be
imposed by a typical PDF reader. If blank, the PDF can be opened by anyone, but only
modified as allowed by the permissions in allow.
• R (int) – Select the security handler algorithm to use. Choose from: 2, 3, 4 or 6. By
default, the highest version of is selected (6). 5 is a deprecated algorithm that should not be
used.
• allow (pikepdf.Permissions) – The permissions to set. If omitted, all permissions
are granted to the user.
• aes (bool) – If True, request the AES algorithm. If False, use RC4. If omitted, AES is
selected whenever possible (R >= 4).
• metadata (bool) – If True, also encrypt the PDF metadata. If False, metadata is not
encrypted. Reading document metadata without decryption may be desirable in some cases.
Requires aes=True. If omitted, metadata is encrypted whenever possible.
exception pikepdf.PdfError
exception pikepdf.PasswordError
Object construction
class pikepdf.Object
1.3. In use 47
pikepdf Documentation, Release 1.19.0
Particularly when working with pages, it may be desirable to remove all of the existing page’s contents
and emplace (insert) a new page on top of it, in a way that preserves all links and references to the original
page. (Or similarly, for other Dictionary objects in a PDF.)
When a page is assigned (pdf.pages[0] = new_page), only the application knows if references to
the original the original page are still valid. For example, a PDF optimizer might restructure a page object
into another visually similar one, and references would be valid; but for a program that reorganizes page
contents such as a N-up compositor, references may not be valid anymore.
This method takes precautions to ensure that child objects in common with self and other are not
inadvertently deleted.
Example
>>> pdf.pages[0].objgen
(16, 0)
>>> pdf.pages[0].emplace(pdf.pages[1])
>>> pdf.pages[0].objgen
(16, 0) # Same object
48 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
The generation number is usually 0, except for PDFs that have been incrementally updated. Incrementally
updated PDFs are now uncommon, since it does not take too long for modern CPUs to reconstruct an entire
PDF. pikepdf will consolidate all incremental updates when saving.
page_contents_add(self: pikepdf.Object, contents: pikepdf.Object, prepend: bool = False) →
None
Append or prepend to an existing page’s content stream.
page_contents_coalesce(self: pikepdf.Object) → None
Coalesce an array of page content streams into a single content stream.
The PDF specification allows the /Contents object to contain either an array of content streams or a
single content stream. However, it simplifies parsing and editing if there is only a single content stream.
This function merges all content streams.
static parse(stream: str, description: str = ”) → pikepdf.Object
Parse PDF binary representation into PDF objects.
read_bytes(self: pikepdf.Object, decode_level: pikepdf._qpdf.StreamDecodeLevel = StreamDe-
codeLevel.generalized) → bytes
Decode and read the content stream associated with this object.
read_raw_bytes(self: pikepdf.Object) → bytes
Read the content stream associated with this object without decoding
same_owner_as(self: pikepdf.Object, arg0: pikepdf.Object) → bool
Test if two objects are owned by the same pikepdf.Pdf.
stream_dict
Access the dictionary key-values for a pikepdf.Stream.
to_json(self: pikepdf.Object, dereference: bool = False) → bytes
Convert to a QPDF JSON representation of the object.
See the QPDF manual for a description of its JSON representation. http://qpdf.sourceforge.net/files/
qpdf-manual.html#ref.json
Not necessarily compatible with other PDF-JSON representations that exist in the wild.
• Names are encoded as UTF-8 strings
• Indirect references are encoded as strings containing obj gen R
• Strings are encoded as UTF-8 strings with unrepresentable binary characters encoded as \uHHHH
• Encoding streams just encodes the stream’s dictionary; the stream data is not represented
• Object types that are only valid in content streams (inline image, operator) as well as “reserved”
objects are not representable and will be serialized as null.
Parameters dereference (bool) – If True, dereference the object is this is an indirect ob-
ject.
Returns JSON bytestring of object. The object is UTF-8 encoded and may be decoded to a
Python str that represents the binary values \x00-\xFF as U+0000 to U+00FF; that is, it
may contain mojibake.
Return type bytes
1.3. In use 49
pikepdf Documentation, Release 1.19.0
50 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
class pikepdf.Dictionary
Constructs a PDF Dictionary object
static __new__(cls, d=None, **kwargs)
Constructs a PDF Dictionary from either a Python dict or keyword arguments.
These two examples are equivalent:
pikepdf.Dictionary(NameOne=1, NameTwo='Two')
In either case, the keys must be strings, and the strings correspond to the desired Names in the PDF
Dictionary. The values must all be convertible to pikepdf.Object.
Returns pikepdf.Object
class pikepdf.Stream
Constructs a PDF Stream object
static __new__(cls, owner, obj)
Parameters
• owner (pikepdf.Pdf) – The Pdf to which this stream shall be attached.
• obj (bytes or list) – If bytes, the data bytes for the stream. If list,
a list of (operands, operator) tuples such as returned by pikepdf.
parse_content_stream().
Returns pikepdf.Object
class pikepdf.Operator
Internal objects
These objects are returned by other pikepdf objects. They are part of the API, but not intended to be created explicitly.
class pikepdf._qpdf.PageList
A list-like object enumerating all pages in a pikepdf.Pdf.
append(self: pikepdf._qpdf.PageList, page: object) → None
Add another page to the end.
extend(*args, **kwargs)
Overloaded function.
1. extend(self: pikepdf._qpdf.PageList, other: pikepdf._qpdf.PageList) -> None
Extend the Pdf by adding pages from another Pdf.pages.
2. extend(self: pikepdf._qpdf.PageList, iterable: iterable) -> None
Extend the Pdf by adding pages from an iterable of pages.
insert(self: pikepdf._qpdf.PageList, index: int, obj: object) → None
Insert a page at the specified location.
Parameters
• index (int) – location at which to insert page, 0-based indexing
• obj (pikepdf.Object) – page object to insert
1.3. In use 51
pikepdf Documentation, Release 1.19.0
Support models are abstracts over “raw” objects within a Pdf. For example, a page in a PDF is a Dictionary with set
to /Type of /Page. The Dictionary in that case is the “raw” object. Upon establishing what type of object it is, we
can wrap it with a support model that adds features to ensure consistency with the PDF specification.
pikepdf does not currently apply support models to “raw” objects automatically, but might do so in a future release
(this would break backward compatibility).
For example, to initialize a Page support model:
Pdf = open(...)
page_support_model = Page(pdf.pages[0])
class pikepdf.Page
52 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
1.3. In use 53
pikepdf Documentation, Release 1.19.0
c
d
e
f
Return one of the six “active values” of the matrix.
encode()
Encode this matrix in binary suitable for including in a PDF
static identity()
Constructs and returns an identity matrix
rotated(angle_degrees_ccw)
Concatenates a rotation matrix on this matrix
scaled(x, y)
Concatenates a scaling matrix on this matrix
shorthand
Return the 6-tuple (a,b,c,d,e,f) that describes this matrix
translated(x, y)
Translates this matrix
class pikepdf.PdfImage(obj)
Support class to provide a consistent API for manipulating PDF images
The data structure for images inside PDFs is irregular and flexible, making it difficult to work with without
introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic
API similar in spirit (and convertible to) the Python Pillow imaging library.
as_pil_image()
Extract the image as a Pillow Image, using decompression as necessary
Returns PIL.Image.Image
extract_to(*, stream=None, fileprefix=”)
Attempt to extract the image directly to a usable image file
If possible, the compressed data is extracted and inserted into a compressed image file format without
transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to
an appropriate format.
Because it is not known until attempted what image format will be extracted, users should not assume what
format they are getting back. When saving the image to a file, use a temporary filename, and then rename
the file to its final name based on the returned file extension.
Examples
>>> im.extract_to(stream=bytes_io)
'.png'
>>> im.extract_to(fileprefix='/tmp/image00')
'/tmp/image00.jpg'
Parameters
• stream – Writable stream to write data to.
54 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
• fileprefix (str or Path) – The path to write the extracted image to, without the
file extension.
Returns If fileprefix was provided, then the fileprefix with the appropriate extension. If no
fileprefix, then an extension indicating the file type.
get_stream_buffer(decode_level=StreamDecodeLevel.specialized)
Access this image with the buffer protocol
icc
If an ICC profile is attached, return a Pillow object that describe it.
Most of the information may be found in icc.profile.
Returns PIL.ImageCms.ImageCmsProfile
is_inline
False for image XObject
read_bytes(decode_level=StreamDecodeLevel.specialized)
Decompress this image and return it as unencoded bytes
show()
Show the image however PIL wants to
class pikepdf.PdfInlineImage(*, image_data, image_object: tuple)
Support class for PDF inline images
class pikepdf.models.PdfMetadata(pdf, pikepdf_mark=True, sync_docinfo=True, over-
write_invalid_xml=True)
Read and edit the metadata associated with a PDF
The PDF specification contain two types of metadata, the newer XMP (Extensible Metadata Platform, XML-
based) and older DocumentInformation dictionary. The PDF 2.0 specification removes the DocumentInforma-
tion dictionary.
This primarily works with XMP metadata, but includes methods to generate XMP from DocumentInformation
and will also coordinate updates to DocumentInformation so that the two are kept consistent.
XMP metadata fields may be accessed using the full XML namespace URI or the short name. For exam-
ple metadata['dc:description'] and metadata['{http://purl.org/dc/elements/1.
1/}description'] both refer to the same field. Several common XML namespaces are registered auto-
matically.
See the XMP specification for details of allowable fields.
To update metadata, use a with block.
Example
See also:
pikepdf.Pdf.open_metadata()
load_from_docinfo(docinfo, delete_missing=False, raise_failure=False)
Populate the XMP metadata object with DocumentInfo
1.3. In use 55
pikepdf Documentation, Release 1.19.0
Parameters
• docinfo – a DocumentInfo, e.g pdf.docinfo
• delete_missing – if the entry is not DocumentInfo, delete the equivalent from XMP
• raise_failure – if True, raise any failure to convert docinfo; otherwise warn and
continue
A few entries in the deprecated DocumentInfo dictionary are considered approximately equivalent to cer-
tain XMP records. This method copies those entries into the XMP metadata.
pdfa_status
Returns the PDF/A conformance level claimed by this PDF, or False
A PDF may claim to PDF/A compliant without this being true. Use an independent verifier such as
veraPDF to test if a PDF is truly conformant.
Returns The conformance level of the PDF/A, or an empty string if the PDF does not claim
PDF/A conformance. Possible valid values are: 1A, 1B, 2A, 2B, 2U, 3A, 3B, 3U.
Return type str
pdfx_status
Returns the PDF/X conformance level claimed by this PDF, or False
A PDF may claim to PDF/X compliant without this being true. Use an independent verifier such as
veraPDF to test if a PDF is truly conformant.
Returns The conformance level of the PDF/X, or an empty string if the PDF does not claim
PDF/X conformance.
Return type str
class pikepdf.models.Encryption(*, owner, user, R=6, allow=Permissions(accessibility=True,
extract=True, modify_annotation=True, mod-
ify_assembly=False, modify_form=True, modify_other=True,
print_highres=True, print_lowres=True), aes=True, meta-
data=True)
Specify the encryption settings to apply when a PDF is saved.
Parameters
• owner (str) – The owner password to use. This allows full control of the file. If blank,
the PDF will be encrypted and present as “(SECURED)” in PDF viewers. If the owner
password is blank, the user password should be as well.
• user (str) – The user password to use. With this password, some restrictions will be
imposed by a typical PDF reader. If blank, the PDF can be opened by anyone, but only
modified as allowed by the permissions in allow.
• R (int) – Select the security handler algorithm to use. Choose from: 2, 3, 4 or 6. By
default, the highest version of is selected (6). 5 is a deprecated algorithm that should not be
used.
• allow (pikepdf.Permissions) – The permissions to set. If omitted, all permissions
are granted to the user.
• aes (bool) – If True, request the AES algorithm. If False, use RC4. If omitted, AES is
selected whenever possible (R >= 4).
• metadata (bool) – If True, also encrypt the PDF metadata. If False, metadata is not
encrypted. Reading document metadata without decryption may be desirable in some cases.
Requires aes=True. If omitted, metadata is encrypted whenever possible.
56 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
1.3. In use 57
pikepdf Documentation, Release 1.19.0
58 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
bits
The number of encryption bits.
encryption_key
The RC4 or AES encryption key used for this file.
file_method
Encryption method used to encode the whole file.
stream_method
Encryption method used to encode streams.
string_method
Encryption method used to encode strings.
user_password
If possible, return the user password.
The user password can only be retrieved when a PDF is opened with the owner password and when older
versions of the encryption algorithm are used.
The password is always returned as bytes even if it has a clear Unicode representation.
In PDF, drawing operations are all performed in content streams that describe the positioning and drawing order of all
graphics (including text, images and vector drawing).
pikepdf (and libqpdf) provide two tools for interpreting content streams: a parser and filter. The parser returns higher
level information, conveniently grouping all commands with their operands. The parser is useful when one wants to
retrieve information from a content stream, such as determine the position of an element. The parser should not be
used to edit or reconstruct the content stream because some subtleties are lost in parsing.
The token filter works at a lower level, considering each token including comments, and distinguishing different types
of spaces. This allows modifying content streams. A TokenFilter must be subclassed; the specialized version describes
how it should transform the stream of tokens.
pikepdf.parse_content_stream(page_or_stream, operators=”)
Parse a PDF content stream into a sequence of instructions.
A PDF content stream is list of instructions that describe where to render the text and graphics in a PDF. This is
the starting point for analyzing PDFs.
If the input is a page and page.Contents is an array, then the content stream is automatically treated as one
coalesced stream.
Each instruction contains at least one operator and zero or more operands.
Parameters
• page_or_stream (pikepdf.Object) – A page object, or the content stream attached
to another object such as a Form XObject.
• operators (str) – A space-separated string of operators to whitelist. For example ‘q
Q cm Do’ will return only operators that pertain to drawing images. Use ‘BI ID EI’ for
inline images. All other operators and associated tokens are ignored. If blank, all tokens are
accepted.
Returns
1.3. In use 59
pikepdf Documentation, Release 1.19.0
Example
class pikepdf.Token
raw_value
The binary representation of a token.
Return type: bytes
type_
Returns the type of token.
Return type: pikepdf.TokenType
value
Interprets the token as a string.
Return type: str or bytes
class pikepdf.TokenType
When filtering content streams, each token is labeled according to the role in plays.
Standard tokens
array_open
array_close
brace_open
brace_close
dict_open
dict_close
These tokens mark the start and end of an array, text string, and dictionary, respectively.
integer
real
null
bool
The token data represents an integer, real number, null or boolean, respectively.
Name
The token is the name of an object. In practice, these are among the most interesting tokens.
inline_image
An inline image in the content stream. The whole inline image is represented by the single token.
60 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Lexical tokens
comment
Signifies a comment that appears in the content stream.
word
Otherwise uncategorized bytes are returned as word tokens. PDF operators are words.
bad
An invalid token.
space
Whitespace within the content stream.
eof
Denotes the end of the tokens in this content stream.
class pikepdf.TokenFilter
1.3.16 Architecture
pikepdf uses pybind11 to bind the C++ interface of QPDF. pybind11 was selected after evaluating Cython, CFFI and
SWIG as possible binding solutions.
In addition to bindings pikepdf includes support code written in a mix of C++ and Python, mainly to present a clean
Pythonic interface to C++ and implement higher level functionality.
Internals
Internally the package presents a module named pikepdf from which objects can be imported. The C++ extension
module is currently named pikepdf._qpdf. Users of pikepdf should not directly access _qpdf since it is an
internal interface.
In general, modules or objects behind an underscore are private (although they may be returned in some situations).
Thread safety
Because of the global interpreter lock (GIL), it is safe to read pikepdf objects across Python threads. Also because of
the GIL, there may not be much performance gain from doing so.
If one or more threads will be modifying pikepdf objects, you will have to coordinate read and write access with a
threading.Lock.
1.3. In use 61
pikepdf Documentation, Release 1.19.0
It is not currently possible to pickle pikepdf objects or marshall them across process boundaries (as would be required
to use pikepdf in multiprocessing). If this were implemented, it would not be much more efficient than saving a
full PDF and sending it to another process. Parallelizing work (for example, by dividing work by PDF pages) can still
be achieved by having each worker process open the same file.
File handles
Because of technical limitations in underlying libraries, pikepdf keeps the source PDF file open when a content is
copied from it to another PDF, even when all Python variables pointing to the source are removed. If a PDF is being
assembled from many sources, then all of those sources are held open in memory.
PyPy3 support
pybind11 does not yet support PyPy3, so it’s not possible to use pikepdf in PyPy3 at this time. When pybind11 finalizes
PyPy3 support, pikepdf will be able to work with PyPy3 as well.
Big changes
Please open a new issue to discuss or propose a major change. Not only is it fun to discuss big ideas, but we might save
each other’s time too. Perhaps some of the work you’re contemplating is already half-done in a development branch.
We use PEP8, black for code formatting and isort for import sorting. The settings for these programs are in
pyproject.toml and setup.cfg. Pull requests should follow the style guide. One difference we use from
“black” style is that strings shown to the user are always in double quotes (") and strings for internal uses are in single
quotes (').
In lieu of a C++ autoformatter that is half as good as black, formatting is more lax.
We have no idea whether to put the pointer designator beside the type or the variable. It logically belongs to the type,
but looks better beside the variable, and ugly in between.
As a general rule for code style, PEP8-style naming conventions should be used. That is, variable and method names
are snake_case, class names are CamelCase. Our coding conventions are closer to pybind11’s than QPDF’s. When
a C++ object wraps is a Python object, it should follow the Python naming conventions for that type of object,
e.g. auto Decimal = py::module::import("decimal").attr("Decimal") for a reference to the
Python Decimal class.
We don’t like the traditional C++ .cpp/.h separation that results in a lot of repetition. Headers that are included by only
one .cpp can contain a complete class.
Use RAII. Avoid naked pointers. Use the STL, use std::string instead of char *. Use #pragma once as a
header guard; it’s been around for 25 years.
62 Chapter 1. At a glance
pikepdf Documentation, Release 1.19.0
Tests
New features should come with tests that confirm their correctness.
New dependencies
If you are proposing a change that will require a new dependency, we prefer dependencies that are already packaged
by Debian or Red Hat. This makes life much easier for our downstream package maintainers.
Dependencies must also be compatible with the source code license.
pikepdf is always spelled “pikepdf”, and never capitalized even at the beginning of a sentence.
Periodic allusions to fish are required, and the writer shall be energetic and mildly amusing.
Known ports/packagers
pikepdf has been ported to many platforms already. If you are interesting in porting to a new platform, check with
Repology to see the status of that platform.
Packager maintainers, please ensure that the command line completion scripts in misc/ are installed.
1.3.18 Debugging
pikepdf does a complex job in providing bindings from Python to a C++ library, both of which have different ideas
about how to manage memory. This page documents some methods that may help should it be necessary to debug the
Python C++ extension (pikepdf._qpdf).
Build pikepdf._qpdf against the version of QPDF above, rather than the system version:
When running Python, ensure that you override shared library load locations:
# Linux
env LD_LIBRARY_PATH=$QPDF_SOURCE_TREE/libqpdf/build/.libs python ...
1.3. In use 63
pikepdf Documentation, Release 1.19.0
You can also run Python through a debugger (gdb or lldb) in this manner, and you will have access to the source
code for both pikepdf’s C++ and QPDF.
Valgrind
Valgrind may also be helpful - see the Python documentation for information on setting up Python and Valgrind.
1.3.19 Resources
• QPDF Manual
• PDF 1.7 ISO Specification PDF 32000-1:2008
• Adobe Supplement to ISO 32000 BaseVersion 1.7 ExtensionLevel 3, Adobe Acrobat 9.0, June 2008, for AESv3
• Other Adobe extensions to the PDF specification
For information about copyrights and licenses, including those associated with the images in this documentation, see
the file debian/copyright.
64 Chapter 1. At a glance
Index
65
pikepdf Documentation, Release 1.19.0
H P
handle_token() (pikepdf.TokenFilter method), 61 P (pikepdf.models.EncryptionInfo attribute), 58
p() (pikepdf._qpdf.PageList method), 51
I Page (class in pikepdf ), 52
icc (pikepdf.PdfImage attribute), 55 page_contents_add() (pikepdf.Object method), 49
identity() (pikepdf.PdfMatrix static method), 54 page_contents_coalesce() (pikepdf.Object
inline_image (pikepdf.TokenType attribute), 60 method), 49
insert() (pikepdf._qpdf.PageList method), 51 PageList (class in pikepdf._qpdf ), 51
integer (pikepdf.TokenType attribute), 60 pages (pikepdf.Pdf attribute), 44
is_inline (pikepdf.PdfImage attribute), 55 parse() (pikepdf.Object static method), 49
is_linearized (pikepdf.Pdf attribute), 41 parse_content_stream() (in module pikepdf ), 59
is_owned_by() (pikepdf.Object method), 48 parse_contents() (pikepdf.Page method), 53
is_rectangle (pikepdf.Object attribute), 48 PasswordError, 47
items() (pikepdf.Object method), 48 Pdf (class in pikepdf ), 39
pdf_version (pikepdf.Pdf attribute), 44
K pdfa_status (pikepdf.models.PdfMetadata attribute),
56
keys() (pikepdf.Object method), 48
PdfError, 47
PdfImage (class in pikepdf ), 54
L PdfInlineImage (class in pikepdf ), 55
load_from_docinfo() PdfMatrix (class in pikepdf ), 53
(pikepdf.models.PdfMetadata method), 55 PdfMetadata (class in pikepdf.models), 55
pdfx_status (pikepdf.models.PdfMetadata attribute),
M 56
make_indirect() (pikepdf.Pdf method), 41 Permissions (class in pikepdf ), 57
make_stream() (pikepdf.Pdf method), 41 pikepdf.models.EncryptionMethod (built-in
modify_annotation (pikepdf.Permissions attribute), class), 58
58 pikepdf.ObjectStreamMode (built-in class), 46
modify_assembly (pikepdf.Permissions attribute), 58 pikepdf.StreamDecodeLevel (built-in class), 46
modify_form (pikepdf.Permissions attribute), 58 pikepdf.TokenType (built-in class), 60
modify_other (pikepdf.Permissions attribute), 58 preserve (pikepdf.ObjectStreamMode attribute), 46
print_highres (pikepdf.Permissions attribute), 58
N print_lowres (pikepdf.Permissions attribute), 58
Name (class in pikepdf ), 50
Name (pikepdf.TokenType attribute), 60 R
new() (in module pikepdf ), 46 R (pikepdf.models.EncryptionInfo attribute), 58
new() (pikepdf.Pdf static method), 41 raw_value (pikepdf.Token attribute), 60
none (pikepdf.models.EncryptionMethod attribute), 58 rc4 (pikepdf.models.EncryptionMethod attribute), 58
none (pikepdf.StreamDecodeLevel attribute), 46 read_bytes() (pikepdf.Object method), 49
66 Index
pikepdf Documentation, Release 1.19.0
S
same_owner_as() (pikepdf.Object method), 49
save() (pikepdf.Pdf method), 44
scaled() (pikepdf.PdfMatrix method), 54
shorthand (pikepdf.PdfMatrix attribute), 54
show() (pikepdf.PdfImage method), 55
show_xref_table() (pikepdf.Pdf method), 46
space (pikepdf.TokenType attribute), 61
specialized (pikepdf.StreamDecodeLevel attribute),
46
Stream (class in pikepdf ), 51
stream_dict (pikepdf.Object attribute), 49
stream_method (pikepdf.models.EncryptionInfo at-
tribute), 59
String (class in pikepdf ), 50
string_method (pikepdf.models.EncryptionInfo at-
tribute), 59
T
to_dictionary_object()
(pikepdf.models.OutlineItem method), 57
to_json() (pikepdf.Object method), 49
Token (class in pikepdf ), 60
TokenFilter (class in pikepdf ), 61
trailer (pikepdf.Pdf attribute), 46
translated() (pikepdf.PdfMatrix method), 54
type_ (pikepdf.Token attribute), 60
U
unknown (pikepdf.models.EncryptionMethod attribute),
58
unparse() (pikepdf.Object method), 49
user_password (pikepdf.models.EncryptionInfo at-
tribute), 59
V
V (pikepdf.models.EncryptionInfo attribute), 58
value (pikepdf.Token attribute), 60
Index 67