1

Converting an image to pdf then back to image format:

$ convert in.jpeg out.pdf
$ convert out.pdf out.jpg
$ diff in.jpeg out.jpg
> Binary files in.jpeg and out.jpg differ

Trying different utilities,

$ gm convert in.jpeg out.pdf
$ pdfimages out.pdf -j orig
$ diff in.jpeg orig/out-100.jpg
> Binary files orig/out-000.jpg and in.jpg differ

Are these tools doing some compression behind the scenes or is it how PDF works, i.e: it's always lossy?

What about image metadata? is it possible to preserve them inside a PDF?

1
  • Could you also indicate the sizes of the in/out files for both utilities.
    – harrymc
    Commented Apr 26, 2023 at 9:19

2 Answers 2

5

PDF is a container format, not an image format. Thus it should be possible to embed an image into a PDF, and then subsequently extract it, without recompression. However, ImageMagick convert does compress the image when creating a PDF, and convert from PDF does not parse the PDF file, but merely takes the snapshot of how it looks, and thus actual recompression takes place twice.

You can avoid this by using alternative tools. For example, img2pdf embeds the image into the PDF document as-is, and pdfimages from poppler (or poppler-utils — package name differs by OS and package manager) can extract embedded images.

$ img2pdf -o out.pdf in.jpeg
$ pdfimages -all out.pdf out
$ diff in.jpeg out-000.jpg
$
2

Answers at end

Background

For a simple "Standard" jpeg PDF In/Out is untouched, that is the input will be binary copy of output, HOWEVER that only can apply to stock PDF and an extractor that does not alter the PDF contents (e.g. resolution or compression) during extraction.

This File is an image from Wikipedia build by Windows CMD the 1st 16 lines are

%PDF-1.7
%ANSI

1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj
2 0 obj <</Type/Pages/Count 1/Kids [ 3 0 R ]>> endobj
3 0 obj <</Type/Page/MediaBox [ 0 0 841.5 594.75 ]/Rotate 0/Resources 4 0 R/Contents 5 0 R/Parent 2 0 R>> endobj
4 0 obj <</XObject <</Img1 6 0 R>>>> endobj
5 0 obj <</Length 61>>
stream
500.000 000.000 000.000 477.000 170.750 053.875 cm /Img1  Do
endstream
endobj
6 0 obj <</Type/XObject/Subtype/Image/ColorSpace/DeviceRGB/BitsPerComponent 8/Filter/DCTDecode
/Width  500/Height 477/Length  36287 >>stream
ÿØÿà JFIF  H H  ÿþ [Photo by David Crawshaw, 2002-01-28
Image composition by David Crawshaw, 2004-09-08, GFDLÿÛ „ 

Note the "density" may change (scaled up/down/distorted) but total pixels of 500x477 should be maintained (there is no such thing as DPI in a PDF) Critically both the source image stream size used to insert and thus extract is /Length 36287 of Standard Jpeg using /DCTDecode (Std not exotic compression), The trailer is

endstream
endobj
xref
0 7
0000000000 65535 f 
0000000016 00000 n 
0000000061 00000 n 
0000000115 00000 n 
0000000228 00000 n 
0000000272 00000 n 
0000000380 00000 n 

trailer
<</Size 7/Info <</Producer (Cmd2PDF)>>/Root 1 0 R>>
startxref
36826
%%EOF

enter image description here

Thus image of 36287 is wrapped as with header=36826+trailer (37,058 bytes) Thus a PDF overhead size of 771 bytes, so about as lean and as mean as that PDF can ever be. Any more size reduction, will be at the cost of reduced quality.

For Tiff or other image types, the header metadata is usually stripped and thus on extraction the core data will be very similar but the file cannot be 100% identical, this same issue can affect most other image types where the structure needs to be altered, for say a transparent "Alpha" layer (PNGs).

So in general a simplistic 24bit RGB.jpg CAN be binary identical for input and output, the same as an MP4 video stream, much else will usually be compressed in a different style.

Many users are surprised to find that saying "compress" my jpegs.pdf makes no difference unless the image is degraded, since it was already optimally compressed in the PDF.

Answers

    1. GM convert is adjusting the image, many other PDF writers will not, those simply inject the file as is area scaled up/down plus its internal metadata. PDF by nature is for image data "lossless" or has same lossy content, so sub-set fonts are by implication lossy but full fonts are lossless.
    1. If inserted and extracted untouched the Compressed and visible MetaData (as shown in above sample) is retained.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .