grayscale images are not extracted #789

adamgreenhall · 2024-01-27T17:10:22Z

grayscale images in pdf are not extracted. I think the problem may be that the images don't define a filter and this code:

pdfcpu/pkg/pdfcpu/extract.go

Lines 386 to 388 in 04634d3

    
           if sd.FilterPipeline == nil { 
        
           	return nil, nil 
        
           }

is skipping the image without warning.

Low priority issue for me - but thought that the code above should at least generate a warning if skipping images.

Here's the example.pdf. It was generated by the adobe suite, which may be part of the problem.

$ pdfcpu version
pdfcpu: v0.6.0 dev
commit: 04634d3a (2024-01-25T20:46:43Z)
base  : go1.21.4

$ pdfcpu images list  example.pdf    
pages: all

example.pdf    
2 images available(16.3 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   1   15 │ Im2 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   2   28 │ Im3 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 

$ pdfcpu extract -m image example.pdf images/
extracting images from example.pdf into images
optimizing...

$ ls -l images
total 0

Interestingly, when I open the pdf in MacOS Preview, edit it (e.g. delete a page) and then save it again - this seems to add filter metadata (and change the color space 🤷 ), which allows the images to be extracted.

pdfcpu images list  example-edited.pdf
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1    9 │ Im1 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 667 KB │ FlateDecode
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   2   20 │ Im2 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 1.5 MB │ FlateDecode

The text was updated successfully, but these errors were encountered:

adamgreenhall · 2024-01-27T18:45:09Z

seeing a second example of exiting without processing the image or warning here:

pdfcpu/pkg/pdfcpu/image.go

Lines 239 to 241 in 98cb73b

    
           if img.Reader == nil { 
        
           	return nil 
        
           }

ran into this with a different PDF, with a DeviceN colorspace image:

pdfcpu images list  example2-DeviceN.pdf

1 images available(3.4 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1   13 │ Im0 │ image                  │  2400 │   3554 │    DeviceN    6   8        │ 3.4 MB │ FlateDecode

hhrutter · 2024-01-27T18:45:46Z

Yeah, that does not surprise me at all, Apple magic..
Thanks I take a look.

hhrutter · 2024-01-27T19:06:19Z

Could you provide a sample for the DeviceN colorspace img issue? 🙏🏻
The intention for ignoring these was not to interrupt any ongoing image extraction
and postpone the implementation until samples were available in order to also test a particular deciding.

adamgreenhall · 2024-01-27T19:09:39Z

Yes - here's the DeviceN file
example2-DeviceN.pdf

adamgreenhall · 2024-01-27T19:19:44Z

To explain a bit more about the color space of this image - it's "CMYK" + two spot colors, so 6 channels in all (im.comp=6). Each channel contains a grayscale image.

The case statement here:

pdfcpu/pkg/pdfcpu/writeImage.go

Lines 775 to 777 in 043541b

    
           func renderDeviceN(xRefTable *model.XRefTable, im *PDFImage, resourceName string, cs types.Array) (io.Reader, string, error) { 
        
           	switch im.comp {

probably should have a default with a warning

I suspect I will need to make a custom renderDevice for this type of thing. Does that sound right?

hhrutter · 2024-01-27T19:22:59Z

I think so..
I am busy in another corner, meanwhile if you want to take stab, go for it.

hhrutter · 2024-01-31T08:40:10Z

Your first example contains uncompressed images, the latest commit is a fix for this.

The second example is tricky, since it involves some postscript processing in order to map
the 6 color components to the alternative CMYK colorspace.

The latest commit contains an uncompleted fix in a sense that at least it renders a gray image for your example 2
So processing of DeviceN colorspaces with more than 4 components remains open.

At some point I need to return to this, right now I am tied up with other issues

adamgreenhall · 2024-01-31T14:53:15Z

Thanks for handling the uncompressed case.

I can tackle the DeviceN colorspaces with more than 4 components in a new PR - I already have some code for this. The tricky part there is going to be that there are multiple output files per PDFImage - which will probably need a types change for RenderImage() to return []io.Reader - along with all of its sub-fuctions

hhrutter · 2024-01-31T17:05:44Z

I will help out with the overall design of this once you have the rendering part working somehow.
I believe this is going to be trick though, because what we'd actually need is a Postscript interpreter for Postscript functions (type 4) or did I miss anything?

adamgreenhall · 2024-02-02T15:06:00Z

this image parsing code:

https://github.com/adamgreenhall/pdfcpu/blob/1f162698f29345bf8f886b29ad6cb28b001b6cbd/pkg/pdfcpu/writeImage.go#L414-L440

is properly extracting the 6 grayscale images in the channels of the example2-DeviceN.pdf file.

But clearly the organization of where to write the files needs to change. Ideas on how to do that? My initial thought was RenderImage() returns []io.Reader - along with all of its sub-fuctions, but that's going to be a lot of changes - all for this one unusual case.

hhrutter · 2024-02-02T16:25:38Z

Awesome!
Let me take a look.

hhrutter · 2024-02-06T01:24:21Z

How did you figure out the necessary decoding for this?
Did you take all of the following into account?
Looks like your solution is hardcoded sort of..

 11:   offset=    5708 generation=0 types.Array
[DeviceN [Cyan Magenta Yellow Black coral light teal] DeviceCMYK (24 0 R) (25 0 R)]

 15:   offset= 3535694 generation=0 types.Array
[Separation coral DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [1.00 0.56 0.57]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

19:   offset= 3535850 generation=0 types.Array
[Separation light teal DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [0.00 0.62 0.65]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

24:   offset= 3537685 generation=0 types.StreamDict
<<
	<Domain, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
	<Filter, FlateDecode>
	<FunctionType, 4>
	<Length, 185>
	<Range, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
>>

25:   offset= 3538049 generation=0 types.Dict subType=NChannel
<<
	<Colorants, (26 0 R)>
	<Process, (27 0 R)>
	<Subtype, NChannel>
>>
   26:   offset= 3538119 generation=0 types.Dict
<<
	<coral, (15 0 R)>
	<light teal, (19 0 R)>
>>
   27:   offset= 3538173 generation=0 types.Dict
<<
	<ColorSpace, DeviceCMYK>
	<Components, [Cyan Magenta Yellow Black]>
>>

Your code is working on the assumption, that any DeviceN color space using more than 4 components is a
CMYK plus Spot To MultiGray image.

I am unsure if we can commit to this - can we?

adamgreenhall · 2024-02-08T22:39:22Z

Agree that this can't be committed as written. To get it to a place where we could merge, I think we'd want:

a way to detect this CMYK+Spot type of image (rather than assuming that CMYK with >4 channels is automatically it). I'm not sure how to do this. Possibly the DeviceN colorspace plus the other Separation info in the PDF (matching the extra color channel names) could be a way to decide?
a way to name/write multiple files per PDFImage that makes sense. I have an idea on this, but it's a little messy.

Do ^ those two make sense?

As for the encoding, I knew what the gray images should look like, and I tried <x,y,c> ordering options until the outputs looked right. I don't know that there is a spec for these InDesign generated PDFs.

adamgreenhall added the investigate label Jan 27, 2024

adamgreenhall assigned hhrutter Jan 27, 2024

hhrutter closed this as completed in 96659b7 Jan 31, 2024

hhrutter reopened this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grayscale images are not extracted #789

grayscale images are not extracted #789

adamgreenhall commented Jan 27, 2024 •

edited

adamgreenhall commented Jan 27, 2024 •

edited

hhrutter commented Jan 27, 2024

hhrutter commented Jan 27, 2024

adamgreenhall commented Jan 27, 2024

adamgreenhall commented Jan 27, 2024 •

edited

hhrutter commented Jan 27, 2024

hhrutter commented Jan 31, 2024

adamgreenhall commented Jan 31, 2024

hhrutter commented Jan 31, 2024

adamgreenhall commented Feb 2, 2024

hhrutter commented Feb 2, 2024

hhrutter commented Feb 6, 2024 •

edited

adamgreenhall commented Feb 8, 2024

grayscale images are not extracted #789

grayscale images are not extracted #789

Comments

adamgreenhall commented Jan 27, 2024 • edited

adamgreenhall commented Jan 27, 2024 • edited

hhrutter commented Jan 27, 2024

hhrutter commented Jan 27, 2024

adamgreenhall commented Jan 27, 2024

adamgreenhall commented Jan 27, 2024 • edited

hhrutter commented Jan 27, 2024

hhrutter commented Jan 31, 2024

adamgreenhall commented Jan 31, 2024

hhrutter commented Jan 31, 2024

adamgreenhall commented Feb 2, 2024

hhrutter commented Feb 2, 2024

hhrutter commented Feb 6, 2024 • edited

adamgreenhall commented Feb 8, 2024

adamgreenhall commented Jan 27, 2024 •

edited

adamgreenhall commented Jan 27, 2024 •

edited

adamgreenhall commented Jan 27, 2024 •

edited

hhrutter commented Feb 6, 2024 •

edited