Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grayscale images are not extracted #789

Open
adamgreenhall opened this issue Jan 27, 2024 · 13 comments
Open

grayscale images are not extracted #789

adamgreenhall opened this issue Jan 27, 2024 · 13 comments
Assignees

Comments

@adamgreenhall
Copy link
Contributor

adamgreenhall commented Jan 27, 2024

grayscale images in pdf are not extracted. I think the problem may be that the images don't define a filter and this code:

if sd.FilterPipeline == nil {
return nil, nil
}

is skipping the image without warning.

Low priority issue for me - but thought that the code above should at least generate a warning if skipping images.

Here's the example.pdf. It was generated by the adobe suite, which may be part of the problem.

$ pdfcpu version
pdfcpu: v0.6.0 dev
commit: 04634d3a (2024-01-25T20:46:43Z)
base  : go1.21.4

$ pdfcpu images list  example.pdf    
pages: all

example.pdf    
2 images available(16.3 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   1   15 │ Im2 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   2   28 │ Im3 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 

$ pdfcpu extract -m image example.pdf images/
extracting images from example.pdf into images
optimizing...

$ ls -l images
total 0

Interestingly, when I open the pdf in MacOS Preview, edit it (e.g. delete a page) and then save it again - this seems to add filter metadata (and change the color space 🤷 ), which allows the images to be extracted.

pdfcpu images list  example-edited.pdf
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1    9 │ Im1 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 667 KB │ FlateDecode
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   2   20 │ Im2 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 1.5 MB │ FlateDecode
@adamgreenhall
Copy link
Contributor Author

adamgreenhall commented Jan 27, 2024

seeing a second example of exiting without processing the image or warning here:

pdfcpu/pkg/pdfcpu/image.go

Lines 239 to 241 in 98cb73b

if img.Reader == nil {
return nil
}

ran into this with a different PDF, with a DeviceN colorspace image:

pdfcpu images list  example2-DeviceN.pdf

1 images available(3.4 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1   13 │ Im0 │ image                  │  2400 │   3554 │    DeviceN    6   8        │ 3.4 MB │ FlateDecode

@hhrutter
Copy link
Collaborator

Yeah, that does not surprise me at all, Apple magic..
Thanks I take a look.

@hhrutter
Copy link
Collaborator

Could you provide a sample for the DeviceN colorspace img issue? 🙏🏻
The intention for ignoring these was not to interrupt any ongoing image extraction
and postpone the implementation until samples were available in order to also test a particular deciding.

@adamgreenhall
Copy link
Contributor Author

Yes - here's the DeviceN file
example2-DeviceN.pdf

@adamgreenhall
Copy link
Contributor Author

adamgreenhall commented Jan 27, 2024

To explain a bit more about the color space of this image - it's "CMYK" + two spot colors, so 6 channels in all (im.comp=6). Each channel contains a grayscale image.

The case statement here:

func renderDeviceN(xRefTable *model.XRefTable, im *PDFImage, resourceName string, cs types.Array) (io.Reader, string, error) {
switch im.comp {

probably should have a default with a warning

I suspect I will need to make a custom renderDevice for this type of thing. Does that sound right?

@hhrutter
Copy link
Collaborator

I think so..
I am busy in another corner, meanwhile if you want to take stab, go for it.

@hhrutter hhrutter reopened this Jan 31, 2024
@hhrutter
Copy link
Collaborator

Your first example contains uncompressed images, the latest commit is a fix for this.

The second example is tricky, since it involves some postscript processing in order to map
the 6 color components to the alternative CMYK colorspace.

The latest commit contains an uncompleted fix in a sense that at least it renders a gray image for your example 2
So processing of DeviceN colorspaces with more than 4 components remains open.

At some point I need to return to this, right now I am tied up with other issues

@adamgreenhall
Copy link
Contributor Author

Thanks for handling the uncompressed case.

I can tackle the DeviceN colorspaces with more than 4 components in a new PR - I already have some code for this. The tricky part there is going to be that there are multiple output files per PDFImage - which will probably need a types change for RenderImage() to return []io.Reader - along with all of its sub-fuctions

@hhrutter
Copy link
Collaborator

I will help out with the overall design of this once you have the rendering part working somehow.
I believe this is going to be trick though, because what we'd actually need is a Postscript interpreter for Postscript functions (type 4) or did I miss anything?

@adamgreenhall
Copy link
Contributor Author

this image parsing code:

https://github.com/adamgreenhall/pdfcpu/blob/1f162698f29345bf8f886b29ad6cb28b001b6cbd/pkg/pdfcpu/writeImage.go#L414-L440

is properly extracting the 6 grayscale images in the channels of the example2-DeviceN.pdf file.

But clearly the organization of where to write the files needs to change. Ideas on how to do that? My initial thought was RenderImage() returns []io.Reader - along with all of its sub-fuctions, but that's going to be a lot of changes - all for this one unusual case.

@hhrutter
Copy link
Collaborator

hhrutter commented Feb 2, 2024

Awesome!
Let me take a look.

@hhrutter
Copy link
Collaborator

hhrutter commented Feb 6, 2024

How did you figure out the necessary decoding for this?
Did you take all of the following into account?
Looks like your solution is hardcoded sort of..

 11:   offset=    5708 generation=0 types.Array
[DeviceN [Cyan Magenta Yellow Black coral light teal] DeviceCMYK (24 0 R) (25 0 R)]

 15:   offset= 3535694 generation=0 types.Array
[Separation coral DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [1.00 0.56 0.57]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

19:   offset= 3535850 generation=0 types.Array
[Separation light teal DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [0.00 0.62 0.65]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

24:   offset= 3537685 generation=0 types.StreamDict
<<
	<Domain, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
	<Filter, FlateDecode>
	<FunctionType, 4>
	<Length, 185>
	<Range, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
>>

25:   offset= 3538049 generation=0 types.Dict subType=NChannel
<<
	<Colorants, (26 0 R)>
	<Process, (27 0 R)>
	<Subtype, NChannel>
>>
   26:   offset= 3538119 generation=0 types.Dict
<<
	<coral, (15 0 R)>
	<light teal, (19 0 R)>
>>
   27:   offset= 3538173 generation=0 types.Dict
<<
	<ColorSpace, DeviceCMYK>
	<Components, [Cyan Magenta Yellow Black]>
>>

Your code is working on the assumption, that any DeviceN color space using more than 4 components is a
CMYK plus Spot To MultiGray image.

I am unsure if we can commit to this - can we?

@adamgreenhall
Copy link
Contributor Author

Agree that this can't be committed as written. To get it to a place where we could merge, I think we'd want:

  1. a way to detect this CMYK+Spot type of image (rather than assuming that CMYK with >4 channels is automatically it). I'm not sure how to do this. Possibly the DeviceN colorspace plus the other Separation info in the PDF (matching the extra color channel names) could be a way to decide?
  2. a way to name/write multiple files per PDFImage that makes sense. I have an idea on this, but it's a little messy.

Do ^ those two make sense?

As for the encoding, I knew what the gray images should look like, and I tried <x,y,c> ordering options until the outputs looked right. I don't know that there is a spec for these InDesign generated PDFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants