Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating PDF/A conforming PDFs #630

Closed
sheppie123 opened this issue May 15, 2018 · 31 comments
Closed

Generating PDF/A conforming PDFs #630

sheppie123 opened this issue May 15, 2018 · 31 comments
Labels
feature New feature that should be supported sponsored Issues sponsored to be resolved faster
Milestone

Comments

@sheppie123
Copy link

Is it possible to generate PDFs that conform to PDF/A using Weasyprint?
From wikipedia:

Other key elements to PDF/A compatibility include:

  • Audio and video content are forbidden.
  • JavaScript and executable file launches are forbidden.
  • All fonts must be embedded and also must be legally embeddable for
    unlimited, universal rendering. This also applies to the so-called
    PostScript standard fonts such as Times or Helvetica.
  • Colorspaces specified in a device-independent manner.
  • Encryption is disallowed.
  • Use of standards-based metadata is mandated.

Many Thanks

@liZe liZe added the feature New feature that should be supported label May 15, 2018
@LukasKlement
Copy link

I opened a ticket on PDF X/3 compliance: #640

Perhaps to start the discussion on what direction WeasyPrint should take, it may be worthwhile to collect the purpose of the different standards:

PDF A -> a standard used predominantly for document archiving
PDF X -> a standard used predominantly for professional print (e.g. offset print)

For detailed differences on the two standards, see page 17 of this document: https://www.impressed.de/DOWNLOADS/pdfToolbox_Server/callas_pdfEngine_Reference.pdf

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

@liZe liZe added this to the 43 milestone Aug 3, 2018
@liZe
Copy link
Member

liZe commented Aug 7, 2018

I've tried to give Acrobat various PDF files generated by WeasyPrint… It's awful, there are many, many, many things to fix before reaching PDF/A or PDF/X conformance.

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

I agree, but there's a long way waiting for us.

@hejsan
Copy link
Contributor

hejsan commented Apr 13, 2020

Hi - opening this can of worms - can we list the things needed to conform to PDF/A?
@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?
I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

@liZe
Copy link
Member

liZe commented Apr 13, 2020

opening this can of worms

🐛🐛🐛🐛🐛🐛🐛🐛

can we list the things needed to conform to PDF/A?

That would be really useful.

@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?

I don’t really remember, but I think that there’s a PDF validator in Acrobat (not in Reader, it’s not free 😢).

Does anyone know an open source (or at least free) tool to check PDF/A and PDF/X conformance?

I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

As far as I can remember, there were lots of errors, and most of them were just impossible to fix with Cairo. I think that we need a dedicated PDF generator for that (see #841).

@hejsan
Copy link
Contributor

hejsan commented Apr 13, 2020

I seem to recall Apache PDFBox having some features, I'll have to check better though.

I think that we need a dedicated PDF generator for that

Maybe this is another use for a post-processor that would parse through the pdf and do what is needed. Seems like a massive undertaking though if it is supposed to support changing everything to be pdf/a compliant. Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

@liZe
Copy link
Member

liZe commented Apr 13, 2020

Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

The current post-processor only knows how to parse PDF files generated by Cairo. It removes a lot of edge cases.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

Of course, removing all external dependencies is not a goal per se. But there are some reasons why it would be interesting to consider getting rid of some of them:

  • Having non-Python dependencies is the source of many, many, many installation problems, at least on Windows and macOS.
  • We’ve had many problems with Cairo. More than 20% of the reported issues have the "Cairo" word in their comments.
  • Cairo releases are … sometimes late. SVG getting mangled when I export to pdf #278 is a good example of why it’s been really frustrating to work with its dev team.
  • Cairo does a lot of things WeasyPrint’s not interested in. Generating PNG is useful for WeasyPrint, but it could be done with a PDF-to-PNG converter. Cairo is complex, it will probably never get new PDF-only features soon (the latest stable version is the first one providing metadata and links, for example).
  • Pango should be useless for us. We use it to break lines, but HTML has requirements that are really different from "normal" use cases. That’s why we have a lot of workarounds for texts. We should use Harfbuzz instead, and break lines using a custom algorithm, just as other browsers do. See Rewrite the line breaking algorithm #301, for example.

So. Here’s what I think.

  • Using a "real" PDF generator would be hard but not impossible. I don’t really like ReportLab for many reasons, but something like that would be really useful.
  • Having a real line-breaking algorithm would make Pango useless.
  • FontConfig is really convenient for Pango, but it should be used only on Linux where it’s the standard library. We could probably rely on macOS and Windows APIs to find fonts (what do other browsers do?).
  • We have to keep HarfBuzz.

@hejsan
Copy link
Contributor

hejsan commented Apr 13, 2020

Ok, I understand and agree with your points.

I don’t really like ReportLab for many reasons

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

@liZe
Copy link
Member

liZe commented Apr 13, 2020

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

👍

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

It can be a separate project, with a quite low-level API. The hard part is probably to handle fonts, by creating a PangoCairo equivalent.

(If anyone knows how to convert PDF to PNG in pure Python, that would be useful too 😒.)

@hejsan
Copy link
Contributor

hejsan commented Apr 14, 2020

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/
(Download here: https://verapdf.org/software/)
There's both a simple gui for checking individual files and also a commandline that can be used for automatic testing.
It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

@liZe
Copy link
Member

liZe commented Apr 15, 2020

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/

That’s really cool, thanks!

It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

That’s really impressive.

Having PDF/A conformance is probably one of the best features we can get once we have a new PDF generator. I’m currently working on that 😉. (That = the generator, not the PDF/A conformance yet)

@hejsan
Copy link
Contributor

hejsan commented Apr 15, 2020

I’m currently working on that

Cool, do you have an open repo for it yet? I had been pondering the same.
Thinking out loud the PDF/A conformance has to be an option as it would impact speed and available features?

@malnajdi
Copy link
Contributor

@liZe is teasing a lot about this new generator. If you need help let me know 😄

@oleg-medovikov
Copy link

How is it going?

@liZe
Copy link
Member

liZe commented Jan 19, 2021

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time 😉. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

@guidocioni
Copy link

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time 😉. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

If I get the latest version from Conda is this already inside? Because I've been trying to produce quite simple (no images or weird components) PDF/A compliant files and from the file info I can see that the version is only 1.5 and they're not PDF/A compliant. :( So maybe the version that I'm using (52.4) still does not include pydyf support?

@grewn0uille
Copy link
Member

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

@guidocioni
Copy link

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

Would be good, the problem is that where I'm deploying this I can only use conda to install anything :D Is there a way to install the master with conda? As you can imagine also converting a PDF to PDF/A using solely conda/python installation is kind of a nightmare :D

@grewn0uille
Copy link
Member

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

@guidocioni
Copy link

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

eh eh I wish it would be so easy. Unfortunately I can only give a list of dependencies to install through conda forge and access a Python environment running with Spark. No access to pip or the underlying unix system. Thanks for the help anyway! I hope someday this will make its way in the stable release

@guidocioni
Copy link

@grewn0uille I managed to install the latest 53.0b1 version (which uses pydyf) in our system and produce a PDF. When looking in the file info I can see it was generated according to the 1.7 standard but when checking in the online validator unfortunately I get these errors:

The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The value of the key Flags is 10 but must be either symbolic or non-symbolic.
The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The document does not conform to the requested standard.
The document contains fonts without embedded font programs or encoding information (CMAPs).
The document doesnot conform to the PDF 1.7 standard.

any idea where are those coming from?

liZe added a commit that referenced this issue Apr 26, 2021
Related to #630.
@liZe
Copy link
Member

liZe commented Apr 26, 2021

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

@guidocioni
Copy link

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

@liZe
Copy link
Member

liZe commented Apr 26, 2021

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

@grewn0uille
Copy link
Member

grewn0uille commented May 5, 2021

Hello!

(The survey is now closed. Thanks for all your answers! We’ll share the results soon 😉)

If you’re interested in PDF/A compliance, we created a short survey where you can give a boost to this feature and help us to improve WeasyPrint 😉

Vote for it!

@guidocioni
Copy link

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

@liZe
Copy link
Member

liZe commented May 12, 2021

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

@guidocioni
Copy link

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

@liZe liZe added this to the 56.0 milestone May 17, 2022
@liZe liZe pinned this issue May 17, 2022
@grewn0uille grewn0uille added the sponsored Issues sponsored to be resolved faster label May 17, 2022
liZe added a commit that referenced this issue May 20, 2022
@liZe liZe closed this as completed in deda575 Jun 13, 2022
@grewn0uille grewn0uille unpinned this issue Jul 7, 2022
@winklemint
Copy link

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

@guidocioni
Copy link

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now

import subprocess
import os


def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

@winklemint
Copy link

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things...
Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now

import subprocess
import os


def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

Hi thanks for this solution I tried with different policy and multiple changes to make the file PDF/A-3B compliant and Vera PDF validated it I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated. Thanks

@FelixSchwarz
Copy link
Contributor

I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated.

@winklemint WeasyPrint does not use GitHub discussions but maybe you can open an issue about Factur-X support. My idea is to gather snippets and advice how to generate Factur-X PDFs using WeasyPrint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature that should be supported sponsored Issues sponsored to be resolved faster
Projects
None yet
Development

No branches or pull requests

10 participants