Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx -> HTML: Pandoc discards comments on tables, table rows and table cells #9727

Open
bvobart opened this issue May 7, 2024 · 0 comments
Labels

Comments

@bvobart
Copy link

bvobart commented May 7, 2024

Explain the problem.
I'm working on a project where we convert Docx files to accessible HTML. We do some of our own pre- and postprocessing, but the main conversion from Docx to HTML is done by Pandoc. In order to track how each element of a Docx file is transformed throughout the preprocessing, Pandoc and postprocessing, we place UUID comments on each element that we want to track, e.g. as a simplified example for paragraphs (w:p):

<w:p>
  <w:commentRangeStart w:id="0"/>
  ...
  <w:r>
    <w:t>Some Text</w:t>
  </w:r>
  ...
  <w:commentRangeEnd w:id="0"/>
</w:p>

The comments.xml of that document will then contain a w:comment with ID 0 and a UUID as contents.

Now, we want to track what happens to tables, so I tried adding comments to w:tbl, w:tr and w:tc elements in a similar way to the above example for w:p. However, these comments never get translated to the resulting HTML. Only the comments that are placed on the w:ps within the table cells, end up on the HTML td elements, but the comments on the w:tr do not end up on the HTML tr or thead elements and the comments on the w:tbl elements also do not end up on the HTML table element. In both those cases, the comments are simply discarded.

This is the command I'm using to call Pandoc:

pandoc --from docx --to html --output result.html --track-changes=all annotator_tables_test_gh_issue.docx

See here for an example file containing several tables, where each w:tbl, w:tr, w:tc and w:p has been annotated with a UUID comment (note: they're only visible in the XML, not when you open the file with Word):
annotator_tables_test_gh_issue.docx

My expectation is that the w:tbl comments end up in the HTML table element, the w:tr comments end up in the HTML tr element and the w:tc or w:p comments end up on the HTML td element.

Pandoc version?
3.1.13

@bvobart bvobart added the bug label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant