Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip HTML tags (but keep any text content) when rendering text #33

Merged
merged 16 commits into from Oct 2, 2021

Conversation

mntn-xyz
Copy link
Contributor

Fixes #6

This also solves the issue of disappearing block content from top-level HTML blocks. Apparently gomarkdown does not provide a content-only attribute for HTML blocks; this is only for spans.

I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.
@mntn-xyz
Copy link
Contributor Author

mntn-xyz commented Sep 19, 2021

I changed it a bit to ensure that text content inside HTML blocks is rendered, even though blocks are currently rendered with tags and all. Apparently gomarkdown does not strip the tags from the content with HTML blocks, it only does this for span elements. (It actually sets the value of Literal to the full text content, including tags, and then nulls out Content.)

I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.

@tdemin
Copy link
Owner

tdemin commented Sep 20, 2021

I am not sure if this is how it should behave..?

Source file for reproduction:

# Blockquote test

> Testing text with an <b>HTML</b> tag.
> Another line of <pre>testing text</pre>.

<code>Entire paragraph.</code>

<p>Paragraph 2.</p>

Test of <b>inline spans</b>.

> <code>Line of text.</code>

Test with fix to #6

@tdemin tdemin added the bug Something isn't working label Sep 20, 2021
@tdemin tdemin added this to the 0.4.0 milestone Sep 20, 2021
@mntn-xyz
Copy link
Contributor Author

I went ahead and implemented tag stripping for HTML blocks using html-strip-tags-go. I also made methods for HTMLBlock and HTMLSpan for consistency, and because I've got some ideas for them later (namely detecting tags like sup/sub and converting them to the proper ast type).

internal/renderer/renderer.go Show resolved Hide resolved
if entering {
r.text(w, node)
w.Write(lineBreak)
w.Write(lineBreak)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subroutines called from RenderNode generally do the final linebreak that splits Gemtext paragraphs with noNewLine. Not that it matters much; it's just a consistency thing (entering and double AST passthrough was initially intended for that job in gomarkdown, but block elements in HTML and Gemtext are different, so it doesn't get much use for that).

@mntn-xyz
Copy link
Contributor Author

mntn-xyz commented Sep 22, 2021

OK, after some deeper investigation, I've discovered the following:

HTMLBlock does not correspond exactly with HTML "block" elements. Although only HTML "block" type elements (<p>, <blockquote>, etc) will be rendered as HTMLBlocks, this alone is not sufficient. The block must also begin at the very start of a line of text (no leading spaces!) and end at the very end of a line of text. It's really just a top-level ast element like a paragraph or list; if you add something outside the block but on the same line, it becomes a paragraph containing HTMLSpans.

HTMLSpan does not correspond with HTML inline elements, it just indicates a single tag within an ast container element. Hello <b>world</b> represents a paragraph containing TWO spans.

This is a single HTMLBlock:

<p>HTML block</p>

This is also a single HTMLBlock:

<p>HTML
block</p>

This is also a single HTMLBlock!

<p>HTML block</p><p>Same HTML block!</p>

This becomes a paragraph containing two HTMLSpans:

<p>HTML span</p>Extra text

This is somewhat disappointing, as I was hoping to be able to easily get the contents of span tags and modify them as needed based on the tag, and also to skip rendering the content of some tags (<script> etc). I will implement some of this at the block level, but I'll have to just naively strip out HTMLSpans for now.

This corrects the issues with HTMLBlock and HTMLSpan parsing, adds a couple of features, and cleans up the code a bit.

For HTMLBlock only, <br> is now interpreted as a hard line break. This should present the content more closely to how it was originaly intended. This has not been implemented for HTMLSpan because it could cause issues with blockquotes, so this is reserved for another time.

For HTMLBlock only, the contents of several tags (script, iframe, etc) are stripped completely and will not be rendered. This can't be done as easily in HTMLSpan because the HTMLSpan only includes the tag itself, not the contents. It might be worth revisiting this later, although it's unlikely that many people will be including these tags inside of (for example) blockquotes or paragraphs.
internal/renderer/renderer.go Show resolved Hide resolved
internal/renderer/renderer.go Show resolved Hide resolved
internal/renderer/renderer.go Show resolved Hide resolved
@mntn-xyz
Copy link
Contributor Author

mntn-xyz commented Oct 1, 2021

I added some tests, and it looks like there is a problem with HTML blocks inside blockquotes.

I also need to handle tag escaping properly inside HTML blocks (<br>, <br>). Edit: in other markdown implementations this doesn't seem to be used

@mntn-xyz
Copy link
Contributor Author

mntn-xyz commented Oct 1, 2021

Even more tests and some fixes for hard breaks. One thing I have found from testing is that we probably need to be unescaping HTML escapes like &lt; (unescaped: <). I did this in the HTML block but it looks like it needs to be done for regular text as well. It's going to be tricky because of inline code spans and backslash escapes.

Still have to fix the issue with HTML blocks inside blockquotes. The issue is that somehow the block is being duplicated, once with tags stripped and once as a blockquote without tags stripped.

@mntn-xyz
Copy link
Contributor Author

mntn-xyz commented Oct 1, 2021

Should be done now.

Copy link
Owner

@tdemin tdemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, this shows very well why my impulsive pick of gomarkdown really wasn't the best solution out there. :D

@tdemin tdemin merged commit ddb3943 into tdemin:master Oct 2, 2021
@tdemin tdemin mentioned this pull request Oct 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML tags in blockquotes are not stripped
2 participants