Strip HTML tags (but keep any text content) when rendering text #33

mntn-xyz · 2021-09-19T02:08:21Z

Fixes #6

This also solves the issue of disappearing block content from top-level HTML blocks. Apparently gomarkdown does not provide a content-only attribute for HTML blocks; this is only for spans. I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.

mntn-xyz · 2021-09-19T03:01:13Z

I changed it a bit to ensure that text content inside HTML blocks is rendered, even though blocks are currently rendered with tags and all. Apparently gomarkdown does not strip the tags from the content with HTML blocks, it only does this for span elements. (It actually sets the value of Literal to the full text content, including tags, and then nulls out Content.)

I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.

tdemin · 2021-09-20T17:57:47Z

I am not sure if this is how it should behave..?

Source file for reproduction:

# Blockquote test

> Testing text with an <b>HTML</b> tag.
> Another line of <pre>testing text</pre>.

<code>Entire paragraph.</code>

<p>Paragraph 2.</p>

Test of <b>inline spans</b>.

> <code>Line of text.</code>

mntn-xyz · 2021-09-20T18:34:04Z

I went ahead and implemented tag stripping for HTML blocks using html-strip-tags-go. I also made methods for HTMLBlock and HTMLSpan for consistency, and because I've got some ideas for them later (namely detecting tags like sup/sub and converting them to the proper ast type).

internal/renderer/renderer.go

tdemin · 2021-09-21T20:39:28Z

internal/renderer/renderer.go

+	if entering {
+		r.text(w, node)
+		w.Write(lineBreak)
+		w.Write(lineBreak)


Subroutines called from RenderNode generally do the final linebreak that splits Gemtext paragraphs with noNewLine. Not that it matters much; it's just a consistency thing (entering and double AST passthrough was initially intended for that job in gomarkdown, but block elements in HTML and Gemtext are different, so it doesn't get much use for that).

mntn-xyz · 2021-09-22T14:43:56Z

OK, after some deeper investigation, I've discovered the following:

HTMLBlock does not correspond exactly with HTML "block" elements. Although only HTML "block" type elements (, <blockquote>, etc) will be rendered as HTMLBlocks, this alone is not sufficient. The block must also begin at the very start of a line of text (no leading spaces!) and end at the very end of a line of text. It's really just a top-level ast element like a paragraph or list; if you add something outside the block but on the same line, it becomes a paragraph containing HTMLSpans.

HTMLSpan does not correspond with HTML inline elements, it just indicates a single tag within an ast container element. Hello world represents a paragraph containing TWO spans.

This is a single HTMLBlock:

<p>HTML block</p>

This is also a single HTMLBlock:

<p>HTML
block</p>

This is also a single HTMLBlock!

<p>HTML block</p><p>Same HTML block!</p>

This becomes a paragraph containing two HTMLSpans:

<p>HTML span</p>Extra text

This is somewhat disappointing, as I was hoping to be able to easily get the contents of span tags and modify them as needed based on the tag, and also to skip rendering the content of some tags (<script> etc). I will implement some of this at the block level, but I'll have to just naively strip out HTMLSpans for now.

This corrects the issues with HTMLBlock and HTMLSpan parsing, adds a couple of features, and cleans up the code a bit. For HTMLBlock only, is now interpreted as a hard line break. This should present the content more closely to how it was originaly intended. This has not been implemented for HTMLSpan because it could cause issues with blockquotes, so this is reserved for another time. For HTMLBlock only, the contents of several tags (script, iframe, etc) are stripped completely and will not be rendered. This can't be done as easily in HTMLSpan because the HTMLSpan only includes the tag itself, not the contents. It might be worth revisiting this later, although it's unlikely that many people will be including these tags inside of (for example) blockquotes or paragraphs.

internal/renderer/renderer.go

mntn-xyz · 2021-10-01T14:44:15Z

I added some tests, and it looks like there is a problem with HTML blocks inside blockquotes.

~~I also need to handle tag escaping properly inside HTML blocks ( ,  ).~~ Edit: in other markdown implementations this doesn't seem to be used

mntn-xyz · 2021-10-01T16:41:22Z

Even more tests and some fixes for hard breaks. One thing I have found from testing is that we probably need to be unescaping HTML escapes like < (unescaped: <). I did this in the HTML block but it looks like it needs to be done for regular text as well. It's going to be tricky because of inline code spans and backslash escapes.

Still have to fix the issue with HTML blocks inside blockquotes. The issue is that somehow the block is being duplicated, once with tags stripped and once as a blockquote without tags stripped.

mntn-xyz · 2021-10-01T18:26:32Z

Should be done now.

tdemin

In retrospect, this shows very well why my impulsive pick of gomarkdown really wasn't the best solution out there. :D

mntn-xyz added 5 commits September 12, 2021 18:25

Render blockquote children recursively

7834c27

Remove fmt.Println

34cd4c6

Merge branch 'master' into blockquote

5df2399

Integrate blockquote fix with textWithNewlineReplacement

9e78c95

Remove nonewline change, does not apply

8fdc8c0

tdemin added the bug Something isn't working label Sep 20, 2021

tdemin added this to the 0.4.0 milestone Sep 20, 2021

mntn-xyz added 2 commits September 20, 2021 14:24

Merge branch 'master' into blockquote

25aec5f

Implement tag stripping for HTML blocks

a13aa93

tdemin reviewed Sep 21, 2021

View reviewed changes

tdemin reviewed Sep 22, 2021

View reviewed changes

internal/renderer/renderer.go Show resolved Hide resolved

internal/renderer/renderer.go Show resolved Hide resolved

internal/renderer/renderer.go Show resolved Hide resolved

mntn-xyz added 4 commits September 22, 2021 16:00

Fix hardbreak

1f902d1

Merge from master

6dc81d8

Add test for hard break

280c969

Add more HTML tests

53c388e

Unescape HTML escapes in HTML blocks, fix for hard break, more tests

1f6bb3a

mntn-xyz added 2 commits October 1, 2021 13:17

Add unescaping of HTML entities

443e1c5

Fix HTML block in blockquote, organize HTML tests

247c5c7

tdemin approved these changes Oct 2, 2021

View reviewed changes

tdemin merged commit ddb3943 into tdemin:master Oct 2, 2021

tdemin mentioned this pull request Oct 3, 2021

Migrate to Goldmark #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip HTML tags (but keep any text content) when rendering text #33

Strip HTML tags (but keep any text content) when rendering text #33

mntn-xyz commented Sep 19, 2021

mntn-xyz commented Sep 19, 2021 •

edited

tdemin commented Sep 20, 2021

mntn-xyz commented Sep 20, 2021

tdemin Sep 21, 2021

mntn-xyz commented Sep 22, 2021 •

edited

mntn-xyz commented Oct 1, 2021 •

edited

mntn-xyz commented Oct 1, 2021 •

edited

mntn-xyz commented Oct 1, 2021

tdemin left a comment

Strip HTML tags (but keep any text content) when rendering text #33

Strip HTML tags (but keep any text content) when rendering text #33

Conversation

mntn-xyz commented Sep 19, 2021

mntn-xyz commented Sep 19, 2021 • edited

tdemin commented Sep 20, 2021

mntn-xyz commented Sep 20, 2021

tdemin Sep 21, 2021

Choose a reason for hiding this comment

mntn-xyz commented Sep 22, 2021 • edited

mntn-xyz commented Oct 1, 2021 • edited

mntn-xyz commented Oct 1, 2021 • edited

mntn-xyz commented Oct 1, 2021

tdemin left a comment

Choose a reason for hiding this comment

mntn-xyz commented Sep 19, 2021 •

edited

mntn-xyz commented Sep 22, 2021 •

edited

mntn-xyz commented Oct 1, 2021 •

edited

mntn-xyz commented Oct 1, 2021 •

edited