New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip HTML tags (but keep any text content) when rendering text #33
Conversation
This also solves the issue of disappearing block content from top-level HTML blocks. Apparently gomarkdown does not provide a content-only attribute for HTML blocks; this is only for spans. I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.
I changed it a bit to ensure that text content inside HTML blocks is rendered, even though blocks are currently rendered with tags and all. Apparently gomarkdown does not strip the tags from the content with HTML blocks, it only does this for span elements. (It actually sets the value of Literal to the full text content, including tags, and then nulls out Content.) I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution. |
I went ahead and implemented tag stripping for HTML blocks using html-strip-tags-go. I also made methods for HTMLBlock and HTMLSpan for consistency, and because I've got some ideas for them later (namely detecting tags like sup/sub and converting them to the proper ast type). |
internal/renderer/renderer.go
Outdated
if entering { | ||
r.text(w, node) | ||
w.Write(lineBreak) | ||
w.Write(lineBreak) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subroutines called from RenderNode
generally do the final linebreak that splits Gemtext paragraphs with noNewLine
. Not that it matters much; it's just a consistency thing (entering
and double AST passthrough was initially intended for that job in gomarkdown, but block elements in HTML and Gemtext are different, so it doesn't get much use for that).
OK, after some deeper investigation, I've discovered the following: HTMLBlock does not correspond exactly with HTML "block" elements. Although only HTML "block" type elements ( HTMLSpan does not correspond with HTML inline elements, it just indicates a single tag within an ast container element. This is a single HTMLBlock:
This is also a single HTMLBlock:
This is also a single HTMLBlock!
This becomes a paragraph containing two HTMLSpans:
This is somewhat disappointing, as I was hoping to be able to easily get the contents of span tags and modify them as needed based on the tag, and also to skip rendering the content of some tags ( |
This corrects the issues with HTMLBlock and HTMLSpan parsing, adds a couple of features, and cleans up the code a bit. For HTMLBlock only, <br> is now interpreted as a hard line break. This should present the content more closely to how it was originaly intended. This has not been implemented for HTMLSpan because it could cause issues with blockquotes, so this is reserved for another time. For HTMLBlock only, the contents of several tags (script, iframe, etc) are stripped completely and will not be rendered. This can't be done as easily in HTMLSpan because the HTMLSpan only includes the tag itself, not the contents. It might be worth revisiting this later, although it's unlikely that many people will be including these tags inside of (for example) blockquotes or paragraphs.
I added some tests, and it looks like there is a problem with HTML blocks inside blockquotes.
|
Even more tests and some fixes for hard breaks. One thing I have found from testing is that we probably need to be unescaping HTML escapes like Still have to fix the issue with HTML blocks inside blockquotes. The issue is that somehow the block is being duplicated, once with tags stripped and once as a blockquote without tags stripped. |
Should be done now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect, this shows very well why my impulsive pick of gomarkdown really wasn't the best solution out there. :D
Fixes #6