Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkstyle fails with "Path contains invalid character" if path to config file contains non-ascii characters #13012

Closed
reftel opened this issue Apr 20, 2023 · 5 comments · Fixed by #14883

Comments

@reftel
Copy link

reftel commented Apr 20, 2023

If the config file is in a location that contains non-ascii characters, then checkstyle fails:

% ls
checkstyle-10.9.3-all.jar	checkstyle.xml
% cat checkstyle.xml 
<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
  "-//Puppy Crawl//DTD Check Configuration 1.3//EN"
  "configuration_1_3.dtd">
<module name="Checker">
</module>
% mkdir a
% cp checkstyle.xml a
% java -jar checkstyle-10.9.3-all.jar -c a/checkstyle.xml .
Starting audit...
Audit done.
% mkdir æ
% cp checkstyle.xml æ
% java -jar checkstyle-10.9.3-all.jar -c æ/checkstyle.xml .
com.puppycrawl.tools.checkstyle.api.CheckstyleException: unable to parse configuration stream
	at com.puppycrawl.tools.checkstyle.ConfigurationLoader.loadConfiguration(ConfigurationLoader.java:324)
	at com.puppycrawl.tools.checkstyle.ConfigurationLoader.loadConfiguration(ConfigurationLoader.java:266)
	at com.puppycrawl.tools.checkstyle.Main.runCheckstyle(Main.java:380)
	at com.puppycrawl.tools.checkstyle.Main.runCli(Main.java:338)
	at com.puppycrawl.tools.checkstyle.Main.execute(Main.java:195)
	at com.puppycrawl.tools.checkstyle.Main.main(Main.java:130)
Caused by: com.sun.org.apache.xerces.internal.util.URI$MalformedURIException: Path contains invalid character: æ
	at java.xml/com.sun.org.apache.xerces.internal.util.URI.initializePath(URI.java:1110)
	at java.xml/com.sun.org.apache.xerces.internal.util.URI.initialize(URI.java:583)
	at java.xml/com.sun.org.apache.xerces.internal.util.URI.<init>(URI.java:336)
	at java.xml/com.sun.org.apache.xerces.internal.util.URI.<init>(URI.java:299)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.expandSystemIdStrictOff1(XMLEntityManager.java:2393)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.expandSystemId(XMLEntityManager.java:2239)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityManager.resolveEntityAsPerStax(XMLEntityManager.java:996)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1142)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1040)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:943)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
	at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
	at com.puppycrawl.tools.checkstyle.XmlLoader.parseInputSource(XmlLoader.java:81)
	at com.puppycrawl.tools.checkstyle.ConfigurationLoader.parseInputSource(ConfigurationLoader.java:196)
	at com.puppycrawl.tools.checkstyle.ConfigurationLoader.loadConfiguration(ConfigurationLoader.java:315)
	... 5 more
Checkstyle ends with 1 errors.

I would have expected checkstyle to work the same, regardless of what characters were used in the path.

See also https://issues.apache.org/jira/browse/MCHECKSTYLE-425 , where Michael Osipov suggests it´s due to use of URI#toString where URI#toASCIIString should have been used.


From plugin issue:

While I can reproduce it, it is not a bug in maven plugin. The issue is with Checkstyle itself. Unfortunately, Sun – back then – decided to print gargabe output with URI#toString() instead of doing the right thing. #toString() produces invalid URIs. One must ALWAYS use #toASCIIString(). Never rely on #toString(). I checked the code of Checkstyle and it unfortunately does on many occasions. File an issue there and have it fixed.

Affected files:

Name Line Text Path
AbstractAutomaticBean.java 24 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle
AbstractHeaderCheck.java 28 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle\checks\header
ImportControlCheck.java 22 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle\checks\imports
ImportControlLoader.java 25 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle\checks\imports
ConfigurationLoader.java 23 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle
SuppressionsLoader.java 24 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle\filters
PropertyCacheFile.java 29 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle
CommonUtil.java 28 import java.net.URI; D:\Entwicklung\Projekte\checkstyle\src\main\java\com\puppycrawl\tools\checkstyle\utils
See: apache/maven@8e0efaa

reftel added a commit to reftel/checkstyle that referenced this issue Apr 20, 2023
@reftel
Copy link
Author

reftel commented Apr 20, 2023

I can confirm that changing from toString to toASCIIString on line 265 of ConfigurationLoader.java resolves this particular issue. There may be other instances of the same issue in the code base, but the test suite does not detect them (it does detect this one - if the repo is cloned to a directory with a non-ascii character, the testsuite on master fails). I have a fix ready for PR is the issue is approved.

@michael-o
Copy link
Contributor

I can confirm that changing from toString to toASCIIString on line 265 of ConfigurationLoader.java resolves this particular issue. There may be other instances of the same issue in the code base, but the test suite does not detect them (it does detect this one - if the repo is cloned to a directory with a non-ascii character, the testsuite on master fails). I have a fix ready for PR is the issue is approved.

Like I told you :-D ALL occurences of toString() must be replaced.

@romani
Copy link
Member

romani commented Apr 20, 2023

Yes, please search and replace all

Abandoned PR - #13016 that is almost done.

reftel added a commit to reftel/checkstyle that referenced this issue Apr 21, 2023
reftel added a commit to reftel/checkstyle that referenced this issue Apr 24, 2023
bot-gradle added a commit to gradle/gradle that referenced this issue Jun 15, 2023
… Linux and macOS

The overarching goal here is to shake out more encoding problems throughout Gradle by running all our integration tests in a way that the tested code has to deal with non-ASCII characters in file paths. This PR takes a step towards that goal by forcing all our non-Windows integration tests to use such a path.

To keep the scope manageable, this PR does not force non-ASCII paths for Windows. That needs to be enabled in a followup PR where we can deal with Windows-specific encoding problems.

The idea is similar to how we add a space to the `build/tmp/test files` directory's name where all the test output is typically located; this time we replace the `s` in `test files` with an `ŝ` (see [U+015D](https://www.compart.com/en/unicode/U+015D)). Importantly this character is not part of ASCII nor any ISO-8859-X codepage, and cannot be represented by a single byte. (See [Wikipedia](https://en.wikipedia.org/wiki/ŝ)).

There is also an escape hatch for tests that for some reason can't support Unicode paths; these need to be tagged with `@DoesNotSupportNonAsciiPaths`. We have such offenders today:

  - Checkstyle fails because of this bug: checkstyle/checkstyle#13012
  - Java 6 wrapper tests fail because Java 6 barfs on non-ASCII characters in the path

Most of the problems that had to be fixed for Unixes come from the fact that `URI.toString()` does not encode non-ASCII characters, and some tools can't parse string representations of URIs with non-ASCII characters in them.

So some of the `URI`s are now converted to strings using `toASCIIString()` instead. This is the canonical form of a URI, and is the intended way to go when the string form is passed to places where we can't ensure that it will be read with the right encoding (see https://www.w3.org/Addressing/URL/3_URI_Choices.html)

There are followups:
- #25316
- #25322

This PR is a followup to the daemon encoding fix in:
- #25319

Co-authored-by: Lóránt Pintér <lorant@gradle.com>
@AayushSaini101
Copy link
Contributor

I am on it

@nrmancuso
Copy link
Member

Closed via #14883

@nrmancuso nrmancuso added this to the 10.17.0 milestone May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants