Issue
I have some code removing the BOM header of the first line of a file like this :
public static String removeBOMIndicator(String line) {
if (line.length() > 1) {
byte[] bytes = line.getBytes();
if (bytes.length >= 3 && bytes[0] == (byte) 0xEF && bytes[1] == (byte) 0xBB && bytes[2] == (byte) 0xBF) {
line = line.substring(1);
}
}
return line;
}
This function works well and I created a test case to be sure it stays that way. The test passes without trouble when I launch it with IntelliJ or when our SonarQube instance runs it.
However, when I launch the test using Git Bash (mvn surefire:test -Dtest=RemoveBomHeadertTest
), the output of my function contains two characters ╗┐
at the start.
If I change my code to remove the 3 first characters instead of only the first one, then it works well in Git Bash, but in IntelliJ, I'm missing the first two characters of my String.
Any idea why the behaviour of substring
might be different in these two cases ?
Solution
Prior to JDK-18 and JEP 400, getBytes()
on a String
uses the platforms default character encoding which is not guaranteed to be UTF-8. You would have to use line.getBytes(StandardCharsets.UTF_8)
to ensure to always get the bytes according to the UTF-8 encoding.
However, this is unnecessarily complicated. To remove the BOM, if present, you can simply use the string’s UTF-16 based API.
if(line.startsWith("\uFEFF")) line = line.substring(1);
Not only is this shorter, it is more efficient as it doesn’t convert the entire string to UTF-8, just to check the first char.
But you also have to check the source of the string. It is possible that you have a similar problem at the file reading site, using the platform’s default encoding rather than reading the file as UTF-8.
Note the differences between the APIs (prior to JDK 18)
Files.newBufferedReader(path)
always uses UTF-8new FileReader(file)
uses the platform’s default charsetnew InputStreamReader(inputStream)
also uses the platform’s default charset
If in doubt, always specify the intended charset.
Answered By - Holger
Answer Checked By - Mary Flores (JavaFixing Volunteer)