Issue
I'm using Java 11 with Tomcat 9 with the latest JSP/JSTL. I'm testing in Chrome 71 and Firefox 64.0 on Windows 10. I have the following test document:
<%@ page contentType="text/html; charset=UTF-8" %>
<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8"/>
<title>Hello</title>
</head>
<body>
<c:if test="${not empty param.fullName}">
<p>Hello, ${param.fullName}.</p>
</c:if>
<form>
<div>
<label>Full name: <input name="fullName" /></label>
</div>
<button>Say Hello</button>
</form>
</body>
</html>
This is perhaps the simplest form possible. As you know the form method
defaults to get
, the form action
defaults to ""
(submitting to the same page), and the form enctype
defaults to application/x-www-form-urlencoded
.
If I enter the name "Flávio José" (a famous Brazilian forró singer and musícian) in the field and submit, the form is submitted via HTTP GET
to the same page using hello.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9
. This is correct, and the page says:
Hello, Flávio José.
If I change the form method
to post
and enter the same name "Flávio José", the form contents are instead submitted via POST
, with HTTP request contents:
fullName=Fl%C3%A1vio+Jos%C3%A9
This also appears correct. But this time the page says:
Hello, Flávio José.
Rather than seeing %C3%A
as a sequence of UTF-8 octects, JSP seems to think that these are a series of ISO-8859-1 octets (or code page 1252 octets), and is therefore decoding them to the wrong character sequence.
But where is it getting ISO-8859-1? What is my JSP page lacking to indicate the correct encoding?
I'll note also that WHATWG specification says that application/x-www-form-urlencoded
octets should be parsed as UTF-8 by default. Is the Java servlet specification simply broken? How do I work around this?
Solution
This is caused by Tomcat, but the root problem is the Java Servlet 4 specification, which is incorrect and outdated.
Originally HTML 4.0.1 said that application/x-www-form-urlencoded
encoded octets should be decoded as US-ASCII. The servlet specification changed this to say that, if the request encoding is not specified, the octets should be decoded as ISO-8859-1. Tomcat is simply following the servlet specification.
There are two problems with the Java servlet specification. The first is that the modern interpretation of application/x-www-form-urlencoded
is that encoded octets should be decoded using UTF-8. The second problem is that tying the octet decoding to the resource charset confuses two levels of decoding.
Take another look at this POST
content:
fullName=Fl%C3%A1vio+Jos%C3%A9
You'll notice that it is ASCII!! It doesn't matter if you consider the POST
HTTP request charset to be ISO-8859-1
, UTF-8
, or US-ASCII
—you'll still wind up with exactly the same Unicode characters before decoding the octets! What encoding is used to decode the encoding octets is completely separate.
As a further example, let's say I download a text file instructions.txt
that is clearly marked as ISO-8859-1, and it contains the URI https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9
. Just because the text file has a charset of ISO-8859-1
, does that mean I need to decode %C3%A
using ISO-8859-1? Of course not! The charset used for decoding URI characters is a separate level of decoding on top of the resource content type charset! Similarly the octets of values encoded in application/x-www-form-urlencoded
should be decoded using UTF-8, regardless of the underlying charset of the resource.
There are several workarounds, some of them found at found by looking at the Tomcat character encoding FAQ to "use UTF-8 everywhere".
Set the request character encoding in your web.xml
file.
Add the following to your WEB-INF/web.xml
file:
<request-character-encoding>UTF-8</request-character-encoding>
This setting is agnostic of the servlet container implementation, and is defined forth in the servlet specification. (You should be able to alternatively put it in Tomcat's conf/web.xml
file, if want a global setting and don't mind changing the Tomcat configuration.)
Set the SetCharacterEncodingFilter
in your web.xml
file.
Tomcat has a proprietary equivalent: use the org.apache.catalina.filters.SetCharacterEncodingFilter
in the WEB-INF/web.xml
file, as the Tomcat FAQ above mentions, and as illustrated by https://stackoverflow.com/a/37833977/421049, excerpted below:
<filter>
<filter-name>setCharacterEncodingFilter</filter-name>
<filter-class>org.apache.catalina.filters.SetCharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>setCharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
This will make your web application only work on Tomcat, so it's better to put this in the Tomcat installation conf/web.xml
file instead, as the post above mentions. In fact Tomcat's conf/web.xml
installations have these two sections, but commented out; simply uncomment them and things should work.
Force the request character encoding to UTF-8 in the JSP or servlet.
You can force the character encoding of the servlet request to UTF-8, somewhere early in the JSP:
<% request.setCharacterEncoding("UTF-8"); %>
But that is ugly, unwieldy, error-prone, and goes against modern best practices—JSP scriptlets shouldn't be used anymore.
Hopefully we can get a newer Java servlet specification to remove any relationship between the resource charset and the decoding of application/x-www-form-urlencoded
octets, and simply state that application/x-www-form-urlencoded
octets must be decoded as UTF-8, as is modern practice as clarified by the latest W3C and WHATWG specifications.
Update: I've updated the Tomcat FAQ on Character Encoding Issues with this information.
Answered By - Garret Wilson
Answer Checked By - David Marino (JavaFixing Volunteer)