configuring character encoding

I recently spent some time working out character encoding issues as part of the internationalization of a project. There are an unfortunate number of places where the character encoding needs to be configured to use UTF-8.

OS

Windows by default uses CP-1252, which is incompatible with UTF-8. This can be a big problem when cutting and pasting from clients’ Word documents into java code. But it’s not a big obstacle as long as your tools have the encoding correctly set up for UTF-8. Linux defaults to UTF-8

Database

You can specify the encoding when creating a database. It is possible to change the character encoding on a database, but it’s not easy. Just use UTF-8 from the beginning.

Eclipse

Window -> Preferences -> General -> Workspace Change text file encoding to UTF-8. I think UTF-8 is the default on linux, but if you are on Windows then be sure to do this.

Individual projects can also have their own encoding settings. If you do the step above, each project will get UTF-8 as its default, “container” setting. Make sure your individual projects are not over-riding this.

Maven

Maven uses ISO-8859 as the default character encoding. It doesn’t really make sense to use this character encoding any more, because while European alphabets are supported, Asian ones are not. Technically, it’s adequate for Canada’s official languages of English and French, but why paint yourself into a corner?

Add this property to your pom.xml to get maven to use UTF-8.

<properties>
	<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

Strings in Java classes

If you specify the character encoding in maven and eclipse to be UTF-8, then any strings in Java classes should work fine.

Properties files

Java properties files are always stored as ISO-8859. To add UTF-8 characters, they’ll need to be escaped using /u plus four hex digits. Surprisingly, this doesn’t cause too many problems as long as you set the java run-time properly. This article has some other ideas.

Java run-time

You can specify the JVM file encoding using this system property.

-Dfile.encoding=UTF8

File reading and writing

The FileReader and FileWriter classes use the default character encoding, so using these classes won’t cause problems.

However, third party libraries may ignore the system file encoding, so you’ll need to tell it the encoding yourself. Check if the API has parameters for encoding.

Take velocity for example:

Template template = ve.getTemplate("my.vm", "UTF-8");

Tomcat

Refer to this post.

JQuery AJAX

jQuery claims that it’s default encoding is UTF-8, however as of version 1.4.1 that was not the case for Chrome or IE7. You can specify it like this:

$.ajax({
   contentType: 'application/x-www-form-urlencoded;charset=utf-8',
   etc...
});

It's only fair to share...
Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

Leave a Reply