How to replace accented characters with the original characters in Java?

Let’s say you have written an application which processes regular text.

If user enters an accented word like “tête-à-tête” your application would break!

How to remove the accents in the above characters?

That is you want the word “tete-a-tete” and not “tête-à-tête”.

Java provides a way.

Below are the steps:

STEP1: Normalize the word

Normalizing is a way to prepare the accented words to be transformed to regular text.

It is a pre processing step.

So if you normalize the above word “tête-à-tête” the accented characters are split from the regular characters.

Below is the code to do that:

String normalizedWord = Normalizer.normalize("tête-à-tête", Form.NFD)

NFD is one of the different standards to normalize. The concepts are a bit complex and we can go ahead and use “NFD” directly.

Once normalized , we can remove the accented marks.

STEP2: Remove the accented characters.

The accented characters are also called diacritics.

Below code would replace them with empty space:

String finalWord = normalizedWord.replaceAll("\\p{M}", "");

p is small case in the above regular expression.

That’s it!

finalWord now contains the word “tete-a-tete”.

Here is the entire code:

	public static void main(String a[]) {
		String word  = "tête-à-tête";
		System.out.println("Original word:"+word);
		String normalizedWord = Normalizer.normalize(word, Form.NFD);
		System.out.println("After normalization:"+normalizedWord);
		String finalWord = normalizedWord.replaceAll("\\p{M}", "");
		System.out.println("After replacing accents"+finalWord);

Here is the output:

Original word:tête-à-tête
 After normalization:te?te-a?-te?te
 After replacing accentstete-a-tete

Notice that the normalized word looks a bit confusing (the accented characters are shown as question marks) . Normalization is an internal transformation process and we can ignore how it is printed to the console.

You can also use StringUtils.stripAccents() method provided by Apache commons lang library (http://Apache Commons Lang)

Internally they use java.text.Normalizer to achieve the same output.





Leave a Reply

%d bloggers like this: