Let’s say you have written an application which processes regular text.
If user enters an accented word like “tête-à-tête” your application would break!
How to remove the accents in the above characters?
That is you want the word “tete-a-tete” and not “tête-à-tête”.
Java provides a way.
Below are the steps:
STEP1: Normalize the word
Normalizing is a way to prepare the accented words to be transformed to regular text.
It is a pre processing step.
So if you normalize the above word “tête-à-tête” the accented characters are split from the regular characters.
Below is the code to do that:
String normalizedWord = Normalizer.normalize("tête-à-tête", Form.NFD)
NFD is one of the different standards to normalize. The concepts are a bit complex and we can go ahead and use “NFD” directly.
Once normalized , we can remove the accented marks.
STEP2: Remove the accented characters.
The accented characters are also called diacritics.
Below code would replace them with empty space:
String finalWord = normalizedWord.replaceAll("\\p{M}", "");
p is small case in the above regular expression.
That’s it!
finalWord now contains the word “tete-a-tete”.
Here is the entire code:
public static void main(String a[]) {
String word = "tête-à-tête";
System.out.println("Original word:"+word);
String normalizedWord = Normalizer.normalize(word, Form.NFD);
System.out.println("After normalization:"+normalizedWord);
String finalWord = normalizedWord.replaceAll("\\p{M}", "");
System.out.println("After replacing accents"+finalWord);
}
Here is the output:
Original word:tête-à-tête After normalization:te?te-a?-te?te After replacing accentstete-a-tete
Notice that the normalized word looks a bit confusing (the accented characters are shown as question marks) . Normalization is an internal transformation process and we can ignore how it is printed to the console.
You can also use StringUtils.stripAccents() method provided by Apache commons lang library (http://Apache Commons Lang)https://commons.apache.org/proper/commons-lang/)
Internally they use java.text.Normalizer to achieve the same output.
Leave a Reply