Issue
Recently, only I notice that, it is possible for substring
to return string with invalid unicode character.
For instance
public class Main {
public static void main(String[] args) {
String text = "🥦_Salade verte";
/* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
// 1 : ?
System.out.println("1 : " + text.substring(0, 1));
// 2 : 🥦
System.out.println("2 : " + text.substring(0, 2));
// 3 : 🥦_
System.out.println("3 : " + text.substring(0, 3));
// 4 : 🥦_S
System.out.println("4 : " + text.substring(0, 4));
}
}
I was wondering, when trimming a long string with String.substring
, what are some good ways to avoid the returned substring from containing invalid unicode?
Solution
char
obsolete
The char
type has been legacy since Java 2, essentially broken. As a 16-bit value, char
is physically incapable of representing most characters.
Your discovery suggests that the String#substring
command is char
based. Hence the problem shown in your code.
Code point
Instead, use code point integer numbers when working with individual characters.
int[] codePoints = "🥦_Salade".codePoints().toArray() ;
[129382, 95, 83, 97, 108, 97, 100, 101]
Extract the first character’s code point.
int codePoint = codePoints[ 0 ] ;
129382
Make a single-character String
object for that code point.
String firstCharacter = Character.toString( codePoint ) ;
🥦
You can grab a subset of that int
array of code points.
int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;
And make a String
object from those code points.
String s =
Arrays
.stream( firstFewCodePoints )
.collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
.toString();
🥦_S
Or we can use a constructor of String
to take a subset of the array.
String result = new String( codePoints , 0 , 3 ) ;
🥦_S
See this code run live at IdeOne.com.
Answered By - Basil Bourque