Issue
So I know about String#codePointAt(int)
, but it's indexed by the char
offset, not by the codepoint offset.
I'm thinking about trying something like:
- using
String#charAt(int)
to get thechar
at an index - testing whether the
char
is in the high-surrogates range- if so, use
String#codePointAt(int)
to get the codepoint, and increment the index by 2 - if not, use the given
char
value as the codepoint, and increment the index by 1
- if so, use
But my concerns are
- I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two
char
values or one - this seems like an awful expensive way to iterate through characters
- someone must have come up with something better.
Solution
Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
Answered By - Jonathan Feinberg
Answer Checked By - Dawn Plyler (JavaFixing Volunteer)