Issue
Does anyone know why the regex \p{Cs}
does not match the symbol
in Java 16?
It used to match it in Java 11.
Java 11
jshell
| Welcome to JShell -- Version 11.0.7
| For an introduction type: /help intro
jshell> import java.util.regex.*
jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==>
jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> true
Java 16
INFO: Created user preferences directory.
| Welcome to JShell -- Version 16.0.1
| For an introduction type: /help intro
jshell> import java.util.regex.*
jshell> var text = new StringBuilder().appendCodePoint(55622).appendCodePoint(56380)
text ==>
jshell> Pattern.compile("\\p{Cs}").matcher(text).find()
$3 ==> false
Solution
First, your “symbol
” has the codepoint 399420, which is not assigned by the Unicode standard (yet), so if you are seeing something useful here, it’s a non-standard behavior of your system.
The way you construct the string, is not correct, semantically, but happens to create the intended string. For historic reasons, Java’s API is centered around a UTF-16 representation.
When you define the symbol using two surrogate characters, i.e.
var text = "\uD946\uDC3C";
System.out.println(text.codePointAt(0));
you’ll get
399420
On the other hand, when you use
var text = new StringBuilder().appendCodePoint(399420);
text.chars().forEach(c -> System.out.printf("\\u%04X", c));
System.out.println();
you’ll get
\uD946\uDC3C
In other words, the sequence of the two surrogate UTF-16 char
units \uD946
, \uDC3C
is equivalent to the single codepoint 399420
. Conceptionally, the string consists of the single codepoint, in other words,
System.out.println(text.codePointCount(0, text.length()) + " codepoint(s)");
System.out.println(text.codePointAt(0));
System.out.println("type " + Character.getType(text.codePointAt(0)));
will print
1 codepoint(s)
399420
type 0
in either case. The type 0
indicates that this codepoint is unassigned.
You are using appendCodePoint
for appending two UTF-16 units to the StringBuilder
, but since this method treats codepoints of the BMP the same way as UTF-16 units, it happens to construct the same string, too.
Since the category of the codepoint is “unassigned”, it shouldn’t be “surrogate”, so \p{Cs}
should never find a match here. When processing a valid Unicode string, you should never encounter this category, as it can only match dangling surrogate characters which can not be interpreted as a codepoint outside the BMP.
But there’s the bug JDK-8247546, Pattern matching does not skip correctly over supplementary characters. Before Java 16, the regex engine did process the codepoint at location zero correctly, but advanced only one char
position, so it found a dangling surrogate character when looking at char
position 1
alone.
We can verify it using
var m = Pattern.compile("\\p{Cs}").matcher(text);
if(m.find()) {
System.out.println("found a match at " + m.start());
}
which prints “found a match at 1” prior to JDK 16, which is wrong, as position 1
should be skipped when there’s a single codepoint at char
positions 0
and 1
.
This bug has been fixed in JDK 16. So now, the string is treated as a single codepoint of the “unassigned” category. Of course, this category might change again in the future. But it should never be “surrogate”.
Answered By - Holger
Answer Checked By - Candace Johnson (JavaFixing Volunteer)