Tuesday, December 7, 2021

[FIXED] What are some ways to avoid String.substring from returning substring with invalid unicode character?

December 07, 2021 android, java

Issue

Recently, only I notice that, it is possible for substring to return string with invalid unicode character.

For instance

public class Main {

    public static void main(String[] args) {
        String text = "🥦_Salade verte";

        /* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
        // 1 : ?
        System.out.println("1 : " + text.substring(0, 1));

        // 2 : 🥦
        System.out.println("2 : " + text.substring(0, 2));

        // 3 : 🥦_
        System.out.println("3 : " + text.substring(0, 3));

        // 4 : 🥦_S
        System.out.println("4 : " + text.substring(0, 4));
    }
}

I was wondering, when trimming a long string with String.substring, what are some good ways to avoid the returned substring from containing invalid unicode?

Solution

`char` obsolete

The char type has been legacy since Java 2, essentially broken. As a 16-bit value, char is physically incapable of representing most characters.

Your discovery suggests that the String#substring command is char based. Hence the problem shown in your code.

Code point

Instead, use code point integer numbers when working with individual characters.

int[] codePoints = "🥦_Salade".codePoints().toArray() ;

[129382, 95, 83, 97, 108, 97, 100, 101]

Extract the first character’s code point.

int codePoint = codePoints[ 0 ] ;

129382

Make a single-character String object for that code point.

String firstCharacter = Character.toString( codePoint ) ;

🥦

You can grab a subset of that int array of code points.

int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;

And make a String object from those code points.

String s = 
    Arrays
        .stream( firstFewCodePoints ) 
        .collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
        .toString();

🥦_S

Or we can use a constructor of String to take a subset of the array.

String result = new String( codePoints , 0 , 3 ) ;

🥦_S

See this code run live at IdeOne.com.

Answered By - Basil Bourque

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 7, 2021

[FIXED] What are some ways to avoid String.substring from returning substring with invalid unicode character?

Issue

Solution

`char` obsolete

Code point

Popular Posts

Labels

Tuesday, December 7, 2021

Issue

Solution

char obsolete

Code point

Popular Posts

Labels

`char` obsolete