Class Utf8


  • final class Utf8
    extends java.lang.Object
    A set of low-level, high-performance static utility methods related to the UTF-8 character encoding. This class has no dependencies outside of the core JDK libraries.

    There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.

    The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:

    
     Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
     

    See the Unicode Standard,
    Table 3-6. UTF-8 Bit Distribution,
    Table 3-7. Well Formed UTF-8 Byte Sequences.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static long ASCII_MASK_LONG
      A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e.
      (package private) static int MAX_BYTES_PER_CHAR
      Maximum number of bytes per Java UTF-16 char in UTF-8.
      private static Utf8.Processor processor
      UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform.
      private static int UNSAFE_COUNT_ASCII_THRESHOLD
      Used by Unsafe UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private Utf8()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      (package private) static java.lang.String decodeUtf8​(byte[] bytes, int index, int size)
      Decodes the given UTF-8 encoded byte array slice into a String.
      (package private) static java.lang.String decodeUtf8​(java.nio.ByteBuffer buffer, int index, int size)
      Decodes the given UTF-8 portion of the ByteBuffer into a String.
      (package private) static int encode​(java.lang.String in, byte[] out, int offset, int length)  
      (package private) static int encodedLength​(java.lang.String string)
      Returns the number of bytes in the UTF-8-encoded form of sequence.
      private static int encodedLengthGeneral​(java.lang.String string, int start)  
      (package private) static void encodeUtf8​(java.lang.String in, java.nio.ByteBuffer out)
      Encodes the given characters to the target ByteBuffer using UTF-8 encoding.
      private static int estimateConsecutiveAscii​(java.nio.ByteBuffer buffer, int index, int limit)
      Counts (approximately) the number of consecutive ASCII characters in the given buffer.
      (package private) static boolean isValidUtf8​(byte[] bytes)
      Returns true if the given byte array is a well-formed UTF-8 byte sequence.
      (package private) static boolean isValidUtf8​(byte[] bytes, int index, int limit)
      Returns true if the given byte array slice is a well-formed UTF-8 byte sequence.
      (package private) static boolean isValidUtf8​(java.nio.ByteBuffer buffer)
      Determines if the given ByteBuffer is a valid UTF-8 string.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • processor

        private static final Utf8.Processor processor
        UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform. The processor is the platform-optimized delegate for which all methods are delegated directly to.
      • ASCII_MASK_LONG

        private static final long ASCII_MASK_LONG
        A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e. any byte >= 0x80).
        See Also:
        Constant Field Values
      • MAX_BYTES_PER_CHAR

        static final int MAX_BYTES_PER_CHAR
        Maximum number of bytes per Java UTF-16 char in UTF-8.
        See Also:
        CharsetEncoder.maxBytesPerChar(), Constant Field Values
      • UNSAFE_COUNT_ASCII_THRESHOLD

        private static final int UNSAFE_COUNT_ASCII_THRESHOLD
        Used by Unsafe UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters. The reason for this threshold is that for small strings, the optimization may not be beneficial or may even negatively impact performance since it requires additional logic to avoid unaligned reads (when calling Unsafe.getLong). This threshold guarantees that even if the initial offset is unaligned, we're guaranteed to make at least one call to Unsafe.getLong() which provides a performance improvement that entirely subsumes the cost of the additional logic.
        See Also:
        Constant Field Values
    • Constructor Detail

      • Utf8

        private Utf8()
    • Method Detail

      • isValidUtf8

        static boolean isValidUtf8​(byte[] bytes)
        Returns true if the given byte array is a well-formed UTF-8 byte sequence.

        This is a convenience method, equivalent to a call to isValidUtf8(bytes, 0, bytes.length).

      • isValidUtf8

        static boolean isValidUtf8​(byte[] bytes,
                                   int index,
                                   int limit)
        Returns true if the given byte array slice is a well-formed UTF-8 byte sequence. The range of bytes to be checked extends from index index, inclusive, to limit, exclusive.
      • isValidUtf8

        static boolean isValidUtf8​(java.nio.ByteBuffer buffer)
        Determines if the given ByteBuffer is a valid UTF-8 string.

        Selects an optimal algorithm based on the type of ByteBuffer (i.e. heap or direct) and the capabilities of the platform.

        Parameters:
        buffer - the buffer to check.
        See Also:
        isValidUtf8(byte[], int, int)
      • encodedLength

        static int encodedLength​(java.lang.String string)
        Returns the number of bytes in the UTF-8-encoded form of sequence. For a string, this method is equivalent to string.getBytes(UTF_8).length, but is more efficient in both time and space.
      • encode

        static int encode​(java.lang.String in,
                          byte[] out,
                          int offset,
                          int length)
      • encodeUtf8

        static void encodeUtf8​(java.lang.String in,
                               java.nio.ByteBuffer out)
        Encodes the given characters to the target ByteBuffer using UTF-8 encoding.

        Selects an optimal algorithm based on the type of ByteBuffer (i.e. heap or direct) and the capabilities of the platform.

        Parameters:
        in - the source string to be encoded
        out - the target buffer to receive the encoded string.
        See Also:
        encode(String, byte[], int, int)
      • estimateConsecutiveAscii

        private static int estimateConsecutiveAscii​(java.nio.ByteBuffer buffer,
                                                    int index,
                                                    int limit)
        Counts (approximately) the number of consecutive ASCII characters in the given buffer. The byte order of the ByteBuffer does not matter, so performance can be improved if native byte order is used (i.e. no byte-swapping in ByteBuffer.getLong(int)).
        Parameters:
        buffer - the buffer to be scanned for ASCII chars
        index - the starting index of the scan
        limit - the limit within buffer for the scan
        Returns:
        the number of ASCII characters found. The stopping position will be at or before the first non-ASCII byte.