Tuesday, November 29, 2016

Java UTF-8

In order to understand UTF-8, you need to write not just English characters, but also characters that use the UNICODE characters like Chinese.

"UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units." ~ Ref: https://en.wikipedia.org/wiki/UTF-8

Use translate.google.com to generate simple strings in different languages to test.  
 
import java.io.IOException;
import java.io.FileOutputStream;
import java.io.DataOutputStream;
/**
 * This program opens a binary file and writes a series of strings
 * using UTF-8 encoding.
 */

public class WriteUTF
{
   public static void main(String[] args) throws IOException
   {
      String names[] = { "Warren", "Becky", "Holly", "Chloe","缅甸城","炫耀的生","你好"};
      FileOutputStream fstream = new FileOutputStream("c:\\temp\\UTFnames.dat");
      DataOutputStream outputFile = new DataOutputStream(fstream);
     
      System.out.println("Writing the names to the file...");
     
      for (int i = 0; i < names.length; i++)
         outputFile.writeUTF(names[i]);
     
      System.out.println("Done.");
      outputFile.close();    
   }
}


Due to the variable length storage of UTF-8, Java writes UTF-8 strings by recording the string length first following by the string characters.

  
Unicode encoded files written in notepad start with a Unicode signature and every character is stored as a 2 byte sequence of characters.
 
UTF-8 encoded files saved in notepad also stat with a signature, but the file structure is very different from the Java implementation of UTF-8 encoding.
 

No comments:

Post a Comment