"UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units." ~ Ref: https://en.wikipedia.org/wiki/UTF-8
Use translate.google.com to generate simple strings in different languages to test.
import java.io.IOException;
import java.io.FileOutputStream;
import java.io.DataOutputStream;
/**
* This program opens a binary file and writes a series of strings
* using UTF-8 encoding.
*/
public class WriteUTF
{
public static void main(String[] args) throws IOException
{
String names[] = { "Warren", "Becky", "Holly", "Chloe","缅甸城","炫耀的生","你好"};
FileOutputStream fstream = new FileOutputStream("c:\\temp\\UTFnames.dat");
DataOutputStream outputFile = new DataOutputStream(fstream);
System.out.println("Writing the names to the file...");
for (int i = 0; i < names.length; i++)
outputFile.writeUTF(names[i]);
System.out.println("Done.");
outputFile.close();
}
}
Due to the variable length storage of UTF-8, Java writes UTF-8 strings by recording the string length first following by the string characters.
Unicode encoded files written in notepad start with a Unicode signature and every character is stored as a 2 byte sequence of characters.
UTF-8 encoded files saved in notepad also stat with a signature, but the file structure is very different from the Java implementation of UTF-8 encoding.
No comments:
Post a Comment