Loading data in unicode / UTF-8
A document's encoding tells an application how the characters in the document are represented as bytes in the file. Essentially, the encoding specifies how many bits there are per character. In a standard ASCII document, all characters are 8 bits. HTML files are often encoded as 8 bits per character, but with the globalization of the internet, this is not always the case. Many HTML documents are encoded as 16-bit characters, or use a combination of 8- and 16-bit characters.
A particularly common form HTML document encoding is referred to as UTF-8. This is the encoding form that we will examine.
Getting ready
We will read a file named unicode.html
from our local web server, located at http://localhost:8080/unicode.html
. This file is UTF-8 encoded and contains several sets of characters in different parts of the encoding space. For example, the page looks as follows in your browser:

The Page in the Browser
Using an editor that supports UTF-8, we can see how the...