See Also: Hello in 30 languages » Complete List of Character Sets »

Encoding an HTML Document for Unicode or Multiple Languages

In 1998 I began experimenting with different languages both on the Web and in documents/applications stored on the PC itself. Back then it was literally impossible to ensure a document displayed all languages on all browsers.

Returning to the task almost a decade later I can see much has changed on the Web! With developments in HTML itself allowing us to specify language, here is my updated guide to writing standard-compliant multilingual HTML documents with maximum compatibility.

File Format, Character Encoding of HTML Documents

It is important to recognise the machine itself has several formats. Typically Notepad in Microsoft Windows will try to save text (html) in ANSI (an extended ASCII 8-bit format) which will support Anglo-European character sets and some of the extended characters.

ASCII Scancode 97 'a'

Example : letter 'a' in ASCII (Scan-code Number 97. NOTE: ASCII is 8-bit!)

.

Note: If you changed the character encoding of the document, you must update your HTML header information to let any browser know what format it is in. This is the line directly underneath the element that reads:

<meta http-equiv="Content-Type" content="text/html; charset=???" >

Where ??? is set to whatever standard the text is saved in : “ISO-8859-1 (European), ISO-8859-6 (Arabic), ISO-8859-4 (Baltic)” (see a complete list here). You need to select the one for your language. All are slightly different, and not all are the same width.

If In Doubt, Use Unicode

If you are planning on using a language that is vastly different from European such as Asian languages (or if you are planning on using multiple languages within the one document even if they are both European) I strongly recommend using Unicode for maximum compatibility, specified by ‘UTF-8‘ :

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >

Saving your Web Documents in UTF-8

Saving Text Files With A Different Character Encoding In Notepad

Saving Text Files With A Different Character Encoding In Notepad

To save a document in UTF-8 using notepad, click the drop-down box that is just below ’save as type’ and select UTF-8

.

Changing the Character Encoding Format in Macromedia

Changing the Character Encoding Format in Macromedia Dreamwaver

In To achieve this in Dreamweaver, go to Edit -> Preferences -> New Document (Tab)

As an extreme example, have a look at my : ‘hello’ in 30 different languages, page which uses this technique. The style of each <li> element is overridden to specify a new language! The page also contains an exhaustive table of language character codes in HTML.

HTML allows you to specify the language the document is written in, by including it in the opening HTML tag:

<html lang="en">

Or by including:
<meta http-equiv="Content-Language" content="??">

Where ?? is equal to any of the two HTML language codes. But In theory any element can inherit a language. So you could even divide up the regions of your HTML document with DIVs that have different language specification types, and all should work fine. (Although I’ve not actually tried this)

If anyone has any thoughts about better ways to store a particular language in an HTML document, feel free to comment. I’m not saying this is perfect, but it seems to work quite well.