Unicode in D

Unicode is a global standard for representing text in computers. D fully supports Unicode in both the language and the standard library.

What and Why

Computers, at the lowest level, have no notion of what text is, as they only deal with numbers. As a result, computer code needs a way to take text data and transform it to and from a binary representation. The method of transformation is called an encoding scheme, and Unicode is one such scheme.

To see the numerical representations underlying the strings in the example, simply run the code.

Unicode is unique in that its design allows it to represent all the languages of the world using the same encoding scheme. Before Unicode, computers made by different companies or shipped in different areas had a hard time communicating, and in some cases an encoding scheme wasn't supported at all, making viewing the text on that computer impossible.

For more information on Unicode and the technical details, see the Wikipedia article on Unicode in the "In-Depth" section.

How

Unicode has solved most of those problems and is supported on every modern machine. In D, all strings are Unicode strings, whereas strings in languages such as C and C++ are just arrays of bytes.

The types string, wstring, and dstring are UTF-8, UTF-16, and UTF-32 encoded strings respectively. Their character types are char, wchar, and dchar.

According to the specification, it is an error to store non-Unicode data in the D string types; expect your program to fail in different ways if your string is improperly encoded.

In order to store other string encodings, or to obtain C/C++ behavior, you can use raw bytes with types ubyte[] or char*.

Strings in Range Algorithms

Reading the gem on range algorithms is suggested for this section.

There are some important caveats to keep in mind when working with Unicode in D.

First, as a convenience, when iterating over a string using the range functions, each element of strings and wstrings is converted into a UTF-32 code-point as each item. This practice, known as auto decoding, means that

static assert(is(typeof(utf8.front) == dchar));

This behavior has a lot of implications, the main one that confuses most people is that std.traits.hasLength!(string) equals False. Why? Because, in terms of the range API, string's length returns the number of elements in the string, rather than the number of elements the range function will iterate over.

From the example, you can see why these two things might not always be equal. As such, the range algorithms act as if strings do not have length information.

For more information on the technical details of auto decoding, and what it means for your program, check the links in the "In-Depth" section.

In-Depth

rdmd playground.d