1.3 File Encoding Standard

Each source file must have the standard encoding corresponding to the type of the source file. Refer to the list of supported source file types to find the appropriate encoding.

Rationale

The preferred encoding is UTF-8. UTF-8 is the standard encoding for source file types whose parsers support it; where UTF-8 is not supported, the 8-bit ASCII Latin-1 character set is the standard encoding.

UTF-8 is preferred because it is widely supported and has a unique code for every character in every transcribed language on the planet. By contrast, UTF-16 is less widely supported whilst the 8-bit ASCII Latin-1 character set can only represent 255 characters.

A UTF-8 encoded file can be recognised by the UTF-8 byte-order-mark (BOM) at the start of the file. The UTF-8 BOM is the UTF-8 encoding for the Unicode "zero-width non-breaking space" character (code U+FEFF): the byte sequence 0xEF, 0xBB, 0xBF. A text file that does not start with this three-byte sequence will not necessarily be recognized as a UTF-8 encoded file. Some UTF-8 encoded files are waived from including the UTF-8 BOM if this causes problems for parsers, as is the case with all GNU C++ compilers prior to release 4.4.