[lex.charset] # 5 Lexical conventions [[lex]](./#lex) ## 5.3 Characters [[lex.char]](lex.char#lex.charset) ### 5.3.1 Character sets [lex.charset] [1](#1) [#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L298) The [*translation character set*](#def:character_set,translation "5.3.1 Character sets [lex.charset]") consists of the following elements: - [(1.1)](#1.1) each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and - [(1.2)](#1.2) a distinct character for each Unicode scalar value not assigned to an abstract character[.](#1.sentence-1) [*Note [1](#note-1)*: Unicode code points are integers in the range [0, 10FFFF] (hexadecimal)[.](#1.sentence-2) A surrogate code point is a value in the range [D800, DFFF] (hexadecimal)[.](#1.sentence-3) A Unicode scalar value is any code point that is not a surrogate code point[.](#1.sentence-4) — *end note*] [2](#2) [#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L317) The [*basic character set*](#def:character_set,basic "5.3.1 Character sets [lex.charset]") is a subset of the translation character set, consisting of 99 characters as specified in Table [1](#tab:lex.charset.basic "Table 1: Basic character set")[.](#2.sentence-1) [*Note [2](#note-2)*: Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context[.](#2.sentence-2) — *end note*] Table [1](#tab:lex.charset.basic) — Basic character set [[tab:lex.charset.basic]](./tab:lex.charset.basic) | [🔗](#tab:lex.charset.basic-row-1)
**character** | | **glyph** | | --- | --- | --- | | [🔗](#tab:lex.charset.basic-row-2)
U+0009 | character tabulation | | | [🔗](#tab:lex.charset.basic-row-3)
U+000b | line tabulation | | | [🔗](#tab:lex.charset.basic-row-4)
U+000c | form feed | | | [🔗](#tab:lex.charset.basic-row-5)
U+0020 | space | | | [🔗](#tab:lex.charset.basic-row-6)
U+000a | line feed | new-line | | [🔗](#tab:lex.charset.basic-row-7)
U+0021 | exclamation mark | ! | | [🔗](#tab:lex.charset.basic-row-8)
U+0022 | quotation mark | " | | [🔗](#tab:lex.charset.basic-row-9)
U+0023 | number sign | # | | [🔗](#tab:lex.charset.basic-row-10)
U+0024 | dollar sign | $ | | [🔗](#tab:lex.charset.basic-row-11)
U+0025 | percent sign | % | | [🔗](#tab:lex.charset.basic-row-12)
U+0026 | ampersand | & | | [🔗](#tab:lex.charset.basic-row-13)
U+0027 | apostrophe | ' | | [🔗](#tab:lex.charset.basic-row-14)
U+0028 | left parenthesis | ( | | [🔗](#tab:lex.charset.basic-row-15)
U+0029 | right parenthesis | ) | | [🔗](#tab:lex.charset.basic-row-16)
U+002a | asterisk | * | | [🔗](#tab:lex.charset.basic-row-17)
U+002b | plus sign | + | | [🔗](#tab:lex.charset.basic-row-18)
U+002c | comma | , | | [🔗](#tab:lex.charset.basic-row-19)
U+002d | hyphen-minus | - | | [🔗](#tab:lex.charset.basic-row-20)
U+002e | full stop | . | | [🔗](#tab:lex.charset.basic-row-21)
U+002f | solidus | / | | [🔗](#tab:lex.charset.basic-row-22)
U+0030 .[.](#tab:lex.charset.basic-row-22-column-1-sentence-1) U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 | | [🔗](#tab:lex.charset.basic-row-23)
U+003a | colon | : | | [🔗](#tab:lex.charset.basic-row-24)
U+003b | semicolon | ; | | [🔗](#tab:lex.charset.basic-row-25)
U+003c | less-than sign | < | | [🔗](#tab:lex.charset.basic-row-26)
U+003d | equals sign | = | | [🔗](#tab:lex.charset.basic-row-27)
U+003e | greater-than sign | > | | [🔗](#tab:lex.charset.basic-row-28)
U+003f | question mark | ? | | [🔗](#tab:lex.charset.basic-row-29)
U+0040 | commercial at | @ | | [🔗](#tab:lex.charset.basic-row-30)
U+0041 .[.](#tab:lex.charset.basic-row-30-column-1-sentence-1) U+005a | latin capital letter a .. z | A B C D E F G H I J K L M | | [🔗](#tab:lex.charset.basic-row-31) | | N O P Q R S T U V W X Y Z | | [🔗](#tab:lex.charset.basic-row-32)
U+005b | left square bracket | [ | | [🔗](#tab:lex.charset.basic-row-33)
U+005c | reverse solidus | \ | | [🔗](#tab:lex.charset.basic-row-34)
U+005d | right square bracket | ] | | [🔗](#tab:lex.charset.basic-row-35)
U+005e | circumflex accent | ^ | | [🔗](#tab:lex.charset.basic-row-36)
U+005f | low line | _ | | [🔗](#tab:lex.charset.basic-row-37)
U+0060 | grave accent | ` | | [🔗](#tab:lex.charset.basic-row-38)
U+0061 .[.](#tab:lex.charset.basic-row-38-column-1-sentence-1) U+007a | latin small letter a .. z | a b c d e f g h i j k l m | | [🔗](#tab:lex.charset.basic-row-39) | | n o p q r s t u v w x y z | | [🔗](#tab:lex.charset.basic-row-40)
U+007b | left curly bracket | { | | [🔗](#tab:lex.charset.basic-row-41)
U+007c | vertical line | | | | [🔗](#tab:lex.charset.basic-row-42)
U+007d | right curly bracket | } | | [🔗](#tab:lex.charset.basic-row-43)
U+007e | tilde | ~ | [3](#3) [#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L372) The [*basic literal character set*](#def:character_set,basic_literal "5.3.1 Character sets [lex.charset]") consists of all characters of the basic character set, plus the control characters specified in Table [2](#tab:lex.charset.literal "Table 2: Additional control characters in the basic literal character set")[.](#3.sentence-1) Table [2](#tab:lex.charset.literal) — Additional control characters in the basic literal character set [[tab:lex.charset.literal]](./tab:lex.charset.literal) | [🔗](#tab:lex.charset.literal-row-1)
**character** | | | --- | --- | | [🔗](#tab:lex.charset.literal-row-2)
U+0000 | null | | [🔗](#tab:lex.charset.literal-row-3)
U+0007 | alert | | [🔗](#tab:lex.charset.literal-row-4)
U+0008 | backspace | | [🔗](#tab:lex.charset.literal-row-5)
U+000d | carriage return | [4](#4) [#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L386) A [*code unit*](#def:code_unit "5.3.1 Character sets [lex.charset]") is an integer value of character type ([[basic.fundamental]](basic.fundamental "6.9.2 Fundamental types"))[.](#4.sentence-1) Characters in a [*character-literal*](lex.ccon#nt:character-literal "5.13.3 Character literals [lex.ccon]") other than a multicharacter or non-encodable character literal or in a [*string-literal*](lex.string#nt:string-literal "5.13.5 String literals [lex.string]") are encoded as a sequence of one or more code units, as determined by the [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3 Character literals [lex.ccon]") ([[lex.ccon]](lex.ccon "5.13.3 Character literals"), [[lex.string]](lex.string "5.13.5 String literals")); this is termed the respective [*literal encoding*](#def:encoding,literal "5.3.1 Character sets [lex.charset]")[.](#4.sentence-2) The [*ordinary literal encoding*](#def:encoding,ordinary_literal "5.3.1 Character sets [lex.charset]") is the encoding applied to an ordinary character or string literal[.](#4.sentence-3) The [*wide literal encoding*](#def:encoding,wide_literal "5.3.1 Character sets [lex.charset]") is the encoding applied to a wide character or string literal[.](#4.sentence-4) [5](#5) [#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L400) A literal encoding or a locale-specific encoding of one of the execution character sets ([[character.seq]](character.seq "16.3.3.3.4 Character sequences")) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element[.](#5.sentence-1) [*Note [3](#note-3)*: A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set[.](#5.sentence-2) — *end note*] The U+0000 null character is encoded as the value 0[.](#5.sentence-3) No other element of the translation character set is encoded with a code unit of value 0[.](#5.sentence-4) The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous[.](#5.sentence-5) The ordinary and wide literal encodings are otherwiseimplementation-defined[.](#5.sentence-6) For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form[.](#5.sentence-7)