Files
cppdraft_translate/cppdraft/lex/char.md
2025-10-25 03:02:53 +03:00

241 lines
15 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[lex.char]
# 5 Lexical conventions [[lex]](./#lex)
## 5.3 Characters [lex.char]
### [5.3.1](#lex.charset) Character sets [[lex.charset]](lex.charset)
[1](#lex.charset-1)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L298)
The [*translation character set*](#def:character_set,translation "5.3.1Character sets[lex.charset]") consists of the following elements:
- [(1.1)](#lex.charset-1.1)
each abstract character assigned a code point in the Unicode codespace
as specified in the Unicode Standard, and
- [(1.2)](#lex.charset-1.2)
a distinct character for each Unicode scalar value
not assigned to an abstract character[.](#lex.charset-1.sentence-1)
[*Note [1](#lex.charset-note-1)*:
Unicode code points are integers
in the range [0, 10FFFF] (hexadecimal)[.](#lex.charset-1.sentence-2)
A surrogate code point is a value
in the range [D800, DFFF] (hexadecimal)[.](#lex.charset-1.sentence-3)
A Unicode scalar value is any code point that is not a surrogate code point[.](#lex.charset-1.sentence-4)
— *end note*]
[2](#lex.charset-2)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L317)
The [*basic character set*](#def:character_set,basic "5.3.1Character sets[lex.charset]") is a subset of the translation character set,
consisting of 99 characters as specified in Table [1](#tab:lex.charset.basic "Table 1: Basic character set")[.](#lex.charset-2.sentence-1)
[*Note [2](#lex.charset-note-2)*:
Unicode short names are given only as a means to identifying the character;
the numerical value has no other meaning in this context[.](#lex.charset-2.sentence-2)
— *end note*]
Table [1](#tab:lex.charset.basic) — Basic character set [[tab:lex.charset.basic]](./tab:lex.charset.basic)
| [🔗](#tab:lex.charset.basic-row-1)<br>**character** | | **glyph** |
| --- | --- | --- |
| [🔗](#tab:lex.charset.basic-row-2)<br>U+0009 | character tabulation | |
| [🔗](#tab:lex.charset.basic-row-3)<br>U+000b | line tabulation | |
| [🔗](#tab:lex.charset.basic-row-4)<br>U+000c | form feed | |
| [🔗](#tab:lex.charset.basic-row-5)<br>U+0020 | space | |
| [🔗](#tab:lex.charset.basic-row-6)<br>U+000a | line feed | new-line |
| [🔗](#tab:lex.charset.basic-row-7)<br>U+0021 | exclamation mark | ! |
| [🔗](#tab:lex.charset.basic-row-8)<br>U+0022 | quotation mark | " |
| [🔗](#tab:lex.charset.basic-row-9)<br>U+0023 | number sign | # |
| [🔗](#tab:lex.charset.basic-row-10)<br>U+0024 | dollar sign | $ |
| [🔗](#tab:lex.charset.basic-row-11)<br>U+0025 | percent sign | % |
| [🔗](#tab:lex.charset.basic-row-12)<br>U+0026 | ampersand | & |
| [🔗](#tab:lex.charset.basic-row-13)<br>U+0027 | apostrophe | ' |
| [🔗](#tab:lex.charset.basic-row-14)<br>U+0028 | left parenthesis | ( |
| [🔗](#tab:lex.charset.basic-row-15)<br>U+0029 | right parenthesis | ) |
| [🔗](#tab:lex.charset.basic-row-16)<br>U+002a | asterisk | * |
| [🔗](#tab:lex.charset.basic-row-17)<br>U+002b | plus sign | + |
| [🔗](#tab:lex.charset.basic-row-18)<br>U+002c | comma | , |
| [🔗](#tab:lex.charset.basic-row-19)<br>U+002d | hyphen-minus | - |
| [🔗](#tab:lex.charset.basic-row-20)<br>U+002e | full stop | . |
| [🔗](#tab:lex.charset.basic-row-21)<br>U+002f | solidus | / |
| [🔗](#tab:lex.charset.basic-row-22)<br>U+0030 .[.](#tab:lex.charset.basic-row-22-column-1-sentence-1) U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 |
| [🔗](#tab:lex.charset.basic-row-23)<br>U+003a | colon | : |
| [🔗](#tab:lex.charset.basic-row-24)<br>U+003b | semicolon | ; |
| [🔗](#tab:lex.charset.basic-row-25)<br>U+003c | less-than sign | < |
| [🔗](#tab:lex.charset.basic-row-26)<br>U+003d | equals sign | = |
| [🔗](#tab:lex.charset.basic-row-27)<br>U+003e | greater-than sign | > |
| [🔗](#tab:lex.charset.basic-row-28)<br>U+003f | question mark | ? |
| [🔗](#tab:lex.charset.basic-row-29)<br>U+0040 | commercial at | @ |
| [🔗](#tab:lex.charset.basic-row-30)<br>U+0041 .[.](#tab:lex.charset.basic-row-30-column-1-sentence-1) U+005a | latin capital letter a .. z | A B C D E F G H I J K L M |
| [🔗](#tab:lex.charset.basic-row-31) | | N O P Q R S T U V W X Y Z |
| [🔗](#tab:lex.charset.basic-row-32)<br>U+005b | left square bracket | [ |
| [🔗](#tab:lex.charset.basic-row-33)<br>U+005c | reverse solidus | \ |
| [🔗](#tab:lex.charset.basic-row-34)<br>U+005d | right square bracket | ] |
| [🔗](#tab:lex.charset.basic-row-35)<br>U+005e | circumflex accent | ^ |
| [🔗](#tab:lex.charset.basic-row-36)<br>U+005f | low line | _ |
| [🔗](#tab:lex.charset.basic-row-37)<br>U+0060 | grave accent | ` |
| [🔗](#tab:lex.charset.basic-row-38)<br>U+0061 .[.](#tab:lex.charset.basic-row-38-column-1-sentence-1) U+007a | latin small letter a .. z | a b c d e f g h i j k l m |
| [🔗](#tab:lex.charset.basic-row-39) | | n o p q r s t u v w x y z |
| [🔗](#tab:lex.charset.basic-row-40)<br>U+007b | left curly bracket | { |
| [🔗](#tab:lex.charset.basic-row-41)<br>U+007c | vertical line | | |
| [🔗](#tab:lex.charset.basic-row-42)<br>U+007d | right curly bracket | } |
| [🔗](#tab:lex.charset.basic-row-43)<br>U+007e | tilde | ~ |
[3](#lex.charset-3)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L372)
The [*basic literal character set*](#def:character_set,basic_literal "5.3.1Character sets[lex.charset]") consists of
all characters of the basic character set,
plus the control characters specified in Table [2](#tab:lex.charset.literal "Table 2: Additional control characters in the basic literal character set")[.](#lex.charset-3.sentence-1)
Table [2](#tab:lex.charset.literal) — Additional control characters in the basic literal character set [[tab:lex.charset.literal]](./tab:lex.charset.literal)
| [🔗](#tab:lex.charset.literal-row-1)<br>**character** | |
| --- | --- |
| [🔗](#tab:lex.charset.literal-row-2)<br>U+0000 | null |
| [🔗](#tab:lex.charset.literal-row-3)<br>U+0007 | alert |
| [🔗](#tab:lex.charset.literal-row-4)<br>U+0008 | backspace |
| [🔗](#tab:lex.charset.literal-row-5)<br>U+000d | carriage return |
[4](#lex.charset-4)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L386)
A [*code unit*](#def:code_unit "5.3.1Character sets[lex.charset]") is an integer value
of character type ([[basic.fundamental]](basic.fundamental "6.9.2Fundamental types"))[.](#lex.charset-4.sentence-1)
Characters in a [*character-literal*](lex.ccon#nt:character-literal "5.13.3Character literals[lex.ccon]") other than a multicharacter or non-encodable character literal or
in a [*string-literal*](lex.string#nt:string-literal "5.13.5String literals[lex.string]") are encoded as
a sequence of one or more code units, as determined
by the [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]") ([[lex.ccon]](lex.ccon "5.13.3Character literals"), [[lex.string]](lex.string "5.13.5String literals"));
this is termed the respective [*literal encoding*](#def:encoding,literal "5.3.1Character sets[lex.charset]")[.](#lex.charset-4.sentence-2)
The [*ordinary literal encoding*](#def:encoding,ordinary_literal "5.3.1Character sets[lex.charset]") is
the encoding applied to an ordinary character or string literal[.](#lex.charset-4.sentence-3)
The [*wide literal encoding*](#def:encoding,wide_literal "5.3.1Character sets[lex.charset]") is the encoding applied
to a wide character or string literal[.](#lex.charset-4.sentence-4)
[5](#lex.charset-5)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L400)
A literal encoding or a locale-specific encoding of one of
the execution character sets ([[character.seq]](character.seq "16.3.3.3.4Character sequences"))
encodes each element of the basic literal character set as
a single code unit with non-negative value,
distinct from the code unit for any other such element[.](#lex.charset-5.sentence-1)
[*Note [3](#lex.charset-note-3)*:
A character not in the basic literal character set
can be encoded with more than one code unit;
the value of such a code unit can be the same as
that of a code unit for an element of the basic literal character set[.](#lex.charset-5.sentence-2)
— *end note*]
The U+0000 null character is encoded as the value 0[.](#lex.charset-5.sentence-3)
No other element of the translation character set
is encoded with a code unit of value 0[.](#lex.charset-5.sentence-4)
The code unit value of each decimal digit character after the digit 0 (U+0030)
shall be one greater than the value of the previous[.](#lex.charset-5.sentence-5)
The ordinary and wide literal encodings are otherwiseimplementation-defined[.](#lex.charset-5.sentence-6)
For a UTF-8, UTF-16, or UTF-32 literal,
the implementation shall encode
the Unicode scalar value
corresponding to each character of the translation character set
as specified in the Unicode Standard
for the respective Unicode encoding form[.](#lex.charset-5.sentence-7)
### [5.3.2](#lex.universal.char) Universal character names [[lex.universal.char]](lex.universal.char)
[n-char:](#nt:n-char "5.3.2Universal character names[lex.universal.char]")
any member of the translation character set except the U+007d right curly bracket or new-line character
[n-char-sequence:](#nt:n-char-sequence "5.3.2Universal character names[lex.universal.char]")
[*n-char*](#nt:n-char "5.3.2Universal character names[lex.universal.char]") [*n-char-sequence*](#nt:n-char-sequence "5.3.2Universal character names[lex.universal.char]")opt
[named-universal-character:](#nt:named-universal-character "5.3.2Universal character names[lex.universal.char]")
\N{ [*n-char-sequence*](#nt:n-char-sequence "5.3.2Universal character names[lex.universal.char]") }
[hex-quad:](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]")
[*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]")
[simple-hexadecimal-digit-sequence:](#nt:simple-hexadecimal-digit-sequence "5.3.2Universal character names[lex.universal.char]")
[*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]") [*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2Universal character names[lex.universal.char]")opt
[universal-character-name:](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]")
\u [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]")
\U [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]") [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]")
\u{ [*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2Universal character names[lex.universal.char]") }
[*named-universal-character*](#nt:named-universal-character "5.3.2Universal character names[lex.universal.char]")
[1](#lex.universal.char-1)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L467)
The [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]") construct provides a way to name any
element in the translation character set using just the basic character set[.](#lex.universal.char-1.sentence-1)
If a [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]") outside
the [*c-char-sequence*](lex.ccon#nt:c-char-sequence "5.13.3Character literals[lex.ccon]"), [*s-char-sequence*](lex.string#nt:s-char-sequence "5.13.5String literals[lex.string]"), or[*r-char-sequence*](lex.string#nt:r-char-sequence "5.13.5String literals[lex.string]") of a [*character-literal*](lex.ccon#nt:character-literal "5.13.3Character literals[lex.ccon]") or[*string-literal*](lex.string#nt:string-literal "5.13.5String literals[lex.string]") (in either case, including within a [*user-defined-literal*](lex.ext#nt:user-defined-literal "5.13.9User-defined literals[lex.ext]"))
corresponds to a control character or to a character in the basic character set,
the program is ill-formed[.](#lex.universal.char-1.sentence-2)
[*Note [1](#lex.universal.char-note-1)*:
A sequence of characters resembling a [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]") in an[*r-char-sequence*](lex.string#nt:r-char-sequence "5.13.5String literals[lex.string]") ([[lex.string]](lex.string "5.13.5String literals")) does not form a[*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]")[.](#lex.universal.char-1.sentence-3)
— *end note*]
[2](#lex.universal.char-2)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L483)
A [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]") of the form \u [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]"),\U [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]") [*hex-quad*](#nt:hex-quad "5.3.2Universal character names[lex.universal.char]"), or\u{[*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2Universal character names[lex.universal.char]")} designates the character in the translation character set
whose Unicode scalar value is the hexadecimal number represented by
the sequence of [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]")*s* in the [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]")[.](#lex.universal.char-2.sentence-1)
The program is ill-formed if that number is not a Unicode scalar value[.](#lex.universal.char-2.sentence-2)
[3](#lex.universal.char-3)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L494)
A [*universal-character-name*](#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]") that is a [*named-universal-character*](#nt:named-universal-character "5.3.2Universal character names[lex.universal.char]") designates the corresponding character
in the Unicode Standard (chapter 4.8 Name)
if the [*n-char-sequence*](#nt:n-char-sequence "5.3.2Universal character names[lex.universal.char]") is equal
to its character name or
to one of its character name aliases of
type “control”, “correction”, or “alternate”;
otherwise, the program is ill-formed[.](#lex.universal.char-3.sentence-1)
[*Note [2](#lex.universal.char-note-2)*:
These aliases are listed in
the Unicode Character Database's NameAliases.txt[.](#lex.universal.char-3.sentence-2)
None of these names or aliases have leading or trailing spaces[.](#lex.universal.char-3.sentence-3)
— *end note*]