Init
This commit is contained in:
240
cppdraft/lex/char.md
Normal file
240
cppdraft/lex/char.md
Normal file
@@ -0,0 +1,240 @@
|
||||
[lex.char]
|
||||
|
||||
# 5 Lexical conventions [[lex]](./#lex)
|
||||
|
||||
## 5.3 Characters [lex.char]
|
||||
|
||||
### [5.3.1](#lex.charset) Character sets [[lex.charset]](lex.charset)
|
||||
|
||||
[1](#lex.charset-1)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L298)
|
||||
|
||||
The [*translation character set*](#def:character_set,translation "5.3.1 Character sets [lex.charset]") consists of the following elements:
|
||||
|
||||
- [(1.1)](#lex.charset-1.1)
|
||||
|
||||
each abstract character assigned a code point in the Unicode codespace
|
||||
as specified in the Unicode Standard, and
|
||||
|
||||
- [(1.2)](#lex.charset-1.2)
|
||||
|
||||
a distinct character for each Unicode scalar value
|
||||
not assigned to an abstract character[.](#lex.charset-1.sentence-1)
|
||||
|
||||
[*Note [1](#lex.charset-note-1)*:
|
||||
|
||||
Unicode code points are integers
|
||||
in the range [0, 10FFFF] (hexadecimal)[.](#lex.charset-1.sentence-2)
|
||||
|
||||
A surrogate code point is a value
|
||||
in the range [D800, DFFF] (hexadecimal)[.](#lex.charset-1.sentence-3)
|
||||
|
||||
A Unicode scalar value is any code point that is not a surrogate code point[.](#lex.charset-1.sentence-4)
|
||||
|
||||
â *end note*]
|
||||
|
||||
[2](#lex.charset-2)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L317)
|
||||
|
||||
The [*basic character set*](#def:character_set,basic "5.3.1 Character sets [lex.charset]") is a subset of the translation character set,
|
||||
consisting of 99 characters as specified in Table [1](#tab:lex.charset.basic "Table 1: Basic character set")[.](#lex.charset-2.sentence-1)
|
||||
|
||||
[*Note [2](#lex.charset-note-2)*:
|
||||
|
||||
Unicode short names are given only as a means to identifying the character;
|
||||
the numerical value has no other meaning in this context[.](#lex.charset-2.sentence-2)
|
||||
|
||||
â *end note*]
|
||||
|
||||
Table [1](#tab:lex.charset.basic) — Basic character set [[tab:lex.charset.basic]](./tab:lex.charset.basic)
|
||||
|
||||
| [ð](#tab:lex.charset.basic-row-1)<br>**character** | | **glyph** |
|
||||
| --- | --- | --- |
|
||||
| [ð](#tab:lex.charset.basic-row-2)<br>U+0009 | character tabulation | |
|
||||
| [ð](#tab:lex.charset.basic-row-3)<br>U+000b | line tabulation | |
|
||||
| [ð](#tab:lex.charset.basic-row-4)<br>U+000c | form feed | |
|
||||
| [ð](#tab:lex.charset.basic-row-5)<br>U+0020 | space | |
|
||||
| [ð](#tab:lex.charset.basic-row-6)<br>U+000a | line feed | new-line |
|
||||
| [ð](#tab:lex.charset.basic-row-7)<br>U+0021 | exclamation mark | ! |
|
||||
| [ð](#tab:lex.charset.basic-row-8)<br>U+0022 | quotation mark | " |
|
||||
| [ð](#tab:lex.charset.basic-row-9)<br>U+0023 | number sign | # |
|
||||
| [ð](#tab:lex.charset.basic-row-10)<br>U+0024 | dollar sign | $ |
|
||||
| [ð](#tab:lex.charset.basic-row-11)<br>U+0025 | percent sign | % |
|
||||
| [ð](#tab:lex.charset.basic-row-12)<br>U+0026 | ampersand | & |
|
||||
| [ð](#tab:lex.charset.basic-row-13)<br>U+0027 | apostrophe | ' |
|
||||
| [ð](#tab:lex.charset.basic-row-14)<br>U+0028 | left parenthesis | ( |
|
||||
| [ð](#tab:lex.charset.basic-row-15)<br>U+0029 | right parenthesis | ) |
|
||||
| [ð](#tab:lex.charset.basic-row-16)<br>U+002a | asterisk | * |
|
||||
| [ð](#tab:lex.charset.basic-row-17)<br>U+002b | plus sign | + |
|
||||
| [ð](#tab:lex.charset.basic-row-18)<br>U+002c | comma | , |
|
||||
| [ð](#tab:lex.charset.basic-row-19)<br>U+002d | hyphen-minus | - |
|
||||
| [ð](#tab:lex.charset.basic-row-20)<br>U+002e | full stop | . |
|
||||
| [ð](#tab:lex.charset.basic-row-21)<br>U+002f | solidus | / |
|
||||
| [ð](#tab:lex.charset.basic-row-22)<br>U+0030 .[.](#tab:lex.charset.basic-row-22-column-1-sentence-1) U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 |
|
||||
| [ð](#tab:lex.charset.basic-row-23)<br>U+003a | colon | : |
|
||||
| [ð](#tab:lex.charset.basic-row-24)<br>U+003b | semicolon | ; |
|
||||
| [ð](#tab:lex.charset.basic-row-25)<br>U+003c | less-than sign | < |
|
||||
| [ð](#tab:lex.charset.basic-row-26)<br>U+003d | equals sign | = |
|
||||
| [ð](#tab:lex.charset.basic-row-27)<br>U+003e | greater-than sign | > |
|
||||
| [ð](#tab:lex.charset.basic-row-28)<br>U+003f | question mark | ? |
|
||||
| [ð](#tab:lex.charset.basic-row-29)<br>U+0040 | commercial at | @ |
|
||||
| [ð](#tab:lex.charset.basic-row-30)<br>U+0041 .[.](#tab:lex.charset.basic-row-30-column-1-sentence-1) U+005a | latin capital letter a .. z | A B C D E F G H I J K L M |
|
||||
| [ð](#tab:lex.charset.basic-row-31) | | N O P Q R S T U V W X Y Z |
|
||||
| [ð](#tab:lex.charset.basic-row-32)<br>U+005b | left square bracket | [ |
|
||||
| [ð](#tab:lex.charset.basic-row-33)<br>U+005c | reverse solidus | \ |
|
||||
| [ð](#tab:lex.charset.basic-row-34)<br>U+005d | right square bracket | ] |
|
||||
| [ð](#tab:lex.charset.basic-row-35)<br>U+005e | circumflex accent | ^ |
|
||||
| [ð](#tab:lex.charset.basic-row-36)<br>U+005f | low line | _ |
|
||||
| [ð](#tab:lex.charset.basic-row-37)<br>U+0060 | grave accent | ` |
|
||||
| [ð](#tab:lex.charset.basic-row-38)<br>U+0061 .[.](#tab:lex.charset.basic-row-38-column-1-sentence-1) U+007a | latin small letter a .. z | a b c d e f g h i j k l m |
|
||||
| [ð](#tab:lex.charset.basic-row-39) | | n o p q r s t u v w x y z |
|
||||
| [ð](#tab:lex.charset.basic-row-40)<br>U+007b | left curly bracket | { |
|
||||
| [ð](#tab:lex.charset.basic-row-41)<br>U+007c | vertical line | | |
|
||||
| [ð](#tab:lex.charset.basic-row-42)<br>U+007d | right curly bracket | } |
|
||||
| [ð](#tab:lex.charset.basic-row-43)<br>U+007e | tilde | ~ |
|
||||
|
||||
[3](#lex.charset-3)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L372)
|
||||
|
||||
The [*basic literal character set*](#def:character_set,basic_literal "5.3.1 Character sets [lex.charset]") consists of
|
||||
all characters of the basic character set,
|
||||
plus the control characters specified in Table [2](#tab:lex.charset.literal "Table 2: Additional control characters in the basic literal character set")[.](#lex.charset-3.sentence-1)
|
||||
|
||||
Table [2](#tab:lex.charset.literal) — Additional control characters in the basic literal character set [[tab:lex.charset.literal]](./tab:lex.charset.literal)
|
||||
|
||||
| [ð](#tab:lex.charset.literal-row-1)<br>**character** | |
|
||||
| --- | --- |
|
||||
| [ð](#tab:lex.charset.literal-row-2)<br>U+0000 | null |
|
||||
| [ð](#tab:lex.charset.literal-row-3)<br>U+0007 | alert |
|
||||
| [ð](#tab:lex.charset.literal-row-4)<br>U+0008 | backspace |
|
||||
| [ð](#tab:lex.charset.literal-row-5)<br>U+000d | carriage return |
|
||||
|
||||
[4](#lex.charset-4)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L386)
|
||||
|
||||
A [*code unit*](#def:code_unit "5.3.1 Character sets [lex.charset]") is an integer value
|
||||
of character type ([[basic.fundamental]](basic.fundamental "6.9.2 Fundamental types"))[.](#lex.charset-4.sentence-1)
|
||||
|
||||
Characters in a [*character-literal*](lex.ccon#nt:character-literal "5.13.3 Character literals [lex.ccon]") other than a multicharacter or non-encodable character literal or
|
||||
in a [*string-literal*](lex.string#nt:string-literal "5.13.5 String literals [lex.string]") are encoded as
|
||||
a sequence of one or more code units, as determined
|
||||
by the [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3 Character literals [lex.ccon]") ([[lex.ccon]](lex.ccon "5.13.3 Character literals"), [[lex.string]](lex.string "5.13.5 String literals"));
|
||||
this is termed the respective [*literal encoding*](#def:encoding,literal "5.3.1 Character sets [lex.charset]")[.](#lex.charset-4.sentence-2)
|
||||
|
||||
The [*ordinary literal encoding*](#def:encoding,ordinary_literal "5.3.1 Character sets [lex.charset]") is
|
||||
the encoding applied to an ordinary character or string literal[.](#lex.charset-4.sentence-3)
|
||||
|
||||
The [*wide literal encoding*](#def:encoding,wide_literal "5.3.1 Character sets [lex.charset]") is the encoding applied
|
||||
to a wide character or string literal[.](#lex.charset-4.sentence-4)
|
||||
|
||||
[5](#lex.charset-5)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L400)
|
||||
|
||||
A literal encoding or a locale-specific encoding of one of
|
||||
the execution character sets ([[character.seq]](character.seq "16.3.3.3.4 Character sequences"))
|
||||
encodes each element of the basic literal character set as
|
||||
a single code unit with non-negative value,
|
||||
distinct from the code unit for any other such element[.](#lex.charset-5.sentence-1)
|
||||
|
||||
[*Note [3](#lex.charset-note-3)*:
|
||||
|
||||
A character not in the basic literal character set
|
||||
can be encoded with more than one code unit;
|
||||
the value of such a code unit can be the same as
|
||||
that of a code unit for an element of the basic literal character set[.](#lex.charset-5.sentence-2)
|
||||
|
||||
â *end note*]
|
||||
|
||||
The U+0000 null character is encoded as the value 0[.](#lex.charset-5.sentence-3)
|
||||
|
||||
No other element of the translation character set
|
||||
is encoded with a code unit of value 0[.](#lex.charset-5.sentence-4)
|
||||
|
||||
The code unit value of each decimal digit character after the digit 0 (U+0030)
|
||||
shall be one greater than the value of the previous[.](#lex.charset-5.sentence-5)
|
||||
|
||||
The ordinary and wide literal encodings are otherwiseimplementation-defined[.](#lex.charset-5.sentence-6)
|
||||
|
||||
For a UTF-8, UTF-16, or UTF-32 literal,
|
||||
the implementation shall encode
|
||||
the Unicode scalar value
|
||||
corresponding to each character of the translation character set
|
||||
as specified in the Unicode Standard
|
||||
for the respective Unicode encoding form[.](#lex.charset-5.sentence-7)
|
||||
|
||||
### [5.3.2](#lex.universal.char) Universal character names [[lex.universal.char]](lex.universal.char)
|
||||
|
||||
[n-char:](#nt:n-char "5.3.2 Universal character names [lex.universal.char]")
|
||||
any member of the translation character set except the U+007d right curly bracket or new-line character
|
||||
|
||||
[n-char-sequence:](#nt:n-char-sequence "5.3.2 Universal character names [lex.universal.char]")
|
||||
[*n-char*](#nt:n-char "5.3.2 Universal character names [lex.universal.char]") [*n-char-sequence*](#nt:n-char-sequence "5.3.2 Universal character names [lex.universal.char]")opt
|
||||
|
||||
[named-universal-character:](#nt:named-universal-character "5.3.2 Universal character names [lex.universal.char]")
|
||||
\N{ [*n-char-sequence*](#nt:n-char-sequence "5.3.2 Universal character names [lex.universal.char]") }
|
||||
|
||||
[hex-quad:](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]")
|
||||
[*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]") [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]")
|
||||
|
||||
[simple-hexadecimal-digit-sequence:](#nt:simple-hexadecimal-digit-sequence "5.3.2 Universal character names [lex.universal.char]")
|
||||
[*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]") [*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2 Universal character names [lex.universal.char]")opt
|
||||
|
||||
[universal-character-name:](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]")
|
||||
\u [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]")
|
||||
\U [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]") [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]")
|
||||
\u{ [*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2 Universal character names [lex.universal.char]") }
|
||||
[*named-universal-character*](#nt:named-universal-character "5.3.2 Universal character names [lex.universal.char]")
|
||||
|
||||
[1](#lex.universal.char-1)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L467)
|
||||
|
||||
The [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]") construct provides a way to name any
|
||||
element in the translation character set using just the basic character set[.](#lex.universal.char-1.sentence-1)
|
||||
|
||||
If a [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]") outside
|
||||
the [*c-char-sequence*](lex.ccon#nt:c-char-sequence "5.13.3 Character literals [lex.ccon]"), [*s-char-sequence*](lex.string#nt:s-char-sequence "5.13.5 String literals [lex.string]"), or[*r-char-sequence*](lex.string#nt:r-char-sequence "5.13.5 String literals [lex.string]") of a [*character-literal*](lex.ccon#nt:character-literal "5.13.3 Character literals [lex.ccon]") or[*string-literal*](lex.string#nt:string-literal "5.13.5 String literals [lex.string]") (in either case, including within a [*user-defined-literal*](lex.ext#nt:user-defined-literal "5.13.9 User-defined literals [lex.ext]"))
|
||||
corresponds to a control character or to a character in the basic character set,
|
||||
the program is ill-formed[.](#lex.universal.char-1.sentence-2)
|
||||
|
||||
[*Note [1](#lex.universal.char-note-1)*:
|
||||
|
||||
A sequence of characters resembling a [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]") in an[*r-char-sequence*](lex.string#nt:r-char-sequence "5.13.5 String literals [lex.string]") ([[lex.string]](lex.string "5.13.5 String literals")) does not form a[*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]")[.](#lex.universal.char-1.sentence-3)
|
||||
|
||||
â *end note*]
|
||||
|
||||
[2](#lex.universal.char-2)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L483)
|
||||
|
||||
A [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]") of the form \u [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]"),\U [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]") [*hex-quad*](#nt:hex-quad "5.3.2 Universal character names [lex.universal.char]"), or\u{[*simple-hexadecimal-digit-sequence*](#nt:simple-hexadecimal-digit-sequence "5.3.2 Universal character names [lex.universal.char]")} designates the character in the translation character set
|
||||
whose Unicode scalar value is the hexadecimal number represented by
|
||||
the sequence of [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2 Integer literals [lex.icon]")*s* in the [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]")[.](#lex.universal.char-2.sentence-1)
|
||||
|
||||
The program is ill-formed if that number is not a Unicode scalar value[.](#lex.universal.char-2.sentence-2)
|
||||
|
||||
[3](#lex.universal.char-3)
|
||||
|
||||
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L494)
|
||||
|
||||
A [*universal-character-name*](#nt:universal-character-name "5.3.2 Universal character names [lex.universal.char]") that is a [*named-universal-character*](#nt:named-universal-character "5.3.2 Universal character names [lex.universal.char]") designates the corresponding character
|
||||
in the Unicode Standard (chapter 4.8 Name)
|
||||
if the [*n-char-sequence*](#nt:n-char-sequence "5.3.2 Universal character names [lex.universal.char]") is equal
|
||||
to its character name or
|
||||
to one of its character name aliases of
|
||||
type âcontrolâ, âcorrectionâ, or âalternateâ;
|
||||
otherwise, the program is ill-formed[.](#lex.universal.char-3.sentence-1)
|
||||
|
||||
[*Note [2](#lex.universal.char-note-2)*:
|
||||
|
||||
These aliases are listed in
|
||||
the Unicode Character Database's NameAliases.txt[.](#lex.universal.char-3.sentence-2)
|
||||
|
||||
None of these names or aliases have leading or trailing spaces[.](#lex.universal.char-3.sentence-3)
|
||||
|
||||
â *end note*]
|
||||
Reference in New Issue
Block a user