This commit is contained in:
2025-10-25 03:02:53 +03:00
commit 043225d523
3416 changed files with 681196 additions and 0 deletions

273
cppdraft/lex/string.md Normal file
View File

@@ -0,0 +1,273 @@
[lex.string]
# 5 Lexical conventions [[lex]](./#lex)
## 5.13 Literals [[lex.literal]](lex.literal#lex.string)
### 5.13.5 String literals [lex.string]
[string-literal:](#nt:string-literal "5.13.5String literals[lex.string]")
[*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]")opt " [*s-char-sequence*](#nt:s-char-sequence "5.13.5String literals[lex.string]")opt "
[*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]")opt R [*raw-string*](#nt:raw-string "5.13.5String literals[lex.string]")
[s-char-sequence:](#nt:s-char-sequence "5.13.5String literals[lex.string]")
[*s-char*](#nt:s-char "5.13.5String literals[lex.string]") [*s-char-sequence*](#nt:s-char-sequence "5.13.5String literals[lex.string]")opt
[s-char:](#nt:s-char "5.13.5String literals[lex.string]")
[*basic-s-char*](#nt:basic-s-char "5.13.5String literals[lex.string]")
[*escape-sequence*](lex.ccon#nt:escape-sequence "5.13.3Character literals[lex.ccon]")
[*universal-character-name*](lex.universal.char#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]")
[basic-s-char:](#nt:basic-s-char "5.13.5String literals[lex.string]")
any member of the translation character set except the U+0022 quotation mark,
U+005c reverse solidus, or new-line character
[raw-string:](#nt:raw-string "5.13.5String literals[lex.string]")
" [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]")opt ( [*r-char-sequence*](#nt:r-char-sequence "5.13.5String literals[lex.string]")opt ) [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]")opt "
[r-char-sequence:](#nt:r-char-sequence "5.13.5String literals[lex.string]")
[*r-char*](#nt:r-char "5.13.5String literals[lex.string]") [*r-char-sequence*](#nt:r-char-sequence "5.13.5String literals[lex.string]")opt
[r-char:](#nt:r-char "5.13.5String literals[lex.string]")
any member of the translation character set, except a U+0029 right parenthesis followed by
the initial [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]") (which may be empty) followed by a U+0022 quotation mark
[d-char-sequence:](#nt:d-char-sequence "5.13.5String literals[lex.string]")
[*d-char*](#nt:d-char "5.13.5String literals[lex.string]") [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]")opt
[d-char:](#nt:d-char "5.13.5String literals[lex.string]")
any member of the basic character set except:
U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,
U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line
[1](#1)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1850)
The kind of a [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]"),
its type, and
its associated character encoding ([[lex.charset]](lex.charset "5.3.1Character sets"))
are determined by its encoding prefix and sequence of[*s-char*](#nt:s-char "5.13.5String literals[lex.string]")*s* or [*r-char*](#nt:r-char "5.13.5String literals[lex.string]")*s* as defined by Table [12](#tab:lex.string.literal "Table 12: String literals") where n is the number of encoded code units
that would result from an evaluation of the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]") (see below)[.](#1.sentence-1)
Table [12](#tab:lex.string.literal) — String literals [[tab:lex.string.literal]](./tab:lex.string.literal)
| [🔗](#tab:lex.string.literal-row-1)<br>**Enco-** | **Kind** | **Type** | **Associated** | **Examples** |
| --- | --- | --- | --- | --- |
| [🔗](#tab:lex.string.literal-row-2)<br>**ding** | | | **character** | |
| [🔗](#tab:lex.string.literal-row-3)<br>**prefix** | | | **encoding** | |
| [🔗](#tab:lex.string.literal-row-4)<br>none | [*ordinary string literal*](#def:literal,string,ordinary "5.13.5String literals[lex.string]") | array of n const char | ordinary literal encoding | "ordinary string" R"(ordinary raw string)" |
| [🔗](#tab:lex.string.literal-row-5)<br>L | [*wide string literal*](#def:literal,string,wide "5.13.5String literals[lex.string]") | array of n const wchar_t | wide literal encoding | L"wide string" LR"w(wide raw string)w" |
| [🔗](#tab:lex.string.literal-row-6)<br>u8 | [*UTF-8 string literal*](#def:literal,string,UTF-8 "5.13.5String literals[lex.string]") | array of n const char8_t | UTF-8 | u8"UTF-8 string" u8R"x(UTF-8 raw string)x" |
| [🔗](#tab:lex.string.literal-row-7)<br>u | [*UTF-16 string literal*](#def:literal,string,UTF-16 "5.13.5String literals[lex.string]") | array of n const char16_t | UTF-16 | u"UTF-16 string" uR"y(UTF-16 raw string)y" |
| [🔗](#tab:lex.string.literal-row-8)<br>U | [*UTF-32 string literal*](#def:literal,string,UTF-32 "5.13.5String literals[lex.string]") | array of n const char32_t | UTF-32 | U"UTF-32 string" UR"z(UTF-32 raw string)z" |
[2](#2)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1909)
A [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]") that has an Rin the prefix is a [*raw string literal*](#def:raw_string_literal "5.13.5String literals[lex.string]")[.](#2.sentence-1)
The[*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]") serves as a delimiter[.](#2.sentence-2)
The terminating[*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]") of a [*raw-string*](#nt:raw-string "5.13.5String literals[lex.string]") is the same sequence of
characters as the initial [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]")[.](#2.sentence-3)
A [*d-char-sequence*](#nt:d-char-sequence "5.13.5String literals[lex.string]") shall consist of at most 16 characters[.](#2.sentence-4)
[3](#3)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1919)
[*Note [1](#note-1)*:
The characters '(' and ')' can appear in a[*raw-string*](#nt:raw-string "5.13.5String literals[lex.string]")[.](#3.sentence-1)
Thus, R"delimiter((a|b))delimiter" is equivalent to"(a|b)"[.](#3.sentence-2)
— *end note*]
[4](#4)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1926)
[*Note [2](#note-2)*:
A source-file new-line in a raw string literal results in a new-line in the
resulting execution string literal[.](#4.sentence-1)
Assuming no
whitespace at the beginning of lines in the following example, the assert will succeed:const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
— *end note*]
[5](#5)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1939)
[*Example [1](#example-1)*:
The raw stringR"a(
)\
a"
)a" is equivalent to "\n)\\\na\"\n"[.](#5.sentence-1)
The raw stringR"(x = "\"y\"")" is equivalent to "x = \"\\\"y\\\"\""[.](#5.sentence-2)
— *end example*]
[6](#6)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1955)
Ordinary string literals and UTF-8 string literals are
also referred to as [*narrow string literals*](#def:literal,string,narrow "5.13.5String literals[lex.string]")[.](#6.sentence-1)
[7](#7)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1960)
The [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")*s* in
any sequence of adjacent [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")*s* shall have at most one unique [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]") among them[.](#7.sentence-1)
The common [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]") of the sequence is
that [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]"), if any[.](#7.sentence-2)
[*Note [3](#note-3)*:
A [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s rawness has
no effect on the determination of the common [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]")[.](#7.sentence-3)
— *end note*]
[8](#8)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L1972)
In translation phase 6 ([[lex.phases]](lex.phases "5.2Phases of translation")),
adjacent [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")*s* are concatenated[.](#8.sentence-1)
The lexical structure and grouping of
the contents of the individual [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")*s* is retained[.](#8.sentence-2)
[*Example [2](#example-2)*:
"\xA" "B" represents
the code unit '\xA' and the character 'B' after concatenation
(and not the single code unit '\xAB')[.](#8.sentence-3)
Similarly,R"(\u00)" "41" represents six characters,
starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a [*universal-character-name*](lex.universal.char#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]"))[.](#8.sentence-4)
Table [13](#tab:lex.string.concat "Table 13: String literal concatenations") has some examples of valid concatenations[.](#8.sentence-5)
— *end example*]
Table [13](#tab:lex.string.concat) — String literal concatenations [[tab:lex.string.concat]](./tab:lex.string.concat)
| [🔗](#tab:lex.string.concat-row-1)<br>Source | | Means | Source | | Means | Source | | Means |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| [🔗](#tab:lex.string.concat-row-2)<br>u"a" | u"b" | u"ab" | U"a" | U"b" | U"ab" | L"a" | L"b" | L"ab" |
| [🔗](#tab:lex.string.concat-row-3)<br>u"a" | "b" | u"ab" | U"a" | "b" | U"ab" | L"a" | "b" | L"ab" |
| [🔗](#tab:lex.string.concat-row-4)<br>"a" | u"b" | u"ab" | "a" | U"b" | U"ab" | "a" | L"b" | L"ab" |
[9](#9)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L2017)
Evaluating a [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]") results in a string literal object
with static storage duration ([[basic.stc]](basic.stc "6.8.6Storage duration"))[.](#9.sentence-1)
[*Note [4](#note-4)*:
String literal objects are potentially non-unique ([[intro.object]](intro.object "6.8.2Object model"))[.](#9.sentence-2)
Whether successive evaluations of a[*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]") yield the same or a different object is
unspecified[.](#9.sentence-3)
— *end note*]
[*Note [5](#note-5)*:
The effect of attempting to modify a string literal object is undefined[.](#9.sentence-4)
— *end note*]
[10](#10)
[#](http://github.com/Eelis/draft/tree/9adde4bc1c62ec234483e63ea3b70a59724c745a/source/lex.tex#L2031)
String literal objects are initialized with
the sequence of code unit values
corresponding to the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s sequence of[*s-char*](#nt:s-char "5.13.5String literals[lex.string]")*s* (originally from non-raw string literals) and[*r-char*](#nt:r-char "5.13.5String literals[lex.string]")*s* (originally from raw string literals),
plus a terminating U+0000 null character,
in order as follows:
- [(10.1)](#10.1)
The sequence of characters denoted by each contiguous sequence of[*basic-s-char*](#nt:basic-s-char "5.13.5String literals[lex.string]")*s*,[*r-char*](#nt:r-char "5.13.5String literals[lex.string]")*s*,[*simple-escape-sequence*](lex.ccon#nt:simple-escape-sequence "5.13.3Character literals[lex.ccon]")*s* ([[lex.ccon]](lex.ccon "5.13.3Character literals")), and[*universal-character-name*](lex.universal.char#nt:universal-character-name "5.3.2Universal character names[lex.universal.char]")*s* ([[lex.charset]](lex.charset "5.3.1Character sets"))
is encoded to a code unit sequence
using the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s associated character encoding[.](#10.1.sentence-1)
If a character lacks representation in the associated character encoding,
then the program is ill-formed[.](#10.1.sentence-2)
[*Note [6](#note-6)*:
No character lacks representation in any Unicode encoding form[.](#10.1.sentence-3)
— *end note*]
When encoding a stateful character encoding,
implementations should encode the first such sequence
beginning with the initial encoding state and
encode subsequent sequences
beginning with the final encoding state of the prior sequence[.](#10.1.sentence-4)
[*Note [7](#note-7)*:
The encoded code unit sequence can differ from
the sequence of code units that would be obtained by
encoding each character independently[.](#10.1.sentence-5)
— *end note*]
- [(10.2)](#10.2)
Each [*numeric-escape-sequence*](lex.ccon#nt:numeric-escape-sequence "5.13.3Character literals[lex.ccon]") ([[lex.ccon]](lex.ccon "5.13.3Character literals"))
contributes a single code unit with a value as follows:
* [(10.2.1)](#10.2.1)
Let v be the integer value represented by
the octal number comprising
the sequence of [*octal-digit*](lex.icon#nt:octal-digit "5.13.2Integer literals[lex.icon]")*s* in
an [*octal-escape-sequence*](lex.ccon#nt:octal-escape-sequence "5.13.3Character literals[lex.ccon]") or by
the hexadecimal number comprising
the sequence of [*hexadecimal-digit*](lex.icon#nt:hexadecimal-digit "5.13.2Integer literals[lex.icon]")*s* in
a [*hexadecimal-escape-sequence*](lex.ccon#nt:hexadecimal-escape-sequence "5.13.3Character literals[lex.ccon]")[.](#10.2.1.sentence-1)
* [(10.2.2)](#10.2.2)
If v does not exceed the range of representable values of
the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s array element type,
then the value is v[.](#10.2.2.sentence-1)
* [(10.2.3)](#10.2.3)
Otherwise,
if the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s [*encoding-prefix*](lex.ccon#nt:encoding-prefix "5.13.3Character literals[lex.ccon]") is absent or L, andv does not exceed the range of representable values of
the corresponding unsigned type for the underlying type of
the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s array element type,
then the value is the unique value of
the [*string-literal*](#nt:string-literal "5.13.5String literals[lex.string]")'s array element type T that is congruent to v modulo 2N, where N is the width of T[.](#10.2.3.sentence-1)
* [(10.2.4)](#10.2.4)
Otherwise, the program is ill-formed[.](#10.2.4.sentence-1)
When encoding a stateful character encoding,
these sequences should have no effect on encoding state[.](#10.2.sentence-2)
- [(10.3)](#10.3)
Each [*conditional-escape-sequence*](lex.ccon#nt:conditional-escape-sequence "5.13.3Character literals[lex.ccon]") ([[lex.ccon]](lex.ccon "5.13.3Character literals"))
contributes animplementation-defined
code unit sequence[.](#10.3.sentence-1)
When encoding a stateful character encoding,
it isimplementation-defined
what effect these sequences have on encoding state[.](#10.3.sentence-2)