15 KiB
[lex.string]
5 Lexical conventions [lex]
5.13 Literals [lex.literal]
5.13.5 String literals [lex.string]
string-literal:
encoding-prefixopt " s-char-sequenceopt "
encoding-prefixopt R raw-string
s-char-sequence:
s-char s-char-sequenceopt
s-char:
basic-s-char
escape-sequence
universal-character-name
basic-s-char:
any member of the translation character set except the U+0022 quotation mark,
U+005c reverse solidus, or new-line character
raw-string:
" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "
r-char-sequence:
r-char r-char-sequenceopt
r-char:
any member of the translation character set, except a U+0029 right parenthesis followed by
the initial d-char-sequence (which may be empty) followed by a U+0022 quotation mark
d-char-sequence:
d-char d-char-sequenceopt
d-char:
any member of the basic character set except:
U+0020 space, U+0028 left parenthesis, U+0029 right parenthesis, U+005c reverse solidus,
U+0009 character tabulation, U+000b line tabulation, U+000c form feed, and new-line
The kind of a string-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding prefix and sequence ofs-chars or r-chars as defined by Table 12 where n is the number of encoded code units that would result from an evaluation of the string-literal (see below).
Table 12 — String literals [tab:lex.string.literal]
| ð Enco- |
Kind | Type | Associated | Examples |
|---|---|---|---|---|
| ð ding |
character | |||
| ð prefix |
encoding | |||
| ð none |
ordinary string literal | array of n const char | ordinary literal encoding | "ordinary string" R"(ordinary raw string)" |
| ð L |
wide string literal | array of n const wchar_t | wide literal encoding | L"wide string" LR"w(wide raw string)w" |
| ð u8 |
UTF-8 string literal | array of n const char8_t | UTF-8 | u8"UTF-8 string" u8R"x(UTF-8 raw string)x" |
| ð u |
UTF-16 string literal | array of n const char16_t | UTF-16 | u"UTF-16 string" uR"y(UTF-16 raw string)y" |
| ð U |
UTF-32 string literal | array of n const char32_t | UTF-32 | U"UTF-32 string" UR"z(UTF-32 raw string)z" |
A string-literal that has an Rin the prefix is a raw string literal.
Thed-char-sequence serves as a delimiter.
The terminatingd-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.
A d-char-sequence shall consist of at most 16 characters.
[Note 1:
The characters '(' and ')' can appear in araw-string.
Thus, R"delimiter((a|b))delimiter" is equivalent to"(a|b)".
â end note]
[Note 2:
A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.
Assuming no
whitespace at the beginning of lines in the following example, the assert will succeed:const char* p = R"(a
b
c)";
assert(std::strcmp(p, "a\\nb\nc") == 0);
â end note]
[Example 1:
The raw stringR"a(
)
a"
)a" is equivalent to "\n)\\na"\n".
The raw stringR"(x = ""y"")" is equivalent to "x = "\"y\""".
â end example]
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
The string-literals in any sequence of adjacent string-literals shall have at most one unique encoding-prefix among them.
The common encoding-prefix of the sequence is that encoding-prefix, if any.
[Note 3:
A string-literal's rawness has no effect on the determination of the common encoding-prefix.
â end note]
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.
The lexical structure and grouping of the contents of the individual string-literals is retained.
[Example 2:
"\xA" "B" represents the code unit '\xA' and the character 'B' after concatenation (and not the single code unit '\xAB').
Similarly,R"(\u00)" "41" represents six characters, starting with a backslash and ending with the digit 1 (and not the single character 'A' specified by a universal-character-name).
Table 13 has some examples of valid concatenations.
â end example]
Table 13 — String literal concatenations [tab:lex.string.concat]
| ð Source |
Means | Source | Means | Source | Means | |||
|---|---|---|---|---|---|---|---|---|
| ð u"a" |
u"b" | u"ab" | U"a" | U"b" | U"ab" | L"a" | L"b" | L"ab" |
| ð u"a" |
"b" | u"ab" | U"a" | "b" | U"ab" | L"a" | "b" | L"ab" |
| ð "a" |
u"b" | u"ab" | "a" | U"b" | U"ab" | "a" | L"b" | L"ab" |
Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]).
[Note 4:
String literal objects are potentially non-unique ([intro.object]).
Whether successive evaluations of astring-literal yield the same or a different object is unspecified.
â end note]
[Note 5:
The effect of attempting to modify a string literal object is undefined.
â end note]
String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence ofs-chars (originally from non-raw string literals) andr-chars (originally from raw string literals), plus a terminating U+0000 null character, in order as follows:
-
The sequence of characters denoted by each contiguous sequence ofbasic-s-chars,r-chars,simple-escape-sequences ([lex.ccon]), anduniversal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding. If a character lacks representation in the associated character encoding, then the program is ill-formed. [Note 6: No character lacks representation in any Unicode encoding form. â end note] When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence. [Note 7: The encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently. â end note]
-
Each numeric-escape-sequence ([lex.ccon]) contributes a single code unit with a value as follows:
-
Let v be the integer value represented by the octal number comprising the sequence of octal-digits in an octal-escape-sequence or by the hexadecimal number comprising the sequence of hexadecimal-digits in a hexadecimal-escape-sequence.
-
If v does not exceed the range of representable values of the string-literal's array element type, then the value is v.
-
Otherwise, if the string-literal's encoding-prefix is absent or L, andv does not exceed the range of representable values of the corresponding unsigned type for the underlying type of the string-literal's array element type, then the value is the unique value of the string-literal's array element type T that is congruent to v modulo 2N, where N is the width of T.
-
Otherwise, the program is ill-formed.
When encoding a stateful character encoding, these sequences should have no effect on encoding state.
-
-
Each conditional-escape-sequence ([lex.ccon]) contributes animplementation-defined code unit sequence. When encoding a stateful character encoding, it isimplementation-defined what effect these sequences have on encoding state.