Files
2025-10-25 03:02:53 +03:00

24 KiB
Raw Permalink Blame History

[text.encoding.class]

28 Text processing library [text]

28.4 Text encodings identification [text.encoding]

28.4.2 Class text_encoding [text.encoding.class]

28.4.2.1 Overview [text.encoding.overview]

1

#

The class text_encoding describes an interface for accessing the IANA Character Sets registry[bib].

🔗

namespace std {struct text_encoding {static constexpr size_t max_name_length = 63; // [text.encoding.id], enumeration text_encoding::idenum class id : int_least32_t {see below}; using enum id; constexpr text_encoding() = default; constexpr explicit text_encoding(string_view enc) noexcept; constexpr text_encoding(id i) noexcept; constexpr id mib() const noexcept; constexpr const char* name() const noexcept; // [text.encoding.aliases], class text_encoding::aliases_viewstruct aliases_view; constexpr aliases_view aliases() const noexcept; friend constexpr bool operator==(const text_encoding& a, const text_encoding& b) noexcept; friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept; static consteval text_encoding literal() noexcept; static text_encoding environment(); template static bool environment_is(); private: id mib_ = id::unknown; // exposition onlychar name_[max_name_length + 1] = {0}; // exposition onlystatic constexpr bool comp-name(string_view a, string_view b); // exposition only};}

2

#

Class text_encoding is a trivially copyable type ([basic.types.general]).

28.4.2.2 General [text.encoding.general]

1

#

A registered character encoding is a character encoding scheme in the IANA Character Sets registry.

[Note 1:

The IANA Character Sets registry uses the term “character sets” to refer to character encodings.

— end note]

The primary name of a registered character encoding is the name of that encoding specified in the IANA Character Sets registry.

2

#

The set of known registered character encodings contains every registered character encoding specified in the IANA Character Sets registry except for the following:

NATS-DANO (33)

NATS-DANO-ADD (34)

3

#

Each known registered character encoding is identified by an enumerator in text_encoding::id, and has a set of zero or more aliases.

4

#

The set of aliases of a known registered character encoding is animplementation-defined superset of the aliases specified in the IANA Character Sets registry.

The set of aliases for US-ASCII includes “ASCII”.

No two aliases or primary names of distinct registered character encodings are equivalent when compared by text_encoding::comp-name.

5

#

How a text_encoding object is determined to be representative of a character encoding scheme implemented in the translation or execution environment isimplementation-defined.

6

#

An object e of type text_encoding such thate.mib() == text_encoding::id::unknown is false ande.mib() == text_encoding::id::other is false maintains the following invariants:

*e.name() == '\0' is false, and

e.mib() == text_encoding(e.name()).mib() is true.

7

#

Recommended practice:

  • (7.1)

    Implementations should not consider registered encodings to be interchangeable. [Example 1: Shift_JIS and Windows-31J denote different encodings. — end example]

  • (7.2)

    Implementations should not use the name of a registered encoding to describe another similar yet different non-registered encoding unless there is a precedent on that implementation. [Example 2: Big5 — end example]

28.4.2.3 Members [text.encoding.members]

🔗

constexpr explicit text_encoding(string_view enc) noexcept;

1

#

Preconditions:

  • (1.1)

    enc represents a string in the ordinary literal encoding consisting only of elements of the basic character set ([lex.charset]).

  • (1.2)

    enc.size() <= max_name_length is true.

  • (1.3)

    enc.contains('\0') is false.

2

#

Postconditions:

  • (2.1)

    If there exists a primary name or alias a of a known registered character encoding such thatcomp-name(a, enc) is true,mib_ has the value of the enumerator of id associated with that registered character encoding. Otherwise, mib_ == id::other is true.

  • (2.2)

    enc.compare(name_) == 0 is true.

🔗

constexpr text_encoding(id i) noexcept;

3

#

Preconditions: i has the value of one of the enumerators of id.

4

#

Postconditions:

  • (4.1)

    mib_ == i is true.

  • (4.2)

    If (mib_ == id::unknown || mib_ == id::other) is true,strlen(name_) == 0 is true. Otherwise,ranges::contains(aliases(), string_view(name_)) is true.

🔗

constexpr id mib() const noexcept;

5

#

Returns: mib_.

🔗

constexpr const char* name() const noexcept;

6

#

Returns: name_.

7

#

Remarks: name() is an ntbs and accessing elements of name_ outside of the range name()+[0, strlen(name()) + 1) is undefined behavior.

🔗

constexpr aliases_view aliases() const noexcept;

Let r denote an instance of aliases_view.

If *this represents a known registered character encoding, then:

  • r.front() is the primary name of the registered character encoding,
  • r contains the aliases of the registered character encoding, and
  • r does not contain duplicate values when compared with strcmp.

Otherwise, r is an empty range.

8

#

Each element in r is a non-null, non-empty ntbs encoded in the literal character encoding and comprising only characters from the basic character set.

9

#

Returns: r.

10

#

[Note 1:

The order of aliases in r is unspecified.

— end note]

🔗

static consteval text_encoding literal() noexcept;

11

#

Mandates: CHAR_BIT == 8 is true.

12

#

Returns: A text_encoding object representing the ordinary character literal encoding ([lex.charset]).

🔗

static text_encoding environment();

13

#

Mandates: CHAR_BIT == 8 is true.

14

#

Returns: A text_encoding object representing the implementation-defined character encoding scheme of the environment.

On a POSIX implementation, this is the encoding scheme associated with the POSIX locale denoted by the empty string "".

15

#

[Note 2:

This function is not affected by calls to setlocale.

— end note]

16

#

Recommended practice: Implementations should return a value that is not affected by calls to the POSIX function setenv and other functions which can modify the environment ([support.runtime]).

🔗

template<id i> static bool environment_is();

17

#

Mandates: CHAR_BIT == 8 is true.

18

#

Returns: environment() == i.

🔗

static constexpr bool comp-name(string_view a, string_view b);

19

#

Returns: true if the two strings a and b encoded in the ordinary literal encoding are equal, ignoring, from left-to-right,

all elements that are not digits or letters ([character.seq.general]),

character case, and

any sequence of one or more 0 characters not immediately preceded by a numeric prefix, where a numeric prefix is a sequence consisting of a digit in the range [1, 9] optionally followed by one or more elements which are not digits or letters,

and false otherwise.

[Note 3:

This comparison is identical to the “Charset Alias Matching” algorithm described in the Unicode Technical Standard 22[bib].

— end note]

[Example 1: static_assert(comp-name("UTF-8", "utf8") == true);static_assert(comp-name("u.t.f-008", "utf8") == true);static_assert(comp-name("ut8", "utf8") == false);static_assert(comp-name("utf-80", "utf8") == false); — end example]

28.4.2.4 Comparison functions [text.encoding.cmp]

🔗

friend constexpr bool operator==(const text_encoding& a, const text_encoding& b) noexcept;

1

#

Returns: If a.mib_ == id::other && b.mib_ == id::other is true, then comp-name(a.name_,
b.name_).

Otherwise, a.mib_ == b.mib_.

🔗

friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept;

2

#

Returns: encoding.mib_ == i.

3

#

Remarks: This operator induces an equivalence relation on its arguments if and only if i != id::other is true.

28.4.2.5 Class text_encoding::aliases_view [text.encoding.aliases]

🔗

struct text_encoding::aliases_view : ranges::view_interface<text_encoding::aliases_view> { constexpr implementation-defined begin() const; constexpr implementation-defined end() const; };

1

#

text_encoding::aliases_view modelscopyable,ranges::view,ranges::random_access_range, andranges::borrowed_range.

[Note 1:

text_encoding::aliases_view is not required to satisfyranges::common_range, nor default_initializable.

— end note]

2

#

Bothranges::range_value_t<text_encoding::aliases_view> andranges::range_reference_t<text_encoding::aliases_view> denote const char*.

3

#

ranges::iterator_t<text_encoding::aliases_view> is a constexpr iterator ([iterator.requirements.general]).

28.4.2.6 Enumeration text_encoding::id [text.encoding.id]

🔗

namespace std {enum class text_encoding::id : int_least32_t { other = 1, unknown = 2, ASCII = 3, ISOLatin1 = 4, ISOLatin2 = 5, ISOLatin3 = 6, ISOLatin4 = 7, ISOLatinCyrillic = 8, ISOLatinArabic = 9, ISOLatinGreek = 10, ISOLatinHebrew = 11, ISOLatin5 = 12, ISOLatin6 = 13, ISOTextComm = 14, HalfWidthKatakana = 15, JISEncoding = 16, ShiftJIS = 17, EUCPkdFmtJapanese = 18, EUCFixWidJapanese = 19, ISO4UnitedKingdom = 20, ISO11SwedishForNames = 21, ISO15Italian = 22, ISO17Spanish = 23, ISO21German = 24, ISO60DanishNorwegian = 25, ISO69French = 26, ISO10646UTF1 = 27, ISO646basic1983 = 28, INVARIANT = 29, ISO2IntlRefVersion = 30, NATSSEFI = 31, NATSSEFIADD = 32, ISO10Swedish = 35, KSC56011987 = 36, ISO2022KR = 37, EUCKR = 38, ISO2022JP = 39, ISO2022JP2 = 40, ISO13JISC6220jp = 41, ISO14JISC6220ro = 42, ISO16Portuguese = 43, ISO18Greek7Old = 44, ISO19LatinGreek = 45, ISO25French = 46, ISO27LatinGreek1 = 47, ISO5427Cyrillic = 48, ISO42JISC62261978 = 49, ISO47BSViewdata = 50, ISO49INIS = 51, ISO50INIS8 = 52, ISO51INISCyrillic = 53, ISO54271981 = 54, ISO5428Greek = 55, ISO57GB1988 = 56, ISO58GB231280 = 57, ISO61Norwegian2 = 58, ISO70VideotexSupp1 = 59, ISO84Portuguese2 = 60, ISO85Spanish2 = 61, ISO86Hungarian = 62, ISO87JISX0208 = 63, ISO88Greek7 = 64, ISO89ASMO449 = 65, ISO90 = 66, ISO91JISC62291984a = 67, ISO92JISC62991984b = 68, ISO93JIS62291984badd = 69, ISO94JIS62291984hand = 70, ISO95JIS62291984handadd = 71, ISO96JISC62291984kana = 72, ISO2033 = 73, ISO99NAPLPS = 74, ISO102T617bit = 75, ISO103T618bit = 76, ISO111ECMACyrillic = 77, ISO121Canadian1 = 78, ISO122Canadian2 = 79, ISO123CSAZ24341985gr = 80, ISO88596E = 81, ISO88596I = 82, ISO128T101G2 = 83, ISO88598E = 84, ISO88598I = 85, ISO139CSN369103 = 86, ISO141JUSIB1002 = 87, ISO143IECP271 = 88, ISO146Serbian = 89, ISO147Macedonian = 90, ISO150 = 91, ISO151Cuba = 92, ISO6937Add = 93, ISO153GOST1976874 = 94, ISO8859Supp = 95, ISO10367Box = 96, ISO158Lap = 97, ISO159JISX02121990 = 98, ISO646Danish = 99, USDK = 100, DKUS = 101, KSC5636 = 102, Unicode11UTF7 = 103, ISO2022CN = 104, ISO2022CNEXT = 105, UTF8 = 106, ISO885913 = 109, ISO885914 = 110, ISO885915 = 111, ISO885916 = 112, GBK = 113, GB18030 = 114, OSDEBCDICDF0415 = 115, OSDEBCDICDF03IRV = 116, OSDEBCDICDF041 = 117, ISO115481 = 118, KZ1048 = 119, UCS2 = 1000, UCS4 = 1001, UnicodeASCII = 1002, UnicodeLatin1 = 1003, UnicodeJapanese = 1004, UnicodeIBM1261 = 1005, UnicodeIBM1268 = 1006, UnicodeIBM1276 = 1007, UnicodeIBM1264 = 1008, UnicodeIBM1265 = 1009, Unicode11 = 1010, SCSU = 1011, UTF7 = 1012, UTF16BE = 1013, UTF16LE = 1014, UTF16 = 1015, CESU8 = 1016, UTF32 = 1017, UTF32BE = 1018, UTF32LE = 1019, BOCU1 = 1020, UTF7IMAP = 1021, Windows30Latin1 = 2000, Windows31Latin1 = 2001, Windows31Latin2 = 2002, Windows31Latin5 = 2003, HPRoman8 = 2004, AdobeStandardEncoding = 2005, VenturaUS = 2006, VenturaInternational = 2007, DECMCS = 2008, PC850Multilingual = 2009, PCp852 = 2010, PC8CodePage437 = 2011, PC8DanishNorwegian = 2012, PC862LatinHebrew = 2013, PC8Turkish = 2014, IBMSymbols = 2015, IBMThai = 2016, HPLegal = 2017, HPPiFont = 2018, HPMath8 = 2019, HPPSMath = 2020, HPDesktop = 2021, VenturaMath = 2022, MicrosoftPublishing = 2023, Windows31J = 2024, GB2312 = 2025, Big5 = 2026, Macintosh = 2027, IBM037 = 2028, IBM038 = 2029, IBM273 = 2030, IBM274 = 2031, IBM275 = 2032, IBM277 = 2033, IBM278 = 2034, IBM280 = 2035, IBM281 = 2036, IBM284 = 2037, IBM285 = 2038, IBM290 = 2039, IBM297 = 2040, IBM420 = 2041, IBM423 = 2042, IBM424 = 2043, IBM500 = 2044, IBM851 = 2045, IBM855 = 2046, IBM857 = 2047, IBM860 = 2048, IBM861 = 2049, IBM863 = 2050, IBM864 = 2051, IBM865 = 2052, IBM868 = 2053, IBM869 = 2054, IBM870 = 2055, IBM871 = 2056, IBM880 = 2057, IBM891 = 2058, IBM903 = 2059, IBM904 = 2060, IBM905 = 2061, IBM918 = 2062, IBM1026 = 2063, IBMEBCDICATDE = 2064, EBCDICATDEA = 2065, EBCDICCAFR = 2066, EBCDICDKNO = 2067, EBCDICDKNOA = 2068, EBCDICFISE = 2069, EBCDICFISEA = 2070, EBCDICFR = 2071, EBCDICIT = 2072, EBCDICPT = 2073, EBCDICES = 2074, EBCDICESA = 2075, EBCDICESS = 2076, EBCDICUK = 2077, EBCDICUS = 2078, Unknown8BiT = 2079, Mnemonic = 2080, Mnem = 2081, VISCII = 2082, VIQR = 2083, KOI8R = 2084, HZGB2312 = 2085, IBM866 = 2086, PC775Baltic = 2087, KOI8U = 2088, IBM00858 = 2089, IBM00924 = 2090, IBM01140 = 2091, IBM01141 = 2092, IBM01142 = 2093, IBM01143 = 2094, IBM01144 = 2095, IBM01145 = 2096, IBM01146 = 2097, IBM01147 = 2098, IBM01148 = 2099, IBM01149 = 2100, Big5HKSCS = 2101, IBM1047 = 2102, PTCP154 = 2103, Amiga1251 = 2104, KOI7switched = 2105, BRF = 2106, TSCII = 2107, CP51932 = 2108, windows874 = 2109, windows1250 = 2250, windows1251 = 2251, windows1252 = 2252, windows1253 = 2253, windows1254 = 2254, windows1255 = 2255, windows1256 = 2256, windows1257 = 2257, windows1258 = 2258, TIS620 = 2259, CP50220 = 2260};}

[Note 1:

The text_encoding::id enumeration contains an enumerator for each known registered character encoding.

For each encoding, the corresponding enumerator is derived from the alias beginning with “cs”, as follows

  • csUnicode is mapped to text_encoding::id::UCS2,
  • csIBBM904 is mapped to text_encoding::id::IBM904, and
  • the “cs” prefix is removed from other names.

— end note]

28.4.2.7 Hash support [text.encoding.hash]

🔗

template<> struct hash<text_encoding>;

1

#

The specialization is enabled ([unord.hash]).