Stroika Library 3.0d16
 
Loading...
Searching...
No Matches
Stroika::Foundation::Characters::CodeCvt< CHAR_T > Class Template Reference

CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point> More...

#include <CodeCvt.h>

Public Member Functions

 CodeCvt (const Options &options=Options{})
 
nonvirtual size_t Bytes2Characters (span< const byte > from) const
 convert span byte (external serialized format) parameters to characters (like std::codecvt<>::in () - but with spans, and simpler api)
 
template<constructible_from< const CHAR_T *, const CHAR_T * > STRINGISH>
nonvirtual STRINGISH Bytes2String (span< const byte > from) const
 
template<constructible_from< const byte *, const byte * > BLOBISH>
nonvirtual BLOBISH String2Bytes (span< const CHAR_T > from) const
 

Static Public Member Functions

template<IStdCodeCVT STD_CODECVT, typename... ARGS>
requires (same_as<CHAR_T, typename STD_CODECVT::intern_type>)
static CodeCvt mkFromStdCodeCvt (const Options &options={}, ARGS... args)
 

Detailed Description

template<IUNICODECanAlwaysConvertTo CHAR_T = Character>
class Stroika::Foundation::Characters::CodeCvt< CHAR_T >

CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point>

Note, UTFConvert is probably a slightly better API, and better designed, and faster. HOWEVER, it ONLY converts to/from UNICODE. std::codecvt can convert to/from any locale code page, and is is more general.

Use the CodeCvt<> API when your code conversions may involve non UNICODE byte representations.

Note that this class - like codecvt - can be used to 'page' over an input, and incrementally convert it (though how it does this differs from codecvt - not maintaining a partial state - but instead adjusting the amount consumed from the input to reflect full-character conversions).

Note
- if encountering invalid data in the input (invalid characters) - this will 'THROW' and not just fill in special bogus replacement characters.
- the BINARY format character is OPAQUE given this API (you get/set bytes). The CHAR_T in the template argument refers to the 'CHARACTER' format you map to/from binary format (so typically wchar_t, or char32_t maybe).

Enhancements over std::codecvt: o this is a span<> based API o You can subclass IRep (to provide your own CodeCvt implementation) and copy CodeCvt objects. (unless I'm missing something, you can do one or the other with std::codecvt, but not both) o Simpler backend virtual API, so easier to create your own compliant CodeCvt object. o CodeCvt leverages these two things via UTFConvert (which uses different library backends to do the UTF code conversion, hopefully enuf faster to make up for the virtual call overhead this class introduces). o Don't support 'partial' conversion. If there is insufficient space in the target buffer, this is an ASSERTION error - UNSUPPORTED. ALL 'srcSpan' CHARACTER data MUST be consumed/converted (for byte data; we allow only a single partial character at the end for Bytes2Characters takes ptr to span and updates span to reflect remaining bytes). o Dont bother templating output byte type (std::covert supports all the useless ones but misses the most useful, at least for fileIO, binary IO) o Don't support mbstate_t. Its opaque, and a PITA. And redundant. o lots of templated combinations (codecvt) dont make sense and dont work and there is no hint/validation clarity about which you can use/make sense and which you cannot with std::codecvt. Hopefully this class will make more sense. It can be used to convert (abstract API) between ANY combination of 'target hidden in implementation' and exposed CHAR_T characters (reading or writing). DEFAULT CTORS only provide the combinations supported by stdc++ (and a little more). To get other combinations, you must use subclass. o 'equivalent code-point types automatically supported (e.g wchar_t == char16 or char32, and Character==char32_t). o No explicit 'external_type' exposed. Just bytes go in and out vs (CHAR_T) UNICODE characters. This erasure of the 'encoding' type from the CodeCvt<CHAR_T> allows it to be used generically where its hidden in the 'rep' what kind of encoding is used.

Difference: o Maybe enhancement, maybe step back: Must call ComputeTargetCharacterBufferSize/ComputeTargetByteBufferSize and provide an output buffer large enuf. This way, can NEVER get get partial conversion due to lack of output buffer space (which simplfies alot within this API). NOTE - large enuf doesn't necessarily mean as large as ComputeTargetCharacterBufferSize/ComputeTargetByteBufferSize would say, as those provide safe estimate. If you know for special reasons, you can use a smaller size, but the call must always FIT - no 'targetExhausted' exceptions thrown. o no 'noconv' error code (better in that simpler, but worse in that forces throw on bad characters)

Enhancements over UTFConvert: o UTFConvert only supports UNICODE <-> UNICODE translations, even if in different UNICODE encodings. This API supports UNICODE <-> any arbitrary output binary format. o So in particular, it supports translating between UNICODE characters and locale encodings (e.g. SHIFT_JIS, or whatever).

And: o All the existing codecvt objects (which map to/from UNICODE) can easily be wrapped in a CodeCvt

CodeCvt as smart Ptr class, and an 'abstract class' (IRep) in that only for some CHAR_T types can it be instantiated directly (the ones std c++ supports, char_16_t, char32_t, and wchar_t with locale).

Definition at line 118 of file CodeCvt.h.

Constructor & Destructor Documentation

◆ CodeCvt()

template<IUNICODECanAlwaysConvertTo CHAR_T>
Stroika::Foundation::Characters::CodeCvt< CHAR_T >::CodeCvt ( const Options &  options = Options{})

Default CTOR: Produces the fastest available CodeCvt(), between the templated UNICODE code-point and UTF-8 (as the binary format).

CodeCvt (const locale& l): Produces a CodeCvt which maps (back and forth) between bytes in the 'locale' character set, and UNICODE Characters.

CodeCvt (const string& localeName): Is equivalent to mkFromStdCodeCvt<...> (std::codecvt_byname {localeName}) - so it can throw if no such locale name

CodeCvt (span<const byte>* guessFormatFrom) the initial part of the span data (up to kMaxBOMSize bytes) are examined and used to select the CodeCvt to create (else default CodeCvt created). If a BOM is found, guessFormatFrom is adjusted to skip it.

CodeCvt (CodePage): Can throw if the code page is not recognized. NOTE - CodePage is a Windows concept, and though many code pages are provided portable (

To use (wrap) existing std::codecvt<A,B,C> class: Quirky, because classes not generally directly instantiable, so instead specify CLASS as template param and ARGS to CTOR. CodeCvt<CHAR_T,std::codecvt<CHAR_T, BINARY_T, MBSTATE_T>> {args to that class} Note works with subclasses of std::codecvt like std::codecvt_byname

To get OTHER conversions, say between char16_t, and char32_t (combines/chains CodeCvt's): CodeCvt<CHAR_T>{UnicodeExternalEncodings} - Uses UTFConvert, along with any needed byte swapping CodeCvt<CHAR_T>{const CodeCvt<OTHER_CHAR_T> basedOn} - Use this to combine CodeCvt's (helpful for locale one)

Example Usage:
CodeCvt cc{"en_US.UTF8"};
constexpr char8_t someRandomText[] = u8"hello mom";
span<const byte> someRandomTextBinarySpan = as_bytes (span<const char8_t> {someRandomText, Characters::CString::Length (someRandomText)});
StackBuffer<Character> buf{cc.ComputeTargetCharacterBufferSize (someRandomTextBinarySpan)};
auto b = cc.Bytes2Characters (&someRandomTextBinarySpan, span{buf});
EXPECT_TRUE (someRandomTextBinarySpan.size () == 0); // ALL CONSUMED
EXPECT_TRUE (b.size () == 9 and b[0] == 'h');
CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt,...
Definition CodeCvt.h:118
Example Usage:
// codeCvt Between UTF16 Characters And UTF8BinaryFormat, best/fastest way
CodeCvt<char16_t> codeCvt1{};
// codeCvt Between UTF16 Characters And UTF8BinaryFormat using std::codecvt<char16_t, char8_t, std::mbstate_t>
CodeCvt<char16_t> codeCvt2 = CodeCvt<char16_t>::mkFromStdCodeCvt<std::codecvt<char16_t, char8_t, std::mbstate_t>> ();
// codeCvt Between UTF16 Characters using codecvt_byname
CodeCvt<char16_t> codeCvt3 = CodeCvt<char16_t,std::codecvt_byname>>{locale{"en_US.UTF8"}};
// or equivalently
CodeCvt<char16_t> codeCvt4{"en_US.UTF8"};

Definition at line 596 of file CodeCvt.inl.

Member Function Documentation

◆ mkFromStdCodeCvt()

template<IUNICODECanAlwaysConvertTo CHAR_T = Character>
template<IStdCodeCVT STD_CODECVT, typename... ARGS>
requires (same_as<CHAR_T, typename STD_CODECVT::intern_type>)
static CodeCvt Stroika::Foundation::Characters::CodeCvt< CHAR_T >::mkFromStdCodeCvt ( const Options &  options = {},
ARGS...  args 
)
static

Note, though logically this should be a CodeCvt constructor, since you cannot directly construct the STD_CODECVT, it cannot be passed by argument to the constructor. And so their appears no way to deduce or specify those constructor template arguments. But that can be done explicitly with a static function, and that is what we do with mkFromStdCodeCvt.

Note
- everything else has options last argument, but since we use ... parameter pack, options must be first here.

◆ Bytes2Characters()

template<IUNICODECanAlwaysConvertTo CHAR_T>
auto Stroika::Foundation::Characters::CodeCvt< CHAR_T >::Bytes2Characters ( span< const byte >  from) const

convert span byte (external serialized format) parameters to characters (like std::codecvt<>::in () - but with spans, and simpler api)

Convert bytes 'from' to characters 'to'.

Arguments: o span<byte> from - initially all of which will be converted or an exception thrown (only if data corrupt/unconvertable) (updated to point to bytes which form part of a single additional character) o span<CHAR_T> to - buffer to have data converted 'into' NOTE - all we require is that the result fit into 'to'. But we offer a quick way to compute a buffer 'large enough' - (call ComputeTargetCharacterBufferSize). But (a more expensive) way is to call Bytes2Characters/1 and that will tell you exactly how many needed. Returns: subspan of 'to', with converted characters. Throws on failure (corrupt source content). And '*from' updated to reflect any remaining bytes that are part of the next character.

Source bytes must begin on a valid character boundary (unlike codecvt - no mbstate). If the input buffer ends with any incomplete characters, *from will refer to those characters on function completion.

The overload taking pointer to from returns the amount left. The overload taking span<> - not pointer - throws if not all consumed.

The caller typically will wish to save those, and resubmit their BytesToCharacter call with a new buffer, starting with those (but there is no requirement to do so).

No state is maintained. ALL the input is converted expect possibly a few bytes at the end of the input which constitute a partial character.

This implies that given a 'lead byte' as argument to Bytes2Characters, this function can return an EMPTY span, and that would not be an error (so no throw).

Note
we use the name 'Bytes' - because its suggestive of meaning, and in every case I'm aware of the target type will be char, or char8_t, or byte. But its certainly not guaranteed to be serialized to byte, and the codecvt API calls this extern_type

/2 overload

Precondition
to.size () >= min(Bytes2Characters(<em>from), ComputeTargetCharacterBufferSize (*from)) on input. span<const byte>
Postcondition
from->size () very small on return (at most partial character)
See also
also Bytes2String for similar function, but operating on strings

Definition at line 750 of file CodeCvt.inl.

◆ Bytes2String()

template<IUNICODECanAlwaysConvertTo CHAR_T = Character>
template<constructible_from< const CHAR_T *, const CHAR_T * > STRINGISH>
nonvirtual STRINGISH Stroika::Foundation::Characters::CodeCvt< CHAR_T >::Bytes2String ( span< const byte >  from) const

Convert a span of bytes (in a coding defined by the constructor to CodeCvt) to a 'string' like object - anything constructible from a 'span' of characters (e.g. String or wstring)

NOTE - when converting Bytes2String, the String must be encoded using CHAR_T characters. The binary rep - can be anything - of course.

Example Usage
span<const byte> bytes = from_somewhere;
static const CodeCvt<wchar_t> kCvt_{UnicodeExternalEncodings::eUTF8};
wstring result = kCvt_.Bytes2String<wstring> (bytes);
nonvirtual STRINGISH Bytes2String(span< const byte > from) const
Example Usage
span<const byte> bytes = from_somewhere;
wstring result = CodeCvt<wchar_t>{locale{}}.Bytes2String<wstring> (bytes);

◆ String2Bytes()

template<IUNICODECanAlwaysConvertTo CHAR_T = Character>
template<constructible_from< const byte *, const byte * > BLOBISH>
nonvirtual BLOBISH Stroika::Foundation::Characters::CodeCvt< CHAR_T >::String2Bytes ( span< const CHAR_T >  from) const

Convert a span of characters ('string') to a BLOB-like object - anything constructible from a 'span' of bytes; note that container of a span of bytes maybe 'string' (special case).

NOTE - when converting String2Bytes, the String must be encoded using CHAR_T characters. The binary rep - can be anything - of course.

Example Usage
span<const wchar_t> s = from_somewhere;
static const CodeCvt<wchar_t> kCvt_{UnicodeExternalEncodings::eUTF8};
string utf8String = kCvt_.String2Bytes<string> (s);
nonvirtual BLOBISH String2Bytes(span< const CHAR_T > from) const
Example Usage
span<const wchar_t> s = from_somewhere;
Memory::BLOB localeFormatRenderingOfUnicodeInputAsLocaleFormatByteStream = CodeCvt<wchar_t>{locale{}}.String2Bytes<Memory::BLOB> (s);
Example Usage
span<const wchar_t> s = from_somewhere;
string localeFormatRenderingOfUnicodeInputAsLocaleFormatByteStream = CodeCvt<wchar_t>{locale{}}.String2Bytes<string> (s);

The documentation for this class was generated from the following files: