CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point> More...
#include <CodeCvt.h>
Public Member Functions | |
CodeCvt (const Options &options=Options{}) | |
nonvirtual size_t | Bytes2Characters (span< const byte > from) const |
convert span byte (external serialized format) parameters to characters (like std::codecvt<>::in () - but with spans, and simpler api) | |
template<constructible_from< const CHAR_T *, const CHAR_T * > STRINGISH> | |
nonvirtual STRINGISH | Bytes2String (span< const byte > from) const |
template<constructible_from< const byte *, const byte * > BLOBISH> | |
nonvirtual BLOBISH | String2Bytes (span< const CHAR_T > from) const |
Static Public Member Functions | |
template<IStdCodeCVT STD_CODECVT, typename... ARGS> requires (same_as<CHAR_T, typename STD_CODECVT::intern_type>) | |
static CodeCvt | mkFromStdCodeCvt (const Options &options={}, ARGS... args) |
CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point>
Note, UTFConvert is probably a slightly better API, and better designed, and faster. HOWEVER, it ONLY converts to/from UNICODE. std::codecvt can convert to/from any locale code page, and is is more general.
Use the CodeCvt<> API when your code conversions may involve non UNICODE byte representations.
Note that this class - like codecvt - can be used to 'page' over an input, and incrementally convert it (though how it does this differs from codecvt - not maintaining a partial state - but instead adjusting the amount consumed from the input to reflect full-character conversions).
Enhancements over std::codecvt: o this is a span<> based API o You can subclass IRep (to provide your own CodeCvt implementation) and copy CodeCvt objects. (unless I'm missing something, you can do one or the other with std::codecvt, but not both) o Simpler backend virtual API, so easier to create your own compliant CodeCvt object. o CodeCvt leverages these two things via UTFConvert (which uses different library backends to do the UTF code conversion, hopefully enuf faster to make up for the virtual call overhead this class introduces). o Don't support 'partial' conversion. If there is insufficient space in the target buffer, this is an ASSERTION error - UNSUPPORTED. ALL 'srcSpan' CHARACTER data MUST be consumed/converted (for byte data; we allow only a single partial character at the end for Bytes2Characters takes ptr to span and updates span to reflect remaining bytes). o Dont bother templating output byte type (std::covert supports all the useless ones but misses the most useful, at least for fileIO, binary IO) o Don't support mbstate_t. Its opaque, and a PITA. And redundant. o lots of templated combinations (codecvt) dont make sense and dont work and there is no hint/validation clarity about which you can use/make sense and which you cannot with std::codecvt. Hopefully this class will make more sense. It can be used to convert (abstract API) between ANY combination of 'target hidden in implementation' and exposed CHAR_T characters (reading or writing). DEFAULT CTORS only provide the combinations supported by stdc++ (and a little more). To get other combinations, you must use subclass. o 'equivalent code-point types automatically supported (e.g wchar_t == char16 or char32, and Character==char32_t). o No explicit 'external_type' exposed. Just bytes go in and out vs (CHAR_T) UNICODE characters. This erasure of the 'encoding' type from the CodeCvt<CHAR_T> allows it to be used generically where its hidden in the 'rep' what kind of encoding is used.
Difference: o Maybe enhancement, maybe step back: Must call ComputeTargetCharacterBufferSize/ComputeTargetByteBufferSize and provide an output buffer large enuf. This way, can NEVER get get partial conversion due to lack of output buffer space (which simplfies alot within this API). NOTE - large enuf doesn't necessarily mean as large as ComputeTargetCharacterBufferSize/ComputeTargetByteBufferSize would say, as those provide safe estimate. If you know for special reasons, you can use a smaller size, but the call must always FIT - no 'targetExhausted' exceptions thrown. o no 'noconv' error code (better in that simpler, but worse in that forces throw on bad characters)
Enhancements over UTFConvert: o UTFConvert only supports UNICODE <-> UNICODE translations, even if in different UNICODE encodings. This API supports UNICODE <-> any arbitrary output binary format. o So in particular, it supports translating between UNICODE characters and locale encodings (e.g. SHIFT_JIS, or whatever).
And: o All the existing codecvt objects (which map to/from UNICODE) can easily be wrapped in a CodeCvt
CodeCvt as smart Ptr class, and an 'abstract class' (IRep) in that only for some CHAR_T types can it be instantiated directly (the ones std c++ supports, char_16_t, char32_t, and wchar_t with locale).
Stroika::Foundation::Characters::CodeCvt< CHAR_T >::CodeCvt | ( | const Options & | options = Options{} | ) |
Default CTOR: Produces the fastest available CodeCvt(), between the templated UNICODE code-point and UTF-8 (as the binary format).
CodeCvt (const locale& l): Produces a CodeCvt which maps (back and forth) between bytes in the 'locale' character set, and UNICODE Characters.
CodeCvt (const string& localeName): Is equivalent to mkFromStdCodeCvt<...> (std::codecvt_byname {localeName}) - so it can throw if no such locale name
CodeCvt (span<const byte>* guessFormatFrom) the initial part of the span data (up to kMaxBOMSize bytes) are examined and used to select the CodeCvt to create (else default CodeCvt created). If a BOM is found, guessFormatFrom is adjusted to skip it.
CodeCvt (CodePage): Can throw if the code page is not recognized. NOTE - CodePage is a Windows concept, and though many code pages are provided portable (
To use (wrap) existing std::codecvt<A,B,C> class: Quirky, because classes not generally directly instantiable, so instead specify CLASS as template param and ARGS to CTOR. CodeCvt<CHAR_T,std::codecvt<CHAR_T, BINARY_T, MBSTATE_T>> {args to that class} Note works with subclasses of std::codecvt like std::codecvt_byname
To get OTHER conversions, say between char16_t, and char32_t (combines/chains CodeCvt's): CodeCvt<CHAR_T>{UnicodeExternalEncodings} - Uses UTFConvert, along with any needed byte swapping CodeCvt<CHAR_T>{const CodeCvt<OTHER_CHAR_T> basedOn} - Use this to combine CodeCvt's (helpful for locale one)
Definition at line 596 of file CodeCvt.inl.
|
static |
Note, though logically this should be a CodeCvt constructor, since you cannot directly construct the STD_CODECVT, it cannot be passed by argument to the constructor. And so their appears no way to deduce or specify those constructor template arguments. But that can be done explicitly with a static function, and that is what we do with mkFromStdCodeCvt.
auto Stroika::Foundation::Characters::CodeCvt< CHAR_T >::Bytes2Characters | ( | span< const byte > | from | ) | const |
convert span byte (external serialized format) parameters to characters (like std::codecvt<>::in () - but with spans, and simpler api)
Convert bytes 'from' to characters 'to'.
Arguments: o span<byte> from - initially all of which will be converted or an exception thrown (only if data corrupt/unconvertable) (updated to point to bytes which form part of a single additional character) o span<CHAR_T> to - buffer to have data converted 'into' NOTE - all we require is that the result fit into 'to'. But we offer a quick way to compute a buffer 'large enough' - (call ComputeTargetCharacterBufferSize). But (a more expensive) way is to call Bytes2Characters/1 and that will tell you exactly how many needed. Returns: subspan of 'to', with converted characters. Throws on failure (corrupt source content). And '*from' updated to reflect any remaining bytes that are part of the next character.
Source bytes must begin on a valid character boundary (unlike codecvt - no mbstate). If the input buffer ends with any incomplete characters, *from will refer to those characters on function completion.
The overload taking pointer to from returns the amount left. The overload taking span<> - not pointer - throws if not all consumed.
The caller typically will wish to save those, and resubmit their BytesToCharacter call with a new buffer, starting with those (but there is no requirement to do so).
No state is maintained. ALL the input is converted expect possibly a few bytes at the end of the input which constitute a partial character.
This implies that given a 'lead byte' as argument to Bytes2Characters, this function can return an EMPTY span, and that would not be an error (so no throw).
/2 overload
Definition at line 750 of file CodeCvt.inl.
nonvirtual STRINGISH Stroika::Foundation::Characters::CodeCvt< CHAR_T >::Bytes2String | ( | span< const byte > | from | ) | const |
Convert a span of bytes (in a coding defined by the constructor to CodeCvt) to a 'string' like object - anything constructible from a 'span' of characters (e.g. String or wstring)
NOTE - when converting Bytes2String, the String must be encoded using CHAR_T characters. The binary rep - can be anything - of course.
nonvirtual BLOBISH Stroika::Foundation::Characters::CodeCvt< CHAR_T >::String2Bytes | ( | span< const CHAR_T > | from | ) | const |
Convert a span of characters ('string') to a BLOB-like object - anything constructible from a 'span' of bytes; note that container of a span of bytes maybe 'string' (special case).
NOTE - when converting String2Bytes, the String must be encoded using CHAR_T characters. The binary rep - can be anything - of course.