Namespaces | |
namespace | FloatConversion |
namespace | Literals |
Create a format-string (see std::wformat_string or Stroika FormatString, or python 'f' strings. | |
namespace | ToStringDefaults |
namespace | WellKnownCharsets |
namespace | WellKnownCodePages |
Classes | |
class | Character |
class | CharacterEncodingException |
| |
class | Charset |
class | CodeCvt |
CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point> More... | |
class | CodePageNotSupportedException |
struct | FormatString |
Roughly equivalent to std::wformat_string, except that it can be constructed from 'char' string, and if 'char' require ASCII characters for format string. More... | |
struct | Latin1 |
class | RegularExpression |
RegularExpression is a compiled regular expression which can be used to match on a String class. More... | |
class | RegularExpressionMatch |
class | String |
String is like std::u32string, except it is much easier to use, often much more space efficient, and more easily interoperates with other string types. More... | |
class | StringBuilder |
Similar to String, but intended to more efficiently construct a String. Mutable type (String is largely immutable). More... | |
struct | StringBuilder_Options |
rarely used directly - defaults generally fine More... | |
struct | StringCombiner |
StringCombiner is a simple function object used to combine two strings visually - used in Iterable<>::Join () More... | |
struct | ToStringFormatter |
class | UTFConvert |
UTFConvert is designed to provide mappings between various UTF encodings of UNICODE characters. More... | |
Concepts | |
concept | IBasicUNICODECodePoint |
check if T is char8_t, char16_t, char32_t - one of the three possible UNICODE UTF code-point classes. | |
concept | IUNICODECodePoint |
check if T is IBasicUNICODECodePoint or wchar_t (any basic code-point class) | |
concept | IStdBasicStringCompatibleCharacter |
concept IStdBasicStringCompatibleCharacter tests if the 'T' argument is a legit CHARACTER argument to std::basic_string, and basic_string_view (char,char8_t,char16_t,char32_t,wchar_t). | |
concept | IUNICODECanAlwaysConvertTo |
UNICODE string can be always be converted into array of this type. | |
concept | IUNICODECanUnambiguouslyConvertFrom |
IUNICODECanUnambiguouslyConvertFrom is any 'character representation type' where array of them unambiguously convertible to UNICODE string. | |
concept | IUNICODECanUnambiguouslyConvertTo |
IUNICODECanUnambiguouslyConvertTo is any 'character representation type' you can unambiguously convert a UNICODE string into. | |
concept | IStdCodeCVT |
concept | IBasicUNICODEStdString |
returns true iff T == u8string, u16string, u32string, or wstring - which std::string types can be unambiguously converted to UNICODE | |
concept | IStdPathLike2UNICODEString |
anything with a 'special .STRINGTYPE conversion' method to UNICODE string, such as filesystem::path | |
concept | IConvertibleToString |
concept | IToString |
Check if legal to call Characters::ToString(T)... | |
Typedefs | |
using | ASCII = char |
Stroika's string/character classes treat 'char' as being an ASCII character. | |
using | CodePage = uint32_t |
using | SDKChar = conditional_t< qTargetPlatformSDKUseswchar_t, wchar_t, char > |
using | SDKString = basic_string< SDKChar > |
template<typename OUTPUT_CHAR_T > | |
using | UTFCodeConverter = function< UTFConvert::ConversionResult(span< const byte > source, span< OUTPUT_CHAR_T > targetBuffer)> |
Enumerations | |
enum | |
DEPRECATED. | |
enum class | AllowMissingCharacterErrorsFlag |
enum class | StringShorteningPreference |
enum class | ByteOrderMark |
enum class | UnicodeExternalEncodings |
list of external UNICODE character encodings, for file IO (eDEFAULT = eUTF8) More... | |
Functions | |
wstring | GetCharsetString (CodePage cp) |
Returns a character encoding name registered by the IANA - for the given CodePage. | |
DISABLE_COMPILER_MSC_WARNING_START (4996) | |
template<typename CHAR_T > | |
String | VFormat (const FormatString< CHAR_T > &f, const Common::StdCompat::wformat_args &args) |
same as std::vformat, except always uses wformat_args, and produces Stroika String (and maybe more - soon - ??? - add extra conversions if I can find how?) | |
template<typename CHAR_T , Common::StdCompat::formattable< wchar_t >... ARGS> | |
String | Format (const FormatString< CHAR_T > &f, ARGS &&... args) |
Like std::format, except returning stroika String, and taking _f (FormatString) string as argument (which can be ASCII, but still produce UNICODE output). | |
template<typename TCHAR > | |
size_t | CRLFToNL (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize) |
Convert the argument srcText buffer from CRLF format line endings, to NL format line endings. | |
template<typename TCHAR > | |
size_t | NLToCRLF (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize) |
Convert the argument srcText buffer from NL format line endings, to CRLF format line endings. Note - even if input is already CRLF, this then is a no-op, changing nothing. | |
template<typename TCHAR > | |
size_t | NormalizeTextToNL (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize) |
SDKString | Narrow2SDK (span< const char > s) |
wstring | NarrowSDK2Wide (span< const char > s) |
string | SDK2Narrow (span< const SDKChar > s) |
wstring | SDK2Wide (span< const SDKChar > s) |
SDKString | Wide2SDK (span< const wchar_t > s) |
wostream & | operator<< (wostream &out, const String &s) |
template<IConvertibleToString LHS_T, IConvertibleToString RHS_T> requires (derived_from<remove_cvref_t<LHS_T>, String> or derived_from<remove_cvref_t<RHS_T>, String>) | |
String | operator+ (LHS_T &&lhs, RHS_T &&rhs) |
template<integral T = int, IUNICODECodePoint CHAR_T> | |
T | String2Int (span< const CHAR_T > s) |
unsigned int | HexString2Int (const String &s) |
template<typename T > | |
String | UnoverloadedToString (const T &t) |
same as ToString()/1 - but without the potentially confusing multi-arg overloads (confused some template expansions). | |
constexpr span< const byte > | GetByteOrderMark (UnicodeExternalEncodings e) noexcept |
constexpr optional< tuple< UnicodeExternalEncodings, size_t > > | ReadByteOrderMark (span< const byte > d) noexcept |
span< byte > | WriteByteOrderMark (UnicodeExternalEncodings e, span< byte > into) |
template<typename T , typename... ARGS> | |
String | ToString (T &&t, ARGS... args) |
Return a debug-friendly, display version of the argument: not guaranteed parsable or usable except for debugging. | |
Variables | |
template<IPossibleCharacterRepresentation T> | |
static constexpr T | kEOL [] |
null-terminated String constant for current compiled platform - Windows (CRLF) or POSIX (NL) - macos I dont think any longer uses \r?? | |
const function< String(String, String, bool)> | kDefaultStringCombiner = StringCombiner<String>{.fSeparator = ", "_k} |
constexpr size_t | kMaxBOMSize = 3 |
Each platform SDK has its own policy for representing characters. Some use narrow characters (char), and a predefined code page (often configured via locale), and others use wide-characters (wchar_t unicode).
SDKChar is the underlying representation of the SDK's characters - whether it be narrow or wide.
TODO
TODO:
@todo SEE http://stroika-bugs.sophists.com/browse/STK-768 - major refactor of this module @todo Add Int2String () module? Like FloatConversion::ToString, and this String2Int? @todo DOCUMENT BEHAVIOR OF STRING2INT() for bad strings. What does it do? AND SIMILARPT FOR hexString2Int. And for both - probably rewrite to use strtoul/strtol etc @todo Same changes to HexString2Int() as we did with String2Int() - template on return value. Or maybe get rid of HexString2Int () - and just have optional radix param? @todo Consdier if we should have variants of these funtions taking a locale, or always using C/currnet locale. For the most part - I find it best to use the C locale. But DOCUMENT in all cases!!! And maybe implement variants...
using Stroika::Foundation::Characters::ASCII = typedef char |
Stroika's string/character classes treat 'char' as being an ASCII character.
This using declaration just documents that fact, without really enforcing anything. Prior to Stroika v3, the Stroika String classes basically prohibited the use of char because it was always UNCLEAR what character set to interpret it as.
But a safe (and quite useful) assumption, is just that it is ASCII. If you assume its always ASCII, you can simplify a lot of pragmatic usage. So Stroika v3 does that, with checks to enforce.
So generally - Stroika String (and Character) APIs - if given a 'char' REQUIRE that it be ASCII (unless otherwise documented in that API). Use u8string, or something else if you don't want to assume ASCII.
Definition at line 59 of file Character.h.
using Stroika::Foundation::Characters::CodePage = typedef uint32_t |
A codePage is a Win32 (really DOS) concept which describes a particular single or
multibyte (narrow) character set encoding.
Definition at line 36 of file CodePage.h.
using Stroika::Foundation::Characters::SDKChar = typedef conditional_t<qTargetPlatformSDKUseswchar_t, wchar_t, char> |
SDKChar is the kind of character passed to most/default platform SDK APIs.
Platform-Specific Meaning: o Windows Typically this is wchar_t, which is char16_t-ish. Windows SDK also supports an older "A" API (active Code page single byte) which Stroika probably still supports, but this has not been tested in a while (not very useful, not used much anymore).
o Unix There is no standard. This could be locale-dependent (often EUC based multibyte character sets). Or could be UTF-8. These aren't totally incompatible possibilities.
o MacOS Same as 'Unix' above, but most typically UTF-8. So Stroika assumes UTF-8. See https://stackoverflow.com/questions/3071377/saner-way-to-get-character-encoding-of-the-cli-in-mac-os-x
o Linux Same as 'Unix' above - default to assume locale{} based.
using Stroika::Foundation::Characters::SDKString = typedef basic_string<SDKChar> |
This is the kind of String passed to most platform APIs.
The easiest way to convert between a String and SDKString, is via the String class APIs: AsSDKString, AsNarrowSDKString, FromSDKString, FromNarrowSDKString.
For std::string (etc) interop, that works, but also @SDK2Narrow and @Narrow2SDK
Notes: NOTE - in the context of this file, the word "Narrow" refers to single byte encodings of UNICODE characters (such as SJIS, UTF-8, or ISO-Latin-1, for example).
NOTE - in the context of this file, the word "Wide" refers to wchar_t based encoding of UNICODE characters.
Definition at line 38 of file SDKString.h.
using Stroika::Foundation::Characters::UTFCodeConverter = typedef function<UTFConvert::ConversionResult (span<const byte> source, span<OUTPUT_CHAR_T> targetBuffer)> |
This is a function that takes a span of bytes, and an OPTIONAL mbstate_t (TBD), and targetBuffer, translates into targetBuffer, and returns the changes. This utility wrapper funciton is meant to capture what you can easily put together from a (configured or default) UTFConvert, but in a form more easily used/consumed by a the BinaryToText::Reader code.
Definition at line 500 of file UTFConvert.h.
|
strong |
This flag ignores missing code points (when transforming from UNICODE to some character set that might not contain them), and does the best possible to map characters. Needed for things like translating a UNICODE error message to a locale{} character set which might not contain some of those UNICODE characters.
Definition at line 54 of file SDKString.h.
|
strong |
|
strong |
\flag used to indicate if ByteOrderMark should be included (in other Stroika modules).
Definition at line 29 of file TextConvert.h.
|
strong |
list of external UNICODE character encodings, for file IO (eDEFAULT = eUTF8)
Definition at line 31 of file UTFConvert.h.
wstring Stroika::Foundation::Characters::GetCharsetString | ( | CodePage | cp | ) |
Returns a character encoding name registered by the IANA - for the given CodePage.
See https://www.w3.org/International/articles/http-charset/index#charset
This works poorly, but is used in the HTTP Response generation, so cannot be removed for now.
Definition at line 83 of file CodePage.cpp.
Stroika::Foundation::Characters::DISABLE_COMPILER_MSC_WARNING_START | ( | 4996 | ) |
DEPRECATED BELOW.../////////////////////////////
String Stroika::Foundation::Characters::VFormat | ( | const FormatString< CHAR_T > & | f, |
const Common::StdCompat::wformat_args & | args | ||
) |
same as std::vformat, except always uses wformat_args, and produces Stroika String (and maybe more - soon - ??? - add extra conversions if I can find how?)
Same as vformat, except always produces valid UNICODE Stroika String...
Definition at line 112 of file Characters/Format.inl.
size_t Stroika::Foundation::Characters::CRLFToNL | ( | const TCHAR * | srcText, |
size_t | srcTextBytes, | ||
TCHAR * | outBuf, | ||
size_t | outBufSize | ||
) |
Convert the argument srcText buffer from CRLF format line endings, to NL format line endings.
return #bytes in output buffer (NO nullptr TERM) - assert buffer big enough - output buf as big is input buf always big enough. OK for srcText and outBuf to be SAME PTR.
Definition at line 65 of file LineEndings.inl.
size_t Stroika::Foundation::Characters::NormalizeTextToNL | ( | const TCHAR * | srcText, |
size_t | srcTextBytes, | ||
TCHAR * | outBuf, | ||
size_t | outBufSize | ||
) |
return #bytes in output buffer (NO nullptr TERM) - assert buffer big enough - output buf as big is input buf always big enough. OK for srcText and outBuf to be SAME PTR.
Definition at line 230 of file LineEndings.inl.
SDKString Stroika::Foundation::Characters::Narrow2SDK | ( | span< const char > | s | ) |
Convert string/span of 'char' - interpreting the char in the locale/active code page of the current operating systems (
On most platforms, this does nothing, but on Windows, it maps strings to wstring using code-page CP_ACP
Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).
Definition at line 19 of file SDKString.inl.
wstring Stroika::Foundation::Characters::NarrowSDK2Wide | ( | span< const char > | s | ) |
Interpret the narrow string in the SDKChar manner (locale/charset) and convert to UNICODE wstring.
This is identical to SDK2Wide () if SDKChar==char (e.g. on Unix).
Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).
Definition at line 81 of file SDKString.inl.
string Stroika::Foundation::Characters::SDK2Narrow | ( | span< const SDKChar > | s | ) |
Interpret the string/span of SDKChar (
On most platforms, this does nothing, but on Windows, it maps wstrings to string using code-page CP_ACP
Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).
Definition at line 105 of file SDKString.inl.
wstring Stroika::Foundation::Characters::SDK2Wide | ( | span< const SDKChar > | s | ) |
Interpret the string/span of SDKChar (
On Windows, this does nothing as SDKString==wstring, but on other platforms it follows the rules of SDKChar to map it to wstring.
Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).
SDKString Stroika::Foundation::Characters::Wide2SDK | ( | span< const wchar_t > | s | ) |
Interpret the string/span of UNICODE wchar_t - and convert it an SDKString string, using the current locale/SDKChar/SDKString rules.
On Windows, this does nothing as SDKString==wstring, but on other platforms it follows the rules of SDKChar to map it from wstring.
Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).
Definition at line 58 of file SDKString.cpp.
wostream & Stroika::Foundation::Characters::operator<< | ( | wostream & | out, |
const String & | s | ||
) |
operator<< ostream adapters work as you would expect and allow writing Stroika strings easily to ostreams such as cout.
The only catch - is that Stroika strings are UNICODE based, and so may not fit perfectly with 'char' based basic_ostream<>. To address this, Stroika strings are mapped to 'narrow sdk strings' - ignoring any errors. As this is generally not a very good practice to do (lossy) - and generally just done for debugging/diagnostic output, this was deemed acceptable (as of Stroika v3.0d6).
Definition at line 2035 of file String.cpp.
String Stroika::Foundation::Characters::operator+ | ( | LHS_T && | lhs, |
RHS_T && | rhs | ||
) |
Basic operator overload with the obvious meaning, and simply indirect to @String::Concatenate (const String& rhs)
Both arguments must be convertible to a String, and at least must be String or derived from String
Definition at line 1288 of file String.inl.
T Stroika::Foundation::Characters::String2Int | ( | span< const CHAR_T > | s | ) |
Convert the given decimal-format integral string to any integer type (e.g. signed char, unsigned short int, long long int, uint32_t etc).
String2Int will return 0 if no valid parse, and numeric_limits<T>::min on underflow, numeric_limits<T>::max on overflow.
CONSIDER!
TODO:
Definition at line 52 of file String2Int.inl.
unsigned int Stroika::Foundation::Characters::HexString2Int | ( | const String & | s | ) |
Convert the given hex-format string to an unsigned integer. String2Int will return 0 if no valid parse, and UINT_MAX on overflow.
Definition at line 23 of file String2Int.cpp.
|
constexprnoexcept |
returns the byte order mark for the given unicode encoding. Size is always <= kMaxBOMSize
Definition at line 21 of file TextConvert.inl.
|
constexprnoexcept |
returns guessed encoding, and number of bytes consumed. If 'd' doesn't contain BOM (possible cuz not large enuf) - returns nullopt
Pass in any size span, but recommended to use size kMaxBOMSize (less and you may miss some, more and wont use extra data).
Definition at line 71 of file TextConvert.inl.
span< byte > Stroika::Foundation::Characters::WriteByteOrderMark | ( | UnicodeExternalEncodings | e, |
span< byte > | into | ||
) |
returns remaining span to write into (basically just into.subspan(SizeOfByteOrderMark (e)) so caller can continue writing
Definition at line 105 of file TextConvert.inl.
String Stroika::Foundation::Characters::ToString | ( | T && | t, |
ARGS... | args | ||
) |
Return a debug-friendly, display version of the argument: not guaranteed parsable or usable except for debugging.
Convert an instance of the given object to a printable string representation. This representation is not guaranteed, pretty, or parsable. This feature is generally for debugging purposes, but can be used to render/emit objects in any informal setting where you just need a rough sense of the object (again, the only case I can think of here would be for debugging).
Patterned after: Java: From the Object.toString() docs: Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. Javascript: The toString() method returns a string representing object. Every object has a toString() method that is automatically called when the object is to be represented as a text value or when an object is referred to in a manner in which a string is expected.
Definition at line 465 of file ToString.inl.
const function< String(String, String, bool)> Stroika::Foundation::Characters::kDefaultStringCombiner = StringCombiner<String>{.fSeparator = ", "_k} |
kDefaultStringCombiner is just StringCombiner{}, rendered as a function object, so that it can be externed/imported in the Iterable code without imposing a dependency on the String code.
Definition at line 1313 of file String.inl.
|
constexpr |
Max size of span returned by GetByteOrderMark ()
Definition at line 42 of file TextConvert.h.