Stroika Library 3.0d16
 
Loading...
Searching...
No Matches
Stroika::Foundation::Characters Namespace Reference

Namespaces

namespace  FloatConversion
 
namespace  Literals
 Create a format-string (see std::wformat_string or Stroika FormatString, or python 'f' strings.
 
namespace  ToStringDefaults
 
namespace  WellKnownCharsets
 
namespace  WellKnownCodePages
 

Classes

class  Character
 
class  CharacterEncodingException
 
  • An error occurred encoding or decoding a character
More...
 
class  Charset
 
class  CodeCvt
 CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt, as well as UTFConvert etc, to map between span<bytes> and a span<UNICODE code-point> More...
 
class  CodePageNotSupportedException
 
struct  FormatString
 Roughly equivalent to std::wformat_string, except that it can be constructed from 'char' string, and if 'char' require ASCII characters for format string. More...
 
struct  Latin1
 
class  RegularExpression
 RegularExpression is a compiled regular expression which can be used to match on a String class. More...
 
class  RegularExpressionMatch
 
class  String
 String is like std::u32string, except it is much easier to use, often much more space efficient, and more easily interoperates with other string types. More...
 
class  StringBuilder
 Similar to String, but intended to more efficiently construct a String. Mutable type (String is largely immutable). More...
 
struct  StringBuilder_Options
 rarely used directly - defaults generally fine More...
 
struct  StringCombiner
 StringCombiner is a simple function object used to combine two strings visually - used in Iterable<>::Join () More...
 
struct  ToStringFormatter
 
class  UTFConvert
 UTFConvert is designed to provide mappings between various UTF encodings of UNICODE characters. More...
 

Concepts

concept  IBasicUNICODECodePoint
 check if T is char8_t, char16_t, char32_t - one of the three possible UNICODE UTF code-point classes.
 
concept  IUNICODECodePoint
 check if T is IBasicUNICODECodePoint or wchar_t (any basic code-point class)
 
concept  IStdBasicStringCompatibleCharacter
 concept IStdBasicStringCompatibleCharacter tests if the 'T' argument is a legit CHARACTER argument to std::basic_string, and basic_string_view (char,char8_t,char16_t,char32_t,wchar_t).
 
concept  IUNICODECanAlwaysConvertTo
 UNICODE string can be always be converted into array of this type.
 
concept  IUNICODECanUnambiguouslyConvertFrom
 IUNICODECanUnambiguouslyConvertFrom is any 'character representation type' where array of them unambiguously convertible to UNICODE string.
 
concept  IUNICODECanUnambiguouslyConvertTo
 IUNICODECanUnambiguouslyConvertTo is any 'character representation type' you can unambiguously convert a UNICODE string into.
 
concept  IStdCodeCVT
 
concept  IBasicUNICODEStdString
 returns true iff T == u8string, u16string, u32string, or wstring - which std::string types can be unambiguously converted to UNICODE
 
concept  IStdPathLike2UNICODEString
 anything with a 'special .STRINGTYPE conversion' method to UNICODE string, such as filesystem::path
 
concept  IConvertibleToString
 
concept  IToString
 Check if legal to call Characters::ToString(T)...
 

Typedefs

using ASCII = char
 Stroika's string/character classes treat 'char' as being an ASCII character.
 
using CodePage = uint32_t
 
using SDKChar = conditional_t< qTargetPlatformSDKUseswchar_t, wchar_t, char >
 
using SDKString = basic_string< SDKChar >
 
template<typename OUTPUT_CHAR_T >
using UTFCodeConverter = function< UTFConvert::ConversionResult(span< const byte > source, span< OUTPUT_CHAR_T > targetBuffer)>
 

Enumerations

enum  
 DEPRECATED.
 
enum class  AllowMissingCharacterErrorsFlag
 
enum class  StringShorteningPreference
 
enum class  ByteOrderMark
 
enum class  UnicodeExternalEncodings
 list of external UNICODE character encodings, for file IO (eDEFAULT = eUTF8) More...
 

Functions

wstring GetCharsetString (CodePage cp)
 Returns a character encoding name registered by the IANA - for the given CodePage.
 
 DISABLE_COMPILER_MSC_WARNING_START (4996)
 
template<typename CHAR_T >
String VFormat (const FormatString< CHAR_T > &f, const Common::StdCompat::wformat_args &args)
 same as std::vformat, except always uses wformat_args, and produces Stroika String (and maybe more - soon - ??? - add extra conversions if I can find how?)
 
template<typename CHAR_T , Common::StdCompat::formattable< wchar_t >... ARGS>
String Format (const FormatString< CHAR_T > &f, ARGS &&... args)
 Like std::format, except returning stroika String, and taking _f (FormatString) string as argument (which can be ASCII, but still produce UNICODE output).
 
template<typename TCHAR >
size_t CRLFToNL (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize)
 Convert the argument srcText buffer from CRLF format line endings, to NL format line endings.
 
template<typename TCHAR >
size_t NLToCRLF (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize)
 Convert the argument srcText buffer from NL format line endings, to CRLF format line endings. Note - even if input is already CRLF, this then is a no-op, changing nothing.
 
template<typename TCHAR >
size_t NormalizeTextToNL (const TCHAR *srcText, size_t srcTextBytes, TCHAR *outBuf, size_t outBufSize)
 
SDKString Narrow2SDK (span< const char > s)
 
wstring NarrowSDK2Wide (span< const char > s)
 
string SDK2Narrow (span< const SDKChar > s)
 
wstring SDK2Wide (span< const SDKChar > s)
 
SDKString Wide2SDK (span< const wchar_t > s)
 
wostream & operator<< (wostream &out, const String &s)
 
template<IConvertibleToString LHS_T, IConvertibleToString RHS_T>
requires (derived_from<remove_cvref_t<LHS_T>, String> or derived_from<remove_cvref_t<RHS_T>, String>)
String operator+ (LHS_T &&lhs, RHS_T &&rhs)
 
template<integral T = int, IUNICODECodePoint CHAR_T>
String2Int (span< const CHAR_T > s)
 
unsigned int HexString2Int (const String &s)
 
template<typename T >
String UnoverloadedToString (const T &t)
 same as ToString()/1 - but without the potentially confusing multi-arg overloads (confused some template expansions).
 
constexpr span< const byte > GetByteOrderMark (UnicodeExternalEncodings e) noexcept
 
constexpr optional< tuple< UnicodeExternalEncodings, size_t > > ReadByteOrderMark (span< const byte > d) noexcept
 
span< byte > WriteByteOrderMark (UnicodeExternalEncodings e, span< byte > into)
 
template<typename T , typename... ARGS>
String ToString (T &&t, ARGS... args)
 Return a debug-friendly, display version of the argument: not guaranteed parsable or usable except for debugging.
 

Variables

template<IPossibleCharacterRepresentation T>
static constexpr T kEOL []
 null-terminated String constant for current compiled platform - Windows (CRLF) or POSIX (NL) - macos I dont think any longer uses \r??
 
const function< String(String, String, bool)> kDefaultStringCombiner = StringCombiner<String>{.fSeparator = ", "_k}
 
constexpr size_t kMaxBOMSize = 3
 

Detailed Description

Each platform SDK has its own policy for representing characters. Some use narrow characters (char), and a predefined code page (often configured via locale), and others use wide-characters (wchar_t unicode).

SDKChar is the underlying representation of the SDK's characters - whether it be narrow or wide.

TODO

TODO:

 @todo   SEE http://stroika-bugs.sophists.com/browse/STK-768 - major refactor of this module

 @todo   Add Int2String () module? Like FloatConversion::ToString, and this String2Int?

 @todo   DOCUMENT BEHAVIOR OF STRING2INT() for bad strings. What does it do?
         AND SIMILARPT FOR hexString2Int. And for both - probably rewrite to use strtoul/strtol etc

 @todo   Same changes to HexString2Int() as we did with String2Int() - template on return value.
         Or maybe get rid of HexString2Int () - and just have optional radix param?

 @todo   Consdier if we should have variants of these funtions taking a locale, or
         always using C/currnet locale. For the most part - I find it best to use the C locale.
         But DOCUMENT in all cases!!! And maybe implement variants...

Typedef Documentation

◆ ASCII

Stroika's string/character classes treat 'char' as being an ASCII character.

This using declaration just documents that fact, without really enforcing anything. Prior to Stroika v3, the Stroika String classes basically prohibited the use of char because it was always UNCLEAR what character set to interpret it as.

But a safe (and quite useful) assumption, is just that it is ASCII. If you assume its always ASCII, you can simplify a lot of pragmatic usage. So Stroika v3 does that, with checks to enforce.

So generally - Stroika String (and Character) APIs - if given a 'char' REQUIRE that it be ASCII (unless otherwise documented in that API). Use u8string, or something else if you don't want to assume ASCII.

Definition at line 59 of file Character.h.

◆ CodePage

 A codePage is a Win32 (really DOS) concept which describes a particular single or

multibyte (narrow) character set encoding.

Note
Maybe someday add a layer to map to/from Mac 'ScriptIDs' - which are basicly analagous, just not as widely used.
UINT in windows SDK;

Definition at line 36 of file CodePage.h.

◆ SDKChar

using Stroika::Foundation::Characters::SDKChar = typedef conditional_t<qTargetPlatformSDKUseswchar_t, wchar_t, char>

SDKChar is the kind of character passed to most/default platform SDK APIs.

Platform-Specific Meaning: o Windows Typically this is wchar_t, which is char16_t-ish. Windows SDK also supports an older "A" API (active Code page single byte) which Stroika probably still supports, but this has not been tested in a while (not very useful, not used much anymore).

o Unix There is no standard. This could be locale-dependent (often EUC based multibyte character sets). Or could be UTF-8. These aren't totally incompatible possibilities.

o MacOS Same as 'Unix' above, but most typically UTF-8. So Stroika assumes UTF-8. See https://stackoverflow.com/questions/3071377/saner-way-to-get-character-encoding-of-the-cli-in-mac-os-x

o Linux Same as 'Unix' above - default to assume locale{} based.

Definition at line 71 of file SDKChar.h.

◆ SDKString

using Stroika::Foundation::Characters::SDKString = typedef basic_string<SDKChar>

This is the kind of String passed to most platform APIs.

The easiest way to convert between a String and SDKString, is via the String class APIs: AsSDKString, AsNarrowSDKString, FromSDKString, FromNarrowSDKString.

For std::string (etc) interop, that works, but also @SDK2Narrow and @Narrow2SDK

See also
SDKChar

Notes: NOTE - in the context of this file, the word "Narrow" refers to single byte encodings of UNICODE characters (such as SJIS, UTF-8, or ISO-Latin-1, for example).

NOTE - in the context of this file, the word "Wide" refers to wchar_t based encoding of UNICODE characters.

Definition at line 38 of file SDKString.h.

◆ UTFCodeConverter

template<typename OUTPUT_CHAR_T >
using Stroika::Foundation::Characters::UTFCodeConverter = typedef function<UTFConvert::ConversionResult (span<const byte> source, span<OUTPUT_CHAR_T> targetBuffer)>

This is a function that takes a span of bytes, and an OPTIONAL mbstate_t (TBD), and targetBuffer, translates into targetBuffer, and returns the changes. This utility wrapper funciton is meant to capture what you can easily put together from a (configured or default) UTFConvert, but in a form more easily used/consumed by a the BinaryToText::Reader code.

Definition at line 500 of file UTFConvert.h.

Enumeration Type Documentation

◆ AllowMissingCharacterErrorsFlag

This flag ignores missing code points (when transforming from UNICODE to some character set that might not contain them), and does the best possible to map characters. Needed for things like translating a UNICODE error message to a locale{} character set which might not contain some of those UNICODE characters.

Definition at line 54 of file SDKString.h.

◆ StringShorteningPreference

Flag principally for LimitLength, but used elsewhere as well (e.g. ToString ()).

Definition at line 99 of file String.h.

◆ ByteOrderMark

\flag used to indicate if ByteOrderMark should be included (in other Stroika modules).

Definition at line 29 of file TextConvert.h.

◆ UnicodeExternalEncodings

list of external UNICODE character encodings, for file IO (eDEFAULT = eUTF8)

Note
- UTF-7 not supported because very few places support it/ever used it, and https://en.wikipedia.org/wiki/UTF-7 says its obsolete. So don't bother.

Definition at line 31 of file UTFConvert.h.

Function Documentation

◆ GetCharsetString()

wstring Stroika::Foundation::Characters::GetCharsetString ( CodePage  cp)

Returns a character encoding name registered by the IANA - for the given CodePage.

See https://www.w3.org/International/articles/http-charset/index#charset

This works poorly, but is used in the HTTP Response generation, so cannot be removed for now.

Definition at line 83 of file CodePage.cpp.

◆ DISABLE_COMPILER_MSC_WARNING_START()

Stroika::Foundation::Characters::DISABLE_COMPILER_MSC_WARNING_START ( 4996  )

DEPRECATED BELOW.../////////////////////////////

◆ VFormat()

template<typename CHAR_T >
String Stroika::Foundation::Characters::VFormat ( const FormatString< CHAR_T > &  f,
const Common::StdCompat::wformat_args &  args 
)

same as std::vformat, except always uses wformat_args, and produces Stroika String (and maybe more - soon - ??? - add extra conversions if I can find how?)

Same as vformat, except always produces valid UNICODE Stroika String...

Note
FormatString typically created with _f, as in "foo={}"_f

Definition at line 112 of file Characters/Format.inl.

◆ CRLFToNL()

template<typename TCHAR >
size_t Stroika::Foundation::Characters::CRLFToNL ( const TCHAR *  srcText,
size_t  srcTextBytes,
TCHAR *  outBuf,
size_t  outBufSize 
)

Convert the argument srcText buffer from CRLF format line endings, to NL format line endings.

return #bytes in output buffer (NO nullptr TERM) - assert buffer big enough - output buf as big is input buf always big enough. OK for srcText and outBuf to be SAME PTR.

Definition at line 65 of file LineEndings.inl.

◆ NormalizeTextToNL()

template<typename TCHAR >
size_t Stroika::Foundation::Characters::NormalizeTextToNL ( const TCHAR *  srcText,
size_t  srcTextBytes,
TCHAR *  outBuf,
size_t  outBufSize 
)

return #bytes in output buffer (NO nullptr TERM) - assert buffer big enough - output buf as big is input buf always big enough. OK for srcText and outBuf to be SAME PTR.

Note
See also
String::NormalizeTextToNL

Definition at line 230 of file LineEndings.inl.

◆ Narrow2SDK()

SDKString Stroika::Foundation::Characters::Narrow2SDK ( span< const char >  s)

Convert string/span of 'char' - interpreting the char in the locale/active code page of the current operating systems (

See also
SDKChar).

On most platforms, this does nothing, but on Windows, it maps strings to wstring using code-page CP_ACP

Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).

Definition at line 19 of file SDKString.inl.

◆ NarrowSDK2Wide()

wstring Stroika::Foundation::Characters::NarrowSDK2Wide ( span< const char >  s)

Interpret the narrow string in the SDKChar manner (locale/charset) and convert to UNICODE wstring.

This is identical to SDK2Wide () if SDKChar==char (e.g. on Unix).

Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).

Definition at line 81 of file SDKString.inl.

◆ SDK2Narrow()

string Stroika::Foundation::Characters::SDK2Narrow ( span< const SDKChar s)

Interpret the string/span of SDKChar (

See also
SDKChar) - and convert it to narrow 'char' using the current code-page/locale.

On most platforms, this does nothing, but on Windows, it maps wstrings to string using code-page CP_ACP

Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).

Definition at line 105 of file SDKString.inl.

◆ SDK2Wide()

wstring Stroika::Foundation::Characters::SDK2Wide ( span< const SDKChar s)

Interpret the string/span of SDKChar (

See also
SDKChar) - and convert it to UNICODE 'wchar_t' string using the current code-page/locale.

On Windows, this does nothing as SDKString==wstring, but on other platforms it follows the rules of SDKChar to map it to wstring.

Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).

◆ Wide2SDK()

SDKString Stroika::Foundation::Characters::Wide2SDK ( span< const wchar_t >  s)

Interpret the string/span of UNICODE wchar_t - and convert it an SDKString string, using the current locale/SDKChar/SDKString rules.

On Windows, this does nothing as SDKString==wstring, but on other platforms it follows the rules of SDKChar to map it from wstring.

Characters with (detectibly) missing code-points will generate an exception, unless AllowMissingCharacterErrorsFlag is specified (but exceptions can happen in any case due to possible bad_alloc).

Definition at line 58 of file SDKString.cpp.

◆ operator<<()

wostream & Stroika::Foundation::Characters::operator<< ( wostream &  out,
const String s 
)

operator<< ostream adapters work as you would expect and allow writing Stroika strings easily to ostreams such as cout.

The only catch - is that Stroika strings are UNICODE based, and so may not fit perfectly with 'char' based basic_ostream<>. To address this, Stroika strings are mapped to 'narrow sdk strings' - ignoring any errors. As this is generally not a very good practice to do (lossy) - and generally just done for debugging/diagnostic output, this was deemed acceptable (as of Stroika v3.0d6).

Definition at line 2035 of file String.cpp.

◆ operator+()

template<IConvertibleToString LHS_T, IConvertibleToString RHS_T>
requires (derived_from<remove_cvref_t<LHS_T>, String> or derived_from<remove_cvref_t<RHS_T>, String>)
String Stroika::Foundation::Characters::operator+ ( LHS_T &&  lhs,
RHS_T &&  rhs 
)

Basic operator overload with the obvious meaning, and simply indirect to @String::Concatenate (const String& rhs)

Note
Design Note Don't use member function so "x" + String{u"x"} works. Insist that EITHER LHS or RHS is a string (else operator applies too widely).

Both arguments must be convertible to a String, and at least must be String or derived from String

Definition at line 1288 of file String.inl.

◆ String2Int()

template<integral T = int, IUNICODECodePoint CHAR_T>
T Stroika::Foundation::Characters::String2Int ( span< const CHAR_T >  s)

Convert the given decimal-format integral string to any integer type (e.g. signed char, unsigned short int, long long int, uint32_t etc).

String2Int will return 0 if no valid parse, and numeric_limits<T>::min on underflow, numeric_limits<T>::max on overflow.

CONSIDER!

See also
strtoll(), or
wcstoll (). This is a simple wrapper on strtoll() / wcstoll (). strtoll() is more flexible. This is merely meant to be an often convenient wrapper. Use strtoll etc directly to see if the string parsed properly.
Precondition
No leading or trailing whitespace in string argument (unlike strcoll/wstrcoll). (new requirement since Stroika 2.1b14)
Example Usage
uint32_t n1 = String2Int<uint32_t> ("33");
int n2 = String2Int (L"33");
int n3 = String2Int ("33aaa"); // invalid parse, so returns zero!

TODO:

Definition at line 52 of file String2Int.inl.

◆ HexString2Int()

unsigned int Stroika::Foundation::Characters::HexString2Int ( const String s)

Convert the given hex-format string to an unsigned integer. String2Int will return 0 if no valid parse, and UINT_MAX on overflow.

See also
strtoul(), or
wcstol (). This is a simple wrapper on strtoul() or wcstoul(). strtoul() etc are more flexible. This is merely meant to be an often convenient wrapper. Use strtoul etc directly to see if the string parsed properly.

Definition at line 23 of file String2Int.cpp.

◆ GetByteOrderMark()

constexpr span< const byte > Stroika::Foundation::Characters::GetByteOrderMark ( UnicodeExternalEncodings  e)
constexprnoexcept

returns the byte order mark for the given unicode encoding. Size is always <= kMaxBOMSize

Definition at line 21 of file TextConvert.inl.

◆ ReadByteOrderMark()

constexpr optional< tuple< UnicodeExternalEncodings, size_t > > Stroika::Foundation::Characters::ReadByteOrderMark ( span< const byte >  d)
constexprnoexcept

returns guessed encoding, and number of bytes consumed. If 'd' doesn't contain BOM (possible cuz not large enuf) - returns nullopt

Pass in any size span, but recommended to use size kMaxBOMSize (less and you may miss some, more and wont use extra data).

Example Usage:
span<const byte> from = argument;
if (optional<tuple<UnicodeExternalEncodings, size_t>> o = ReadByteOrderMark (from)) {
return make_tuple (Characters::CodeCvt<Character> (get<0> (*o)), get<1> (*o));
}
else {
return make_tuple (Characters::CodeCvt<Character> (UnicodeExternalEncodings::eDEFAULT), 0);
}
CodeCvt unifies byte <-> unicode conversions, vaguely inspired by (and wraps) std::codecvt,...
Definition CodeCvt.h:118
constexpr optional< tuple< UnicodeExternalEncodings, size_t > > ReadByteOrderMark(span< const byte > d) noexcept

Definition at line 71 of file TextConvert.inl.

◆ WriteByteOrderMark()

span< byte > Stroika::Foundation::Characters::WriteByteOrderMark ( UnicodeExternalEncodings  e,
span< byte >  into 
)
Precondition
into.size () >= SizeOfByteOrderMark (e)

returns remaining span to write into (basically just into.subspan(SizeOfByteOrderMark (e)) so caller can continue writing

Definition at line 105 of file TextConvert.inl.

◆ ToString()

template<typename T , typename... ARGS>
String Stroika::Foundation::Characters::ToString ( T &&  t,
ARGS...  args 
)

Return a debug-friendly, display version of the argument: not guaranteed parsable or usable except for debugging.

Convert an instance of the given object to a printable string representation. This representation is not guaranteed, pretty, or parsable. This feature is generally for debugging purposes, but can be used to render/emit objects in any informal setting where you just need a rough sense of the object (again, the only case I can think of here would be for debugging).

Note
AKA o DUMP o PrettyPrint

Patterned after: Java: From the Object.toString() docs: Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. Javascript: The toString() method returns a string representing object. Every object has a toString() method that is automatically called when the object is to be represented as a text value or when an object is referred to in a manner in which a string is expected.

Note
Built-in or std types intrinsically supported: o type_index, type_traits, or anything with a .name () which returns a const SDKChar* string. o is_array<T> o is_enum<T> o atomic<T> where T is a ToStringable type o shared_ptr<T> (but not unique_ptr<> because its not regular, which std::format requires) o std::exception o std::tuple o std::variant o std::pair o std::optional o std::wstring o std::filesystem::path o exception_ptr o POD types (int, bool, double, etc) o anything with .begin (), .end () - so any container/iterable o anything class(or struct) with a ToString () method
Other types automatically supported o KeyValuePair o CountedValue o String – printed as 'the-string' (possibly length limited)
Extra arguments/overloads o ToString (float, FloatConversion::ToStringOptions) // any floating point argument o ToString (float, FloatConversion::Precision) // '' o ToString (integral T, ios_base::fmtflags flags); // flags may be std::dec, std::oct, or std::hex //
See also
https://en.cppreference.com/w/cpp/io/ios_base/fmtflags o ToString (Duration, FloatConversion::Precision) // forwards to Duration::As<String> (Precision)
Note
Implementation Note This implementation defaults to calling T{}.ToString ().
See also
IToString to check if Characters::ToString () well defined.

Definition at line 465 of file ToString.inl.

Variable Documentation

◆ kDefaultStringCombiner

const function< String(String, String, bool)> Stroika::Foundation::Characters::kDefaultStringCombiner = StringCombiner<String>{.fSeparator = ", "_k}

kDefaultStringCombiner is just StringCombiner{}, rendered as a function object, so that it can be externed/imported in the Iterable code without imposing a dependency on the String code.

Definition at line 1313 of file String.inl.

◆ kMaxBOMSize

constexpr size_t Stroika::Foundation::Characters::kMaxBOMSize = 3
constexpr

Max size of span returned by GetByteOrderMark ()

Definition at line 42 of file TextConvert.h.