Ion
|
The Utf8Iterator class iterates over characters in strings encoded with UTF-8, extracting the Unicode index for each character. More...
#include "utf8iterator.h"
Public Types | |
enum | State { kInString, kEndOfString, kInvalid } |
Iterator states. More... | |
Public Member Functions | |
Utf8Iterator (const std::string &utf8_string) | |
The constructor is passed an std::string in UTF-8 format. More... | |
uint32 | Next () |
Returns the Unicode index (up to 21 bits) for the next character in the string, or kInvalidCharIndex if there are no characters remaining or an error occurred. More... | |
State | GetState () const |
Returns the state of the iterator. More... | |
size_t | ComputeCharCount () const |
Convenience function that computes and returns the number of Unicode characters in the string by iterating over it. More... | |
Static Public Attributes | |
static const uint32 | kInvalidCharIndex = 0x110000 |
An invalid Unicode character index. More... | |
The Utf8Iterator class iterates over characters in strings encoded with UTF-8, extracting the Unicode index for each character.
It can also compute the total number of characters in the string.
For easy reference, here are the ranges for UTF-8 characters.
00 - 7f 0 - 127 Only byte of a 1-byte character (Ascii) 80 - bf 128 - 191 Continuation byte of a multi-byte sequence c0 - c1 192 - 193 <Invalid bytes - should never appear> c2 - df 194 - 223 First byte of a 2-byte sequence e0 - ef 224 - 239 First byte of a 3-byte sequence f0 - f4 240 - 244 First byte of a 4-byte sequence f5 - ff 245 - 255 <Invalid bytes - should never appear>
In addition, no Unicode index should exceed the maximum of 0x10ffff (4-byte sequence f4/8f/bf/bf). Some sequences that start with byte f4 (244) may exceed this limit and be illegal.
Definition at line 49 of file utf8iterator.h.
Iterator states.
Enumerator | |
---|---|
kInString | |
kEndOfString | |
kInvalid |
Definition at line 52 of file utf8iterator.h.
|
explicit |
The constructor is passed an std::string in UTF-8 format.
Utf8Iterator functions.
Definition at line 93 of file utf8iterator.cc.
size_t ion::base::Utf8Iterator::ComputeCharCount | ( | ) | const |
Convenience function that computes and returns the number of Unicode characters in the string by iterating over it.
This returns 0 if there are any encoding errors in the string.
Definition at line 137 of file utf8iterator.cc.
References GetState(), kEndOfString, kInvalidCharIndex, and Next().
|
inline |
Returns the state of the iterator.
This can be used once iteration terminates to determine whether an error occurred or the end of string was reached.
Definition at line 72 of file utf8iterator.h.
Referenced by ComputeCharCount().
uint32 ion::base::Utf8Iterator::Next | ( | ) |
Returns the Unicode index (up to 21 bits) for the next character in the string, or kInvalidCharIndex if there are no characters remaining or an error occurred.
Definition at line 99 of file utf8iterator.cc.
References kInString, kInvalid, and kInvalidCharIndex.
Referenced by ComputeCharCount().
|
static |
An invalid Unicode character index.
Definition at line 59 of file utf8iterator.h.
Referenced by ComputeCharCount(), ion::text::ComputeTextSize(), and Next().