Ion
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
ion::base::Utf8Iterator Class Reference

The Utf8Iterator class iterates over characters in strings encoded with UTF-8, extracting the Unicode index for each character. More...

#include "utf8iterator.h"

Collaboration diagram for ion::base::Utf8Iterator:

Public Types

enum  State {
  kInString,
  kEndOfString,
  kInvalid
}
 Iterator states. More...
 

Public Member Functions

 Utf8Iterator (const std::string &utf8_string)
 The constructor is passed an std::string in UTF-8 format. More...
 
uint32 Next ()
 Returns the Unicode index (up to 21 bits) for the next character in the string, or kInvalidCharIndex if there are no characters remaining or an error occurred. More...
 
State GetState () const
 Returns the state of the iterator. More...
 
size_t ComputeCharCount () const
 Convenience function that computes and returns the number of Unicode characters in the string by iterating over it. More...
 

Static Public Attributes

static const uint32 kInvalidCharIndex = 0x110000
 An invalid Unicode character index. More...
 

Detailed Description

The Utf8Iterator class iterates over characters in strings encoded with UTF-8, extracting the Unicode index for each character.

It can also compute the total number of characters in the string.

For easy reference, here are the ranges for UTF-8 characters.

Hex Decimal What

00 - 7f 0 - 127 Only byte of a 1-byte character (Ascii) 80 - bf 128 - 191 Continuation byte of a multi-byte sequence c0 - c1 192 - 193 <Invalid bytes - should never appear> c2 - df 194 - 223 First byte of a 2-byte sequence e0 - ef 224 - 239 First byte of a 3-byte sequence f0 - f4 240 - 244 First byte of a 4-byte sequence f5 - ff 245 - 255 <Invalid bytes - should never appear>

In addition, no Unicode index should exceed the maximum of 0x10ffff (4-byte sequence f4/8f/bf/bf). Some sequences that start with byte f4 (244) may exceed this limit and be illegal.

Definition at line 49 of file utf8iterator.h.

Member Enumeration Documentation

Iterator states.

Enumerator
kInString 
kEndOfString 
kInvalid 

Definition at line 52 of file utf8iterator.h.

Constructor & Destructor Documentation

ion::base::Utf8Iterator::Utf8Iterator ( const std::string &  utf8_string)
explicit

The constructor is passed an std::string in UTF-8 format.

Utf8Iterator functions.

Definition at line 93 of file utf8iterator.cc.

Member Function Documentation

size_t ion::base::Utf8Iterator::ComputeCharCount ( ) const

Convenience function that computes and returns the number of Unicode characters in the string by iterating over it.

This returns 0 if there are any encoding errors in the string.

Definition at line 137 of file utf8iterator.cc.

References GetState(), kEndOfString, kInvalidCharIndex, and Next().

State ion::base::Utf8Iterator::GetState ( ) const
inline

Returns the state of the iterator.

This can be used once iteration terminates to determine whether an error occurred or the end of string was reached.

Definition at line 72 of file utf8iterator.h.

Referenced by ComputeCharCount().

uint32 ion::base::Utf8Iterator::Next ( )

Returns the Unicode index (up to 21 bits) for the next character in the string, or kInvalidCharIndex if there are no characters remaining or an error occurred.

Definition at line 99 of file utf8iterator.cc.

References kInString, kInvalid, and kInvalidCharIndex.

Referenced by ComputeCharCount().

Member Data Documentation

const uint32 ion::base::Utf8Iterator::kInvalidCharIndex = 0x110000
static

An invalid Unicode character index.

Definition at line 59 of file utf8iterator.h.

Referenced by ComputeCharCount(), ion::text::ComputeTextSize(), and Next().


The documentation for this class was generated from the following files: