Exercise: Protobuf Parsing
In this exercise, you will build a parser for the protobuf binary encoding. Don’t worry, it’s simpler than it seems! This illustrates a common parsing pattern, passing slices of data. The underlying data itself is never copied.
Fully parsing a protobuf message requires knowing the types of the fields,
indexed by their field numbers. That is typically provided in a proto file. In
this exercise, we’ll encode that information into match statements in
functions that get called for each field.
We’ll use the following proto:
message PhoneNumber {
optional string number = 1;
optional string type = 2;
}
message Person {
optional string name = 1;
optional int32 id = 2;
repeated PhoneNumber phones = 3;
}
Messages
A proto message is encoded as a series of fields, one after the next. Each is
implemented as a “tag” followed by the value. The tag contains a field number
(e.g., 2 for the id field of a Person message) and a wire type defining
how the payload should be determined from the byte stream. These are combined
into a single integer, as decoded in unpack_tag below.
Varint
Integers, including the tag, are represented with a variable-length encoding
called VARINT. Luckily, parse_varint is defined for you below.
Wire Types
Proto defines several wire types, only two of which are used in this exercise.
The Varint wire type contains a single varint, and is used to encode proto
values of type int32 such as Person.id.
The Len wire type contains a length expressed as a varint, followed by a
payload of that number of bytes. This is used to encode proto values of type
string such as Person.name. It is also used to encode proto values
containing sub-messages such as Person.phones, where the payload contains an
encoding of the sub-message.
Exercise
The given code also defines callbacks to handle Person and PhoneNumber
fields, and to parse a message into a series of calls to those callbacks.
What remains for you is to implement the parse_field function and the
ProtoMessage trait for Person and PhoneNumber.
// Copyright 2024 Google LLC
// SPDX-License-Identifier: Apache-2.0
/// A wire type as seen on the wire.
enum WireType {
/// The Varint WireType indicates the value is a single VARINT.
Varint,
// The I64 WireType indicates that the value is precisely 8 bytes in
// little-endian order containing a 64-bit signed integer or double type.
//I64, -- not needed for this exercise
/// The Len WireType indicates that the value is a length represented as a
/// VARINT followed by exactly that number of bytes.
Len,
// The I32 WireType indicates that the value is precisely 4 bytes in
// little-endian order containing a 32-bit signed integer or float type.
//I32, -- not needed for this exercise
}
#[derive(Debug)]
/// A field's value, typed based on the wire type.
enum FieldValue<'a> {
Varint(u64),
//I64(i64), -- not needed for this exercise
Len(&'a [u8]),
//I32(i32), -- not needed for this exercise
}
#[derive(Debug)]
/// A field, containing the field number and its value.
struct Field<'a> {
field_num: u64,
value: FieldValue<'a>,
}
trait ProtoMessage<'a>: Default {
fn add_field(&mut self, field: Field<'a>);
}
impl From<u64> for WireType {
fn from(value: u64) -> Self {
match value {
0 => WireType::Varint,
//1 => WireType::I64, -- not needed for this exercise
2 => WireType::Len,
//5 => WireType::I32, -- not needed for this exercise
_ => panic!("Invalid wire type: {value}"),
}
}
}
impl<'a> FieldValue<'a> {
fn as_str(&self) -> &'a str {
let FieldValue::Len(data) = self else {
panic!("Expected string to be a `Len` field");
};
std::str::from_utf8(data).expect("Invalid string")
}
fn as_bytes(&self) -> &'a [u8] {
let FieldValue::Len(data) = self else {
panic!("Expected bytes to be a `Len` field");
};
data
}
fn as_u64(&self) -> u64 {
let FieldValue::Varint(value) = self else {
panic!("Expected `u64` to be a `Varint` field");
};
*value
}
}
/// Parse a VARINT, returning the parsed value and the remaining bytes.
fn parse_varint(data: &[u8]) -> (u64, &[u8]) {
for i in 0..7 {
let Some(b) = data.get(i) else {
panic!("Not enough bytes for varint");
};
if b & 0x80 == 0 {
// This is the last byte of the VARINT, so convert it to
// a u64 and return it.
let mut value = 0u64;
for b in data[..=i].iter().rev() {
value = (value << 7) | (b & 0x7f) as u64;
}
return (value, &data[i + 1..]);
}
}
// More than 7 bytes is invalid.
panic!("Too many bytes for varint");
}
/// Convert a tag into a field number and a WireType.
fn unpack_tag(tag: u64) -> (u64, WireType) {
let field_num = tag >> 3;
let wire_type = WireType::from(tag & 0x7);
(field_num, wire_type)
}
/// Parse a field, returning the remaining bytes
fn parse_field(data: &[u8]) -> (Field<'_>, &[u8]) {
let (tag, remainder) = parse_varint(data);
let (field_num, wire_type) = unpack_tag(tag);
let (fieldvalue, remainder) = match wire_type {
_ => todo!("Based on the wire type, build a Field, consuming as many bytes as necessary.")
};
todo!("Return the field, and any un-consumed bytes.")
}
/// Parse a message in the given data, calling `T::add_field` for each field in
/// the message.
///
/// The entire input is consumed.
fn parse_message<'a, T: ProtoMessage<'a>>(mut data: &'a [u8]) -> T {
let mut result = T::default();
while !data.is_empty() {
let parsed = parse_field(data);
result.add_field(parsed.0);
data = parsed.1;
}
result
}
#[derive(Debug, Default)]
struct PhoneNumber<'a> {
number: &'a str,
type_: &'a str,
}
#[derive(Debug, Default)]
struct Person<'a> {
name: &'a str,
id: u64,
phone: Vec<PhoneNumber<'a>>,
}
// TODO: Implement ProtoMessage for Person and PhoneNumber.
#[test]
fn test_id() {
let person_id: Person = parse_message(&[0x10, 0x2a]);
assert_eq!(person_id, Person { name: "", id: 42, phone: vec![] });
}
#[test]
fn test_name() {
let person_name: Person = parse_message(&[
0x0a, 0x0e, 0x62, 0x65, 0x61, 0x75, 0x74, 0x69, 0x66, 0x75, 0x6c, 0x20,
0x6e, 0x61, 0x6d, 0x65,
]);
assert_eq!(person_name, Person { name: "beautiful name", id: 0, phone: vec![] });
}
#[test]
fn test_just_person() {
let person_name_id: Person =
parse_message(&[0x0a, 0x04, 0x45, 0x76, 0x61, 0x6e, 0x10, 0x16]);
assert_eq!(person_name_id, Person { name: "Evan", id: 22, phone: vec![] });
}
#[test]
fn test_phone() {
let phone: Person = parse_message(&[
0x0a, 0x00, 0x10, 0x00, 0x1a, 0x16, 0x0a, 0x0e, 0x2b, 0x31, 0x32, 0x33,
0x34, 0x2d, 0x37, 0x37, 0x37, 0x2d, 0x39, 0x30, 0x39, 0x30, 0x12, 0x04,
0x68, 0x6f, 0x6d, 0x65,
]);
assert_eq!(
phone,
Person {
name: "",
id: 0,
phone: vec![PhoneNumber { number: "+1234-777-9090", type_: "home" },],
}
);
}
// Put that all together into a single parse.
#[test]
fn test_full_person() {
let person: Person = parse_message(&[
0x0a, 0x07, 0x6d, 0x61, 0x78, 0x77, 0x65, 0x6c, 0x6c, 0x10, 0x2a, 0x1a,
0x16, 0x0a, 0x0e, 0x2b, 0x31, 0x32, 0x30, 0x32, 0x2d, 0x35, 0x35, 0x35,
0x2d, 0x31, 0x32, 0x31, 0x32, 0x12, 0x04, 0x68, 0x6f, 0x6d, 0x65, 0x1a,
0x18, 0x0a, 0x0e, 0x2b, 0x31, 0x38, 0x30, 0x30, 0x2d, 0x38, 0x36, 0x37,
0x2d, 0x35, 0x33, 0x30, 0x38, 0x12, 0x06, 0x6d, 0x6f, 0x62, 0x69, 0x6c,
0x65,
]);
assert_eq!(
person,
Person {
name: "maxwell",
id: 42,
phone: vec![
PhoneNumber { number: "+1202-555-1212", type_: "home" },
PhoneNumber { number: "+1800-867-5308", type_: "mobile" },
]
}
);
}
- In this exercise there are various cases where protobuf parsing might fail,
e.g. if you try to parse an
i32when there are fewer than 4 bytes left in the data buffer. In normal Rust code we’d handle this with theResultenum, but for simplicity in this exercise we panic if any errors are encountered. On day 4 we’ll cover error handling in Rust in more detail.