Exercise: Protobuf Parsing
In this exercise, you will build a parser for the protobuf binary encoding. Don't worry, it's simpler than it seems! This illustrates a common parsing pattern, passing slices of data. The underlying data itself is never copied.
Fully parsing a protobuf message requires knowing the types of the fields,
indexed by their field numbers. That is typically provided in a proto
file. In
this exercise, we'll encode that information into match
statements in
functions that get called for each field.
We'll use the following proto:
message PhoneNumber {
optional string number = 1;
optional string type = 2;
}
message Person {
optional string name = 1;
optional int32 id = 2;
repeated PhoneNumber phones = 3;
}
Messages
A proto message is encoded as a series of fields, one after the next. Each is
implemented as a "tag" followed by the value. The tag contains a field number
(e.g., 2
for the id
field of a Person
message) and a wire type defining
how the payload should be determined from the byte stream. These are combined
into a single integer, as decoded in unpack_tag
below.
Varint
Integers, including the tag, are represented with a variable-length encoding
called VARINT. Luckily, parse_varint
is defined for you below.
Wire Types
Proto defines several wire types, only two of which are used in this exercise.
The Varint
wire type contains a single varint, and is used to encode proto
values of type int32
such as Person.id
.
The Len
wire type contains a length expressed as a varint, followed by a
payload of that number of bytes. This is used to encode proto values of type
string
such as Person.name
. It is also used to encode proto values
containing sub-messages such as Person.phones
, where the payload contains an
encoding of the sub-message.
Exercise
The given code also defines callbacks to handle Person
and PhoneNumber
fields, and to parse a message into a series of calls to those callbacks.
What remains for you is to implement the parse_field
function and the
ProtoMessage
trait for Person
and PhoneNumber
.
/// A wire type as seen on the wire. enum WireType { /// The Varint WireType indicates the value is a single VARINT. Varint, // The I64 WireType indicates that the value is precisely 8 bytes in // little-endian order containing a 64-bit signed integer or double type. //I64, -- not needed for this exercise /// The Len WireType indicates that the value is a length represented as a /// VARINT followed by exactly that number of bytes. Len, // The I32 WireType indicates that the value is precisely 4 bytes in // little-endian order containing a 32-bit signed integer or float type. //I32, -- not needed for this exercise } #[derive(Debug)] /// A field's value, typed based on the wire type. enum FieldValue<'a> { Varint(u64), //I64(i64), -- not needed for this exercise Len(&'a [u8]), //I32(i32), -- not needed for this exercise } #[derive(Debug)] /// A field, containing the field number and its value. struct Field<'a> { field_num: u64, value: FieldValue<'a>, } trait ProtoMessage<'a>: Default { fn add_field(&mut self, field: Field<'a>); } impl From<u64> for WireType { fn from(value: u64) -> Self { match value { 0 => WireType::Varint, //1 => WireType::I64, -- not needed for this exercise 2 => WireType::Len, //5 => WireType::I32, -- not needed for this exercise _ => panic!("Invalid wire type: {value}"), } } } impl<'a> FieldValue<'a> { fn as_str(&self) -> &'a str { let FieldValue::Len(data) = self else { panic!("Expected string to be a `Len` field"); }; std::str::from_utf8(data).expect("Invalid string") } fn as_bytes(&self) -> &'a [u8] { let FieldValue::Len(data) = self else { panic!("Expected bytes to be a `Len` field"); }; data } fn as_u64(&self) -> u64 { let FieldValue::Varint(value) = self else { panic!("Expected `u64` to be a `Varint` field"); }; *value } } /// Parse a VARINT, returning the parsed value and the remaining bytes. fn parse_varint(data: &[u8]) -> (u64, &[u8]) { for i in 0..7 { let Some(b) = data.get(i) else { panic!("Not enough bytes for varint"); }; if b & 0x80 == 0 { // This is the last byte of the VARINT, so convert it to // a u64 and return it. let mut value = 0u64; for b in data[..=i].iter().rev() { value = (value << 7) | (b & 0x7f) as u64; } return (value, &data[i + 1..]); } } // More than 7 bytes is invalid. panic!("Too many bytes for varint"); } /// Convert a tag into a field number and a WireType. fn unpack_tag(tag: u64) -> (u64, WireType) { let field_num = tag >> 3; let wire_type = WireType::from(tag & 0x7); (field_num, wire_type) } /// Parse a field, returning the remaining bytes fn parse_field(data: &[u8]) -> (Field, &[u8]) { let (tag, remainder) = parse_varint(data); let (field_num, wire_type) = unpack_tag(tag); let (fieldvalue, remainder) = match wire_type { _ => todo!("Based on the wire type, build a Field, consuming as many bytes as necessary.") }; todo!("Return the field, and any un-consumed bytes.") } /// Parse a message in the given data, calling `T::add_field` for each field in /// the message. /// /// The entire input is consumed. fn parse_message<'a, T: ProtoMessage<'a>>(mut data: &'a [u8]) -> T { let mut result = T::default(); while !data.is_empty() { let parsed = parse_field(data); result.add_field(parsed.0); data = parsed.1; } result } #[derive(Debug, Default)] struct PhoneNumber<'a> { number: &'a str, type_: &'a str, } #[derive(Debug, Default)] struct Person<'a> { name: &'a str, id: u64, phone: Vec<PhoneNumber<'a>>, } // TODO: Implement ProtoMessage for Person and PhoneNumber. fn main() { let person_id: Person = parse_message(&[0x10, 0x2a]); assert_eq!(person_id, Person { name: "", id: 42, phone: vec![] }); let person_name: Person = parse_message(&[ 0x0a, 0x0e, 0x62, 0x65, 0x61, 0x75, 0x74, 0x69, 0x66, 0x75, 0x6c, 0x20, 0x6e, 0x61, 0x6d, 0x65, ]); assert_eq!(person_name, Person { name: "beautiful name", id: 0, phone: vec![] }); let person_name_id: Person = parse_message(&[0x0a, 0x04, 0x45, 0x76, 0x61, 0x6e, 0x10, 0x16]); assert_eq!(person_name_id, Person { name: "Evan", id: 22, phone: vec![] }); let phone: Person = parse_message(&[ 0x0a, 0x00, 0x10, 0x00, 0x1a, 0x16, 0x0a, 0x0e, 0x2b, 0x31, 0x32, 0x33, 0x34, 0x2d, 0x37, 0x37, 0x37, 0x2d, 0x39, 0x30, 0x39, 0x30, 0x12, 0x04, 0x68, 0x6f, 0x6d, 0x65, ]); assert_eq!( phone, Person { name: "", id: 0, phone: vec![PhoneNumber { number: "+1234-777-9090", type_: "home" },], } ); // Put that all together into a single parse. let person: Person = parse_message(&[ 0x0a, 0x07, 0x6d, 0x61, 0x78, 0x77, 0x65, 0x6c, 0x6c, 0x10, 0x2a, 0x1a, 0x16, 0x0a, 0x0e, 0x2b, 0x31, 0x32, 0x30, 0x32, 0x2d, 0x35, 0x35, 0x35, 0x2d, 0x31, 0x32, 0x31, 0x32, 0x12, 0x04, 0x68, 0x6f, 0x6d, 0x65, 0x1a, 0x18, 0x0a, 0x0e, 0x2b, 0x31, 0x38, 0x30, 0x30, 0x2d, 0x38, 0x36, 0x37, 0x2d, 0x35, 0x33, 0x30, 0x38, 0x12, 0x06, 0x6d, 0x6f, 0x62, 0x69, 0x6c, 0x65, ]); assert_eq!( person, Person { name: "maxwell", id: 42, phone: vec![ PhoneNumber { number: "+1202-555-1212", type_: "home" }, PhoneNumber { number: "+1800-867-5308", type_: "mobile" }, ] } ); }
- In this exercise there are various cases where protobuf parsing might fail,
e.g. if you try to parse an
i32
when there are fewer than 4 bytes left in the data buffer. In normal Rust code we'd handle this with theResult
enum, but for simplicity in this exercise we panic if any errors are encountered. On day 4 we'll cover error handling in Rust in more detail.