Recently I’ve been writing Rust code to work with a third-party data source in TOML format. In other languages I’d just load the data with some standard TOML library and have my program rummage through it, but I’ve been hearing lovely things about the Rust serialization library serde, so I figured I’d try it out.
Here’s a cut-down example of the data I’m dealing with:
manifest-version = "2" # ...other useful fields... [renames.oldpkg] to = "newpkg"
This is a pretty simple data format, and it’s pretty easy to write a Rust structure that can be serialized to and deserialized from it:
#[derive(Serialize, Deserialize)] struct ThirdPartyData { #[serde(rename = "manifest-version")] manifest_version: String, // ...other useful fields... renames: BTreeMap<String, BTreeMap<String, String>>, }
This struct corresponds exactly to
the structure of the input data,
and the only extra code I had to write was
the serde(rename = "blah")
attribute
because manifest-version
is not a legal Rust identifier.
Among the communities of strongly-typed languages like Rust, there’s an old maxim: “make illegal states unrepresentable”. That means that if your program assumes something about the data it’s working with, you should use the type system to guarantee that assumption is true.
For example,
take that manifest-version
field.
That’s not really part of the data I care about,
it’s metadata, information about the data I want.
When serializing,
it must always be set to “2”.
When deserializing,
if it’s not “2” then this must be some other file-format I don’t recognise,
and I should give up reading it.
The code that uses the rest of the data
never needs to read or write that field,
and if anything did change that field it would only mess things up later.
The best way to make sure nothing ever reads or writes a field
is just to remove the field entirely,
it’s just wasting space.
The renames
field is problematic in a different way.
It’s definitely data I care about,
but it’s represented as a strange double-map.
What would it mean if a key in the outer map
was associated with an empty inner map?
Or what if the inner map had keys other than to
?
A mapping from “old name” to “new name”
should just be a BTreeMap<String, String>
and then such illegal states wouldn’t even be possible.
Together, I want my Rust structure to look more like this:
#[derive(Serialize, Deserialize)] struct ThirdPartyData { // no manifest_version field! // ...other useful fields... renames: BTreeMap<String, String>, }
Unfortunately,
with serde
this doesn’t do what I want:
it doesn’t check that manifest-version
is set to the correct value,
and it cannot convert the renames
field automatically.
If serde
‘s standard derive
macro can’t handle it,
we’ll just have to do it manually, right?
So, I wrote my own implementations
of the serde::Serialize
and serde::Deserialize
traits
for my ThirdPartyData
struct.
To cut a long story short, it worked!
However, it was also tedious to write and complex to understand.
The serde
docs for serializing a struct
are straight-forward,
and the process was easy:
write a serialize
method for your struct
that calls the correct methods on a serde::Serializer
,
and you’re done.
However, the docs for deserializing
are much more complex:
as well as implementing Deserialize
for your struct,
you also need an extra helper struct
and need to implement the serde::Visitor
trait for it,
with a bunch of extra methods.
Then it turns out the lengthy Deserialize
example
only shows how to write a deserializer for a primitive type like i32
.
Deserializing a struct gets its own page of documentation,
which is vastly more complex.
Like I said, I got it working, but I wasn’t comfortable committing that code to my project.
Part of the problem was that
implementing Serialize
and Deserialize
manually
meant writing code to handle all the fields in my struct,
even though serde
could handle most of them automatically.
It turns out,
one of the many per-field attributes serde
provides
is the serde(with = "module")
attribute.
This attribute names a Rust module
containing serialize
and deserialize
functions,
which will be used to serialize and deserialize that field specifically,
while the rest of the struct is handled by
the regular serde
machinery.
For the renames
field,
this is great!
I still had to do the dummy struct/Visitor
dance,
but I only had to do it for that one specific field,
not all the fields in my struct.
For the manifest-version
field,
it didn’t help.
Since I didn’t want a manifest_version
field in my struct,
there was nothing to apply the attribute to.
So I sighed and deleted that code too, and tried to think of another way.
To recap the problem:
Put like that,
I’m sure you can see the solution too:
Use serde
to convert the input format
to Rust structs that match it exactly,
then manually convert the data
to Rust structs that are nice to use.
I’m using the “nice” version of ThirdPartyData
I sketched above,
but now the deserialization code now looks like this:
impl<'de> serde::Deserialize<'de> for ThirdPartyData { fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where D: serde::Deserializer<'de>, { use serde::de::Error; // An intermediate struct that exactly matches the input schema. #[derive(Deserialize)] struct EncodedThirdPartyData { #[serde(rename = "manifest-version")] pub manifest_version: String, // ...other useful fields... pub renames: BTreeMap<String, BTreeMap<String, String>>, } // Because we derived Deserialize automatically, // serde does all the hard work for us. let input = EncodedThirdPartyData::deserialize(deserializer)?; // Validating the manifest_version field is straightforward. if input.manifest_version != "2" { return Err(D::Error::invalid_value( ::serde::de::Unexpected::Str(&input.manifest_version), &"2", )); } // Converting the structure of the renames field // is straightforward too. let mut renames = BTreeMap::new(); for (old_pkg, mut inner_map) in input.renames { let new_pkg = inner_map .remove("to") .ok_or(D::Error::missing_field("to"))?; renames.insert(old_pkg, new_pkg); } // Finally, we move all the data into an instance // of our "nice" struct. Ok(Channel { renames: renames, }) } }
Our intermediate struct owns the deserialized data,
so we can rip it apart to build the nice struct
without any extra allocations…
well, we need to allocate some BTreeMap
s
to change the structure of the renames
map,
but at least we don’t need to clone the keys and values.
To serialize a struct,
we could use the same intermediate struct
and work in reverse,
but since that struct owns its data
we’d have to pull apart our nice struct to get the data out,
or clone all the data.
Neither option is great,
so instead we’ll use a different struct
that replaces all the String
types with &str
.
serde
serializes both types the same way,
but it means we can do zero-allocation serializing
(except for reshaping the renames
map, again):
impl serde::Serialize for ThirdPartyData { fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer, { // An intermediate struct that exactly matches the input schema, // and uses &str instead of String. #[derive(Serialize)] struct EncodedThirdPartyData<'a> { #[serde(rename = "manifest-version")] manifest_version: &'a str, // ...other useful fields... renames: BTreeMap<&'a str, BTreeMap<&'a str, &'a str>>, } // Convert the structure of the renames field, // but take references to the original data. let mut renames = BTreeMap::new(); for (old_pkg, new_pkg) in self.renames.iter() { let mut inner = BTreeMap::new(); inner.insert("to", new_pkg.as_str()); renames.insert(old_pkg.as_str(), inner); } let output = EncodedThirdPartyData { // We can hard-code the manifest version // we want to serialize manifest_version: "2", renames: renames, }; // Once again, serde does all the hard work for us output.serialize(serializer) } }
Finally, we have our nice, robust data module structure, with serialization and deserialization almost completely automated, except for a few lines of code that are almost entirely a straightforward implementation of the checks and conversions that need to be done.