Screwtape's Notepad

Serializing awkward data with serde

Recently I’ve been writing Rust code to work with a third-party data source in TOML format. In other languages I’d just load the data with some standard TOML library and have my program rummage through it, but I’ve been hearing lovely things about the Rust serialization library serde, so I figured I’d try it out.

The basics

Here’s a cut-down example of the data I’m dealing with:

manifest-version = "2"
# ...other useful fields...
[renames.oldpkg]
to = "newpkg"

This is a pretty simple data format, and it’s pretty easy to write a Rust structure that can be serialized to and deserialized from it:

#[derive(Serialize, Deserialize)]
struct ThirdPartyData {
    #[serde(rename = "manifest-version")]
    manifest_version: String,
    // ...other useful fields...
    renames: BTreeMap<String, BTreeMap<String, String>>,
}

This struct corresponds exactly to the structure of the input data, and the only extra code I had to write was the serde(rename = "blah") attribute because manifest-version is not a legal Rust identifier.

A better Rust structure

Among the communities of strongly-typed languages like Rust, there’s an old maxim: “make illegal states unrepresentable”. That means that if your program assumes something about the data it’s working with, you should use the type system to guarantee that assumption is true.

For example, take that manifest-version field. That’s not really part of the data I care about, it’s metadata, information about the data I want. When serializing, it must always be set to “2”. When deserializing, if it’s not “2” then this must be some other file-format I don’t recognise, and I should give up reading it. The code that uses the rest of the data never needs to read or write that field, and if anything did change that field it would only mess things up later. The best way to make sure nothing ever reads or writes a field is just to remove the field entirely, it’s just wasting space.

The renames field is problematic in a different way. It’s definitely data I care about, but it’s represented as a strange double-map. What would it mean if a key in the outer map was associated with an empty inner map? Or what if the inner map had keys other than to? A mapping from “old name” to “new name” should just be a BTreeMap<String, String> and then such illegal states wouldn’t even be possible.

Together, I want my Rust structure to look more like this:

#[derive(Serialize, Deserialize)]
struct ThirdPartyData {
    // no manifest_version field!

    // ...other useful fields...

    renames: BTreeMap<String, String>,
}

Unfortunately, with serde this doesn’t do what I want: it doesn’t check that manifest-version is set to the correct value, and it cannot convert the renames field automatically.

Attempt #1: Do it yourself!

If serde‘s standard derive macro can’t handle it, we’ll just have to do it manually, right? So, I wrote my own implementations of the serde::Serialize and serde::Deserialize traits for my ThirdPartyData struct. To cut a long story short, it worked! However, it was also tedious to write and complex to understand.

The serde docs for serializing a struct are straight-forward, and the process was easy: write a serialize method for your struct that calls the correct methods on a serde::Serializer, and you’re done. However, the docs for deserializing are much more complex: as well as implementing Deserialize for your struct, you also need an extra helper struct and need to implement the serde::Visitor trait for it, with a bunch of extra methods.

Then it turns out the lengthy Deserialize example only shows how to write a deserializer for a primitive type like i32. Deserializing a struct gets its own page of documentation, which is vastly more complex.

Like I said, I got it working, but I wasn’t comfortable committing that code to my project.

Attempt #2: Field attributes

Part of the problem was that implementing Serialize and Deserialize manually meant writing code to handle all the fields in my struct, even though serde could handle most of them automatically.

It turns out, one of the many per-field attributes serde provides is the serde(with = "module") attribute. This attribute names a Rust module containing serialize and deserialize functions, which will be used to serialize and deserialize that field specifically, while the rest of the struct is handled by the regular serde machinery.

For the renames field, this is great! I still had to do the dummy struct/Visitor dance, but I only had to do it for that one specific field, not all the fields in my struct.

For the manifest-version field, it didn’t help. Since I didn’t want a manifest_version field in my struct, there was nothing to apply the attribute to.

So I sighed and deleted that code too, and tried to think of another way.

Success: Use intermediate structs

To recap the problem:

Put like that, I’m sure you can see the solution too: Use serde to convert the input format to Rust structs that match it exactly, then manually convert the data to Rust structs that are nice to use.

I’m using the “nice” version of ThirdPartyData I sketched above, but now the deserialization code now looks like this:

impl<'de> serde::Deserialize<'de> for ThirdPartyData {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        use serde::de::Error;

        // An intermediate struct that exactly matches the input schema.
        #[derive(Deserialize)]
        struct EncodedThirdPartyData {
            #[serde(rename = "manifest-version")]
            pub manifest_version: String,
            // ...other useful fields...
            pub renames: BTreeMap<String, BTreeMap<String, String>>,
        }

        // Because we derived Deserialize automatically,
        // serde does all the hard work for us.
        let input = EncodedThirdPartyData::deserialize(deserializer)?;

        // Validating the manifest_version field is straightforward.
        if input.manifest_version != "2" {
            return Err(D::Error::invalid_value(
                ::serde::de::Unexpected::Str(&input.manifest_version),
                &"2",
            ));
        }

        // Converting the structure of the renames field
        // is straightforward too.
        let mut renames = BTreeMap::new();
        for (old_pkg, mut inner_map) in input.renames {
            let new_pkg = inner_map
                .remove("to")
                .ok_or(D::Error::missing_field("to"))?;
            renames.insert(old_pkg, new_pkg);
        }

        // Finally, we move all the data into an instance
        // of our "nice" struct.
        Ok(Channel {
            renames: renames,
        })
    }
}

Our intermediate struct owns the deserialized data, so we can rip it apart to build the nice struct without any extra allocations… well, we need to allocate some BTreeMaps to change the structure of the renames map, but at least we don’t need to clone the keys and values.

To serialize a struct, we could use the same intermediate struct and work in reverse, but since that struct owns its data we’d have to pull apart our nice struct to get the data out, or clone all the data. Neither option is great, so instead we’ll use a different struct that replaces all the String types with &str. serde serializes both types the same way, but it means we can do zero-allocation serializing (except for reshaping the renames map, again):

impl serde::Serialize for ThirdPartyData {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: serde::Serializer,
    {
        // An intermediate struct that exactly matches the input schema,
        // and uses &str instead of String.
        #[derive(Serialize)]
        struct EncodedThirdPartyData<'a> {
            #[serde(rename = "manifest-version")]
            manifest_version: &'a str,
            // ...other useful fields...
            renames: BTreeMap<&'a str, BTreeMap<&'a str, &'a str>>,
        }

        // Convert the structure of the renames field,
        // but take references to the original data.
        let mut renames = BTreeMap::new();
        for (old_pkg, new_pkg) in self.renames.iter() {
            let mut inner = BTreeMap::new();
            inner.insert("to", new_pkg.as_str());
            renames.insert(old_pkg.as_str(), inner);
        }

        let output = EncodedThirdPartyData {
            // We can hard-code the manifest version
            // we want to serialize
            manifest_version: "2",
            renames: renames,
        };

        // Once again, serde does all the hard work for us
        output.serialize(serializer)
    }
}

Finally, we have our nice, robust data module structure, with serialization and deserialization almost completely automated, except for a few lines of code that are almost entirely a straightforward implementation of the checks and conversions that need to be done.