24 Jun 2016

Protocol Buffers Are Needlessly Complex

Protocol buffers are too complex for little benefit, compared to JSON. Or binary JSON if you want more efficiency.

To begin with, every field in protobufs needs a tag:

string address = 1;

Here, 1 is the tag for this field. When the protobuf is serialised, the tag is stored, not the name of the field (address, in this case).

One reason given for this is efficiency: if you have a million addresses, you don’t need to store the key “address” a million times. But that’s just a premature optimisation for a compression algorithm. There’s no need to expose tags to the user. The serialiser can synthesise tags on the fly when serialising them, and store them in a lookup table along with the data. That way, instead of storing “address” a million times, you store a number 1 a million times. Tags then become an implementation detail of the protobuf library, like a dictionary in a compression algorithm. Programmers who deal with protos don’t need to worry about tags.

The other reason given in support of tags is backward-compatibility. But you don’t need tags for backward-compatibility. You can use strings just as well. Any unique identifier will do. If anything, tags confuse rather than clarify things. Which of the following changes are backward-compatible:

- Adding a new field

- Removing an existing field

- Renaming an existing field

- Changing the type of an existing field

- Changing the tag of an existing field

- Changing a singular field to repeated or vice-versa

Unless you have a PhD in protocol buffers, you don’t know which of these are backward-compatible. Or forward-compatible.

One particular case is deleting a field, and then months later, adding a new field that happens to reuse the tag of the deleted field. This can cause severe problems. And the solution to that is to declare certain tags “reserved”. Which is yet another layer of band-aid on top of a needlessly complex system.

Protobufs also have too many data types — integers, for example, come in 32-bit and 64-bit versions, signed and unsigned, variable- and fixed-length. That’s way too many options. And some of these are backward-compatible: you can change a field from 32-bit to 64-bit type, and it will work. Conversely as well, unless the value happens to exceed the range of a 32-bit integer, in which case it’s silently truncated. This is dangerous and should never happen implicitly and silently.

Another question: suppose you were to deserialise a proto, and then immediately re-serialise it without setting any fields. Will its state change? Surprisingly, yes — if the code that created the proto is using a newer version of the proto definition, with additional fields, those fields won’t be preserved when you deserialise and then reserialise the proto.

JSON is simple, consistent and just works. There’s no need for all this complexity, which is more likely to cause bugs than fix them.

Protos aren’t self-describing, either — if I give you a proto without telling you what message it’s an instance of, you won’t be able to deserialise it. Or, if you use the wrong definition, you’ll get corrupted data. This is a step back from JSON, a premature optimisation [1].

Neither are protos self-delimiting. If you have a network connection across which the other side is streaming one proto after another, and you want to process each as it comes in, you can’t, because the decoder needs to be told where the end of each proto is. Protos aren’t self-delimiting, like JSON is — when you find an opening ( or [, just scan for the corresponding closing ) or ].

JSON [2] can also be parsed by JS running in a web app, which protos can’t [3].

It’s also hard to handle protos in a generic manner, because it’s statically typed. To work around that limitation, descriptors were introduced, which are like reflection. JSON again doesn’t have all that complexity — it’s effectively just a Map<String, Object>.

To sidestep this entire mess, protobufs as they exist should be thrown away. Start with JSON. Use a binary encoding if performance is critical.

If you want static type checking (and you’re coding in a statically typed language), retrofit it on top of JSON. You’d write a definition that looks like JSON except that the value would be replaced the type. That is, if a JSON object said, name: “Kartick”, the type definition for that message would say, name: string. Then you can build a compiler that generates a class with getters and setters. That way, you won’t have typos in key names, or type mismatches in values, like accidentally setting a field to a string when it should have been an integer. Finally, when you invoke toString() on that object, you’ll get well-formed JSON that you can send down the wire. On the receiving side, you’ll use a parser and again use statically typed getters to access the fields. Importantly, the class generated by the compiler will still let you use dynamic typing to get and set any field, as with a Map. That way, you have the best of both worlds — the flexibility and simplicity of JSON with the type checking of protocol buffers.

Get rid of protocol buffers and just use JSON, or build something like protobufs on top of JSON.

[1] This is like how objects in Java know what class they are an instance of — you can safely cast an Object reference to a particular class, with runtime checking. Not in C++, where you can corrupt data and memory, if you get the cast wrong.

[2] One thing protos do better than JSON is repeated fields. I once had a bug in my JSON parsing code where I had an array field phoneNumbers, like:

phoneNumbers: [1, 2, 3]

To check whether there are any phone numbers, I checked jsonObject.hasField(’phoneNumbers’) but the creator of the JSON happened to encode it as:

phoneNumbers: []

The field did exist; it was just empty. My code didn’t therefore work right. Protos don’t have this gotcha. A repeated field can occur zero times, once or multiple times. There’s distinction between an empty array and the field not being encoded.

[2] Unless they are encoded as JSON, in which case, just use JSON everywhere to begin with.

No comments:

Post a Comment