Interview: Google Open Sources Protocol Buffers

By Kurt Cagle
July 18, 2008 | Comments: 2

Data messaging formats represent the life-blood of any distributed application. The ability to pass information back and forth between disparate systems becomes crucial for any organization, but for companies such as Google, the challenge of setting up communications between the thousands of different servers that host the various Google services forced the need for a specialized format that met their needs in particular.

Recently, Kenton Varda, an engineer working on search engine infrastructure at Google, became the point man for releasing Google's internal messaging format - called Protocol Buffers - as an open source project using the Apache license. I had a chance to talk with Kenton about Protocol Buffers, why they are important to Google and why the decision was made to open source them (and use an internal format rather than a format such as XML, JSON or related technology).

What do you and your team do at Google?

My "primary" project is part of the search engine infrastructure -- enabling Universal Search in particular -- but I've been spending most of my time for the last several months just on getting Protocol Buffers released. Others who have contributed towards this work on a variety of other projects. There isn't actually a "protocol buffers team" at Google -- people from other teams add features as needed.

You've recently announced that you were open sourcing Google Protocol Buffers (PB), which are an internal form of Data Interchange at Google. What role do protocol buffers serve at Google?


Practically all our internal data formats -- for both RPC and storage -- are based on Protocol Buffers. Many apps need the them for performance reasons, but they are also often used just because it's the path of least resistance. Everyone knows what Protocol Buffers are and they're integrated well into our build system. Even when a human-readable format is needed -- for example, for config files -- the Protocol Buffers "text format" is frequently used just for the convenience of having generated data access classes, and because it is familiar.

Can you give an example of what a Protocol Buffer looks like?

You write a .proto file like this:

message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}

Then you compile it with protoc, the protocol buffer compiler, to produce code in C++, Java, or Python.

Then, if you are using C++, you use that code like this:

Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("bob@example.com");

fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();

Or like this:

Person person;
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
cerr << "Failed to parse person.pb." << endl;
exit(1);
}

cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
cout << "e-mail: " << person.email() << endl;
}
What do you see as the greatest strengths and weaknesses of PB? With which domains does PB work well, and with which domains do you find you need to go to other structural formats?

Protocol Buffers are good when you have structured data which you need to encode in a way that is both efficient and extensible. The second point is important: a lot of people ask why we didn't just use various existing binary formats, and the answer is usually that those formats do not provide easy extensibility. It is very important to us that it be possible to add or remove fields of a message type easily without breaking compatibility with old software. It's impossible for us to update all our servers simultaneously, so new servers have to be able to send messages to old ones and vice versa without problems.

Of course, XML and JSON provide extensibility as well, but Protocol Buffers have an advantage over them in efficiency -- Protocol Buffers are both smaller and faster to parse. Furthermore, the data access classes generated by the Protocol Buffer compiler are often more convenient to use than typical SAX or DOM parsers. Of course, lack of human-readability can be a serious disadvantage depending on the use case.

That said, XML is a much better solution when you need to encode documents composed primarily of text with markup. Protocol Buffers provide no obvious way to interleave text with structured elements. XML and JSON are also better if you need a human-readable format -- although there is a standard way to encode Protocol Buffers in text, it provides no real advantages over JSON.
Why was the decision made to open source PB, and why now?

We've all felt for a long time that Protocol Buffers should be open source. It just seemed like the natural thing to do. However, someone needed to put in the effort to get them into a releasable state. Version 1 of Protocol Buffers had evolved slowly over many years and had become a bit of a mess, so we decided that we really needed to rewrite it before we could feel good about releasing it. I started this work in my 20% time in 2006 thinking it would take a couple months.

As with so many software projects, it took several times that, but we were finally able to get it out the door.

You indicate in your documentation for PB that you'd evaluated XML, and
you found that it was inefficient for your needs. Why was this? Is this
an indication that Google in general sees XML as being insufficient in
this role, or in general?


Contrary to what many people are saying, our intent with this release is not to "kill XML". We simply believe that while XML works very well in the situations for which it was designed, it is not the ideal solution for every problem. XML is inherently inefficient both in terms of size and parsing speed since it is a text-based format. In many applications, these inefficiencies don't matter, but for us they make a big difference. Furthermore, XML, despite being a simplification of SGML, is still a very complicated standard, and many of its features just get in the way in a lot of cases. Protocol Buffers are designed to be very simple conceptually.

Why establish your own protocol rather than utilizing one currently in
use, such as YAML, ASN1, JSON, HDF5 or others along those lines, most
of which are accepted as common interchange formats?


ASN.1 and HDF5 are examples of extremely complicated standards. We believe that the more complicated you make the spec, the harder it is to implement it well. A high-quality implementation is worth much more to us than a lot of features, and is usually much easier to use. You might consider this analogous to the philosophy behind our home page.

YAML and JSON are commendably simple standards, but are still text-based, and therefore not as efficient. Protocol Buffers are very similar in structure to JSON, to the point where you could reasonably think of Protocol Buffers as being a binary format for JSON. Some projects in Google actually use them this way -- it's easy to write code that reads or writes a protocol message in JSON format. That said, this similarity wasn't intentional; as far as I know, Protocol Buffers predate JSON.

PB obviously has a fair amount of similiarity to JSON in particular -
are there efforts to integrate these two formats internal to Google?


Yes, it's very easy to write code which reads or writes a protocol message object in JSON format by using protobuf reflection. Some Google projects do exactly this. This is particularly useful when communicating with AJAX clients.

Where do you see Protocol Buffers being used eventually within the open
source community? How does it factor in with other Google software
initiatives?


To be honest, we don't know. We have code which uses Protocol Buffers which we would like to release open source, so getting protobuf out there will allow us to do that. Other than that, though, this release is not a part of any grand scheme. We simply had a tool we found very useful internally and wanted to open it up for others to use too. We're eager to see what people decide to do with it.

Is there more information about Protocol Buffers available?

Check out the Google code page at http://code.google.com/apis/protocolbuffers/docs/overview.html.



You might also be interested in:


2 Comments

You know, just an idea... since this was an interview why not use a recording of the interview for the audio version instead of a machine reading of the text copy?

Because it was an e-mail interview rather than an audio one ... and I wasn't thinking about that when I wrote up the title.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?