Data messaging formats represent the life-blood of any distributed application. The ability to pass information back and forth between disparate systems becomes crucial for any organization, but for companies such as Google, the challenge of setting up communications between the thousands of different servers that host the various Google services forced the need for a specialized format that met their needs in particular.
Recently, Kenton Varda, an engineer working on search engine infrastructure at Google, became the point man for releasing Google's internal messaging format - called Protocol Buffers - as an open source project using the Apache license. I had a chance to talk with Kenton about Protocol Buffers, why they are important to Google and why the decision was made to open source them (and use an internal format rather than a format such as XML, JSON or related technology).
My "primary" project is part of the search engine infrastructure -- enabling Universal Search in particular -- but I've been spending most of my time for the last several months just on getting Protocol Buffers released. Others who have contributed towards this work on a variety of other projects. There isn't actually a "protocol buffers team" at Google -- people from other teams add features as needed.
You write a .proto file like this:
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}Then you compile it with protoc, the protocol buffer compiler, to produce code in C++, Java, or Python.
Then, if you are using C++, you use that code like this:
Person person;
person.set_id(123);
person.set_name("Bob");
person.set_email("bob@example.com");
fstream out("person.pb", ios::out | ios::binary | ios::trunc);
person.SerializeToOstream(&out);
out.close();
Or like this:
Person person;What do you see as the greatest strengths and weaknesses of PB? With which domains does PB work well, and with which domains do you find you need to go to other structural formats?
fstream in("person.pb", ios::in | ios::binary);
if (!person.ParseFromIstream(&in)) {
cerr << "Failed to parse person.pb." << endl;
exit(1);
}
cout << "ID: " << person.id() << endl;
cout << "name: " << person.name() << endl;
if (person.has_email()) {
cout << "e-mail: " << person.email() << endl;
}
Protocol Buffers are good when you have structured data which you need to encode in a way that is both efficient and extensible. The second point is important: a lot of people ask why we didn't just use various existing binary formats, and the answer is usually that those formats do not provide easy extensibility. It is very important to us that it be possible to add or remove fields of a message type easily without breaking compatibility with old software. It's impossible for us to update all our servers simultaneously, so new servers have to be able to send messages to old ones and vice versa without problems.
We've all felt for a long time that Protocol Buffers should be open source. It just seemed like the natural thing to do. However, someone needed to put in the effort to get them into a releasable state. Version 1 of Protocol Buffers had evolved slowly over many years and had become a bit of a mess, so we decided that we really needed to rewrite it before we could feel good about releasing it. I started this work in my 20% time in 2006 thinking it would take a couple months.
As with so many software projects, it took several times that, but we were finally able to get it out the door.
You indicate in your documentation for PB that you'd evaluated XML, and
you found that it was inefficient for your needs. Why was this? Is this
an indication that Google in general sees XML as being insufficient in
this role, or in general?
Why establish your own protocol rather than utilizing one currently in
use, such as YAML, ASN1, JSON, HDF5 or others along those lines, most
of which are accepted as common interchange formats?
ASN.1 and HDF5 are examples of extremely complicated standards. We believe that the more complicated you make the spec, the harder it is to implement it well. A high-quality implementation is worth much more to us than a lot of features, and is usually much easier to use. You might consider this analogous to the philosophy behind our home page.
PB obviously has a fair amount of similiarity to JSON in particular -
are there efforts to integrate these two formats internal to Google?
Where do you see Protocol Buffers being used eventually within the open
source community? How does it factor in with other Google software
initiatives?
To be honest, we don't know. We have code which uses Protocol Buffers which we would like to release open source, so getting protobuf out there will allow us to do that. Other than that, though, this release is not a part of any grand scheme. We simply had a tool we found very useful internally and wanted to open it up for others to use too. We're eager to see what people decide to do with it.
Is there more information about Protocol Buffers available?
Check out the Google code page at http://code.google.com/apis/protocolbuffers/docs/overview.html.


Print
Listen
Share




By 

Leave a comment