INTERNET-DRAFT S. Stuart
Intended Status: Proposed Standard Google
Expires: April 11, 2013 R. Fernando
Cisco
October 8, 2012
Encoding rules and MIME type for Protocol Buffers
draft-rfernando-protocol-buffers-00
Abstract
This document describes the encoding format for Protocol Buffers
encoded data and registers a MIME type associated with Protocol
Buffers encoded data.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright and License Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Fernando and Stuart Expires April 11, 2013 [Page 1]
INTERNET DRAFT Protocol Buffers October 8, 2012
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Message Structure . . . . . . . . . . . . . . . . . . . . . . . 3
3. Encoding Rules . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Numbers as VarInts . . . . . . . . . . . . . . . . . . . . . 5
3.2 Encoding and Interpretation of Protobuf Messages . . . . . . 5
3.3 Wire Types . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1 Wire Type 0 . . . . . . . . . . . . . . . . . . . . . . 5
3.3.2 Wire Type 1 . . . . . . . . . . . . . . . . . . . . . . 6
3.3.3 Wire Type 2 . . . . . . . . . . . . . . . . . . . . . . 6
3.3.4 Wire Type 5 . . . . . . . . . . . . . . . . . . . . . . 6
4. Embedded Messages . . . . . . . . . . . . . . . . . . . . . . . 7
5. Optional and Repeated Elements . . . . . . . . . . . . . . . . 7
6. Field Order . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9
8. Security Considerations . . . . . . . . . . . . . . . . . . . . 9
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
10.1 Informative References . . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10
Fernando and Stuart Expires April 11, 2013 [Page 2]
INTERNET DRAFT Protocol Buffers October 8, 2012
1 Introduction
Protocol buffers, referred to as protobuf in this document, is a
commonly used interchange format to serialize structured data for
storage and transmission between applications and systems. It
supports simple and composite data types and provides rules to
serialize those data types into a portable format that is both
language and platform neutral. Since it encodes data into binary
format, it is fast and efficient. It is also supported by a wide
variety of programming languages.
While protocol buffers has gained wide spread use, it has so far been
described only informally and has not been standardized. This
document specifies the encoding rules for protobuf and registers the
MIME type 'application/protobuf' for it in accordance with RFC 2048.
This document heavily borrows ideas from web page [GPBENC].
1.1 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. Message Structure
Protobuf defines all data elements in discrete units called
"messages" [GPBOVW]. A message is a logical collection of related
data items. It is similar to a "record" or a "structure" in a
traditional programming language. Many standard simple data types are
available as field types, including bool, int32, float, double and
string. One can also add further structure to the outer message by
using enums and other messages as field types.
The following is an example of a message definition in protobuf:
message Person {
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
required string name = 1;
required int32 id = 2;
optional string email = 3;
Fernando and Stuart Expires April 11, 2013 [Page 3]
INTERNET DRAFT Protocol Buffers October 8, 2012
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2;
}
repeated PhoneNumber phone = 4;
}
Note the presence of simple data types such as strings and int32s as
well as complex data types such as enums and messages in the above
message definition.
Each field is annotated with one of the following three modifiers:
1. required: a value for the field must be provided, otherwise the
message will be considered malformed and the decoding entity will
throw and exception.
2. optional: the field may or may not be set. If an optional field
value isn't set, a default value is used.
3.repeated: the field may be repeated any number of times (including
zero). The order of the repeated values will be preserved in the
protocol buffer encoding.
The integer token to the right of the assignment operator is a field
number. These field numbers uniquely identify a field in a message
and together with the wire type is used to form the key for the key-
value pairs in the serialized data stream. Field numbers 1-15 require
one less byte to encode than higher numbers, so as an optimization
one can decide to use those field numbers for the commonly used or
repeated elements. Each element in a repeated field requires re-
encoding the field number, so repeated fields are particularly good
candidates for this optimization.
This document will not describe every syntactic element of the
protbuf language but will restrict discussion to only those elements
that are relevant to the encoding and decoding of data types.
3. Encoding Rules
This section describes the encoding rules for the different field
types.
Fernando and Stuart Expires April 11, 2013 [Page 4]
INTERNET DRAFT Protocol Buffers October 8, 2012
3.1 Numbers as VarInts
To understand protobuf encoding, we need to first understand
VarInts.
All numbers in protobuf are represented as base 128 variable-length
integers (or VarInt). VarInt is an encoding scheme that uses only as
many bytes as is necessary to represent a number and it can be used
to encode arbitrary large numbers. It achieves this by using a
continuation bit in every byte. Each byte in a VarInt, except the
last byte, has the most significant bit (msb) set indicating that
there are more bytes to come. The last byte has the msb set to zero.
The stream of 7-byte quantities (after msb has been removed) are then
reversed and concatenated to produce one single binary representation
of the number.
3.2 Encoding and Interpretation of Protobuf Messages
Protobuf messages are not self describing. In other words, the entity
decoding the binary representation of the message needs to refer to
the equivalent text definition of the message to interpret the
fields. The "tag" that's associated with the field (with the "=" sign
in the text definition) indicates to the decoder which field it is
looking at currently.
To achieve backward compatibility a wire-type is also included for
every field. Using the wire-type, the decoder can skip a field
without interpreting it if it desires to do so. This can be useful to
achieve backward compatibility when the decoder is not aware of a
particular field's tag value.
Every field is encoded as a (key, value) pair. The key is a VarInt
with the value ((field-tag << 3) | wire-type). In other words, the
last three bits of the key VarInt is the wire type.
3.3 Wire Types
This document defines the following wire types, their interpretation
and the data types that they are used for.
3.3.1 Wire Type 0
If the wire type is 0, the value field is simply a VarInt. This
encoding is used to represent int32, int64, uint32, uint64, sint32,
sint64, bool and enum. For positive integers the interpretation of
the VarInt is straight forward as explained in section 3.1.
For example, consider the following message,
Fernando and Stuart Expires April 11, 2013 [Page 5]
INTERNET DRAFT Protocol Buffers October 8, 2012
message Test1 {
required int32 a = 1;
}
would be serialized as '08 96 01'.
If int32 and int64 are used for encoding negative integers, the
resulting VarInt is always a ten byte quantity (effectively treating
it as a large unsigned integer). If a singed type is used, a zigzag
encoding scheme is used which assigns small VarInt values for small
negative numbers. In this scheme, the numbers -2, -1, 0, 1, 2 would
be represented as VarInts 3, 1, 0, 2, 4 and so on. Mathematically,
each value 'n' is encoded using (n << 1) ^ (n >> 31) for sint32 or (n
<< 1) ^ (n >> 63) for sint64.
3.3.2 Wire Type 1
This is a fixed length 64-bit quantity. This wire type is used to
represent fixed64, sfixed64 and double data types. The value is
stored in little-endian format.
3.3.3 Wire Type 2
This is a length delimited stream of bytes. The value field is a
VarInt encoded length followed by the specified number of bytes of
data.
As an example, consider the following message,
message Test2 {
required string b = 2;
}
would be serialized as, '12 0b 68 65 6c 6c 6f 20 77 6f 72 6c 64', if
the string 'b' was set to "Hello World".
3.3.4 Wire Type 5
This is a fixed length 32-bit quantity. This wire type is used to
represent fixed32, sfixed32 and float data types. The value is stored
in little-endian format.
Fernando and Stuart Expires April 11, 2013 [Page 6]
INTERNET DRAFT Protocol Buffers October 8, 2012
4. Embedded Messages
Embedded messages are encoded as follows. The inner (or the embedded)
message is serialized first using the rules described above. The
resultant byte stream is then treated as a Wire Type 2 field in the
outer message and added to its encoding.
Consider the example,
message Test1 {
required int32 foo = 1;
}
message Test2 {
required Test1 c = 3;
}
If the field 'foo' were to take the value 150, the resultant encoded
byte stream for the inner message would be 08 '96 01'. And for Test2
would be '1a 03 08 96 01'.
5. Optional and Repeated Elements
If the message definition has 'repeated' elements, then the encoded
message has zero or more key-value pairs with the same field number.
These repeated values do not have to appear consecutively; they may
be interleaved with other fields.
If the message definition has 'optional' elements, then the encoded
message may or may not have a key-value pair with that field number.
A repeated field could be a 'packed repeated field' in which case the
encoding for the field is slightly different. A packed repeated field
containing zero elements does not appear in the encoded message.
Otherwise, all of the elements of the field are packed into a single
key-value pair with the wire type 2 (length delimited). Each element
is encoded the same way it would be normally, except without a field
number preceding it.
6. Field Order
When a message is serialized its known fields should be written
sequentially by field number. This allows parsing code to use
optimizations that rely on field numbers being in sequence. However,
protocol buffer parsers must be able to parse fields in any order, as
not all messages are created by simply serializing an object - for
Fernando and Stuart Expires April 11, 2013 [Page 7]
INTERNET DRAFT Protocol Buffers October 8, 2012
instance, it's sometimes useful to merge two messages by simply
concatenating them.
Fernando and Stuart Expires April 11, 2013 [Page 8]
INTERNET DRAFT Protocol Buffers October 8, 2012
7. IANA Considerations
The MIME media type for protobuf messages is application/protobuf.
Type name: application
Subtype name: protobuf
Required parameters: n/a
Optional parameters: n/a
Encoding considerations: 8 bit binary, UTF-8
Security considerations:
Generally there are security issues with serialization formats
if code is transmitted and executed on the decoder end. Since
protobuf binary encoding does not carry code, we consider the
encoding scheme itself to not introduce any security risks.
8. Security Considerations
See section 7.
9. Acknowledgements
We thank the engineers at Google for giving us the protocol buffers
serialization format. All the concepts described in this document
come from web pages [GPBENC, GPBOVW] defining protocol buffer
mechanisms. This document is merely an attempt to standardize those
mechanisms in IETF and assign a MIME type for protobuf encoded
messages.
10. References
10.1 Informative References
[GPBENC] Google Protocol Buffer Encoding,
https://developers.google.com/protocol-buffers/docs/encoding
[GPBOVW] Google Protocol Buffer Overview,
https://developers.google.com/protocol-buffers/docs/overview
Authors' Addresses
Fernando and Stuart Expires April 11, 2013 [Page 9]
INTERNET DRAFT Protocol Buffers October 8, 2012
Stephen Stuart
Google
1600 Amphitheatre Parkway
Mountain View, CA 94043
USA
EMail: sstuart@google.com
Rex Fernando
Cisco Systems
170 W. Tasman Dr.
San Jose, CA 95134
Email: rex@cisco.com
Fernando and Stuart Expires April 11, 2013 [Page 10]