---
title: Schema Evolution
weight: 3
---

In this doc we're going to go through a case study in writing and then updating a schema from the perspective of the hypothetical *Person* service which stores data about people. We will attempt to capture both the technical and social aspects of schema evolution in this example as services evolve.

## Aside: Thinking In Messages

It's important to note, that the way you think about how to do schema for an API may differ from how you would for the Event Bus. Namely, in an API you can often make not-fully-compatible changes by e.g. adding new API actions or option flags, or "versioning up" the API. However, in a message stream this is less practical: even if subscribers were to update their code for an incompatible change they have to roll out a deploy and it's not practical for all instances to instantaneously all be in the new code. This leaves publishers with difficult kluges like having to dual-publish multiple event streams, etc., that we mostly want to avoid.

For that reason, we emphasize careful consideration in designing event schema for potential future needs and making sure breakage can be avoided as much as possible. Protobuf3 is the schema definition language chosen for the Event Bus team as a result because it has a lot of features that make schema evolution possible in a type-safe and subscriber-safe way (such as allowing receivers to continue operating with much-older schema with no risk).

## Rev. 1: Our First Schema {#r1}

The Person service wants to make an event that publishes some basic info when a person is created. So we [follow the NounAction formula]({{< ref "./style.md#naming" >}}) and come up with `PersonCreate` as our message type. We will start with a small set of fields that our subscribers care about:

```protobuf
message PersonCreate {
  // The id in the Person service database
  string id = 1;
  // The person's full name.
  string full_name = 2;
  // Person's home time zone in the IANA tz database, e.g. "US/Eastern"
  // https://en.wikipedia.org/wiki/Tz_database
  string timezone = 3;
}
```

This is a relatively simple message definition but it contains a good number of best practices, such as [commented fields]({{< ref "./style.md#comments" >}}) and [referencing code tables]({{< ref "./style.md#unnecessary-enums" >}}).

## Rev. 2: Adding address information {#r2}

(n.b. Protobuf doesn't have actual version numbers for schema, we're just doing it for the sake of the guide)

Person service wants to add information about the user's mailing address to this schema. So we go with something like this: (comments omitted in fields 1-3 to make reading a bit easier)

```protobuf
message PersonCreate {
  string id = 1;
  string full_name = 2;
  string timezone = 3;
  // The person's home address. omitted if the user elected not to provide it.
  Address address = 4;
}

message Address {
  // Street address may be 2 lines in some countries.
  repeated string street_lines = 1;
  // City/Town
  string city = 2;
  // The state/subregion of the country as defined in ISO-3166-2 e.g. "US-TX"
  string state = 3;

  // etc
}
```

So why'd we use a [sub-message]({{< ref "./style.md#sub-messages" >}}) for the Address info? The primary reason is clarity for the subscriber: If a subscriber receives a Person with no address, rather than having to check a number of address fields for blank-ness, you can instead check that the Address field is not nil/omitted. As a secondary incentive it also provides clean logical grouping.

## Rev. 3a: Clarifying name field {#r3a}

So our hypothetical Person service ran into an issue: Subscribers were parsing out the `full_name` field using e.g. `string.Split` on spaces and it was sometimes un-clear with e.g. East Asian names that put the family name first, or people with compound names, which part of the name was what. So we make our first "deprecating" change to our schema:

```protobuf
message PersonCreate {
  string id = 1;
  string timezone = 3;
  Address address = 4;
  // Person's given name (known as "first name" in the US/Canada)
  string given_name = 5;
  // Person's family name/surname ("last name" in US/Canada)
  string family_name = 6;

  // Full name is still included in PersonCreate; we strongly recommend using
  // the given/surname instead, as full name will stop being sent after 10/2019
  string full_name = 2 [deprecated=true];
}
```

In this update, We added new fields to the schema and moved the `full_name` to the very bottom of the schema so it's visually out of the way for people reading the .proto file. The `[deprecated=true]` annotation still allows the field to be used by subscribers, but it does add annotations in some languages like Java which show the deprecation status in e.g. IDE's and compilers.

So with this update, the Person Service would continue to send the `full_name` field, so that subscribers using it can continue using it without breakage. This is necessary as there's no way that subscribers could start using given/family name in the very instant full name goes away, so we recommend as much as possible that services continue to send deprecated fields as long as possible, at least 2-3 months would be best as long as technically feasible.

## Rev. 3b: finishing the deprecation {#r3b}

So it's October now, and we want to get rid of the `full_name` field that we deprecated. It looks something like this (comments elided again for brevity):

```protobuf
message PersonCreate {
  reserved 2;

  string id = 1;
  string timezone = 3;
  Address address = 4;
  string given_name = 5;
  string family_name = 6;
}
```

the addition of `reserved 2` is simply a way of encoding that we should [not use that field ID ever again](https://developers.google.com/protocol-buffers/docs/proto3#reserved). This ensures that we don't break people on older schemas if we started using it for a different field.

Now we can stop sending `full_name` following our deprecation plan.
