Schema change management for platform teams
What is Platform Engineering
A few months ago, an article on The New Stack made a lot of waves in our industry, claiming that “DevOps Is Dead” and that it will be superseded by Platform Engineering. I had the pleasure of discussing this topic with Armon Dadgar, who serves on Ariga’s advisory board (but more famously is the co-founder and CTO of HashiCorp).
The DevOps movement challenged the software industry with a bold vision, Armon recalled: “If only we could meld the functions of Dev and Ops into the same human, a super-engineer that writes backend applications for breakfast, configures build systems and CI pipelines for lunch, and then puts out a database production fire for dinner, we would finally get total ownership and accountability.”
However, Armon pointed out, in practice, things were more complex: “Outside of a few select companies that could hire and retain them, these people didn’t actually exist.” This realization has given rise to Platform Engineering. “We need developers to be able to self-serve and be responsible for their applications, but the complexity of modern cloud-native architectures is too big. To be efficient, organizations need to abstract these things away.”
We are seeing platform teams being created everywhere as a response to this complexity. These teams are commissioned with the charter to maintain an Internal Developer Platform, which acts as a flexible abstraction to the complexity of managing applications in cloud-native architectures.
Schema change management for Platform Teams
A good platform team treats its Developer Platform as a product by constantly looking for measurable ways to improve the efficiency of the engineering organization it serves.
This is often done by one of the following:
- Making slow things fast – by solving technical bottlenecks or automating manual work.
- Making hard things easy – by providing simpler ways of retrieving information or providing declarative workflows
- Making risky things safer – by preventing human error during CI.
One overlooked area in which platform teams can create serious technical leverage is providing a strong schema change management story. Schema change management refers to the collection of tools and processes that support the evolution of an application’s data model and the ways in which it is stored in databases.
In the past, most applications consisted of a single enterprise database, often supported by an enterprise vendor, serving a monolithic backend. These were developed in a waterfall methodology and managed by a professional, well-trained DBA. Today, applications are characterized by an explosion in microservices. Each microservice is backed by its own database (and sometimes multiple databases), developed and maintained by multiple autonomous teams, with varying (and sometimes very little) operational knowledge when it comes to managing their databases.
In other words, despite being a critical component of any architecture, the operational aspects of the backing databases are an afterthought in many cases. Organizations can easily spend hundreds of thousands of dollars a year to make sure devs have access to observability data, but when it comes to managing schema changes, developers are somehow expected to know all the intricacies of the database their team happens to work with.
What is the impact of not supporting schema change management?
Having interviewed engineers from dozens of companies, we’ve seen some serious issues repeat themselves in organizations that do not have a well-thought-out strategy when it comes to schema change management:
- Backward-incompatible changes to the database schema break the contract between the database and the application backend, causing downtime.
- Downstream consumers of the database schema (such as data engineering teams consuming CDC logs) are constantly surprised.
- Tables are accidentally locked for writes causing application downtime – sometimes for hours or days.
- Developers connect to the database with root credentials to apply changes or troubleshoot them.
- Deployments fail halfway because of constraint violations that are discovered only on production data.
- Incidents and outages occur because of database behaviors unknown to many engineers.
- Simple refactorings become complex projects that require senior engineering leadership to plan and carefully execute, making them less frequent, and leading to increased technical debt.
- Frustration from (and fear of) database schema changes promotes anti-patterns such as pushing schema management to the application layer effectively using SQL databases as NoSQL storage.
- And many more.
Evolving beyond schema migration tools
Most existing schema management tools (often called Schema Migration Tools) were created in an era with very different problems. At the dawn of the DevOps movement, the idea of describing all schema changes in files that are committed to source control and automatically applied by a tool that knows which of them have already been applied was revolutionary.
However, as we’ve mentioned above, a few things changed drastically in the way software is built today, compared to when these tools were conceived:
- The way we develop – Microservices brought an explosion in the number of databases and the diversity of storage technology organizations use, making it impractical for many organizations to have a professional DBA author or even review database schema changes. Teams are expected to be autonomous and self-sufficient so they can continuously make progress.
- The way we operate – In the past it was acceptable to bring an application down for maintenance – your bank’s DBA had a few good hours between its last employee leaving the office and when the first one came in the following day when they could shut the system down, upgrade it, and spin it back up. Managing an always-on system that serves traffic 24/7 is a different ordeal.
- Who operates – In most cases, teams operate their own database, resulting in a situation where the person on-call for the system often has very basic knowledge when it comes to operational aspects of the database.
What can platform teams do to increase developer efficiency with schema change management?
As a result, a modern solution for schema change management can address the following problems:
- Planning changes – Today’s tools expect developers of all technical backgrounds and levels of expertise to be able to plan correct, safe, and efficient changes to the database. Given the vast range of technology developers must deal with, this may not always be possible. Therefore, platforms can provide developers with an automated, declarative workflow for planning changes (a “terraform plan for databases”). Ideally, this workflow should support any ORM or framework developers use to build applications.
- Verifying safety – Once a change leaves a developer’s workstation and is submitted as a pull request, it becomes the team’s responsibility to review and approve the correctness and safety of the change. Existing tools offer no support in this area, leaving it completely to manual review. Platforms can provide teams with automated verification of changes (“CI for schema changes”), to detect risky changes before they reach production.
- Deploying changes – Existing tools are mostly centered on the machinery for describing and applying changes to the target database. This is a great start, but deployment machinery is seldom used in isolation anymore. Platforms need to figure out how to integrate these tools into their continuous delivery pipelines in a native way. In addition, delivery pipelines are responsible to verify that the target environment is safe to deploy to before rolling out changes (“CD for schema changes”).
In addition, with microservices architectures, managing and coordinating schema migrations of various microservices within a single deployment unit is crucial to ensure safe rollout or recovery from failures.
- Troubleshooting – Unfortunately, schema changes don’t always succeed. Existing tools offer little to no support when it comes to helping developers out of the mud when things fail. This often requires engineers to connect to the database to diagnose what went wrong and then perform risky operations such as manually editing metadata tables. Platform teams should consider what they can do to support engineers when planned changes don’t succeed or cause an outage.
- Drift detection and schema monitoring – Once changes are successfully rolled out, it is valuable for teams to be able to detect drift between the expected state of the system and its actual state. Schema drift can happen because of technical issues or in cases where manual access to the database is permitted and can cause both operational and compliance issues. Platform teams should consider how they can provide their teams with the confidence that there is no schema drift in their application.
Surprisingly, database schema change management is an area in which little change or innovation has occurred since we embarked on the DevOps journey as an industry. At Ariga, we are building Atlas (open-source) and Atlas Cloud to provide solutions for many of the problems we described above.
If you’re a member of a Platform Engineering team and want to learn more about how we can help, please ping me (or my co-founder Ariel) on our Discord Server.