Feature Request - Consider the data

Back to Unsung Developer Thoughts

Once you create data on production you enter the world of permanence. You must strive to keep that data alive and healthy for as long as needed. This is challenging because our understanding of problems is ever evolving, which means our data model is always changing. And sometimes there’s a data model change that requires a change to the data. These next questions are tailored to databases, but should be considered for any type of persisted object such as images.

How much data is involved?

Data model changes are easy to handle when it’s low impact locking operation such as adding a null column or if there are very few rows in a table. However, adding a new column that shouldn’t be null to a table that has millions of active rows is not easy.

If you’re making a data model change you should check to see how much data exists in your production environment for the relevant tables. If it’s a large amount of data, you’ll likely need to split up how you deploy your work to avoid downtime in the application.

For example, the typical solution to add a not null column is as follows:

  1. Add a nullable column to the table
  2. Change the application to create new rows with a value
  3. Update the existing null values to an appropriate value in batches
  4. Create the not null constraint

While these are easy to understand steps, it is tedious and stressful work. Any time you update data across an entire table you need to make sure that you’re not holding locks for very long. Attempting to update all the rows in a table at once defeats the purpose behind adding a nullable column first.

Since this work has nothing to do with the actual feature being implemented this cost should be identified early on to understand the full impact of a change.

How will you maintain backwards compatibility for database updates?

Consider the scenario of needing to drop a non-nullable column. This is challenging because you need to make the column nullable first. This is so that the application can stop relying on it existing for reads and writes. You will need to time these database changes appropriately with changes to your running application. Especially if you’re running a web application with a fleet of web servers that only cycle out old servers after the new ones are cycled in. This means your database always has to be compatible with both the new and old version of the web app.

A more challenging scenario is making a many-to-one relationship a many-to-many relationship. This will require the data to be transformed for some period of time. During this time your application will need to keep both sets of data up to date while still providing the correct response. Once the data transformation is complete, you can switch to only updating the new data and remove the code handling the old data. If that sounds complex, that’s because it is. Your goal here is to identify if the requested change would require such a task, then to communicate the effort involved.

Here’s a helpful, but maybe obvious hint: the more steps to the transition and the larger the dataset, the more effort something will take.