Playing with Real Data

Recently I was apart of a software project that was creating a replacement to a legacy application. This legacy application contained a significant but not not outrageous amount of data. As part of this replacement we had to migrate all of the existing data into our new data model. This task was thoughtlessly left until the end of the project, an iteration before the release. The reasoning for this decision was that it wasn't seen as a big overhead and that we were testing our system with production like quantities of data. That is, our generated data had around the same amount of records that the legacy application had for certain entities. The following issues were observed once the data was completely migrated:

Certain areas of the application performed terribly
The legacy data did not contain various records that we expected to exist
Our search re-indexing system performed worse

Now, two of the above points required a bit of re-work, re-work that probably could of been addressed and identified much earlier in the project. The root cause of points one and three was that the complexity of our generated data did not match that of our production data. We didn't have the complex relationships that the "real" data had. Our generated data was not derived from production data. It was generated by hand and then duplicated until the record count matched that of what we noted in production. I don't think there is anything wrong with that approach, it is certainly common practice and the simplest way to get a large data set to test against. What we didn't realize was the impact that data with more complex one-to-one or one-to-many relationships would have on certain features of our system.

The most obvious solution to this problem, which is considered an anti-pattern is to, start using production data as soon as possible. The reasons for why this is an anti-pattern are:

Privacy and confidentiality issues.
Some production databases are just too big to replicate across multiple development environments.
Production data often becomes stale and people will typically want to "refresh" it on development environments. This results in the assumptions being made about the old data invalid. New assumptions about how the system under test should perform with the new data are then made. This ambiguity in assumptions leads to incorrect decisions and over optimization of perceived problem areas that go away after each refresh.
The need to use production data highlights a lack of understanding or lack of desire to understand the complex structure of the data itself.

Problems one and two can be addressed by using Production Like Data. Production Like Data is production data that has been obfuscated to ensure privacy and may also be a smaller but still meaningful subset of the production data. Production Like Data should always be used where production data is needed. However, Production Like Data shouldn't be used everywhere:

Production Like Data should not be used for acceptance tests
Production Like Data should only be used for exploratory, performance or upgrade testing

Using Production Like Data in acceptance tests really highlights a lack of understanding the data, which is what problem three and four above are really about. A priority of acceptance tests should be to test the functionality with the least amount of setup and overhead as possible. This is to minimize the scope and complexity of what is being tested. Relying on Production Like Data violates this priority and will result in flaky difficult to manage tests. Especially as there should not and probably won't be, any control on the state or content of the Production Like Data.

For the remaining three forms of testing, Production Like Data is typically seen as OK. The reasoning behind this is that it is difficult to reach a sufficient level of confidence about the system without data that mimics that which exists in the wild.

Henry Lawson

Playing with Real Data

Recent Articles / View all

Who is Henry Lawson?

Can't find that article?