Avoiding unpleasant surprises with test data
The General Data Protection Regulation (GDPR) prescribes strict handling of personal data. Accordingly, it is prohibited to use data for purposes for which it was not originally collected. Violations can result in severe penalties. If the GDPR at times seemed like a dog that barks but doesn’t bite, the authorities have recently struck with great severity.
The airline British Airways is to pay more than 200 million euros, and the fine for the hotel chain Marriott amounts to 123 million euros. While companies can still appeal, the message is clear: when the GDPR takes hold, it does it right.
The use of personal data is a risky process
In addition to the legal requirements for data use, there are other reasons for using synthetic data exclusively for testing purposes. Many companies outsource the testing of applications with specialized companies as well as foreign companies. Both approaches are problematic when dealing with personal data. Even if contracts are concluded that supposedly ensure data protection, this is not enough.
In addition to the reputational damage for the contracting company, which must inform all affected customers within 72 hours in the event of a data leak, the transfer of customer data abroad is also subject to strong legal rules that also apply to test data.
Another reason is that test environments may well be unstable, increasing the risk of a data leak. Development work is also often outsourced to external experts. Data then leave the company along with the source code to be processed. Even if the data is safe within the company’s own IT infrastructure, an IT manager can of course no longer put his hand in the fire for external environments.
Data is then copied and passed on, but the primary responsibility remains with the original user. In other words, if the protection of personal data is violated, the company or management that originally assumed responsibility for the sensitive information is liable.
Data is essential
But developers need valid data for their tests. Creating a test database with type-compliant field content is a laborious and time-consuming undertaking. In addition, it often does not meet the flexible requirements of the testers and is ultimate – like almost every manual process – more error-prone. A technical solution is available for this: tokenization. Unlike encryption, where the data is converted into a series of seemingly random characters, tokenization creates real-looking, but not real, values. For example, the e-mail address “firstname.lastname@example.org” becomes the non-real “email@example.com.” Thus, the data no longer has any meaning, but it has the correct format and can be tested in the application or database.
Done in a few steps
Creating a tokenized database is easily done in just a few steps using a test data generator. First, the source (e.g. a database) containing personal data must be determined. Then the desired format must be selected – i.e. whether it is a mail address, telephone number, or similar. After tokenization is complete, the source is available with dummy data.
Advantages of type conform test data generation
The advantages of test data generation are obvious. By eliminating the need for real data, all risks of violating privacy regulations are eliminated. This makes it easy for companies to meet regulatory compliance requirements because the test data does not allow for backward conversion to the original data.
In addition, the generator can be controlled via API, allowing it to be integrated with custom testing tools. It also provides consistent test data in all systems, such as databases, apps, files, and interfaces. This is indispensable for integration tests in particular, where the aim in complex system environments is to have consistent data in as many systems as possible in order to validate the correct interaction of the systems in the test.
What is often forgotten: Test data generation is not a one-time process, but must be flexible to the requirements of the application and the tests. Depending on the test scenario, different data is required, which must also be available in large numbers for mass tests.
In addition, a good generator is characterized by high performance, also because the process of test data generation can be parallelized. It tokenizes only sensitive data and thus reduces the administrative effort. Including the preservation of proportions and logic, it must be fully configurable.
Test data management requirements
In order to take advantage of these benefits, companies must make sure when choosing an appropriate tool that it can handle different formats – i.e. not only the database currently in use but also data in files and applications, for example. And this consistently across all platforms. Especially if the systems to be tested are in data exchange with others.
In addition, the automatic assignment of suitable token profiles for database columns should take place in order to allow an individual token profile. Furthermore, the generator should offer the option to generate the tokens as pronounceable strings if desired to facilitate usability for the testers.
And of course, the quality of the test data must be right, because the requirements of the systems under test must be met. For some applications, for example, a credit card number as a random 16-digit number is sufficient; in other applications, it must be a valid credit card number with correct check digits – generated with the LUHN algorithm.
Sometimes a fictitious street name is sufficient, sometimes the substitute value must be a valid address with a similar demographic structure. Here, the test data management application should also be able to use its own program code to define the valid value ranges of the substitute values.
And as always with applications, ease of use is key. So, how easy is it to define the tokens? Does a custom language need to be taught? Or does the application support common definition options such as regular expressions?
Considering the consequences of carelessly handling real data for testing purposes, any company involved in application or database development should take care of the appropriate software for test data generation. Studies have shown that many companies do not even have an overview of which departments use real data for testing. Such a data classification and survey of data usage is also mandatory according to DSGVO.
Acquiring such a system is a simple process – as is generating the test data. And companies should not only think about the current requirements but also about the increasing number of systems on-premises or in the cloud, which comes up with increasingly faster release cycles. So if a company wants to play it safe in application or database development, there is no way around implementing a professional test data generator.