Test Data Best Practices: Sweet 16

Written by Amitai Richman | May 17, 2023

Test data best practices are guidelines to ensure that all generated test data is accurate, compliant, and relevant – for more effective software testing.

Table of Contents

1. Show Me the Test Data
2. Run for Cover
3. Quit Bugging Me
4. Stop Reinventing the QA Wheel
5. Find the Needle in the Test Data Haystack
6. Begin Breaking Bad...Data
7. Assure All-You-Can-Eat Data Ingestion
8. ID Sensitive Data Automatically
9. Mask Data on the Fly
10. Maintain Referential Integrity
11. Keep Private Data Private yet Accessible
12. Test New Releases BEFORE they are Released
13. Move Test Data TO and FROM any Environment
14. Deliver Test Data Instantly
15. Prevent QA Data Collisions
16. Automate, Automate, Automate

1. Show Me the Test Data

Enterprise data is often siloed and fragmented across numerous data sources, scattered throughout the organization. Using production data is often the path of least resistance, however, data protection and privacy legislation often limit the use of production data in development and testing environments. Testers are challenged to obtain data with the appropriate characteristics for each test. According to analyst firm Gartner, QA engineers spend 46% of their time searching, analyzing, and preparing test data.

A word of advice: Use test data automation to enable your testers to concentrate on doing the job they were hired to do – test software, not prepare data.

2. Run for Cover

Test coverage is the method used to measure the percentage of test cases covering the entire application code. Test coverage will be low if the test data represents only a small percentage of production data, both in terms of volume and type. Even production data typically offers a minority of the functional coverage required to fully test an application.

With better software testing coverage, you can reap more test data management benefits:

Improve software quality
Comprehensive testing coverage helps ensure that software is thoroughly checked for defects and errors, improving the overall quality of the software.
Increase user satisfaction
Better testing coverage can help catch issues that could negatively impact the user experience, resulting in increased satisfaction and improved retention.
Reduce costs
As mentioned earlier, detecting defects early can reduce the cost of fixing those defects in the later stages of development, such as production. It also reduces the cost of addressing customer complaints and negative feedback.
Mitigate business risk
Comprehensive testing coverage reduces the risk of software defects causing significant disruptions to the business, such as downtime or lost revenue.

3. Quit Bugging Me

Software defect density refers to the number of defects or bugs found in a piece of software code, typically measured per unit of code or per function point.

When test data is poorly designed or insufficient, it may not test all possible scenarios or paths through the code. This can lead to defects going unnoticed and remaining in the code, resulting in a higher defect density.

Continuous testing helps identify defects early in the software development lifecycle by allowing developers to test their code more frequently and with more accuracy, ultimately reducing the defect density.

4. Stop Reinventing the QA Wheel

The reusability (versioning) of test data leads to enhanced:

Reproducibility
By versioning datasets, it becomes possible to reproduce tests and experiments, as well as to validate the results of previous tests. This is essential for ensuring the accuracy and reliability of testing results.
Traceability
By tracking changes made to datasets, it is possible to trace the source of errors and issues that arise during testing. This can help to identify and fix problems more quickly.
Efficiency
Reusing test data can save time and effort in creating new test cases.
Collaboration
Versioning allows teams to work collaboratively on test data, without the risk of overwriting or losing important information. This is especially important when multiple teams are working on the same project or test data.
Consistency
Reusing test data can help ensure consistency across test cases. This is especially important for regression testing, where the same data may need to be used repeatedly to ensure that changes in the software have not introduced new issues.
Accuracy
Reusing test data can help ensure that the same data is used consistently across different testing scenarios, reducing the risk of errors or omissions in the test data.

5. Find the Needle in the Test Data Haystack

Test data subsetting is a process of dividing a given set of test data into smaller subsets. In this process, a subset of data is selected from the original test data that represents a specific scenario or a part of the system being tested. This is done to reduce the size of the test data set or to create specialized test cases that focus on specific aspects of the software being tested. Subsetting delivers:

Faster testing
Subsetting data reduces the amount of data that needs to be processed, resulting in faster testing times. This is especially important for large datasets, where processing times can be significant.
Cost savings
By reducing the amount of data that needs to be processed, organizations can reduce their storage and processing costs.
More accurate testing
Subsetting data allows for more targeted testing, as testers can select specific subsets of data that are relevant to the tests being performed.
Improved data privacy
Subsetting data can help to improve data privacy and compliance by reducing the amount of sensitive data that needs to be accessed or shared.

A word of advice: Subset test data based on business rules, without scripting. Ideally, test data subsets should hide the technical complexity of underlying source systems, preventing the need to know which databases/tables/columns contain the required data, and ensuring referential consistency.

6. Begin Breaking Bad...Data

In testing software applications, bad data can result in inaccurate results, false positives, and false negatives, which can have serious consequences for the quality and reliability of the software. Bad data may result in:

Inaccurate conclusions
If the test data is incorrect or incomplete, the test results may not accurately reflect the behavior of the software. This can lead to false positives or false negatives, where the software is either reported as functioning correctly when it is not or reported as failing when it is working correctly.
Unreliable performance
Testing with bad data can also lead to unreliable performance results, as the data may not reflect real-world scenarios, user behavior, or system usage patterns. This can cause the software to perform poorly or inconsistently under different conditions.
Missed bugs
Bad data may not trigger or uncover certain bugs, vulnerabilities, or performance issues, as the data may not include the necessary inputs or edge cases that can expose these problems. This can lead to undetected errors that can cause problems for end-users in the future.
Increased costs
Fixing bugs and issues that are not identified during testing can be expensive and time-consuming, especially if they are discovered after the software has been released. This can result in delayed releases, additional testing, and increased development costs.

A word of advice: Almost half of all test data automation failures can be attributed to bad data. High-quality test data is required to eliminate bad data conditions.

7. Assure All-You-Can-Eat Data Ingestion

When we speak with large enterprises, we ask, “Which database technologies do you use?” The usual response is, “At least one of each!” Most large organizations today use legacy systems while striving to regularly upgrade their tech stack and integrate the latest technologies.

Some test data management tools support a limited list of database vendors and versions, albeit skewed to some of the most popular ones. According to analyst firm, Gartner: “Modern applications rely on an increasing number of interconnected data stores, applications, and APIs to function, requiring tools to coordinate and synchronize changes while ensuring relational consistency and addressing security and speed mandates.”

Consequently, enterprise applications rely on more and more data sources.

A word of advice: For the highest quality test data, it’s important to be able to integrate with any kind of data source or technology, including those that may be adopted in the future.

8. ID Sensitive Data Automatically

Personally Identifiable Information (PII) discovery in production databases is the process of identifying and protecting sensitive information that can be used to identify an individual. This can include information such as name, social security number, driver's license number, passport number, email address, and other similar identifying information.

PII discovery in production databases is important because it helps to protect the privacy of individuals and prevent identity theft. Production databases may contain large amounts of sensitive data, and it can be difficult to identify all of the data that needs to be protected. PII discovery tools can be used to scan databases and identify any data that contains sensitive information.

Once the sensitive data is identified, appropriate steps can be taken to protect it, such as encrypting the data, restricting access to it, or deleting it entirely if it is no longer needed. PII discovery can also help organizations comply with regulatory requirements, such as the General Data Protection Regulation (GDPR) in the European Union or the California Consumer Privacy Act (CCPA) in the US.

A word of advice: Automatically discover sensitive data in any data store, and take the appropriate steps to protect it according to internal policies.

9. Mask Data on the Fly

Masking data (like PII) is often a laborious, repetitive, and time-consuming task for DevOps personnel. Data masking techniques come in many forms, here are a few examples:

Random data masking
In this technique, sensitive data is replaced with randomly generated data (including unstructured data such as images). This method ensures that the data looks realistic but is not meaningful, making it impossible to reverse-engineer.
Data pseudonymization
Pseudonymization swaps sensitive information, such as a name or driver’s license number, with a fictional alias or random figures. This is a reversible process, and can also be applied to unstructured data, like a photocopy of a passport.
Redaction
Redaction involves replacing sensitive data with generic values in development and testing environments. This technique is useful when the sensitive data itself isn’t necessary for QA or development, and when test data can differ from the original datasets.
Substitution masking
This method replaces sensitive data with data that has the same format, but different content. For example, a person's name could be replaced with a different name, or a credit card number could be replaced with a different number that appears to be a credit card.
Shuffling
Instead of substituting data with generic values, shuffling is a technique that randomly inserts other masked data. For example, instead of replacing employee names with fake ones, it scrambles all of the real names in a dataset, across multiple records.
Data aging
If data includes confidential dates, you can apply policies to each data field to conceal the true date. For example, you can set back the dates by 100 or 1,000 days, randomly, to maximize concealment.
Nulling
This data masking technique protects sensitive data by applying a null value to a data column, so unauthorized users won’t be able to see it.

A word of advice: Make your test data available to a wide variety of data masking techniques, as well as the possibility to combine data masking tools with data tokenization tools, and/or synthetic data generation tools.

10. Maintain Referential Integrity

Referential integrity refers to masking each type of sensitive data with the same algorithm to ensure consistency across databases. Data masking tools and processes must be synchronized across the organization for each data type, to keep it functional for analytics and other use cases. If one masks data in each data source separately, it becomes difficult, if not impossible, to maintain consistency for the same business entity across systems. Creating the linkage between entities is a time consuming and complex procedure.

According to analyst Gartner, “With modern applications relying on an increasing number of interconnected data stores (many of which are technologically vastly more complex), applications and APIs to function, testing has become more complex. Such complexity demands that tools support the ability to coordinate and synchronize changes across different data stores to ensure relational consistency while addressing security and speed mandates.”

11. Keep Private Data Private yet Accessible

Dynamic data masking is a technique that protects sensitive data by masking it inflight. By hiding sensitive data from unauthorized users, but not from authorized ones, it helps enforce privacy regulations by:

Minimizing the risk of data breaches, by preventing unauthorized users from accessing sensitive data, even if they penetrate the underlying database.
Protecting sensitive data during development and testing, thus reducing the risk of data exposure, and ensuring compliance with privacy regulations.
Meeting data privacy compliance requirements, by providing an extra layer of security for sensitive data.
Allowing controlled access to sensitive data, by masking the data from unauthorized users.

A word of advice: Avoid static data masking, which provides a window of opportunity for hackers to breach the staging environment and retrieve production data before it’s masked. The process can also be cumbersome and complex, sometimes requiring coding for various integrations, transformations, and maintaining referential integrity.

12. Test New Releases BEFORE they are Released

As software applications evolve (or new applications are created), new versions require testing data that isn't available in production databases. To support new functionality or requirements, applications may require updated schemas. Instead of manually generating synthetic test data, a test data management tool can create realistic, synthetic test data to fill coverage gaps or new/updated schemas. Synthetic data generation allows for greater:

Scalability
Synthetic data can be generated in large quantities, which makes it useful for testing applications at scale. For example, it can be used to simulate large numbers of users, data volumes, and other factors that may be difficult or time-consuming to replicate manually.
Data diversity
Synthetic data can be generated to simulate a wide range of data types and scenarios. This can help test the application's ability to handle different data inputs and identify issues that may not have been discovered using real-world data.
Privacy
Synthetic data can be generated to replace sensitive or confidential data, such as personal information or financial data, reducing the risk of data breaches and ensuring compliance with privacy regulations.
Performance testing
Synthetic data can be used to simulate large volumes of data to test system performance under heavy loads. This allows testing teams to identify and address potential performance bottlenecks and scalability issues before software is released.

13. Move Test Data TO and FROM Any Environment

Having spent much time and effort preparing the "perfect" test data set for a given sprint (or test cycle, such as Integration Test, UAT, or Performance Test), DevOps test data management teams are required to prepare a new testing environment for the next sprint (or test cycle).

A word of advice: Allow test data to be provisioned directly from any source (including non-production environments) to any target – without the need for complicated, expensive, and long setups.

14. Deliver Test Data Instantly

Initial setup and refresh of the staging area for test environments is a time-consuming process: from ingesting or refreshing the data from the source systems to the staging environment, running masking jobs, and then provisioning the data to lower environments.

At times it becomes necessary for the QA team to test software functionality with up-to-date data from production sources. In many enterprises, test data is refreshed once a quarter, at best. In some testing scenarios, for example, when recreating bugs reported in production, having a near real-time refresh of source data is crucial.

A word of advice: Provision test data from production in seconds/minutes, rather than days/weeks.

15. Prevent QA Data Collisions

It's not uncommon for testers to inadvertently override each other’s test data, resulting in corrupted test data, lost time, and wasted efforts. Test data must be provisioned again, and tests need to be rerun. The right test data management tool can resolve this issue by segregating test data between individuals to mitigate the risk of corruption.

A word of advice: Allow for test data versioning per individual tester.

16. Automate...Automate...Automate...

Automated test data provisioning delivers better:

Efficiency
Test data management can be a time-consuming process, especially when dealing with large datasets. Automation can help to streamline this process, reducing the time and resources required to manage test data.
Accuracy
Manual data entry can be prone to errors, which can compromise the accuracy of testing results. Automation can help to ensure that data is entered correctly, improving the accuracy and reliability of test results.
Reproducibility
By automating the process of generating test data, it becomes possible to reproduce tests and experiments. This is essential for ensuring the accuracy and reliability of testing results.
Scalability
Automation makes it easier to manage large datasets, which is important for organizations that need to test applications or systems with many users or large amounts of data.
Consistency
Automation can help to ensure that test data is consistent across different tests and experiments.
Cost-effectiveness
Automating the process of test data provisioning can be cost-effective in the long run because it reduces errors and the need for manual labor, thus saving time and resources.

A word of advice: Make sure you can provision test data via APIs, as part of a CI/CD pipeline.

Practice what you preach with K2view Test Data Management software .

View full post