5 Reasons that Continuous Testing Might Fail

Continuous Integration is an established practice in software projects to continuously merge the source code changes made by different developers. Generally, the process can be divided into 2 steps; the compilation and the testing process. This article highlights some of the challenges that can arise in the area of testing in large projects.

In modern projects, a large part of testing is fully automated. Techniques such as Test-Driven-Development (TDD) have a proven positive influence on project progress, architecture and coverage. The number of tests available for execution increases continuously due to such methods. Ideally, all automated tests that exist would be executed with every change to the source code. This would ensure that functionality already implemented is not broken by new changes. However, new challenges are not considered in the literature on Continuous Integration. Initially, all tests can and should be executed with each change. But we are neglecting the factor for the execution time for the test-suite, which is constantly increasing, sometimes up to several hours. This leads to long feedback loops which we try to avoid. The following five factors can cause Continuous Testing to fail.

Test Level Not Used Correctly

Ideally, the distribution of the tests on different levels results in a pyramid ascending from unit tests as the foundation, to integration tests and system tests at the top. This is known as the Test Pyramid metaphor from Mike Cohn.

However, in many projects we find an inverse pyramid, the so called Icecream-Cone, in which most of the tests are at a very high level. At worst, all tests are UI-based or manual. These tests should be the smallest part of the test-suite. But why does this have a negative effect? Tests at higher levels have some or all the following negative characteristics:

High execution time
High analysis effort in case of failure
High maintenance requirements (especially for UI tests)
Increased instability during test execution

On the other hand, it does not make sense if a test suite only consists of unit tests. These test very quickly whether a change in a local area has broken the existing functionality, but not whether the individual parts of the software work together correctly. Therefore, it is essential to bring the number of tests on the respective levels into the form of a pyramid. Each of these levels covers a separate aspect that should be tested, from the functionality of individual classes to that of the entire system. It also tests whether the system meets the non-functional requirements.

Tests Are Not Continuously Improved

Unit tests are ideally continuously adjusted and improved together with the product code. As a result, the aging of the code is usually not a problem. As soon as you approach the system level tests, however, you will increasingly see the phenomenon that the test code is no longer considered once it has been added to the source code management. Over time, especially on the higher test levels this can lead to problems. Test-suites that initially ran in a few seconds can last from minutes to hours.

In addition, technical debt can accumulate in the tests, such as duplicated code or lack of refactoring. As a result, the test-suite is increasingly difficult to maintain. You can then find yourself in a situation where it is more time-consuming to adapt the tests than to implement the product. This is why the quality of the automated tests must be continuously improved. For example, you can analyse the five slowest tests in each iteration to see if you can speed them up.

Unstable Tests Are Not Treated Separately

In every large project there are tests that deliver a non-deterministic result (Flaky Tests). They should not exist but, but they do! The automated triggered build process in our Continuous Integration practices is therefore sporadically green or red. This reduces confidence in our Continuous Integration System.

This problem should not be ignored. If you look at the “Broken Window Theory”, which says that you start to neglect something as soon as the first flaws appear, you start to ignore the Continuous Integration build results. Statements like “I’m done when 95% of my tests are green” or “These five tests fail sporadically, but everything works” are the result. But what if it’s 94% instead of 95%? Or if six tests fail instead of five at once? Or five completely different tests fail, and nobody notices it because you only concentrate on the number of failed tests?

In a system that continuously performs automated tests, unstable tests should therefore be treated separately with a separate process. Martin Fowler used the term “quarantine”. In principle, the point is that a test that fails sporadically should not be included in the normal test-suite. Much like a highly contagious disease where the infected persons must be quarantined, these tests should be removed from the normal execution to prevent the results of the build from being ignored. The tests must of course be examined and then either the product repaired, or the test corrected. If the problem cannot be solved within a few hours, this should be done outside the normal build process to avoid it not having a positive outcome. With the introduction of Test Quarantine, it should also be monitored how many tests have been quarantined to avoid that suddenly there are no more tests in normal test execution.

Tools Are Too Complex

Another possible error is that the tools used for Continuous Integration and Continuous Testing are too complex. They should be as easy to use as possible, since they are used several times a day by almost all developers and testers. However, this also means that the tools cannot be considered separately for the “compile” and “test” areas. These aspects must go hand in hand to make daily work as easy as possible for all involved. What can be expected from a test should be clearly defined in terms of its behaviour and what should be the task of the infrastructure. In this way you can decide what measures are to be taken using the tools, and what is the responsibility of the tests. Typical examples are configuration changes that a test makes in order to be executed. The test is responsible for resetting them (even in the event of an error).

No Test Architecture

In many projects, test code development is still not considered software development. But in fact, they are the first class citizens of your product code. In complex projects, test code development is a big challenge and must be approached with the same software development methods as product development. If the quality of the product code is poor, an automated test suite with good quality is required. This allows you to gradually increase the quality of the product code and rely on the safety harness of high-quality tests.

However, if the test code quality is not good either, the correction of errors found in the field will cost a lot of time. This means that a high proportion of the costs are incurred after the actual development, i.e. during the maintenance phase. Therefore, it is important to set up an architecture with clear rules for the test code, and to measure the quality of the implemented code. At best, the architecture of the test code is described in the same way as the architecture of the product code to ensure that all project participants can read this documentation in the same way.

Summary

Continuous testing in large projects is not impossible, but it is a challenge. These five points cover a large part of the possible difficulties. However, the challenge should be accepted, as Continuous Testing can significantly impact the development of software and reduce the fear of changes in an existing system. It is an investment in the future.