Test Case Prioritization
Imagine you have several thousands of test cases in your test repository for regression testing. Now imagine you have to find out the best test cases every day to run overnight for regression testing. Let’s make it more complex and let us imagine you can only run approximately a thousand test cases overnight. How do you do that? I may have a solution for you. Now do note that I will only share the overview of the idea and leave the readers to apply the idea with flexibility to their environment.
Importance of a test case
We must first define how to calculate importance of a test case. On any given day/ overnight regression, whether a test case TC1 is more important than another test case TC2? If yes, how do we quantify it? Let us say every test case is assigned a score every day based on some attributes. How do we decide which attributes will contribute to defining how important a test case is?
I have come up with three key attributes for calculating a test case’s score and they are:
- Historical Failure Score
- Test Case Complexity
- Defect Exposure Capability
So, if you plot the scores of all the test cases, it may look like this
Let us dive in a little deeper and understand what these attributes contribute towards.
Historical Failure Score
Assuming that the test cases we are considering are regression test cases and have been executed at least a couple of times in the past history. That gives us a historical trend of a test case in terms of it’s results i.e. pass, fail, crash, etc. In this literature we will only consider pass, fail and crash as the possible result categories for a test case. If we encode them to integer values i.e., Pass: 0, Fail: 1 and Crash: 2, for most of the test cases we can get a trend line depicting it’s historical execution trend.
Below is a hypothetical example of such a trend
The above plot comes from a tabular data that may look like below
Now let’s suppose we have test case getting executed on multiple test beds in parallel. In that case we may have the data like below, where we have 3 different test beds where a test case has been executed. What we can do in this case is take the mean of all the results every day and plot the trend line.
But why do we need this historical execution trend you may ask? To which I will reply with another question i.e., what kind of data do you see here? Do you see a time series data? yes, of course it is a time series data. And what is the obvious machine learning problem that comes to your mind when you look at a time series data? To me it is a regression problem.
Now imagine if you employ an LSTM model that can learn the test cases’ execution trend and predict what is going to happen tomorrow, that can give us what I call Historical Failure Score. You may even call it Historical Execution Score. For example as shown below:
In case we may not have the same data dimension for every test case (some test cases may be new or not run very often), we train an LSTM model for each test case. They are light weight and can be trained under 2–3 seconds max.
Test Case Complexity
The above discussed historical score will not be enough for scoring a test case. We add another attribute that defines the complexity of a test case. the purpose behind it is that we want to make sure that we run enough complex test cases irrespective of whether it has exposed any bug in the past or not. We must not only look in the past but also select test cases which are complex enough in terms of how much code they are testing and have the potential to expose bug(s) in the future.
This topic can be very vague when it come to calculating complexity score of a test case. One can find the complexity by using tools which test code coverage of a test case or for e.g., in my case I used my own formula based on the configuration of a test case to calculate complexity (which is specific to the domain knowledge for e.g., telecommunication testing). But in the end, you can define a Complexity Score for every test case.
Defect Exposure Capability
Finally we come to Defect Exposure Capability. The idea behind this attribute is to capture as much information as possible pertaining to which file(s) in the code have high impact on which test case(s). There are 2 ways of calculating this score for every test case.
- Mapping historical failures to files changed: Imagine a test case failed/ crashed on a given day and you have the information on which files in the code were modified in git repo. We can map the failed test case against the files changed. We build this matrix where rows correspond to files and columns correspond to test cases and the values get updated every day. For e.g., TC1 failed 5 times in the past and every time it failed file1 was modified in the code and file2 was modified once. So we increase the TC1 score for file1 to 5 and file2 to 1. Now we know this is blind correlation in a way but if you build this data historically you will start observing a genuine correlation being developed. Note that there will always be some outliers. For e.g., a test case has been failing for very long time because of h/w issue but will still have a high score for a regularly modified file in the code.
- Mapping historical defects: Now imagine there was a bug that one of your teams fixed in the past and they used TC1 to validate their fix. Can we not map the files changed in the code to fix that defect and the test case(s) used to validate that fix? This gives us a direct relation of files in the code and test cases.
The idea is to get a matrix of some sort like shown below
Everyday, when you have to run the overnight regression, you can check which files were changed for your regression label through git, filter out those files in above dataframe/ matrix, add up the score for every test case and get a Defect Exposure Score for every test case which tells us for the files changed today which test case has the highest potential to expose a defect or chances of failing/ crashing.
Test Case Score
Finally, every day we process the data to calculate all the three different scores and get the final Test Case Score for every test case. This I leave it to the reader to experiment with different approaches as I found simple addition of all three values worked for me. For some, weighted sum can be helpful as in if they want to give more weight to defect exposure score or to complexity score.
Sort the list of test cases based on highest to lowest score and you will have the prioritised list of test cases to run for your overnight regression. This list will dynamically change every day as 2 of the attributes i.e., Historical Execution Score and Defect Exposure Score may change for few test cases every day.
Conclusion
Test Case Prioritization in the current Agile way of working must not be static in nature and by applying the above described idea, one can have a dynamic prioritized list of test cases which will provide better testing and avoid any fault slip throughs.