An Exploratory Study of History-based Test Case Prioritization Techniques on Different Datasets

Software testing comprises different


Introduction
The software development life cycle consists of many stages but testing is one of the important phases of the software development lifecycle.Software testing comprises different activities but https://doi.org/10.21123/bsj.2024.9604P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal test case generation and optimization are at the forefront 1 .Test case optimization can be further divided into 3 activities TCP, test case selection (TCS), and test case minimization (TCM).TCP amongst its other counterparts has proven to be more useful as it does not compromise the test cases, while TCS selects some test cases from among the available set of test cases according to some objective, which might miss some important test cases and in TCM it reduces the test suit to a minimal number of test cases required which may compromise the ability of test cases to detect faults in future as the test cases are permanently removed 2- 4 .
There are different techniques used to prioritize test cases.These techniques use different data, procedures, and factors such as coverage data, requirements of the software, development history of software, search algorithms, and the similarity between test cases 2 .Coverage-based approaches work on the principle of how much code is covered by a test case, but it does not always guarantee good performance 5,6 .Requirements-based approaches can be employed in the initial phases because they use user requirements that are available in the beginning 7 .Similarly, historical approaches use past data from software development which is analogous to stock market prediction where the future can be predicted with the help of past data.It can also be used in the early phases.In Search-based techniques, algorithms can be used to search for the best solutions according to the provided constraints.Similarity-based approaches use commonness among test cases and code to prioritize test cases 2 .TCP techniques suffer from different problems such as equal priority, not achieving high APFD, running identical test cases recurrently, the enormous size of the test suite, resource scarcity, and incomplete coverage of code 8,9 .
TCP techniques using history-based approaches are gaining popularity among researchers over time and the availability of historical data has also increased 10 .Previously it was difficult to find historical data because it was not recorded but now with the efforts of researchers and industry, few platforms are available from where historical data can be acquired 2 .Open source datasets are more commonly available as compared to industrial datasets 10 .Secondly, historical data can be generated by executing and analyzing the source code available in version control repositories but with the introduction of newer versions of software tools, support for previous ones usually ends, so while executing old projects to acquire data researchers might run into errors.Efforts are required to provide up-to-date datasets to the research community so that different techniques can be put to the test.
This research has three objectives; the primary objective is to implement history-based TCP techniques dispersed in the literature under one roof, the secondary objective is to investigate the problem of equal priority in history-based TCP techniques and the tertiary objective is to explore random sorting as a solution to the problem of equal priority in history-based TCP techniques.Similar work was found in the literature but with different objectives 10 , the study has an objective to establish which techniques work well for open and closedsource projects and whether there is a technique that works effectively for both types of projects.However, the study at hand focuses on highlighting the problems of history-based TCP techniques.

Literature Review
Researchers have presented history-based techniques with only one history-based factor [11][12][13] .Secondly, different studies have been found that have utilized more than one history-based factor 14,15 .Some researchers have incorporated one or more than one type of TCP technique such as coverage based, or similarity-based with historybased technique to improve the performance of TCP in regression testing 16,17 .The drawback of one type of TCP technique may be reduced by combining it with another type of TCP technique.
The most recent failure (MRF) has been derived from the approach that makes use of history-based and similarity-based TCP techniques 16 .An approach that combined factors such as failure rate (FR) with test case age to form an indicator to prioritize test cases 15  smoothing to calculate the priority of test cases based on their test case execution results 11 .The approach can also be considered in terms of coverage and test case age-based TCP approaches.However, the coverage base is not under the scope of this study as this study focuses on history-based TCP approaches only.The history-based approach is widely used by researchers.An industrial weightage scheme named ROCKET (R) was introduced to prioritize test cases using historical records and the execution time of test cases 14 .The ROCKET weightage scheme gives the maximum importance to failure in the last run, then it gives medium level importance to failure in the second last run, and then least importance to failures in the third last run and all previous runs.
The approach known as a co-failure-based approach assigned failure probability to test cases according to failing test cases and then rearranged them accordingly 12 .The co-failure-based (CoF) approach is dynamic because reprioritizes the test cases after every run.The flipping history-based (FH) approach uses the ROCKET to identify the first failure and then statistically assigns priority to the test cases based on the result of flipping 13 .Flipping is when two test cases switch states simultaneously that is pass or fail together.Terminator (T) has been proposed in 17 11,15 , and failure count 15 .However, the most fundamental factor used in history-based approaches is test case execution results [11][12][13][14][15][16][17] .Table 1 shows the factors used by history-based techniques.

Failure Count
Most Recent Failure (MRF) 16  Failure Rate (FR) 15    Exponential Decay (ED) 11   ROCKET (R) 14   Co-failure (CoF) 12  Flipping History (FH) 13  Terminator (T) 17  The most readily available data is related to test case executions.However, the earlier datasets of historical data of execution of test cases encounter the issue of imbalance that is there are more pass instances and fewer fail instances 2 .The imbalance can affect the techniques depending on this type of historical data.But with time the imbalance has decreased.This may be due to the modern way of software development by using version control tools such as GitHub and the like, where developers from around the world can work together and where the platform records most of the data.Another factor is the introduction of bug-tracking systems such as Bugzilla and Jira, where bugs found in the software are maintained.The nature and particulars of data used may also change over time, earlier the researchers used mutation-based testing, so the bugs generated were mutation-based similarly this resulted in the generation of mutation-based historical data afterward the trend shifted to using historical data of real bugs.

Materials and Methods
The history-based TCP techniques considered in this study are MRF, FR, R, ED, CoF, FH, and T.
The most recent failure with random (MRFR), failure rate with random (FRR), ROCKET metric with random (RR), Exponential Decay with random (EDR), co-failure with random (CoFR), and AFSAC flipping history with random (FHR).The history-based techniques, history-based techniques with random and random sorting will be applied to the dataset.Then factors will be calculated by techniques and based on factors test cases will be arranged.In the next step test cases are run against historical results to calculate the APFD metric and execution time of each technique.Finally, APFD results will be plotted in a graph to provide results in a visual form to make comparison easier, and execution time will be displayed in tabular form.
The whole process is presented in Fig. 1.

Figure 1. Flow of activities.
The dataset which contains execution results of test cases from a recent study 10 will be used.It provides data from 30 projects collected from GitHub and Travis CI, but only 12 projects will be included in this study due to time and space constraints.Some of the projects have an enormous amount of data.
The execution time of history-based techniques on these projects may approach 18 hours or more 10 so projects with the appropriate amount of data were utilized in this study.These datasets are mostly based on Java and ruby programming languages, while few of them use Python and C++.The projects selected from the dataset 10 are deeplearning4j, structr, diaspora, okhttp, puppet, rspec, loomia, parsl, wicket bootstrap, radical, titan, and jetty project.The duration of all the projects is more than 1 year.The number of total builds ranges from 118 to 1122.Failed test cases range from 6 to 830.
Secondly, 5 datasets from the Software-artifact Infrastructure Repository (SIR) which is one of the oldest repositories were selected based on the properties of the projects 18 .SIR is a well-known repository and has been used extensively by researchers working on regression testing.It holds different datasets comprised of projects developed in the C programming language.In terms of test cases, the GitHub datasets have a smaller number of test cases while SIR datasets have a large number of test cases.GitHub projects were tested more extensively than SIR projects so the GitHub datasets are enriched with extensive build data with real faults which is a feature of modern workflow systems while the SIR dataset lacks this feature because, at the time of its inception, these modern tools were not available at the disposal of software.
The projects selected from the SIR repository are tcas, space, printtokens2, replace, and schedule2.Tcas has been used in regression testing studies [18][19][20] .Space has been used by researchers in software testing [20][21][22][23] .Replace was used by scientists [20][21][22][23] .Printtokens2 has been used by researhers 24,[25][26][27][28] .Schedule2 was utilized by studies [29][30][31] .As this study is focused on history-based TCP the similarity-based approach was omitted and only the historical approach was considered 16 .Similarly, the dataset used does not come with the execution time of each test case, so it was not considered while implementing 14 .Only failure rate was considered in this study as test case age is not explicitly provided in datasets 15 .Secondly, it would be difficult to calculate it without sufficient information at hand.Thirdly, if all test cases were run then it would not make any significant difference.The performance of TCP techniques is measured in terms of fault detection and it is calculated with the help of APFD 32 .APFD will be used to quantify the performance of TCP techniques.Execution time 10 will be used to measure the performance of the technique in terms of overhead incurred.APFD has been used in studies [33][34][35][36] to measure the effectiveness of TCP techniques.APFD can be calculated by the formula given in Eq. 1.

Results and Discussion
In this section, the performance and execution time of history-based TCP techniques will be presented and discussed when these techniques are applied to different data sets.It was found that most of the history-based techniques suffer from the problem of equal priority and to solve this problem if random sorting is employed it does not give optimal results.Fig. 2(a-i) shows the box plot of 12 selected projects and demonstrates the results of random, history base with random sorting, and history base techniques without random sorting.APFD is plotted on the y-axis.TCP techniques used in this study are plotted on the x-axis.
It can be observed that for all 12 projects, random sorting is the worst-performing technique as it does not arrange test cases according to some heuristic it just randomly arranges them.After the random sorting terminator is the second worst-performing technique, this finding is in line with the findings of 10 .MRF and ED are among the best-performing techniques as shown in Fig. 2(a-i).FR, R, CoF, and F perform similarly.Their performance lies in a mediocre range when compared to best and worstperforming techniques.However, it can be noticed that using random techniques with history-based techniques to solve equal priority does not yield better results, in fact, it slightly deteriorates the original performance of history-based techniques.Moreover, including random adds some extra overhead in terms of execution time.It is better to merge two techniques when their synergy offers better results than using the techniques separately.
Fig. 3  The execution time of history-based techniques when applied to GitHub and SIR datasets is shown in Table 4 and Table 5.In datasets from GitHub, the terminator is the most expensive technique in terms of execution time, Co-failure-based approach is the second most expensive in terms of execution time as it assigns failure probability and rearranges cases every time a test case fails.Flipping history-based approach is the third most expensive.Rocket is the fourth most expensive technique.While random is the least expensive technique for most of the data sets it is approximately equal to zero which can be viewed in  For SIR-based datasets, Table 5 shows that random is the best-performing technique according to execution time and terminator is the worstperforming technique which matches the GitHub dataset results.For the Space dataset, the execution time for CoFR, CoF, FHR, FH, and T is not available because the execution time went into the exponential domain so the execution was stopped.The reason behind this is the large number of test cases as compared to the other datasets from SIR.

Conclusion
History-based TCP techniques are encountered with the problem of equal priority as many other techniques of TCP do.Secondly using random ordering is not the best solution to the problem of equal priority in regression testing.To get to the bottom of why equal priority issues are encountered by history-based techniques the researchers examined the dataset closely, it was found that the test cases are acting alike as they pass and fail simultaneously.Secondly, the properties inherited in the datasets due to development processes employed also play a major role in the ways certain techniques react to these datasets.Individual techniques respond differently because of the features of datasets.So, to solve this problem existing techniques are not sufficiently capable enough as demonstrated with the help of experiments.Code inspection-based approaches, coverage-based, and change-based approaches can be explored discretely and in combination in the future to solve the problem of equal priority in history-based TCP techniques.

1 ,
TF 1 , TF 2 , and TF n represent the position of the first fault detected in each of the n test cases in the test set.The smaller the position, the better, as it indicates that the fault was found earlier in the test.The total number of test cases in your test set is represented by n.The total number of faults in the system being tested is represented by m.Time can be calculated by counting the number of seconds elapsed during the execution of a certain technique.

Figure 2 Figure 3
Figure 2(a-i).APFD of history-based approaches on GitHub datasets.
28 41 MRF = Most recent failure, MRFR = most recent failure with random, FR = failure rate, FRR failure rate with random, R = ROCKET, RR = ROCKET with random, ED = Exponential Decay, EDR Exponential Decay with random, CoF co-failure, CoFR co-failure with random, FHR = AFSAC flipping history with random, FH = AFSAC flipping history, and T = terminator.
20 NA NA NA NA NA MRF = Most recent failure, MRFR = most recent failure with random, FR = failure rate, FRR failure rate with random, R = ROCKET, RR = ROCKET with random, ED = Exponential Decay, EDR Exponential Decay with random, CoF co-failure, CoFR co-failure with random, FHR = AFSAC flipping history with random, FH = AFSAC flipping history, and T = terminator.https://doi.org/10.21123/bsj.2024.9604P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal The variation in the performance of techniques over different datasets can be attributed to differences in software development methodology which has evolved.However, the execution time sets a clear message about the selection of technique for the desired task.The decision of which technique works best for a dataset depends on APFD and execution time.When time and APFD are considered most recent failure and exponential decay are the most suitable techniques for selected datasets.Combining random with history-based TCP techniques also incurs a cost that may be insignificant for smaller datasets but becomes significant when the size of the dataset increases.Secondly, it can be noticed that for best-performing techniques in terms of execution time, adding random to the technique slightly increases execution time but for techniques that are worst performers regarding time adding random would worsen the performance.It was found that most of the history-based TCP techniques face the problem of equal priority while prioritizing test cases like other TCP techniques.However, when random sorting is used to solve this problem of equal priority, favorable results are not achieved.The performance of history-based TCP approaches deteriorates in terms of APFD, and more time is incurred.Besides the problem of equal priority, history-based TCP also faces the problem of unavailability of data historical data, small amount of available historical data, proper formation of data, and imbalance in historical data.

Table 2 . Description of datasets selected from GitHub. Project Name Total Test Cases Maximum Number of Failed Test Cases in one build Maximum Number of times a single test case failed Total Builds
The dataset selected has some properties to ensure that suitable test cases are selected such as the number of test cases, number of builds, maximum number of times a single test case fails, maximum number of test cases failed in one build, number of failed test cases, number of developers, duration of project.The properties of selected projects from the GitHub repository are mentioned in Table 2.Those https://doi.org/10.21123/bsj.2024.9604P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal projects were selected which had enough failed test cases.The total number of test cases ranged from 54 -188.The projects have more than 5 developers.

Table 3 . Description of datasets selected from SIR.
Table 3 contains the properties of datasets collected from SIR.It states the number of test cases, number of versions, maximum number of times a single test case fails, maximum number of test cases that failed in one version, and number of failed test cases.
It can be examined that there are outlier values in all 12 projects which means still the performance of prioritization techniques can be improved.Similarly, it can be noted in the SIR dataset results https://doi.org/10.21123/bsj.2024.9604P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal that still there is room for improvement.There is minimal number of outliers in SIR dataset results because the available data is limited.Upon careful inspection of the dataset, it was noticed that one reason behind the equal priority problem can be test cases that produce similar results, that is when one test passes the other passes, and when one test case fails the other fails.
(a-e) shows a boxplot of APFD of historybased approaches on the SIR dataset.For tcas dataset CoFR and CoF perform best, FHR