Guidelines on Replicating a Software Experiment

Software Experiment

To draw meaningful conclusions from a software experiment, we apply statistical analysis methods on the collected data to interpret the results. To get the most out of the experiment, it must be carefully planned and designed. Which statistical analyses we can apply depend on the chosen design, and the used measurement scales. An experiment consists of a series of tests of the treatments also known as "Trails".

Software Experimental Design

Design of an experiment describes how the tests are organized and run. We can define an experiment as a set of tests. General design principles of any experiment consist of:

Randomization

All statistical methods used for analyzing the data require that the observations be from independent random variables. Randomization applies on the allocation of the objects, subjects and in which order the tests are performed. It averages out the effect of a factor that may otherwise be present and is also used to select subjects that is representative of the population of interest. E.g. the selection of the persons (subjects) will be representative of the designers in the company, by random selection of the available designers. The assignment to each treatment (object-oriented design or the standard company design principle) is selected randomly.

Blocking

Sometimes we have a factor that probably has an effect on the response, but we are not interested in that effect. Blocking is used only if the effect of the factor is known and controllable. Systematically eliminate the undesired effect in the comparison among the treatments. Within one block, the undesired effect is the same and we can study the effect of the treatments on that block. The effects between the blocks are not studied. E.g. The persons (subjects) used, for this experiment, have different experience. Some of them have used object-oriented design before and some have not. To minimize the effect of the experience, the persons are grouped into two groups (blocks), one with experience of object-oriented design and one without.

Balancing

If we assign the treatments so that each treatment has equal number of subjects, we have a balanced design. Balancing is desirable because it both simplifies and strengthens the statistical analysis of the data. E.g. the experiment uses a balanced design, which means that there is the same number of persons in each group (block).

The experimental design should consist of the following aspects, even if the experiment is being replicated:

Objective of the Experiment

This defines the purpose of the experiment. E.g. Imbalanced data in code effecting the efficiency of software change prediction model, or To observe the effect of personality trait as described by five-factor model on the performance of a software testing task.

Type of Experiment

Empirical study can be conducted using the following methods:

Controlled Experiments
Case Studies
Survey Research
Ethnographies
Article/archive analysis (mining)
Action Research

Subject Selection

It is important for generalization of results of the experiments
In order to generalize the results to the desired population
The selection must be representative for that population
The selection of subjects is also called a sample from a population
The sampling of the population can be either a probability or a non-probability sample

Hypothesis

Null Hypothesis

H0, states that there are no real underlying trends or patterns in the experiment setting. The only reasons for differences in our observations are coincidental. This type of hypothesis is the hypothesis that the experimenter wants to reject with as high significance as possible.

Alternative Hypothesis

Ha;H1, etc., is the hypothesis in favor of which the null hypothesis is rejected

Instrumentation

These are chosen while planning an experiment and are developed before the execution for the specific experiment. It consists of objects (e.g. specification or code documents), guidelines (e.g.process descriptions and checklists) to guide the participants in the experiment and measurement instruments. The goal of the instrumentation is to provide means for performing the experiment and to monitor it, without affecting the control of the experiment

The results of the experiment shall be the same independently of how the experiment is instrumented
If the instrumentation affects the outcome of the experiment the results are invalid.

Variables

Dependent Variables:

Effect of the treatments is measured in the dependent variable(s). Often there is only one dependent variable and it should therefore be derived directly from the hypothesis. The variable is mostly not directly measurable and we have to measure it via an indirect measure instead which must be carefully validated, because it affects the result of the experiment. The hypothesis can be refined when we have chosen the dependent variable

Independent Variables:

Variables that we can control and change in the experiment. This type of variables should have some effect on the dependent variable and must be controllable.The choice of independent variables also includes choosing the measurement scales, the range for the variables and the specific levels at which tests will be made.

Statistical Test Selected

This requires us to define the statistical tests used to analyze the collected data during the experiment. For more details see https://advancesereadings.blogspot.com/2018/10/statistical-tests-in-empirical-software.html

Data Collection Mechanism

This requires us to define the mechanism used to collect data i.e. via interviews, questionnaire, analysis of integration activities, observations or tools.

Performance Measure Selected

Performance measure evaluated are based on the objective of the experiment. E.g. usefulness and completeness of a documentation, Input/output speed, work load and time consumption of a particular algorithm or a process, or APFD (Average Percentage of Fault-Detection) metric

Threats to Validity

Conclusion validity:

This validity is concerned with the relationship between the treatment and the outcome. We want to make sure that there is a statistical relationship, i.e. with a given significance.

Internal validity:

If a relationship is observed between the treatment and the outcome, we must make sure that it is a causal relationship, and that it is not a result of a factor of which we have no control or have not measured. In other words that the treatment causes the outcome (the effect).

Construct validity:

This validity is concerned with the relation between theory and observation. If the relationship between cause and effect is causal, we must ensure two things: (1) that the treatment reflects the construct of the cause well and (2) that the outcome reflects the construct of the effect well

External validity:

The external validity is concerned with generalization. If there is a causal relationship between the construct of the cause, and the effect, can the result of the study be generalized outside the scope of our study? Is there a relation between the treatment and the outcome?

Some studies do not explicitly mention the hypothesis being tested in the empirical study conducted instead only describe the goals and in some cases the research questions to be answered with the help of the results obtained from the study. Most of the studies focus on empirically evaluating certain goals and hypothesis using case studies. The most reported external threat is the extent to which the results obtained in the study can be generalized. Some of the most commonly used statistical tests consist of Wilcoxon test, and median or frequency. Most of the recent empirical studies are focused on change assessment in software with varying perspectives i.e. empirical evolution of a software change prediction model.

Sample Software Experimental Design

Blow is an example experimental design for replicating a published software experiment (the same sample can also be used for conducting the software experiment for the first time):

Focus on	Imbalanced data in code effecting the efficiency of software change prediction model
Type of Experiment	Controlled experiment
Subject Selection	Application: Open Source Data Sets from Object Oriented code written in C++ and an application written in JAVA, with previous versions of the application.
Subject Selection	Participants: 2 groups consisting of equal number Master’s students skilled in both C++ and JAVA
Hypothesis	Null Hypothesis = “Change prediction models developed using various Machine learning Technique WEKA do not show significant differences when various sampling methods are used for handling an imbalanced data set as compared to use of no sampling method when evaluated using G-mean measure” Alternative Hypothesis = “Change prediction models developed using Machine Technique WEKA show significant differences when various sampling methods are used for handling an imbalanced data set as compared to use of no sampling method when evaluated using G-mean measure.”
Instrumentation	Open Source Code in JAVA and an open source object oriented code written in C++ with older versions, Defect Collection and Reporting System (DCRS) tool, WEKA tool
Independent Variable	Object Oriented Matrices
Dependent Variable	Change proneness observed through code analysis in the version of the same software
Statistical Test Selected	Friedman statistical test , Wilcoxon post-hoc test
Data Collection Mechanism	Defect Collection and Reporting System (DCRS) tool collecting change-logs between two consecutive versions of the software data set
Performance Measures Selected	Confusion matrix
Threats to Validity	Internal Threat = Lack of evaluation of casual effect, Construct Validity = accurate measurement of variables, Conclusion Validity = source leading to misjudgment of accurate relationship between the variables, External Validity = generalizability of results

Replication of Software Experiment

Replicating a software experiment has gained popularity of the recent years. This is so to test the existing set of hypothesis using the same experimental design to test its validity. Software experiments have also been replicated to evaluate and compare the existing set of hypothesis with the newly developed hypothesis, or to compare their set of hypothesis using modified experimental design.

Guidelines to Replicating a Software Experiment

Following are some of the guidelines provided by some of the recently published journals on software experiments:

For experimental replications to have a scientific value, they must be published in peer reviewed literature
Replications reported must be consistent
The replication study must report the same type of information at the same detail level
The replication study must consist of information about the originals study’s research questions, participants, design, artifacts, context variables and summary of results, information about the replication to develop understanding on the motivation behind the replication study, level of interactions with the experimenters, the changes made to the original experiment and other specific important details about the replication itself, comparison of the replication results with those reported in the original study, and conclusion across the studies providing the readers with the insights drawn from the series of study that may not be obvious otherwise from a single study
For better use and generation of knowledge from the studied set of replicated studies, the replication studies must follow a reporting standard.
Explicit information should be provided regarding the variations between the replications and the originals studies
The type of replication should be clear i.e. it should be clear whether the replication study made is internal or external as the factors used to analyze these replication differ
The time elapsed between the publication of the original study and the replication should be noted

Domains of Software Engineering

Search This Blog