Skip to main content

Guidelines on Replicating a Software Experiment

Software Experiment

To draw meaningful conclusions from a software experiment, we apply statistical analysis methods on the collected data to interpret the results. To get the most out of the experiment, it must be carefully planned and designed. Which statistical analyses we can apply depend on the chosen design, and the used measurement scales. An experiment consists of a series of tests of the treatments also known as "Trails".

Software Experimental Design

Design of an experiment describes how the tests are organized and run. We can define an experiment as a set of tests. General design principles of any experiment consist of:

Randomization

All statistical methods used for analyzing the data require that the observations be from independent random variables. Randomization applies on the allocation of the objects, subjects and in which order the tests are performed. It averages out the effect of a factor that may otherwise be present and is also used to select subjects that is representative of the population of interest. E.g. the selection of the persons (subjects) will be representative of the designers in the company, by random selection of the available designers. The assignment to each treatment (object-oriented design or the standard company design principle) is selected randomly.

Blocking

Sometimes we have a factor that probably has an effect on the response, but we are not interested in that effect. Blocking is used only if the effect of the factor is known and controllable. Systematically eliminate the undesired effect in the comparison among the treatments. Within one block, the undesired effect is the same and we can study the effect of the treatments on that block. The effects between the blocks are not studied. E.g. The persons (subjects) used, for this experiment, have different experience. Some of them have used object-oriented design before and some have not. To minimize the effect of the experience, the persons are grouped into two groups (blocks), one with experience of object-oriented design and one without.

Balancing

If we assign the treatments so that each treatment has equal number of subjects, we have a balanced design. Balancing is desirable because it both simplifies and strengthens the statistical analysis of the data. E.g. the experiment uses a balanced design, which means that there is the same number of persons in each group (block).

The experimental design should consist of the following aspects, even if the experiment is being replicated:

  • Objective of the Experiment

This defines the purpose of the experiment. E.g. Imbalanced data in code effecting the efficiency of software change prediction model, or To observe the effect of personality trait as described by five-factor model on the performance of a software testing task.

  • Type of Experiment

Empirical study can be conducted using the following methods:
    • Controlled Experiments
    • Case Studies
    • Survey Research
    • Ethnographies
    • Article/archive analysis (mining)
    • Action Research

  • Subject Selection

    • It is important for generalization of results of the experiments 
    • In order to generalize the results to the desired population
    • The selection must be representative for that population
    • The selection of subjects is also called a sample from a population
    • The sampling of the population can be either a probability or a non-probability sample

  • Hypothesis

    • Null Hypothesis

H0, states that there are no real underlying trends or patterns in the experiment setting. The only reasons for differences in our observations are coincidental. This type of hypothesis is the hypothesis that the experimenter wants to reject with as high significance as possible.

    • Alternative Hypothesis

Ha;H1, etc., is the hypothesis in favor of which the null hypothesis is rejected

  • Instrumentation

These are chosen while planning an experiment and are developed before the execution for the specific experiment. It consists of objects (e.g. specification or code documents), guidelines (e.g.process descriptions and checklists) to guide the participants in the experiment and measurement instruments. The goal of the instrumentation is to provide means for performing the experiment and to monitor it, without affecting the control of the experiment
    • The results of the experiment shall be the same independently of how the experiment is instrumented
    • If the instrumentation affects the outcome of the experiment the results are invalid.

  • Variables

    • Dependent Variables:

Effect of the treatments is measured in the dependent variable(s). Often there is only one dependent variable and it should therefore be derived directly from the hypothesis. The variable is mostly not directly measurable and we have to measure it via an indirect measure instead which must be carefully validated, because it affects the result of the experiment. The hypothesis can be refined when we have chosen the dependent variable

    • Independent Variables:

Variables that we can control and change in the experiment. This type of variables should have some effect on the dependent variable and must be controllable.The choice of independent variables also includes choosing the measurement scales, the range for the variables and the specific levels at which tests will be made.

  • Statistical Test Selected

This requires us to define the statistical tests used to analyze the collected data during the experiment. For more details see https://advancesereadings.blogspot.com/2018/10/statistical-tests-in-empirical-software.html

  • Data Collection Mechanism

This requires us to define the mechanism used to collect data i.e. via interviews, questionnaire, analysis of integration activities, observations or tools.

  • Performance Measure Selected

Performance measure evaluated are based on the objective of the experiment. E.g. usefulness and completeness of a documentation, Input/output speed, work load and time consumption of a particular algorithm or a process, or APFD (Average Percentage of Fault-Detection) metric

  • Threats to Validity

    • Conclusion validity: 

This validity is concerned with the relationship between the treatment and the outcome. We want to make sure that there is a statistical relationship, i.e. with a given significance.

    • Internal validity: 

If a relationship is observed between the treatment and the outcome, we must make sure that it is a causal relationship, and that it is not a result of a factor of which we have no control or have not measured. In other words that the treatment causes the outcome (the effect).

    • Construct validity:

This validity is concerned with the relation between theory and observation. If the relationship between cause and effect is causal, we must ensure two things: (1) that the treatment reflects the construct of the cause well and (2) that the outcome reflects the construct of the effect well

    • External validity:

The external validity is concerned with generalization. If there is a causal relationship between the construct of the cause, and the effect, can the result of the study be generalized outside the scope of our study? Is there a relation between the treatment and the outcome?

Some studies do not explicitly mention the hypothesis being tested in the empirical study conducted instead only describe the goals and in some cases the research questions to be answered with the help of the results obtained from the study. Most of the studies focus on empirically evaluating certain goals and hypothesis using case studies. The most reported external threat is the extent to which the results obtained in the study can be generalized. Some of the most commonly used statistical tests consist of Wilcoxon test, and median or frequency. Most of the recent empirical studies are focused on change assessment in software with varying perspectives i.e. empirical evolution of a software change prediction model.

Sample Software Experimental Design

Blow is an example experimental design for replicating a published software experiment (the same sample can also be used for conducting the software experiment for the first time):
Focus on
Imbalanced data in code effecting the efficiency of software change prediction model
Type of Experiment
Controlled experiment
Subject Selection
Application: Open Source Data Sets from Object Oriented code written in C++ and an application written in JAVA, with previous versions of the application.
Participants: 2 groups consisting of equal number Master’s students skilled in both C++ and JAVA
Hypothesis
Null Hypothesis = “Change prediction models developed using various Machine learning Technique WEKA do not show significant differences when various sampling methods are used for handling an imbalanced data set as compared to use of no sampling method when evaluated using G-mean measure”
Alternative Hypothesis = “Change prediction models developed using Machine Technique WEKA show significant differences when various sampling methods are used for handling an imbalanced data set as compared to   use of no sampling method when evaluated using G-mean measure.”
Instrumentation
Open Source Code in JAVA and an open source object oriented code written in C++ with older versions, Defect Collection and Reporting System (DCRS) tool, WEKA tool
Independent Variable
Object Oriented Matrices
Dependent Variable
Change proneness observed through code analysis in the version of the same software
Statistical Test Selected
Friedman statistical test , Wilcoxon post-hoc test
Data Collection Mechanism
Defect Collection and Reporting System (DCRS) tool collecting change-logs between two consecutive versions of the software data set
Performance Measures Selected
Confusion matrix
Threats to Validity
Internal Threat = Lack of evaluation of casual effect,
Construct Validity = accurate measurement of variables,
Conclusion Validity = source leading to misjudgment of accurate relationship between the variables,
External Validity = generalizability of results

Replication of Software Experiment

Replicating a software experiment has gained popularity of the recent years. This is so to test the existing set of hypothesis using the same experimental design to test its validity. Software experiments have also been replicated to evaluate and compare the existing set of hypothesis with the newly developed hypothesis, or to compare their set of hypothesis using modified experimental design.

Guidelines to Replicating a Software Experiment

Following are some of the guidelines provided by some of the recently published journals on software experiments:
  • For experimental replications to have a scientific value, they must be published in peer reviewed literature
  • Replications reported must be consistent
  • The replication study must report the same type of information at the same detail level
  • The replication study must consist of information about the originals study’s research questions, participants, design, artifacts, context variables and summary of results, information about the replication to develop understanding on the motivation behind the replication study, level of interactions with the experimenters, the changes made to the original experiment and other  specific important details about the replication itself, comparison of the replication results with those reported in the original study, and conclusion across the studies providing the readers with the insights drawn from the series of study that may not be obvious otherwise from a single study
  • For better use and generation of knowledge from the studied set of replicated studies, the replication studies must follow a reporting standard.
  • Explicit information should be provided regarding the variations between the replications and the originals studies
  • The type of replication should be clear i.e. it should be clear whether the replication study made is internal or external as the factors used to analyze these replication differ
  • The time elapsed between the publication of the original study and the replication should be noted

Comments

Popular posts from this blog

Quality Practices in Agile Approaches

Agile Approaches Agile is an umbrella consisting of different methods adopted by the practitioners depending upon the circumstances. In the recent years  Agile has been gaining popularity among software practitioners due to its ability in assisting the development team to deliver the software product in a short amount of time.  Original Agile Approaches Based on the circumstances under which the agile methodologies have been used can be classified into the following 3 categories: Classification of Agile Approaches Agile Methodologies consist of the  original agile methods Hybrid Agile Methodologies consist of a combination of several original agile methodologies e.g. Industrial Extreme Programming merged with practices of Rational Unified Process Miscellaneous category consists of methodologies adopting only certain aspects of the original agile methodologies Extreme Programming, Test Driven/Test First Development, and SCRUM are among the most popular agile met

How traceability of non-functional requirements is managed throughout the software development process?

1. Requirements Traceability: Requirements traceability is the process of describing and keeping track of a set of requirements throughout the system’s lifecycle. The process assists in managing changing requirements of a particular software product. Requirements traceability of is two types, forward traceability where a particular requirement involved during the design and implementation phases of the software system, and backward requirement traceability where a particular requirement is being traced back to its source. 2. Proposed Solutions for the Traceability of Non - Functional Requirements : The author J. Merilinna [8], proposed a framework supported by a tool to trace the non-functional requirements in both forward and backward direction. The proposed method is based on the context of DSM (Domain Specific Modeling).  The NFR+1 framework involved are used for the elicitation, definition and redefinition of the system’s non-functional requirements. The proposed

Software Architecture Views and Structures

Description of Views: In the year 1995 Kruchten presented his 4+1 architectural view model consisting of the following five types of views: Logical Development Process Physical Scenario Later with further development and research in the domain of architectural view following new views were developed to represent their respective structures: Views Sub-View of Description Logical None Highlights the functionalities provided by the system to the end-users. Unified Modeling Language (UML) diagrams such as the Class diagram, Domain diagram, Use Case diagram, State diagrams and Activity diagrams can be used to represent the logical view of the architecture. Development None It is also known as an implementation view. It is mainly concerned with the software project management. It represents the system with the programmer’s perspective. Process None It deals with the representation of