Software Experiment
To draw meaningful conclusions from a software experiment, we apply statistical analysis methods on the collected data to interpret the results. To get the most out of the experiment, it must be carefully planned and designed. Which statistical analyses we can apply depend on the chosen design, and the used measurement scales. An experiment consists of a series of tests of the treatments also known as "Trails".Software Experimental Design
Design of an experiment describes how the tests are organized and run. We can define an experiment as a set of tests. General design principles of any experiment consist of:Randomization
All statistical methods used for analyzing the data require that the observations be from independent random variables. Randomization applies on the allocation of the objects, subjects and in which order the tests are performed. It averages out the effect of a factor that may otherwise be present and is also used to select subjects that is representative of the population of interest. E.g. the selection of the persons (subjects) will be representative of the designers in the company, by random selection of the available designers. The assignment to each treatment (object-oriented design or the standard company design principle) is selected randomly.Blocking
Sometimes we have a factor that probably has an effect on the response, but we are not interested in that effect. Blocking is used only if the effect of the factor is known and controllable. Systematically eliminate the undesired effect in the comparison among the treatments. Within one block, the undesired effect is the same and we can study the effect of the treatments on that block. The effects between the blocks are not studied. E.g. The persons (subjects) used, for this experiment, have different experience. Some of them have used object-oriented design before and some have not. To minimize the effect of the experience, the persons are grouped into two groups (blocks), one with experience of object-oriented design and one without.Balancing
If we assign the treatments so that each treatment has equal number of subjects, we have a balanced design. Balancing is desirable because it both simplifies and strengthens the statistical analysis of the data. E.g. the experiment uses a balanced design, which means that there is the same number of persons in each group (block).The experimental design should consist of the following aspects, even if the experiment is being replicated:
- Objective of the Experiment
This defines the purpose of the experiment. E.g. Imbalanced data in code effecting the efficiency of software change prediction model, or To observe the effect of personality trait as described by five-factor model on the performance of a software testing task.
- Type of Experiment
Empirical study can be conducted using the following methods:- Controlled Experiments
- Case Studies
- Survey Research
- Ethnographies
- Article/archive analysis (mining)
- Action Research
- Subject Selection
- It is important for generalization of results of the experiments
- In order to generalize the results to the desired population
- The selection must be representative for that population
- The selection of subjects is also called a sample from a population
- The sampling of the population can be either a probability or a non-probability sample
- Hypothesis
- Null Hypothesis
- Alternative Hypothesis
- Instrumentation
These are chosen while planning an experiment and are developed before the execution for the specific experiment. It consists of objects (e.g. specification or code documents), guidelines (e.g.process descriptions and checklists) to guide the participants in the experiment and measurement instruments. The goal of the instrumentation is to provide means for performing the experiment and to monitor it, without affecting the control of the experiment- The results of the experiment shall be the same independently of how the experiment is instrumented
- If the instrumentation affects the outcome of the experiment the results are invalid.
- Variables
- Dependent Variables:
Effect of the treatments is measured in the dependent variable(s). Often there is only one dependent variable and it should therefore be derived directly from the hypothesis. The variable is mostly not directly measurable and we have to measure it via an indirect measure instead which must be carefully validated, because it affects the result of the experiment. The hypothesis can be refined when we have chosen the dependent variable
- Independent Variables:
- Statistical Test Selected
This requires us to define the statistical tests used to analyze the collected data during the experiment. For more details see https://advancesereadings.blogspot.com/2018/10/statistical-tests-in-empirical-software.html
- Data Collection Mechanism
This requires us to define the mechanism used to collect data i.e. via interviews, questionnaire, analysis of integration activities, observations or tools.
- Performance Measure Selected
Performance measure evaluated are based on the objective of the experiment. E.g. usefulness and completeness of a documentation, Input/output speed, work load and time consumption of a particular algorithm or a process, or APFD (Average Percentage of Fault-Detection) metric
- Threats to Validity
- Conclusion validity:
- Conclusion validity:
This validity is concerned with the relationship between the treatment and the outcome. We want to make sure that there is a statistical relationship, i.e. with a given significance.
- Internal validity:
- Internal validity:
If a relationship is observed between the treatment and the outcome, we must make sure that it is a causal relationship, and that it is not a result of a factor of which we have no control or have not measured. In other words that the treatment causes the outcome (the effect).
- Construct validity:
- Construct validity:
This validity is concerned with the relation between theory and observation. If the relationship between cause and effect is causal, we must ensure two things: (1) that the treatment reflects the construct of the cause well and (2) that the outcome reflects the construct of the effect well
- External validity:
- External validity:
The external validity is concerned with generalization. If there is a causal relationship between the construct of the cause, and the effect, can the result of the study be generalized outside the scope of our study? Is there a relation between the treatment and the outcome?
Some studies do not explicitly mention the hypothesis being tested in the empirical study conducted instead only describe the goals and in some cases the research questions to be answered with the help of the results obtained from the study. Most of the studies focus on empirically evaluating certain goals and hypothesis using case studies. The most reported external threat is the extent to which the results obtained in the study can be generalized. Some of the most commonly used statistical tests consist of Wilcoxon test, and median or frequency. Most of the recent empirical studies are focused on change assessment in software with varying perspectives i.e. empirical evolution of a software change prediction model.
Sample Software Experimental Design
Blow is an example experimental design for replicating a published software experiment (the same sample can also be used for conducting the software experiment for the first time):
Focus on
|
Imbalanced data in code effecting the efficiency of software change prediction model
|
Type of Experiment
|
Controlled experiment
|
Subject Selection
|
Application: Open Source Data Sets from Object Oriented code written in C++ and an application written in JAVA, with previous versions of the application.
|
Participants: 2 groups consisting of equal number Master’s students skilled in both C++ and JAVA
| |
Hypothesis
|
Null Hypothesis = “Change prediction models developed using various Machine learning Technique WEKA do not show significant differences when various sampling methods are used for handling an imbalanced data set as compared to use of no sampling method when evaluated using G-mean measure”
Alternative Hypothesis = “Change prediction models developed using Machine Technique WEKA show significant differences when various sampling methods are used for handling an imbalanced data set as compared to use of no sampling method when evaluated using G-mean measure.”
|
Instrumentation
|
Open Source Code in JAVA and an open source object oriented code written in C++ with older versions, Defect Collection and Reporting System (DCRS) tool, WEKA tool
|
Independent Variable
|
Object Oriented Matrices
|
Dependent Variable
|
Change proneness observed through code analysis in the version of the same software
|
Statistical Test Selected
|
Friedman statistical test , Wilcoxon post-hoc test
|
Data Collection Mechanism
|
Defect Collection and Reporting System (DCRS) tool collecting change-logs between two consecutive versions of the software data set
|
Performance Measures Selected
|
Confusion matrix
|
Threats to Validity
|
Internal Threat = Lack of evaluation of casual effect,
Construct Validity = accurate measurement of variables,
Conclusion Validity = source leading to misjudgment of accurate relationship between the variables,
External Validity = generalizability of results
|
Replication of Software Experiment
Replicating a software experiment has gained popularity of the recent years. This is so to test the existing set of hypothesis using the same experimental design to test its validity. Software experiments have also been replicated to evaluate and compare the existing set of hypothesis with the newly developed hypothesis, or to compare their set of hypothesis using modified experimental design.Guidelines to Replicating a Software Experiment
Following are some of the guidelines provided by some of the recently published journals on software experiments:- For experimental replications to have a scientific value, they must be published in peer reviewed literature
- Replications reported must be consistent
- The replication study must report the same type of information at the same detail level
- The replication study must consist of information about the originals study’s research questions, participants, design, artifacts, context variables and summary of results, information about the replication to develop understanding on the motivation behind the replication study, level of interactions with the experimenters, the changes made to the original experiment and other specific important details about the replication itself, comparison of the replication results with those reported in the original study, and conclusion across the studies providing the readers with the insights drawn from the series of study that may not be obvious otherwise from a single study
- For better use and generation of knowledge from the studied set of replicated studies, the replication studies must follow a reporting standard.
- Explicit information should be provided regarding the variations between the replications and the originals studies
- The type of replication should be clear i.e. it should be clear whether the replication study made is internal or external as the factors used to analyze these replication differ
- The time elapsed between the publication of the original study and the replication should be noted
Comments
Post a Comment