Skip to main content

Statistical Tests in Empirical Software Engineering

Empirical Software Engineering

Software engineering requires a cycle of model building, experimentation, learning, and re-modeling. Researcher's role is to understand the nature of the processes, products, and their relationship in the context:
  • They (often) use laboratory settings to observe and manipulate the variables
  • What is the effect?, Why is this so?,...
Practitioner's role is to build "improved" systems using available knowledge
  • They need to better understand how to build better systems
  • What is the problem?, What are the potential solutions?, What is the cost?, To what extent do they solve the problem?,...
Empirical software engineering provides methods, techniques, and tools to systematically obtain relevant information. Empirical software engineering considers the systematic application of scientific (research) methods to understand, evaluate, and model software engineering phenomenon e.g.:
  • Something is won with software development - what? why? and how?
  • There must be some room for improvements - what? and where?
  • A specific decision was taken - why? and how?

Measurement

A measure is a mapping from the attribute of an entity to a measurement value, usually a numerical value to characterize and manipulate the attributes in a formal way. One of the basic characteristics of a measure is therefore that it must preserve the empirical observations of the attribute i.e. if object A is longer than object B, the measure of A must be greater than the measure of B. We must be certain that the measure is valid:
  • The measure must not violate any necessary properties of the attribute it measures
  • It must be a proper mathematical characterization of the attribute
Types of measurements can be classified as:

Objective VS Subjective

Direct Measure VS Indirect Measure


The objects that are of interest in software engineering can be divided into three different classes:

Scales

The mapping from an attribute to a measurement value can be made in many different ways. Each different mapping of an attribute is a scale e.g. if the attribute is the length of an object, we can measure it in meters, centimeters or inches, each of which is a different scale of the measure of the length. In some cases a transformation is required to convert the measure from one scale to another. An admissible transformation is also known as rescaling that preserves the relationship among objects. With the measures of the attribute, we make statements about the object or the relation between different objects. If the statements are true even if the measures are rescaled, they are called meaningful, otherwise they are meaningless. There are 4 types of scales used for any measurement calculations:

Statistical Tests

Quantitative analysis of a particular set of data requires statistical tests. This type of testing deals with the presentation and numerical processing of the data which may in turn be used to describe and graphically present interesting aspects of the data set. The goal of such type of testing is to learn about the distribution of data, understanding its nature and identifying outliers (abnormal/false data points). Following are some of the types of statistical tests:

Parametric Tests

For this test the data should be equally variant and normally distributed. Parametric tests use either interval or ratio scales, requires complete information about the population being tested. The measure of central tendency of such tests are a mean of the population and are applicable only on variables. Following are some of the parametric tests used for empirically evaluating data:
Parametric Tests
Purpose
Welch’s T-Test
It is a similar test to 2-sample T-test comparing distributions that estimates variances and adjusts the degree of freedom to use in a test
Dunnett’s Test Williams Test
Instead of comparing all possible combinations, the test allows us to compare each group to a reference
Permutation Student’s T-Test
It is a function that deals with the limited floating point precisions and can bias calculations of p-values based on static distributions of discrete test
Jarque-Bera Test
It tests for the normality of the data and checks whether the sample data have the kurtosis and the skewness matching a normal distribution
Pearson’s Correlation / Parametric Correlation
It evaluates the association between 2 or more variables by measuring a linear dependence between the variables. It tests the data that is normally distributed.
Paired T-Test
It is a statistical procedure used to determine whether the mean difference between 2 sets of observations is 0. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.
Levene Test
It is used to assess the equality of variances for a variable calculated for 2 or more groups. This test check whether the variances of the populations from which different samples are drawn are equal.
Un-Paired T-Test
It compares the means of 2 unmatched groups, assuming that the values of both groups follow a Gaussian distribution
One Way ANOVA
Also known as One Way Analysis of Variance, is used to determine whether there are any statistically significant differences between the means of 2 or more independent (unrelated) groups.

Non-Parametric Tests


Non-parametric tests use either ordinal or nominal scale, does not complete information on the population being tested and is applicable to both the variables and the attributes. Such type of tests use median as the measure of central tendency. For this type of test the data should not be normally distributed and not equally variant. Following are some of the non-parametric tests used during empirical evaluation:
Non-Parametric Tests
Purpose
Binomial Test
It is a method for testing the null hypothesis on binomial distribution
Wicoxon Test / Mann-Whitney U-Test
Also known as Wilcoxon signed-rank test, used to compare 2 related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ
Kolmogorov-Smirnov Test
It tests for the sameness of 2 independent samples from a continuous distribution. The function is used as a test for normality in the variables used as predictors in a regression model before the fit
Adhoc Modification of Original T-Test
Also known as Tukey's test, Tukey's procedure, Tukey's honestly significat diference test or Tukey' HSD. It is used to determine which means amongst a set of means differ from the rest.
Discrete Cramer-Von Mises Goodness-Of-Fit Tests
The test is used for a cumulative distribution function. It is a criterion used for judging the goodness of fit. It does the same as the Kolmogorov-Smirnov Test but is more powerful against a large class of alternative hypothesis
D’ Agostino
It checks for the normality of the data. Based on the D statistics, it gives an upper and lower critical value
F-Test
It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled
Spearman’s Correlation / Kendall Tau
It evaluates the association between 2 or more variables using ranks and tests checks that the data is not normally distributed
Bonferroni U-Test
It is a method to counteract the problems of inflated Type I errors while working with multiple pairwise comparisons between different sub-groups and is similar to Tukey’s Procedure
Barttell’s Test
It compares the variance of 2 or more samples in order to determine whether they are drawn from the sample population with equal variance. The test is however applicable to normally distributed data
Kruskul Wallis Test
It is used for comparing two or more independent samples of equal or different sample sizes.
Fligner Killeen Test
This test is similar to the Levene Test and conducts variance analysis to check that the data is not normally distributed. It also checks whether the variances in each group is the same
Brown-Forsythe   Test
The test checks for homogeneity of variance
Manhattan Metric: There are some tests that do not fall in any of the above discussed types of tests since they can use any type of scale but are applicable only to the variables:

It is a function responsible for computing and returning the distance matrix computed using the absolute distance between the 2 vectors to compute the distance between the rows of a data matrix

  • Minkowski Distance: 

The function is responsible for computing and returning the distance matrix computed using the p norm, the pth path root of the sum of the pth power of the difference of the components to compute the distance between the rows of a data matrix

  • Manhattan Matric: 
The function is responsible for computing and returning the distance matrix computed using the absolute distance between the 2 vectors to compute the distance between the rows of a data matrix

Examples of the Statistical Tests:

The examples presented below against each of the tests are taken from the source of the information related to the test:

Test
R-Command
Example Code
Results
Result Analysis
Fligner Killeen Test
fligner.test(size~location, data=sample.dataframe)
Ø  size<-c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22)
Ø  location<-c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10))
Ø  sample.dataframe<-data.frame(size,location) fligner.test(size~location, data=sample.dataframe)

Fligner-Killeen test of homogeneity of variances Data: size by location Fligner Killeen : med chi-squared = 0.9556, df = 2, p-value = 0.6201
The p-value obtained through the test shows that the variance are homogeneous
Bartlett’s Test
bartlett.test(values~groups, dataset)

Ø  Attach (PlantGrowth)
Ø  bartlett.test(weight~group, PlantGrowth)

Bartlett test of homogeneity of variances
Data: weight by group
Bartlett’s K-squared = 2.8786, df = 2, p-value = 0.2371
The p-value being greater than 0.05 shows that the H0 of the variances being the same for all groups is true.
Binomial Test
binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"),\ conf.level = 0.95)

Ø  Suppose in a coin tossing, the chance to get a head or tail is 50%. In a real case, we have 100 coin tossing, and get 48 heads, is our original hypothesis true?
Ø  binom.test(48,100)

Exact binomial test
data: 48 and 100
number of successes = 48, number of trials = 100, p-value = 0.7644
alternative hypothesis: true probability of success is not equal to 0.5. 95 percent confidence interval: 0.3790055 0.5822102
sample estimates:
probability of success 0.48
The p-value obtained being greater than 0.05 shows that the H0 being the probability of getting a head or a tail is accepted
Permutation Student's T-Test
perm.t.test(x, y, paired = FALSE, ...)
Ø  response <- c(rnorm(5),rnorm(5,2,1))
Ø  fact <- gl(2,5,labels=LETTERS[1:2])
Ø  # Unpaired test
perm.t.test(response~fact,nperm=49)
Ø  # Paired test perm.t.test(response~fact,paired=TR


Kolmogorov-Smirnov test
ks.test(x,y)

Ø  x <- c(1,2,2,3,3,3,3,4,5,6)
Ø  y <- c(2,3,4,5,5,6,6,6,6,7)
Ø  z <- c(12,13,14,15,15,16,16,16,16,17)
Ø  ks.test(x,y)
Ø  ks.test(y,z)
Ø  ks.test(z,x)


Cramer-von Mises test for normality
cvm.test(x)
Ø  cvm.test(rnorm(100, mean = 10, sd = 6))
Ø  cvm.test(runif(100, min = 2, max = 4))



Jarque–Bera test
jarqueberaTest(x, title = NULL, description = NULL)
The function returns the values for the 'W' statistic and the p-value.
Jarque–Bera test
jarqueberaTest(x, title = NULL, description = NULL)
D'Agostino
dagoTest(x, title = NULL, description = NULL)



Manhattan Matrics
dist(rbind(x, y), method = "manhattan")
Ø  x <- c(0, 0, 1, 1, 1, 1)
Ø  y <- c(1, 0, 1, 1, 0, 1)
Ø  dist(rbind(x, y), method = "manhattan")
x
y 2
The distance between the rows is 2
Minkowski Matrics
dist(rbind(x, y), method = "minkowski")
Ø  x <- c(0, 0, 1, 1, 1, 1)
Ø  y <- c(1, 0, 1, 1, 0, 1)
Ø  dist(rbind(x, y), method = "minkowski")
x
y 1.414214
The distance between the rows is 1.41
Parametric Correlation
cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))
Ø  res <- cor.test(my_data$wt, my_data$mpg,method= "pearson")
Ø  resres<-cor.test(my_data$wt, my_data$mpg, method = "pearson")
Ø  res

Pearson's product-moment correlation
data: my_data$wt and my_data$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0.  95 percent confidence interval:
-0.9338264 -0.7440872
sample  estimates:
cor  -0.8676594
The p-value of the test is 1.29410^{-10}, which is less than the significance level alpha = 0.05. Thus wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .
Spearman Correlation
cor(x, y, method = c("pearson", "kendall", "spearman"))
cor.test(x, y, method=c("pearson", "kendall", "spearman"))
Ø  res2<-cor.test(my_data$wt, my_data$mpg,method = "spearman")
Ø  res2

Spearman's rank correlation rho
data: my_data$wt and my_data$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho -0.886422
The correlation coefficient between x and y are -0.8864 and the p-value is 1.48810^{-11}.
Welch’s T-Test
t.test(x,y)
Ø  x = rnorm(10)
Ø  y = rnorm(10)
Ø  t.test(x,y)
Welch Two Sample t-test data: x and y
t = -0.8103, df = 17.277, p-value = 0.4288 alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0012220 0.4450895
sample estimates:
mean of x mean of y
0.2216045 0.4996707

Dunnett's Test
Williams Test
test.out = glht(out, linfct = mcp(ZNGROUP = "Dunnett"))
Ø  library(multcomp)  test.out = glht(out, linfct = mcp(ZNGROUP = "Dunnett"))
Ø  summary(test.out)

Multiple Comparisons of Means: Dunnett Contrasts
Fit: aov(formula = DIVERSTY ~
ZNGROUP, data = d)
Linear Hypotheses:
Estimate Std. Error t value Pr( > |t|)
2 - 1 == 0 0.23500 0.23303 1.008 0.6195
3 - 1 == 0 -0.07972 0.22647 -0.352 0.9701
4 - 1 == 0 -0.51972 0.22647 -2.295 0.0725 .
---

References:

[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html
[2] http://www.endmemo.com/program/R/binomial.php
[3] http://math.furman.edu/~dcs/courses/math47/R/library/fBasics/html/015D-OneSampleTests.html
[4] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html
[5] https://www.rdocumentation.org/packages/RVAideMemoire/versions/0.9-68/topics/perm.t.test
[6] https://www.rdocumentation.org/packages/dgof/versions/1.2/topics/cvm.test
[7] https://www.rdocumentation.org/packages/tsoutliers/versions/0.3/topics/jarque.bera.test
[8] https://www.pinterest.com/APstatistics/chapter-7-sampling-distributions/?lp=true
[9] http://cw.routledge.com/textbooks/9780415368780/e/CH26box.asp
[10] http://www.jisponline.com/article.asp?issn=0972-124X;year=2013;volume=17;issue=5;spage=577;epage=582;aulast=Avula
[11] http://keydifferences.com/difference-between-parametric-and-nonparametric-test.html
[12] https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/parametric-nonparametric-tests
[13] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881615/
[14] http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r#pearson-correlation-formula
[15] https://statistics.berkeley.edu/computing/r-t-tests
[16] https://www.otexts.org/node/687 

Comments

  1. I find this article useful. Kindly share more such articles so that I can get better insight. Statistical Analysis Services

    ReplyDelete
  2. This blog is very helpful. Thanks for sharing this type of blog with us. Really very happy to say, your post is very interesting to read. I never stop myself from saying something about it. You’re doing a great job. Keep it up and share this kind of good knowledgeable content. I have also gone through the site related to your industry that is studywec.com offers BEng (Hons) Software Engineering
    Internationalization: Software Engineering and application development is by virtue an international business and therefore isn’t bound by geographical area.BEng(Hons) Software Engineering (Enroll Now)
    limited seats available.

    ReplyDelete
  3. Great and that i have a super offer you: How To Properly Renovate A House house repair quotes

    ReplyDelete

Post a Comment

Popular posts from this blog

Quality Practices in Agile Approaches

Agile Approaches Agile is an umbrella consisting of different methods adopted by the practitioners depending upon the circumstances. In the recent years  Agile has been gaining popularity among software practitioners due to its ability in assisting the development team to deliver the software product in a short amount of time.  Original Agile Approaches Based on the circumstances under which the agile methodologies have been used can be classified into the following 3 categories: Classification of Agile Approaches Agile Methodologies consist of the  original agile methods Hybrid Agile Methodologies consist of a combination of several original agile methodologies e.g. Industrial Extreme Programming merged with practices of Rational Unified Process Miscellaneous category consists of methodologies adopting only certain aspects of the original agile methodologies Extreme Programming, Test Driven/Test First Development, and SCRUM are among the most pop...

How traceability of non-functional requirements is managed throughout the software development process?

1. Requirements Traceability: Requirements traceability is the process of describing and keeping track of a set of requirements throughout the system’s lifecycle. The process assists in managing changing requirements of a particular software product. Requirements traceability of is two types, forward traceability where a particular requirement involved during the design and implementation phases of the software system, and backward requirement traceability where a particular requirement is being traced back to its source. 2. Proposed Solutions for the Traceability of Non - Functional Requirements : The author J. Merilinna [8], proposed a framework supported by a tool to trace the non-functional requirements in both forward and backward direction. The proposed method is based on the context of DSM (Domain Specific Modeling).  The NFR+1 framework involved are used for the elicitation, definition and redefinition of the system’s non-functional requirements. The...

Software Architecture Views and Structures

Description of Views: In the year 1995 Kruchten presented his 4+1 architectural view model consisting of the following five types of views: Logical Development Process Physical Scenario Later with further development and research in the domain of architectural view following new views were developed to represent their respective structures: Views Sub-View of Description Logical None Highlights the functionalities provided by the system to the end-users. Unified Modeling Language (UML) diagrams such as the Class diagram, Domain diagram, Use Case diagram, State diagrams and Activity diagrams can be used to represent the logical view of the architecture. Development None It is also known as an implementation view. It is mainly concerned with the software project management. It represents the system with the programmer’s perspective. Process None It deals with the representation of...