Statistical Tests in Empirical Software Engineering

Empirical Software Engineering

Software engineering requires a cycle of model building, experimentation, learning, and re-modeling. Researcher's role is to understand the nature of the processes, products, and their relationship in the context:

They (often) use laboratory settings to observe and manipulate the variables
What is the effect?, Why is this so?,...

Practitioner's role is to build "improved" systems using available knowledge

They need to better understand how to build better systems
What is the problem?, What are the potential solutions?, What is the cost?, To what extent do they solve the problem?,...

Empirical software engineering provides methods, techniques, and tools to systematically obtain relevant information. Empirical software engineering considers the systematic application of scientific (research) methods to understand, evaluate, and model software engineering phenomenon e.g.:

Something is won with software development - what? why? and how?
There must be some room for improvements - what? and where?
A specific decision was taken - why? and how?

Measurement

A measure is a mapping from the attribute of an entity to a measurement value, usually a numerical value to characterize and manipulate the attributes in a formal way. One of the basic characteristics of a measure is therefore that it must preserve the empirical observations of the attribute i.e. if object A is longer than object B, the measure of A must be greater than the measure of B. We must be certain that the measure is valid:

The measure must not violate any necessary properties of the attribute it measures
It must be a proper mathematical characterization of the attribute

Types of measurements can be classified as:

Objective VS Subjective

Direct Measure VS Indirect Measure

The objects that are of interest in software engineering can be divided into three different classes:

Scales

The mapping from an attribute to a measurement value can be made in many different ways. Each different mapping of an attribute is a scale e.g. if the attribute is the length of an object, we can measure it in meters, centimeters or inches, each of which is a different scale of the measure of the length. In some cases a transformation is required to convert the measure from one scale to another. An admissible transformation is also known as rescaling that preserves the relationship among objects. With the measures of the attribute, we make statements about the object or the relation between different objects. If the statements are true even if the measures are rescaled, they are called meaningful, otherwise they are meaningless. There are 4 types of scales used for any measurement calculations:

Statistical Tests

Quantitative analysis of a particular set of data requires statistical tests. This type of testing deals with the presentation and numerical processing of the data which may in turn be used to describe and graphically present interesting aspects of the data set. The goal of such type of testing is to learn about the distribution of data, understanding its nature and identifying outliers (abnormal/false data points). Following are some of the types of statistical tests:

Parametric Tests

For this test the data should be equally variant and normally distributed. Parametric tests use either interval or ratio scales, requires complete information about the population being tested. The measure of central tendency of such tests are a mean of the population and are applicable only on variables. Following are some of the parametric tests used for empirically evaluating data:

Parametric Tests	Purpose
Welch’s T-Test	It is a similar test to 2-sample T-test comparing distributions that estimates variances and adjusts the degree of freedom to use in a test
Dunnett’s Test Williams Test	Instead of comparing all possible combinations, the test allows us to compare each group to a reference
Permutation Student’s T-Test	It is a function that deals with the limited floating point precisions and can bias calculations of p-values based on static distributions of discrete test
Jarque-Bera Test	It tests for the normality of the data and checks whether the sample data have the kurtosis and the skewness matching a normal distribution
Pearson’s Correlation / Parametric Correlation	It evaluates the association between 2 or more variables by measuring a linear dependence between the variables. It tests the data that is normally distributed.
Paired T-Test	It is a statistical procedure used to determine whether the mean difference between 2 sets of observations is 0. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.
Levene Test	It is used to assess the equality of variances for a variable calculated for 2 or more groups. This test check whether the variances of the populations from which different samples are drawn are equal.
Un-Paired T-Test	It compares the means of 2 unmatched groups, assuming that the values of both groups follow a Gaussian distribution
One Way ANOVA	Also known as One Way Analysis of Variance, is used to determine whether there are any statistically significant differences between the means of 2 or more independent (unrelated) groups.

Non-Parametric Tests

Non-parametric tests use either ordinal or nominal scale, does not complete information on the population being tested and is applicable to both the variables and the attributes. Such type of tests use median as the measure of central tendency. For this type of test the data should not be normally distributed and not equally variant. Following are some of the non-parametric tests used during empirical evaluation:

Non-Parametric Tests	Purpose
Binomial Test	It is a method for testing the null hypothesis on binomial distribution
Wicoxon Test / Mann-Whitney U-Test	Also known as Wilcoxon signed-rank test, used to compare 2 related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ
Kolmogorov-Smirnov Test	It tests for the sameness of 2 independent samples from a continuous distribution. The function is used as a test for normality in the variables used as predictors in a regression model before the fit
Adhoc Modification of Original T-Test	Also known as Tukey's test, Tukey's procedure, Tukey's honestly significat diference test or Tukey' HSD. It is used to determine which means amongst a set of means differ from the rest.
Discrete Cramer-Von Mises Goodness-Of-Fit Tests	The test is used for a cumulative distribution function. It is a criterion used for judging the goodness of fit. It does the same as the Kolmogorov-Smirnov Test but is more powerful against a large class of alternative hypothesis
D’ Agostino	It checks for the normality of the data. Based on the D statistics, it gives an upper and lower critical value
F-Test	It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled
Spearman’s Correlation / Kendall Tau	It evaluates the association between 2 or more variables using ranks and tests checks that the data is not normally distributed
Bonferroni U-Test	It is a method to counteract the problems of inflated Type I errors while working with multiple pairwise comparisons between different sub-groups and is similar to Tukey’s Procedure
Barttell’s Test	It compares the variance of 2 or more samples in order to determine whether they are drawn from the sample population with equal variance. The test is however applicable to normally distributed data
Kruskul Wallis Test	It is used for comparing two or more independent samples of equal or different sample sizes.
Fligner Killeen Test	This test is similar to the Levene Test and conducts variance analysis to check that the data is not normally distributed. It also checks whether the variances in each group is the same
Brown-Forsythe Test	The test checks for homogeneity of variance

Manhattan Metric: There are some tests that do not fall in any of the above discussed types of tests since they can use any type of scale but are applicable only to the variables:

It is a function responsible for computing and returning the distance matrix computed using the absolute distance between the 2 vectors to compute the distance between the rows of a data matrix

Minkowski Distance:

The function is responsible for computing and returning the distance matrix computed using the p norm, the pth path root of the sum of the pth power of the difference of the components to compute the distance between the rows of a data matrix

Manhattan Matric:

The function is responsible for computing and returning the distance matrix computed using the absolute distance between the 2 vectors to compute the distance between the rows of a data matrix

Examples of the Statistical Tests:

The examples presented below against each of the tests are taken from the source of the information related to the test:

Test	R-Command	Example Code	Results	Result Analysis
Fligner Killeen Test	fligner.test(size~location, data=sample.dataframe)	Ø size<-c(25,22,28,24,26,24,22,21,23,25,26,30,25,24,21,27,28,23,25,24,20,22,24,23,22,24,20,19,21,22) Ø location<-c(rep("ForestA",10), rep("ForestB",10), rep("ForestC",10)) Ø sample.dataframe<-data.frame(size,location) fligner.test(size~location, data=sample.dataframe)	Fligner-Killeen test of homogeneity of variances Data: size by location Fligner Killeen : med chi-squared = 0.9556, df = 2, p-value = 0.6201	The p-value obtained through the test shows that the variance are homogeneous
Bartlett’s Test	bartlett.test(values~groups, dataset)	Ø Attach (PlantGrowth) Ø bartlett.test(weight~group, PlantGrowth)	Bartlett test of homogeneity of variances Data: weight by group Bartlett’s K-squared = 2.8786, df = 2, p-value = 0.2371	The p-value being greater than 0.05 shows that the H0 of the variances being the same for all groups is true.
Binomial Test	binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"),\ conf.level = 0.95)	Ø Suppose in a coin tossing, the chance to get a head or tail is 50%. In a real case, we have 100 coin tossing, and get 48 heads, is our original hypothesis true? Ø binom.test(48,100)	Exact binomial test data: 48 and 100 number of successes = 48, number of trials = 100, p-value = 0.7644 alternative hypothesis: true probability of success is not equal to 0.5. 95 percent confidence interval: 0.3790055 0.5822102 sample estimates: probability of success 0.48	The p-value obtained being greater than 0.05 shows that the H0 being the probability of getting a head or a tail is accepted
Permutation Student's T-Test	perm.t.test(x, y, paired = FALSE, ...)	Ø response <- c(rnorm(5),rnorm(5,2,1)) Ø fact <- gl(2,5,labels=LETTERS[1:2]) Ø # Unpaired test perm.t.test(response~fact,nperm=49) Ø # Paired test perm.t.test(response~fact,paired=TR
Kolmogorov-Smirnov test	ks.test(x,y)	Ø x <- c(1,2,2,3,3,3,3,4,5,6) Ø y <- c(2,3,4,5,5,6,6,6,6,7) Ø z <- c(12,13,14,15,15,16,16,16,16,17) Ø ks.test(x,y) Ø ks.test(y,z) Ø ks.test(z,x)
Cramer-von Mises test for normality	cvm.test(x)	Ø cvm.test(rnorm(100, mean = 10, sd = 6)) Ø cvm.test(runif(100, min = 2, max = 4))
Jarque–Bera test	jarqueberaTest(x, title = NULL, description = NULL)	The function returns the values for the 'W' statistic and the p-value.	Jarque–Bera test	jarqueberaTest(x, title = NULL, description = NULL)
D'Agostino	dagoTest(x, title = NULL, description = NULL)
Manhattan Matrics	dist(rbind(x, y), method = "manhattan")	Ø x <- c(0, 0, 1, 1, 1, 1) Ø y <- c(1, 0, 1, 1, 0, 1) Ø dist(rbind(x, y), method = "manhattan")	x y 2	The distance between the rows is 2
Minkowski Matrics	dist(rbind(x, y), method = "minkowski")	Ø x <- c(0, 0, 1, 1, 1, 1) Ø y <- c(1, 0, 1, 1, 0, 1) Ø dist(rbind(x, y), method = "minkowski")	x y 1.414214	The distance between the rows is 1.41
Parametric Correlation	cor(x, y, method = c("pearson", "kendall", "spearman")) cor.test(x, y, method=c("pearson", "kendall", "spearman"))	Ø res <- cor.test(my_data$wt, my_data$mpg,method= "pearson") Ø resres<-cor.test(my_data$wt, my_data$mpg, method = "pearson") Ø res	Pearson's product-moment correlation data: my_data$wt and my_data$mpg t = -9.559, df = 30, p-value = 1.294e-10 alternative hypothesis: true correlation is not equal to 0. 95 percent confidence interval: -0.9338264 -0.7440872 sample estimates: cor -0.8676594	The p-value of the test is 1.29410^{-10}, which is less than the significance level alpha = 0.05. Thus wt and mpg are significantly correlated with a correlation coefficient of -0.87 and p-value of 1.29410^{-10} .
Spearman Correlation	cor(x, y, method = c("pearson", "kendall", "spearman")) cor.test(x, y, method=c("pearson", "kendall", "spearman"))	Ø res2<-cor.test(my_data$wt, my_data$mpg,method = "spearman") Ø res2	Spearman's rank correlation rho data: my_data$wt and my_data$mpg S = 10292, p-value = 1.488e-11 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.886422	The correlation coefficient between x and y are -0.8864 and the p-value is 1.48810^{-11}.
Welch’s T-Test	t.test(x,y)	Ø x = rnorm(10) Ø y = rnorm(10) Ø t.test(x,y)	Welch Two Sample t-test data: x and y t = -0.8103, df = 17.277, p-value = 0.4288 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.0012220 0.4450895 sample estimates: mean of x mean of y 0.2216045 0.4996707
Dunnett's Test Williams Test	test.out = glht(out, linfct = mcp(ZNGROUP = "Dunnett"))	Ø library(multcomp) test.out = glht(out, linfct = mcp(ZNGROUP = "Dunnett")) Ø summary(test.out)	Multiple Comparisons of Means: Dunnett Contrasts Fit: aov(formula = DIVERSTY ~ ZNGROUP, data = d) Linear Hypotheses: Estimate Std. Error t value Pr( > \|t\|) 2 - 1 == 0 0.23500 0.23303 1.008 0.6195 3 - 1 == 0 -0.07972 0.22647 -0.352 0.9701 4 - 1 == 0 -0.51972 0.22647 -2.295 0.0725 . ---

References:

[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html

[2] http://www.endmemo.com/program/R/binomial.php

[3] http://math.furman.edu/~dcs/courses/math47/R/library/fBasics/html/015D-OneSampleTests.html

[4] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html

[5] https://www.rdocumentation.org/packages/RVAideMemoire/versions/0.9-68/topics/perm.t.test

[6] https://www.rdocumentation.org/packages/dgof/versions/1.2/topics/cvm.test

[7] https://www.rdocumentation.org/packages/tsoutliers/versions/0.3/topics/jarque.bera.test

[8] https://www.pinterest.com/APstatistics/chapter-7-sampling-distributions/?lp=true

[9] http://cw.routledge.com/textbooks/9780415368780/e/CH26box.asp

[10] http://www.jisponline.com/article.asp?issn=0972-124X;year=2013;volume=17;issue=5;spage=577;epage=582;aulast=Avula

[11] http://keydifferences.com/difference-between-parametric-and-nonparametric-test.html

[12] https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/parametric-nonparametric-tests

[13] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881615/

[14] http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r#pearson-correlation-formula

[15] https://statistics.berkeley.edu/computing/r-t-tests

[16] https://www.otexts.org/node/687

Comments

Anonymous1:07 am, November 19, 2018
I find this article useful. Kindly share more such articles so that I can get better insight. Statistical Analysis Services
Studywec11:18 pm, February 05, 2021
This blog is very helpful. Thanks for sharing this type of blog with us. Really very happy to say, your post is very interesting to read. I never stop myself from saying something about it. You’re doing a great job. Keep it up and share this kind of good knowledgeable content. I have also gone through the site related to your industry that is studywec.com offers BEng (Hons) Software Engineering
Internationalization: Software Engineering and application development is by virtue an international business and therefore isn’t bound by geographical area.BEng(Hons) Software Engineering (Enroll Now)
limited seats available.
laxmi4:08 am, May 14, 2021
hyperion online training
msbi training
sharepoint training
Anonymous7:42 am, September 07, 2024
Great and that i have a super offer you: How To Properly Renovate A House house repair quotes

Domains of Software Engineering

Search This Blog

Statistical Tests in Empirical Software Engineering

Empirical Software Engineering