Tuesday, 24 July 2012

Measuring effect size with the Vargha-Delaney A measure

Sometimes, you want to compare the performance of two techniques. That is, you want to quantify how much better one technique is than the other, in a statistically-sound fashion. In this post, I will show you how to do just that, using the Varga-Delaney A measure.The A measure is awesome because it is: agnostic to the underlying distribution of the data; easy to interpret; fast and easy to compute; and compatible with real numbers (continuous or not) and ordinal data alike.

For example, say you were trying to kill zombies that were trying to come in your house and eat your brains. You dream up two different techniques for defending your homestead:
  • Technique T1: Rocket launcher to their faces
  • Technique T2: Hurling fruit cake at their faces
Which technique can eliminate more zombies?

vs. 

The Experiment

To evaluate your two techniques, you conduct a simple experiment. First, you invite 100 sets of zombies to try and break into your house. Let's say that each set contains exactly 10 zombies. For the sake of simplicity, let's assume that each set of zombies comes to your house twice, no matter what happens to them the first time. The first time they come, you unleash the rocket launcher (technique T1); the second time, the fruit cake (T2). After each set comes and you try each technique, you tally up the number of dead zombies.

When the dust settles, you will have a simple dataset that looks something like:

1, 3, 5
2, 9, 5
3, 0, 5
[97 more rows] 

[Download the sample dataset here.]

The first column shows the zombie set ID, an integer between 1 and 100. The second column shows the number of zombies eliminated by technique T1, and the third column shows the number of zombies eliminated by technique T2.

Which Technique is Better?

One way to determine which set is better is to simply compare the means and medians of columns 2 and 3: how many zombies did each technique eliminate, on average? In R, we could do something like this:


> zombies = read.csv("zombies.dat", sep=",", header=F)
> summary(zombies[,2])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    3.00    5.00    4.79    7.00    9.00 
> summary(zombies[,3])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.00    6.00    6.48    8.00   10.00 

Here, we see that the rocket launchers killed about 4.8 zombies on average, whereas the fruit cake killed about 6.5. However, averages are easily skewed by extreme values, and as a result, can be misleading. In addition, they cannot handle ordinal data nor do they do well with non-continuous variables.

Instead, we could use the Vargha-Delaney A measure, which tells us how often, on average, one technique outperforms the other. When applied to two populations (like the results of our two techniques), the A measure is a value between 0 and 1: when the A measure is exactly 0.5, then the two techniques achieve equal performance; when A is less than 0.5, the first technique is worse; and when A is more than 0.5, the second technique is worse. The closer to 0.5, the smaller the difference between the techniques; the farther from 0.5, the larger the difference.

Let's return to our example:

> AMeasure(zombies[,2], zombies[,3])
[1] 0.33555

[See the Appendix for the implementation of the function AMeasure().]

In our case, the A measure is 0.34, indicating that rocket launchers (T1) performed worse than fruit cake (T2), because 0.34 is less than 0.5. We can interpret the value as follows:  34% of the time, rocket launchers will work better than fruit cake. Equivalently, 66% of the time, fruit cake will work better than rocket launchers. (Anyone who has ever tried fruit cake can understand its true destructive power.) We can deduce that you have twice the changes of surviving should you forgo the military equipment and instead put Aunt Gertrude's culinary creation to good use. 

Conclusion

There are many statistical techniques to help you choose your zombie-fighting strategies, but the Vargha-Delaney A measure is simple, robust, and easy to interpret.

Note: My coauthors and I used this measure in an upcoming Empirical Software Engineering article.


Appendix: R code for Varga-Delaney

Using Equations (13) and (14) in their paper, we get R code as follows:

##########################################################################
##########################################################################
#
# Computes the Vargha-Delaney A measure for two populations a and b.
#
# Equation numbers below refer to the paper:
# @article{vargha2000critique,
#  title={A critique and improvement of the CL common language effect size
#               statistics of McGraw and Wong},
#  author={Vargha, A. and Delaney, H.D.},
#  journal={Journal of Educational and Behavioral Statistics},
#  volume={25},
#  number={2},
#  pages={101--132},
#  year={2000},
#  publisher={Sage Publications}
# }
# 
# a: a vector of real numbers
# b: a vector of real numbers 
# Returns: A real number between 0 and 1 
AMeasure <- function(a,b){

    # Compute the rank sum (Eqn 13)
    r = rank(c(a,b))
    r1 = sum(r[seq_along(a)])

    # Compute the measure (Eqn 14) 
    m = length(a)
    n = length(b)
    A = (r1/m - (m+1)/2)/n

    A
}

4 comments:

  1. Greetings, is this your one and only site or you personally own some others?

    ReplyDelete
  2. Hello, in your R code, you refer to the rank sum computation as eq. 13 from Vargha-Delaney... It's not. Eq. 13 shows the relationship between the rank-sum and the A measure, leading to eq. 14. Your code is correct, only the reference to eq. 13 may be confusing.

    ReplyDelete
  3. This is a nice intro to V&D A measure.

    The measure is currently implemented in the effsize R package.

    Just a minor comment. The original formula from V&D may introduce some accuracy error, a better formulation is:

    A = (2* r1 - m*(m+1))/(2*m*n)

    For more details see http://mtorchiano.wordpress.com/2014/05/19/effect-size-of-r-precision/

    ReplyDelete
  4. Hello. I write this in hope someone might resolve this doubt. What happens if I'm comparing two conditions of an experiment where the condition that performs better should get a negative value? In other words, I want to see which condition has performed better in giving results with smaller values. In this case, shouldn't the resulting A measure be smaller for the group that performs better?

    Cheers

    ReplyDelete