# VolcanoCI: "A modified volcano plot using confidence intervals for simple presentation of microarray data"

Supplement to VolcanoCI by Hothorn LA, Ma D.

R codes for making the CI-FOLD Plot can be download here R-code.Rt.

The data examples (Hedenfalk & Shipp) used in this study can be found here. We used data from public sources. Importantly, we studied comparisons, which were not made in the same way by the authors of the original study. Because of this, here we clarify exactly what providing the primary data files that we actually used. The datasets are traceable back to the original sources under Author's web sites (Hedenfalk & Shipp).

Take the cDNA array data (Hedenfalk et al., 2001) for example, the original data without normalization shows the property of nonnormality for most of the genes. The Joint Distribution of Skewness and Kurtosis is shown in Figure 1. The log2 transformation is common for cDNA data to attain more symmetric distribution. The Joint Distribution of Skew and Kurtosis for the transformed data is shown in Figure 2.  Also there are also few genes with variance heterogeneity between two samples, which can be shown in Figure 3. Based on the above facts, the parametric methods (such as t test) based on the assumption of normality are not appropriate from this kind of dataset.

With regard to Affymetrix data, it is good to remember that the data has already been normalized and scaled to some extent using specific algorithms. But the Affy-type data also show the nature of nonnormality as well. For example, the Joint Distribution of Skew and Kurtosis of Shipp (Shipp et al., 2002) data is shown in Figure 4.  Further normalization can be done, but the overfitting is likely to remove the biologically relevant information and add some noises. Thus the Affy-data also calls for robust procedures for the detection of differentially expressed genes, instead of using parametric test (such as t test).

Due to the facts shown above, we propose the presentation of statistical significance by the upper/lower limit of the confidence interval instead of the p-value, because confidence intervals offer information about the distance from the null-hypothesis (distance to 1), the direction of the effect (larger/smaller than 1), and the variability (width) simultaneously.

R Package VolcanoCI can be used to implement such plot. There are different options for the plotting function:

method:  two methods are available for the construction of CIs, parametric Fieller’s method and “nonpar” nonparametric method.

alpha   :  can be changed for control of multiplicity.

col       :  color

pch      :  plotting characteristic

xlab     :  a title for the x axis

ylab     :  a title for the y axis

main    :  a title for the graph

aid.lines:  two bisection lines in the I and the II quadrants

fold.arc:   a compromise measure of fold change: the Euclidean distance between the gene and the origin, namely $$\sqrt{X^2+Y^2}$$

VIG.names : whether highlight the Very Important Genes’ names

Global :  whether discard the IV and the V quadrants

The following is a classical Volcano Plot log2(Ratio) vs. -log10(p.value|t.test)   compared with a modified Volcano Plot defined as $$log_2(CL^{overexpressed} _{lower})or-log_2(CL^{underexpressed} _{upper})=f(log_2(HL-estimator))$$ using VolcanoCI with options for Shipp Dataset:

method = “nonpar”,

alpha = 0.05,

col = “blue”,

pch = 16,

xlab = "log2(Hodges-Lehmann Estimator)",

ylab = "(-)log(Lower(Upper) Limit | Over(Under)-expressed genes)",

main = "CI-FOLD PLOT ",

aid.lines = TRUE, # abline(c(0,1)) and abline(c(0,-1))#

fold.arc = 4,

VIG.names = TRUE,global = TRUE) ,

global = TRUE

All the more than 4 fold over-expressed genes in the first quadrant outside the sector will be highlighted with their IDs. Fact, the genes in the lower two quadrants are not interesting. Because the genes in the third quadrant are under-expressed genes with upper limits (for ratios) larger than one; the genes in the forth quadrant are over-expressed genes with lower limits (for ratios) smaller than one, which means they are not significantly differentially expressed.  In practice, only the upper two quadrants is necessary to be plotted, which can be done by changing the argument "global" in CI.FOLD.plot function to FALSE. The figure is shown