Example: confidence

A Comparison of Normalization Methods for High Density ...

A Comparison of Normalization Methods for high DensityOligonucleotide array data based on variance and BiasB. M. Bolstad1, R. A. Irizarry2, M. Astrand3and T. P. Speed4, 51 Group in Biostatistics, University of California, Berkeley, CA 94720, USA,2 Department of Biostatistics, JohnHopkins University, Baltimore, MD, USA,3 AstraZeneca R & D M olndal, Sweden,4 Department of Statistics,University of California, Berkeley, CA 94720, USA and5 Division of Genetics and Bioinformatics, WEHI,Melbourne, AustraliaABSTRACTM otivation:When running experiments that involve multi-ple high Density oligonucleotide arrays, it is important to re-move sources of variation between arrays of non-biologicalorigin. Normalization is a process for reducing this varia-tion. It is common to see non-linear relations between ar-rays and the standard Normalization provided by Affymetrixdoes not perform well in these :We present three Methods of performing nor-malization at the probe intensity level.

A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias B. M. Bolstad1, R. A. Irizarry2, M. Astrand3 and T. P. Speed4, 5 1Group in Biostatistics, University of California, Berkeley, CA 94720, USA, 2Department of Biostatistics, John Hopkins University, Baltimore, MD, USA, 3 AstraZeneca R & D Molndal, Sweden,¨ 4Department of …

Tags:

  Based, Array, High, Data, Methods, Variance, Density, Oligonucleotide, Bias, Methods for high density oligonucleotide array data based

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Comparison of Normalization Methods for High Density ...

1 A Comparison of Normalization Methods for high DensityOligonucleotide array data based on variance and BiasB. M. Bolstad1, R. A. Irizarry2, M. Astrand3and T. P. Speed4, 51 Group in Biostatistics, University of California, Berkeley, CA 94720, USA,2 Department of Biostatistics, JohnHopkins University, Baltimore, MD, USA,3 AstraZeneca R & D M olndal, Sweden,4 Department of Statistics,University of California, Berkeley, CA 94720, USA and5 Division of Genetics and Bioinformatics, WEHI,Melbourne, AustraliaABSTRACTM otivation:When running experiments that involve multi-ple high Density oligonucleotide arrays, it is important to re-move sources of variation between arrays of non-biologicalorigin. Normalization is a process for reducing this varia-tion. It is common to see non-linear relations between ar-rays and the standard Normalization provided by Affymetrixdoes not perform well in these :We present three Methods of performing nor-malization at the probe intensity level.

2 These Methods arecalled complete data Methods because they make use ofdata from all arrays in an experiment to form the normaliz-ing relation. These algorithms are compared to two meth-ods that make use of a baseline array : a one number scal-ing based algorithm and a method that uses a non-linearnormalizing relation by comparing the variability and biasof an expression measure. Two publicly available datasetsare used to carry out the comparisons. The simplest andquickest complete data method is found to perform :Software implementing all three of the com-plete data Normalization Methods is available as part ofthe R package Affy, which is a part of the Information:Additional figures bolstad/ high Density oligonucleotide microarray technology,as provided by the Affymetrix GeneChipR , is beingused in many areas of biomedical research. As describedin Lipshutz et al.

3 (1999) and Warrington et al. (2000),oligonucleotides of 25 base pairs in length are used toprobe genes. There are two types of probes: referenceprobes that match a target sequence exactly, called theperfect match(PM), and partner probes which differ fromthe reference probes only by a single base in the center ofthe sequence. These are called themismatch(MM) 16-20 of these probe pairs, each interrogating adifferent part of the sequence for a gene, make up whatis known as a probeset. Some more recent arrays, suchas the HG-U133 arrays, use as few as 11 probes in aprobeset. The intensity information from the values ofeach of the probes in a probeset are combined togetherto get an expression measure, for example, AverageDifference (AvgDiff), the Model based Expression Index(MBEI) of Li and Wong (2001), the MAS Statisticalalgorithm from Affymetrix (2001), and the Robust Multi-chip Average proposed in Irizarry et al.

4 (2002).The need for Normalization arises naturally whendealing with experiments involving multiple arrays. Thereare two broad characterizations that could be used for thetype of variation one might expect to see when comparingarrays: interesting variation and obscuring would classify biological differences, for examplelarge differences in the expression level of particulargenes between a diseased and a normal tissue source,as interesting variation. However, observed expressionlevels also include variation that is introduced duringthe process of carrying out the experiment, which couldbe classified as obscuring variation. Examples of thisobscuring variation arise due to differences in samplepreparation (for instance labeling differences), productionof the arrays and the processing of the arrays (for instancescanner differences).

5 The purpose of Normalization isto deal with this obscuring variation. A more completediscussion on the sources of this variation can be found inHartemink et al. (2001).Affymetrix has approached the Normalization problemby proposing that intensities should be scaled so that eacharray has the same average value. The Affymetrix nor-malization is performed on expression summary approach does not deal particularly well with caseswhere there are non-linear relationships between using non-linear smooth curves have beenproposed in Schadt et al. (2001), Schadt et al. (2002) andLi and Wong (2001). Another approach is to transformthe data so that the distribution of probe intensities is thesame across a set of arrays. Sidorov et al. (2002) proposeparametric and non-parametric Methods to achieve these approaches depend on the choice of a propose three different Methods of normalizing1probe intensity level oligonucleotide data , none of whichis dependent on the choice of a baseline array .

6 Normaliza-tion is carried out at probe level for all the probes on anarray. Typically we do not treat PM and MM separately,but instead consider them all as intensities that need to benormalized. The Normalization Methods do not accountfor saturation. We consider this a separate problem to bedealt with in a different this paper, we compare the performance of our threeproposed complete data Methods . These Methods are thencompared with two Methods making use of a baselinearray. The first method, which we shall refer to as thescaling method, mimics the Affymetrix approach. Thesecond method, which we call the non-linear method,mimics the approaches of Schadt et al. Our assessmentof the Normalization procedures is based on empiricalresults demonstrating ability to reduce variance withoutincreasing ALGORITHMSC omplete data MethodsThe complete data Methods combine information from allarrays to form the Normalization relation.

7 The first twomethods, Cyclic loess and Contrast, are extensions of ac-cepted Normalization Methods that have been used suc-cessfully with cDNA microarray data . The third method, based on Quantiles, is both quicker and simpler than LoessThis approach is based upon the ideaof theMversusAplot, whereMis the differencein log expression values andAis the average of thelog expression values, presented in Dudoit et al. (2002).However, rather than being applied to two color channelson the same array , as is done in the cDNA case, it is appliedto probe intensities from two arrays at a time. AnMvsAplot for normalized data should show a point cloudscattered about theM= any two arraysi,jwith probe intensitiesxkiandxkjwherek= 1, .. , prepresents the probe, we calculateMk= log2(xki/xkj)andAk=12log2(xkixkj). Anormalization curve is fitted to thisMversusAplotusing loess.

8 Loess is a method of local regression (seeCleveland and Devlin (1988) for details). The fits based onthe Normalization curve are Mkand thus the normalizationadjustment isM k=Mk Mk. Adjusted probe intensitesare given byx ki= 2Ak+M K2andx kj= 2AK M Normalization curves may be computed using rankinvariant sets of deal with more than two arrays, the method isextended to look at all distinct pairwise normalizations are carried out in a pairwise manner,recording an adjustment for each of the two arraysin each pair. After looking at all pairs of arrays, wehave a set of adjustments which may be applied to theset of arrays. This is applied and then we repeat theprocess. Typically, only 1 or 2 complete iterations throughall pairwise combinations are needed to achieve usefulresults. However, because this method works in a pairwisemanner, it is somewhat time based methodThe contrast based method isanother extension of theMvsAmethod.

9 Full details canbe found in Astrand (2001). The Normalization is carriedout by placing the data on a log-scale and transformingbasis. In the transformed basis, a series ofn 1normalizing curves are fit in a similar manner to theMvsAapproach of the cyclic loess method. The data is thenadjusted by using a smooth transformation which adjuststhe Normalization curve so that it lies along the in the normalized state is obtained by transformingback to the original basis and exponentiating. The contrastbased method is faster than the cyclic method. However,the computation of the loess smoothers is still somewhattime normalizationThe goal of the Quantile methodis to make the distribution of probe intensities for eacharray in a set of arrays the same. The method is motivatedby the idea that a quantile-quantile plot shows that thedistribution of two data vectors is the same if the plot isa straight diagonal line and not the same if it is other thana diagonal line.

10 This concept is extended tondimensionsso that if allndata vectors have the same distribution,then plotting the quantiles inndimensions gives a straightline along the line given by the unit vector(1 n, .. ,1 n).This suggests we could make a set of data have the samedistribution if we project the points of ourndimensionalquantile plot onto the (qk1, .. , qkn)fork= 1, .. , pbe the vectorof thekth quantiles for allnarraysqk= (qk1, .. , qkn)andd=(1 n, .. ,1 n)be the unit diagonal. Totransform from the quantiles so that they all lie along thediagonal, consider the projection ofqontodprojdqk=(1nn j=1qkj, .. ,1nn j=1qkj)This implies that we can give each array the samedistribution by taking the mean quantile and substitutingit as the value of the data item in the original dataset. Thismotivates the following algorithm for normalizing a set ofdata vectors by giving them the same of lengthp, formXof dimensionp nwhere each array is a each column ofXto the means across rows ofXsortand assign thismean to each element in the row to getX rearranging each column ofX sortto have the same ordering as originalXThe quantile Normalization method is a specific caseof the transformationx i=F 1(G(xi)), where weestimateGby the empirical distribution of each array andFusing the empirical distribution of the averaged samplequantiles.


Related search queries