Data Normalization

Data normalization is used to remove systematic effects
Since the IVTT control spots carry the chip, sample and batch-level systematic effects, but also antibody background activity to the IVTT system, this procedure normalizes the data and provides a relative measure of the specific antibody binding to the non-specific antibody binding to the IVTT controls (a.k.a. background).

[downloadable file here]

Some terminology

Background. Strictly speaking, this is the local (auto)fluorescence intensity immediately surrounding the spot, but which is not actually part of the spot itself.

IVTT. In vitro transcription/translation. This is the E. coli-based cell lysate used for expression of the different proteins from T7 expression plasmids.

IVTT spots. These are the protein spots printed from IVTT reactions containing the expression plasmids. They are only called ‘antigens’ when reactive with antibody. The data from IVTT spots are sometimes called ‘IVTT foreground’ or simply ‘foreground’. 

IVTT control spots. Sometimes called ‘no DNA controls’, ‘IVTT background’ or simply ‘background’. These are the control spots on the array printed from IVTT reactions lacking a T7 plasmid as template for expression.
[Note: Although E. coli lysate is used to block antibodies to E. coli in human serum, some residual antibody to E coli always remains and will react uniformly to every IVTT spot on the array. Unfortunately, this IVTT background value varies between different humans. Thus, before (raw) data for different individuals can be compared, the data for each sample needs to be normalized using the median of the sample specific IVTT control spots.]

Features. An alternative term for IVTT spots, particularly in eukaryotic proteome arrays.[Note: In the old days we used to use the term ‘ORFs’ (open reading frames) for different IVTT spots.  Recall that genomic DNA is used to amplify and clone individual ORFs for expression and printing on arrays. Therefore arrays of prokaryotic organisms and viruses do display the expression products of full-length ORFs. However, in some cases where the ORF is too long to be amplified by PCR, it is made instead as a set of overlapping ORF-fragments). Moreover for eukaryotes, most of the spots correspond to individual exons and only some (those without introns) are full-length. Therefore the term ‘ORF’ fell into disuse.]

Background subtraction
The first step in any data normalization is to adjust the spot intensities by local background by subtraction. Usually the quantification software will take care of this for you. For example, in the ScanArrayExpress output .csv file, the column of data headed Median-B is the median pixel intensity of the spot subtracted of local background.

log2 fold-over control (FOC) normalization
One method to normalize is to express the data as a ratio of the IVTT control. Log transformation is also applied to help make the data less skewed and to prepare the data for statistical analysis. To do this, first set a floor (1-100) to remove any zeros and negative values. Then calculate the median of the IVTT controls spots for each sample. You can then normalize, either by dividing foreground values by the sample-specific median of the IVTT control spots and taking the base-2 logarithm (Log2) of the ratio [i.e., log2(foreground over background)], or you can log2-transform the foreground values and subtract the log2-transformed median of the IVTT control spots [i.e., log(foreground) minus log(background)]. In the latter, since the subtraction is performed in the the log space, this is in effect also a fold-over control normalization. Either way, the end result is the same: a value of 0.0 means that the intensity is no different than the IVTT control background, and a value of 1.0 indicates a doubling with respect to the background.

Subtraction normalization
The other normalization method is to subtract the foreground values of the median of the sample-specific IVTT control spots. [expand].

Seropositivity thresholds.
Once the data has been normalized, calculation of a seropositivity threshold or cutoff allows antigens to be classified as seroreactive and non-seroreactive. This is a useful filtering step for reducing the size of data sets. Some argue that any statistical comparisons between groups should only be applied to seroreactive antigens rather than all the antigens, since many spurious antigens with very low signal intensities may be significant when two groups are compared statistically. There are several approaches that can be used to define seropositivity. One is to use two or three standard deviations (SD) above the mean of the IVTT controls. The cutoff is first calculated from the raw data using the mean+2SD (or mean+3SD) of the IVTT controls from all the samples probed, and divided by the mean to give a FOC. A more conservative cutoff is twice the IVTT background, or a normalized signal of 1.0. If a control population is available (such as naives, healthy US controls, unvaccinated, etc.) a third method is to set a cutoff for each IVTT spot of two or three standard deviations above the mean of the control population.

An additional criterion for seropositivity might be reactivity above the cutoff in at least 10% of the study population. This prevents antigens that are recognized particularly strongly in one or two individuals from skewing the results.

Distribution plots
Distribution plots for normalization of the data are included in Supplementary Figure 7.