Academia.eduAcademia.edu
Journal of Magnetic Resonance 202 (2010) 190–202 Contents lists available at ScienceDirect Journal of Magnetic Resonance journal homepage: www.elsevier.com/locate/jmr icoshift: A versatile tool for the rapid alignment of 1D NMR spectra F. Savorani, G. Tomasi, S.B. Engelsen * Quality & Technology, Department of Food Science, Faculty of Life Sciences, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark a r t i c l e i n f o Article history: Received 25 August 2009 Revised 3 November 2009 Available online 18 November 2009 Keywords: Preprocessing NMR Alignment Metabonomics Chemometrics Algorithm a b s t r a c t The increasing scientific and industrial interest towards metabonomics takes advantage from the high qualitative and quantitative information level of nuclear magnetic resonance (NMR) spectroscopy. However, several chemical and physical factors can affect the absolute and the relative position of an NMR signal and it is not always possible or desirable to eliminate these effects a priori. To remove misalignment of NMR signals a posteriori, several algorithms have been proposed in the literature. The icoshift program presented here is an open source and highly efficient program designed for solving signal alignment problems in metabonomic NMR data analysis. The icoshift algorithm is based on correlation shifting of spectral intervals and employs an FFT engine that aligns all spectra simultaneously. The algorithm is demonstrated to be faster than similar methods found in the literature making full-resolution alignment of large datasets feasible and thus avoiding down-sampling steps such as binning. The algorithm uses missing values as a filling alternative in order to avoid spectral artifacts at the segment boundaries. The algorithm is made open source and the Matlab code including documentation can be downloaded from www.models.life.ku.dk. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction In recent years a constantly increasing interest has been devoted towards the – omics sciences in which Nuclear Magnetic Resonance Spectroscopy (NMR) plays a central role since it is able to provide, in a short time, a reliable and unique metabolic fingerprint of complex chemical and/or biological matrices. However, the large amount and the high complexity of the acquired data make it challenging to disentangle the sought meaningful information, and a major effort has been made lately for providing the researchers with mathematical and statistical tools able to cope, in a reasonable time, with such overwhelming information. Multivariate data analysis methods have been developed and tailored for dealing with this new complex problem, providing a step forward into data handling and interpretation of metabolomic and metabonomic studies which are becoming more and more common. Multivariate data-mining in spectroscopy has a long history in near infrared spectroscopy but has a weaker tradition in nuclear magnetic resonance spectroscopy. While the first application of multivariate chemometric analysis to NMR spectra appeared in the early eighties [1], it was not until the early nineties that the field of metabonomics emerged and the highly complex metabolic fingerprints in NMR spectra of body fluids made the need for powerful multivariate * Corresponding author. Fax: +45 3533 3245. E-mail addresses: frsa@life.ku.dk (F. Savorani), gto@life.ku.dk (G. Tomasi), se@life.ku.dk (S.B. Engelsen). 1090-7807/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jmr.2009.11.012 data analysis obvious [2]. Chemometrics is now rapidly gaining momentum in the analysis of NMR spectra [3]. However, NMR data are not always readily suitable for the analysis by so-called bilinear chemometric methods. NMR spectra are generally a sequence of response signals, closely spaced Lorentzian peaks, differing in shape, intensity and position. While all these entities carry important information, changes in peak positions of the same analyte signal between samples does not conform to the bilinearity assumptions and thus deteriorate the chemometric modeling. In algebraic terms, bilinearity requires that each column (if samples are stored in the rows of a matrix) contain information about a signal originating from an identical common compound along all the samples in order the statistical approach to be able to work properly. However, especially due to small pH changes and intermolecular interactions, this is rarely the case with biological samples, wherefore it is imperative for multivariate exploratory metabonomics investigations aimed at biomarker profiling or pattern recognition studies, that the data be aligned before chemometric analysis. Of course, any effort should be made prior to and during the NMR analysis for assuring that the samples are being collected and prepared as homogeneously as possible (preparation protocols) and that the instrumental conditions and parameters are identical [4]. Despite standardization, spectral misalignments still occur and this is the reason why a posteriori aligning methods are required. The aim of this study is to provide the NMR spectroscopist with a simple, versatile and efficient algorithm for performing a posteriori alignment which uses the newest and fastest algorithmic techniques. F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 Different approaches have been explored for solving the alignment problem and have supplied appropriate solutions for many different experimental cases. The historical bases of such an approach relies on a very simple, but still largely used method, binning or bucketing, which involves a data reduction, performed through NMR signals integration within standardized spectral regions whose width commonly ranges between 0.01 and 0.05 ppm (bins or buckets) [5]. The major drawback of this solution is the loss of spectral resolution. When high resolution is required, other and more sophisticated alignment methods need to be considered such as dynamic time warping (DTW) or correlation optimized warping (COW) [6–8], which have been demonstrated to be effective on chromatographic data and also have been employed for solving simple NMR alignments with satisfactory results [9]. Apart from being computationally intensive, the main problem of these two approaches is that alignment is obtained by local stretching or compression. This is not really suitable for NMR signals because this model for correction works best when there is a positive correlation between peak width and shift. Another important class of alignment methods finds all the relevant peaks present in an NMR spectrum. This converts spectra into a list of peaks and relative attributes thereby dramatically decreasing the dimensions of the dataset. Torgrip et al. have introduced and developed these methods [10–12]. The major drawbacks of these methods are the elimination of the information carried by the fine structure of the signal shape and the need to define some meta-parameters for the peak-picking procedure and subsequent alignment. An alternative, more advanced, approach concerns designing algorithms able to perform an automatic NMR peak alignment with no or limited user intervention, trying meanwhile to keep all the relevant spectral information. The first attempts to develop such an algorithm involved the application of a genetic algorithm to align segments of spectra [13,14], the application of partial linear fit to align the spectra [15] and a search for the misaligned spectral region by a PCA based algorithm followed by a rigid shift [16]. None of these methods have been broadly adopted due to lack of alignment performance and/or high computational costs. Wong et al. [17] solved the problem of the computational inefficiency by employing a Fast Fourier Transformation (FFT) correlation engine to boost the algorithmic speed (PAFFT) and at the same time introduced the use of regular spectral intervals to be individually and rigidly aligned. Veskelov et al. [18] combined the properties of the peak picking methods with the FTT and interval features of PAFFT. The result is a rapid fully automated aligning process, able to recursively split the NMR spectra in meaningful intervening regions and to align their signal until a certain degree of goodness is reached. Similarly, a recently published method using a fuzzy Hough transform [12] is able to perform an automatic full spectral alignment. Although promising, these methods rely on a peak picking algorithm which need the user to set some meta-parameters which can dramatically affect the final result and especially those based on the Hough transform are computationally very expensive. The trend towards interval based algorithms represents a key step forward in the research of a definitive solution to the alignment problem. As a characteristic, the chemical shift of each NMR signal (or pattern of them) depends on several factors and can independently change in any direction. Peaks that are adjacent or even overlapped in one spectrum can be baseline separated in other spectra and, as an extreme consequence, their relative position inverted. When dealing with the spectral features it is therefore preferable to reduce the global problem to smaller localized ones that can be found in specific spectral intervals. In this way, the shift occurring in opposite directions can be easily solved and eventually a global, full resolution aligned NMR dataset can be reconstructed. In practice, it is often only a small percentage of 191 the total spectral width that suffers for misalignment problems and, besides increasing the chances of bad corrections, it is clearly inefficient to align what it is already aligned. An automatic search for the meaningful regions of intervention can be a desirable option for an aligning method, but what it is apparent for the average NMR spectroscopist can be very difficult (and time consuming) to implement algorithmically. In effect, it is rarely worth the effort of optimizing several meta-parameters in order to make a method work compared to manually select the regions of intervention. Fully automated procedures also normally do not allow user intervention and, if the achieved result is not satisfying, there are no alternatives. The new algorithm presented in this study, icoshift, has been designed to provide the user with a versatile open source tool for spectral alignment which is based on rigid shift of intervals and which uses the FFT to boost the simultaneous alignment of all spectra in a dataset. icoshift combines a rapid optimized FFT engine, which brings the calculation times for large metabonomic datasets from hours (e.g., with COW), or minutes [18] to seconds, with optional interactive facilities for interval definitions and for alignment automation. It introduces the use of missing values for solving the interpolation problems that still represent an open issue for other algorithms. Plot and interactive facilities are also provided for the user to be able to immediately evaluate the achieved result. Applications of icoshift to solve different misalignment problems in real experimental NMR datasets will be illustrated in this paper. The MatlabÒ (2008b, The Mathworks Inc., Natick, MA, USA) code of icoshift, including a help section and a demo, is freely available for download from www.models.life.ku.dk. 2. Results and discussion 2.1. The icoshift algorithm The icoshift algorithm, namely interval-correlation-shifting, which derives its name form the basic coshift algorithm [19], independently aligns each NMR signal to a target (which can optionally be an actual signal or a synthetic one like the average, or the median) by maximizing the cross-correlation between user-defined intervals. Fig. 1 shows an overview of icoshift results when applied to a misaligned set of human urine NMR spectra zoomed into a strongly misaligned region. The algorithm is comprised of three essential parts: (1) interval definition, (2) maximization of the cross-correlation of each interval by an FFT engine and (3) signal reconstruction (Fig. 2). For the three parts, several options are available depending on how the alignment problem is to be solved. Therefore, it is possible to define intervals automatically (e.g., allowing a full spectrum alignment with regularly spaced intervals, or adjacent intervals of user-defined length, or customized interval boundaries), set a boundary for the maximum local correction allowed for each interval and define the fill-in value for the reconstruction part (viz., a missing value or the first/last point in the segment). The basic principle of icoshift is quite similar to other published methods for the alignment of spectral and chromatographic signals: Peak Alignment by FFT (PAFFT [17]), Recursive Peak Alignment by FFT (RAFFT [17]) and Recursive Segment-wise Peak Alignment (RSPA [18]) and can in fact be used as a computation engine for these methods as it is efficient and numerically sound. The algorithm was designed having in mind the wide range of misalignment problems that can be encountered in NMR spectroscopy (e.g., it allows the registration to a reference signal and completely customized-interval definitions) and can help reduce the emergence of artifacts related to the insertion/deletion model used for the alignment (namely, by inserting missing values instead of 192 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 Fig. 1. Overview of the algorithm results. icoshift provides aligned NMR datasets working on user-defined intervals. Fig. 2. Simplified flow chart of the algorithm. X is the dataset to be aligned. The ns intervals, whose optimal lag is found through the FFT engine, have form ½ai ; bi  for i ¼ 1; . . . ; ns . and X(t) is the target. The Seg variable defines which type of alignment is used. N is the signal length. repeating the value on the boundary). However, like the majority of current alignment methods, it cannot correct for a change in the order of peaks. The details of the implementation of icoshift are given in the experimental section. Three experimental NMR datasets, covering a multiplicity of alignment problems to be solved, were used to assess and illustrate the performance of the icoshift algorithm. The achieved results are presented and discussed case by case. The figures presented are based on the actual plot facilities of the icoshift program. 2.2. Case 1: the wine data The wine dataset consists of 40 spectra of different table wines [9] in which the spectra were processed and reduced to 8712 data points covering a 5.5 ppm spectral region (from 6.00 to 0.50 ppm). The wine samples included red, white and rosé wines that were not buffered and/or pH adjusted before NMR analysis. This introduced a broad range of shifts of all the pH dependent signals found in the organic acids/amino acids region. Simple alignment according to the TSP (3-(trimethylsilyl)-propionic acid-d4) reference signal cannot correct for the pH dependent shifts nor for the dominant ethanol signals. This can be observed in Fig. 3a for example by inspection of ethanol’s 13C satellites (signals No. 10 , 100 , 80 and 800 ). In the original publication Larsen et al. [9] used COW [20] for solving the misalignment problem, but since COW was only able to correct for the shift of the dominant ethanol signals, it was necessary to use an ad hoc multistep interval procedure, utilizing coshift, for improving the alignment of a much smaller signal such F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 193 Fig. 3. Comparison between raw (a) and icoshift aligned (b) NMR spectra for the table wine dataset. A regular splitting into 50 intervals was performed by the algorithm and the full alignment process took less than 1 s. The main resonance peaks are: 1. ethanol (–CH2–); 10 . and 100 . 13C satellites of peak 1; 2. glycerol; 3. methanol; 4. malic acid; 5. succinic acid; 6. acetic acid; 7. lactic acid; 8. ethanol (–CH3); 80 . and 800 . 13C satellites of peak 8. as the lactic acid one (signal No. 7 in Fig. 3). The subsequent multivariate quantitative modeling by iPLS (interval partial least squares) regression [21] showed greatly improved results making the authors able to distinguish among the different types of wine and to obtain reliable calibration models for the prediction of some indigenous wine metabolites. Just like the iPLS algorithm, the icoshift algorithm can split the dataset into a user-defined number of intervals (or into regular spaced intervals). Fig. 3 shows the 40 superimposed wine spectra before and after the icoshift correction using regular intervals. Evidently, the spectral alignment problems are corrected for both the dominant and the smaller peaks. The alignment of the spectra is produced by the icoshift algorithm by the Matlab command line #1 in Table 1, which applies icoshift to the spectra in WineData using the average spectrum as a reference spectrum and dividing the spectra up into 50 regularly spaced intervals. The result is shown in Fig. 3b with the 50 regular intervals marked by straight vertical lines. For this dataset, the icoshift alignment takes less than 1 s and the AlignedWineData is ready for subsequent multivariate data analysis. As apparent from the figure, almost all the pH dependent broadly misaligned signals are efficiently shift corrected. However, one signal remains evidently misaligned by this procedure (the lactic acid doublet centered at 3.95 ppm; #7), which illustrates one fundamental problem with unsupervised alignment algorithms: in this case, the spectral splitting into regular intervals left one part of the lactic acid signal in one interval and another part of the signal in the adjacent interval, which made it impossible for the algorithm to provide an effective correction for the shift. In icoshift, this problem can easily be overcome by using a different number of intervals. However, in many cases, a custom interval solution is preferable for optimal division of the sample spectra into baseline separated intervals. This can be done by providing the icoshift algorithm with a vector (‘‘custom_intervals” in Table 1) containing the boundaries of the desired intervals. The selected intervals do not necessarily have to be adjacent and can be of flexible size. The alignment of the WineData set to the average spectrum target was applied using the Matlab command line #2 in Table 1. The third element in the options parameter makes the icoshift algorithm performing an automatic Table 1 Simple command lines in Matlab for applying icoshift to the NMR datasets in the three cases. Command line #1 applies icoshift to the spectra in WineData using the average spectrum as a reference spectrum and dividing the spectra up into 50 regularly spaced intervals. Command line #2 applies icoshift to the WineData using the average spectrum as a reference spectrum, dividing the spectra up into intervals defined by the vector ‘‘custom_intervals” and using the x-axis definition (ppm) described in the vector ‘‘ppm_scale” for plotting. The fourth parameter ‘f’ is optional and makes the icoshift algorithm automatically determine the maximum allowed correction for each interval. Command line #3 applies icoshift to the spectra in NibathData using the average spectrum as a reference and the interval settings described by the ‘‘custom_intervals” vector. The automatic fast search for the suitable maximum allowed shift for each interval does not need to be specified since it is set as the default. Command line #4 applies icoshift to the PlasmaData using the average spectrum as a reference and using the signal contained in the interval described in ‘‘reference_region” as a guide for shifting the entire spectra. Command line #1 #2 #3 #4 AlignedWineData = icoshift(’average’, WineData, 50); AlignedWineData = icoshift(’average’, WineData, custom_intervals, ’f’, [2 0 1], ppm_scale); AlignedNibathData = icoshift(’average’, NibathData, custom_intervals); AlignedPlasmaData = icoshift(’average’, PlasmaData, reference_region); 194 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 full-spectrum correlation-shift before the interval alignment. This additional step is an advantage (more in diminishing the risk of misalignment than computationally) in cases with large shifts as it allows the spectra to be roughly pre-aligned according to the dominant signals. The results of both alignment steps are shown in Fig. 4 for the challenging lactic acid region. The full-spectrum correlation-shifting step was effective in aligning the intense ethanol peaks as well as their ethanol 13C satellites (since they are shifted exactly like their main 1H signals) (Fig. 4b), the lactic acid doublet was not aligned because of its pH dependency. In order to align these signals the customized interval-correlation-shifting step was necessary for obtaining fully-aligned signals (Fig. 4c). Many efforts have been dedicated to optimize the algorithm speed and for this dataset the time of calculation for this two-step procedure is only about 0.2 s. On the contrary, computation time for COW is significantly higher, especially when the optimization step is taken into account. In terms of quality of the alignment, the results obtained with COW are acceptable, although the lactic acid region is not handled correctly (cf. Supplementary Fig. 2). Similar observations can be made about RSPA, which is also slower than icoshift because of the computation overhead related to the peak picking and the recursion. RSPA results are comparable with icoshift, but not quite as good for the lactic acid region. However, this might depend on the choice of the meta-parameters, and other values might lead to improved results. The optimization of these parameters and a detailed comparison between icoshift and RSPA is beyond the scope of this paper and is left for future research. A simple but illustrative way to demonstrate the successful alignment of the spectra is to compare the performances of multivariate PLS regression models to the content of chemical constituents. In this case, PLS prediction of lactic acid was calculated using mean-centered data and repeated-random cross-validation for determining the number of significant PLS components. Fig. 5 shows a comparison between the calibration curves obtained using either the unaligned raw spectra (Fig. 5a) or the icoshift-aligned spectra (Fig. 5b). Clearly, the signal alignment has improved the PLS model significantly by reducing its complexity and obtaining a better correlation between measured and predicted values of lactic acid content. A similar, but more dramatic improvement is obtained on the PLS model built for the ethanol calibration. In this case the Root Mean Square Error of Cross-Validation (RMSECV) lowered from 0.351 to 0.177 and the R2 improved from 0.79 to 0.95 (Supplementary Figs. 1a and b). It is important to bear in mind that spectral alignment also can be a destructive process as it can remove useful physical information related to the signal shifts in the spectra. This can be demonstrated by the ability of a full-spectrum PCA model (mean-centered data) to differentiate among the three different types of wine, which is clearly lost after the alignment process (compare Fig. 5c and d). The shifts induced by different acidity (pH) of the three kind of wine, carrying the information about their nature, are removed after the alignment. This is a fundamental characteristic of any spectral alignment process that always has to be taken into account. However, it is possible to preserve the information about the shift corrections for all samples and intervals using an optional output (‘‘ind”) which collects it into a table. If desired, this table can be appended to the NMR dataset for a multivariate data analysis that preserves and isolates the shift information. The wine data demonstrates how spectral alignment can largely be performed in a fully automated operator-blinded mode (unsupervised) but also may lead to erroneous results and/ or to deterioration of the information sought. In icoshift, these problems have been solved by offering an interactive mode that allows the user to define the intervals that are to be aligned leaving the rest unaffected. The use of the customized-interval definition has the benefit that user-written algorithms for automatic interval definition [18] can be directly interfaced to the icoshift program. 2.3. Case 2: the nickel-bath data Fig. 4. Customized interval alignment of the wine dataset zoomed into the lactic acid NMR spectral region. The figure clearly illustrates the two steps of the alignment, starting from the raw data (a), passing through the full-spectrum correlation-shifted spectra (b) and obtaining the customized interval-correlationshifted spectra (c). The intervals selected for the icoshift are highlighted with a background color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this paper.) In order to test the speed and performance of the icoshift algorithm under more realistic metabonomic conditions, an unusually large NMR dataset (460  40,905) was investigated. The nickelbath dataset is part of a study aimed at following the evolution of organic additives in an electroplating nickel bath (unpublished data). The superimposed raw spectra (Fig. 7) show strong and broad signal misalignments along the whole spectral region. Succinic acid was added as an internal standard and its largely misaligned singlet can be found at approximately 2.5 ppm in the aliphatic region. The aqueous solutions were not buffered prior to NMR analysis causing also the lactic acid resonance peak to move within an almost 0.3 ppm wide region which is unusually broad for normal NMR datasets. Moreover, significant misalignments are also present in the aromatic region for the signals of some additives that were of great interest for the study of these nickel-bath samples. This dataset is particularly challenging because the multiple signal shifts present are of very different magnitude depending on the pH susceptibility of the molecules. This feature represents an insurmountable problem for the alignment algorithms based on FFT cross-correlation published so far [17,18,20] because they use a predetermined and invariable allowed maximum shift for the spectra to be aligned. The maximum allowed shift is thus identical for each interval in order to prevent excessive shifting and misalignment on a local scale. However, in practice, since the intervals can be of different size, the required maximum allowed shift for one interval can be wider than the size of another interval. This will seriously limit the algorithm’s flexibility and hamper its ability to achieve satisfactory results. In order to overcome this limitation, the icoshift algorithm is implemented with an (optional) automatic procedure for finding the maximum F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 195 Fig. 5. PLS performances and PCA score plots for the wine dataset. Subplots (a) and (b) show the full-random cross-validated PLS calibration curves for the content of lactic acid obtained before (a, RMSECV = 0.137 R2 = 0.93) and after (b, RMSECV = 0.104 R2 = 0.96) the icoshift alignment when only the NMR region containing the lactic acid doublet (1.45–1.35 ppm) is taken into consideration. Subplots (c) and (d) show the PCA score plots calculated on the whole NMR region before (c) and after (d) the icoshift alignment. allowed shift of each interval. The results of the icoshift alignment to the average spectrum for the Ni-bath dataset (NibathData in line #3 of Table 1) using custom intervals are illustrated in Fig. 6. Despite the large dimension of the dataset, the icoshift alignment was completed in only 4.74 s. After the icoshift alignment all the selected intervals appear fairly well aligned and in particular interval No. 11, containing the singlet peak of the succinic acid, is now well aligned (Fig. 6c). Such a broad misalignment could not be optimally corrected for using other aligning methods such as COW [20] and RSPA [18] (Fig. 7). The poor alignment performance of COW and RSPA are related to the automated segmentation (both) and in the compression/expansion model for alignment (COW). The problem is that the succinic acid peak does not always end up in the same segment in the target spectrum and in the sample to be aligned. For COW, the problem cannot be solved in an efficient way just by using custom segment boundaries because the compression/expansion model seeks the alignment of the succinic acid interval by intervening on the previous intervals. Therefore, the user-defined segmentation would also affect the other intervals making the overall procedure lengthy, tedious and without a guaranteed success. The problem of RSPA is essentially that the interval boundaries for the succinic acid are very close to the peak. When the intervals are not matched between sample and target, they cannot be corrected because no peaks are present in the target to estimate the correct shift. 2.4. Case 3: the rat plasma data It is not always optimal to align a dataset by splitting it into smaller intervals since complex peak shape can be fundamental to the sought information. This is normally the case in typical metabonomic investigations of human (or animal) body biofluids (blood, urine, fecal liquid, etc.) in which the complexity of the pattern of resonances makes it difficult to define customized intervals without increasing the risk of deteriorating the relevant information. As a common practice metabonomic datasets are acquired under strictly defined experimental conditions, listed by a protocol, limiting as much as possible the risk of acquiring noisy spectra [22]. Still, some unwanted variations that may arise both during the sample preparation and during NMR acquisition can lead to slightly misaligned spectra even when they have been referenced to an internal standard. Indeed, for this kind of metabonomic datasets (plasma and serum samples) the common referencing molecules TSP or DSS (2,2-dimethyl-2-silapentane-5-sulfonate) have proved to be unreliable because unpredictable interactions with blood proteins make their signal vary. Therefore, it has become common practice to align the spectra according to pH unaffected signals present in a suitable concentration in all the spectra [22]. A perfect candidate for animal blood samples is the a-D-glucopyranose anomeric doublet, centered at 5.23 ppm, just on the right side of a broad lipid olefinic resonance at about 5.27 ppm. A plot of the complete rat plasma spectral dataset is shown in Fig. 8. The a-D-glucose region is highlighted and amplified in the insets a and b representing the raw and the aligned spectra, respectively (alignment result produced by command line #4 of Table 1). In practice, icoshift searches for the best local cross-correlation for the provided reference region and then shifts the whole spectra left or right according to the found shift indexes. Although the correction is small, it improves the dataset homogeneity and helps the interpretation of the results of a subsequent multivariate data analysis. The relatively large metabolomic dataset, including 72 plasma samples and consisting of 21,727 data points covering a 6.86 ppm wide spectral region (from 5.86 to 1.00 ppm) required only a fraction of second to be aligned (Table 2). 196 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 Fig. 6. Customized interval alignment of the nickel-bath data. The dataset is split up into 17 intervals with automatic determination of the maximum allowed shift for each interval. The 17 user-defined selected intervals are numbered on top of the figure and their background is colored. The peak pointed out by  is the singlet resonance peak of succinic acid which was used as an internal standard. (b) and (c) show details of the aromatic and the aliphatic NMR regions, respectively. By mouse-clicking on one spectrum in either the raw spectra window or in the aligned spectra window, the spectrum becomes highlighted in both windows allowing the user to inspect the alignment result for the individual selected spectrum. This feature is demonstrated in the inset of the aromatic region (b) in which spectrum No. 409 was selected. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this paper.) Fig. 7. Comparison of the alignment results of three different algorithms. The succinic acid region of the nickel-bath dataset is plotted after the full dataset has been aligned using different methods: (a) raw data; (b) icoshift; (c) COW (‘ ¼ 175 points and t ¼ 10) and (d) RSPA. Maximum allowed shift = 550 points for all methods. 197 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 Fig. 8. icoshift alignment based on a reference signal of the rat plasma metabolomic dataset. The chosen reference region (5.25–5.21 ppm) contains the anomeric doublet of a-D-glucopyranose which was chosen for the algorithm to drive the alignment of the dataset according to a rigid shift of whole spectra. The average spectrum used as the target is highlighted in inset (a) which shows the raw data zoomed into the selected reference region as well as inset (b) shows the icoshift aligned data. Because of the compression/expansion model for the warping, COW is not capable of correcting the spectra according to one signal: the alignment is always global and the best compromise is found given the segment length and the slack. This can be seen in Supplementary Fig. 3, which clearly shows that the results for this method are suboptimal. RSPA, on the other hand, performs well on the reference signal, but, similarly to COW, it also aligns other parts of the spectra in order to maximize the local correlation. Table 3 Calculation times of icoshift for random simulated datasets (ns = 100). Times are in seconds. 2.5. Computational aspects The fact that RSPA requires higher computation times is both related to the sample-wise operation, without padding to N " (cf. Section 4.2.1), and to the peak picking and automatic interval definition. Obtaining the most efficient RSPA implementation was not the aim of this work and it is expected that zero padding to N " will improve the performance. Table 3 shows the computation times for a set of artificial sets of specific size, both in terms of number of samples and length of the signal (100 contiguous intervals were used). From the table it is apparent that icoshift can be used as a real time method even for the alignment of very large datasets. Table 2 shows the results for the alignment of the three different datasets presented above using the optimal definition of the intervals for each dataset. As it can be seen, the time consumption for COW is as expected much larger than for FFT based methods. Particularly expensive was the search for the optimal warping parameters: around 6 min were necessary to complete the alignment for the Ni-bath data and another 37 were employed to find the optimal ‘, and t using the simplex procedure. This is manifestly non-optimal, especially in view of the relatively poor alignment that was obtained. Computation times for COW and RSPA on dataset 3 are not meaningful (and therefore not reported) because both methods operate on a much larger problem (the shift is sought only on the reference signal in icoshift) and cannot handle this case correctly. Table 2 Speed comparison among different algorithms on the alignment of the presented datasets. Times are in seconds. COW (opt)a icoshift RSPA Wine Nibath Plasma 1.30 (38.3) 0.089 2.23 348 (2.23  103) 1.9 22.32 – 0.067 – a The value in parentheses is the time consumption for the simplex search of the optimal ‘ and t. No. samples 50 100 200 400 Sample length 4096 8192 16,384 32,768 0.58 0.68 0.83 1.10 1.10 1.24 1.44 1.98 2.06 2.32 2.74 3.89 4.01 4.47 5.54 8.75 3. Conclusions The icoshift algorithm presented in this work provides a computationally efficient and versatile tool for the one-dimensional alignment of NMR signals in large spectral datasets. The speed of alignment algorithms have previously been a hindrance for their extensive application, but in all the tests made so far icoshift proved to be sufficiently fast to allow the NMR spectroscopist to work with full-resolution datasets almost in real time and to introduce automation improvements. Although only three input parameters are required for solving the majority of alignment problems, a number of optional meta-parameters can be provided to allow the algorithm to handle special misalignment problems (see Table 1). The speed of icoshift allows the user to swiftly and interactively investigate different operative combinations of interval settings 198 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 for achieving the optimal result; the program is designed to be user friendly and to provide helpful and interactive plot facilities. Paradoxically, the whole idea behind spectral alignment is to render the large NMR datasets bilinear and thus suitable for subsequent multivariate chemometric models such as PCA, PLS and multivariate curve resolution (MCR) [3], but with the wine example, we have also shown that a perfect alignment of NMR spectra might remove part of the information sought. This advocates for using a customizable tool such as icoshift when aligning NMR spectra. While icoshift can be operated in a fully automated mode, it has been specifically designed for allowing user-defined intervals boundaries. Indeed, the time spent for adjusting the meta-parameters required for tuning a fully-automated interval definition step is often badly spent compared to the time required for the spectroscopist to manually select the intervals. However, the icoshift algorithm has also been designed to be used as an engine for fully automated interventions as it has been demonstrated on this work providing that part of programming code (as a separate routine) that applies a recently published method [18]. Although no direct comparisons were made, the icoshift algorithm appears to be faster than other recent methods as seconds are required where others report minutes for a smaller set on an equivalent computer [6,17,20]. In order for the software to be widely tested and used and to make further customizations and improvements possible, the Matlab source code is freely available for download at www.models.life.ku.dk/algorithms. Preliminary results have shown that icoshift performs equally well on chromatographic data, an investigation that will be pursued in a future publication. for calculating PLS regression models. The table wine dataset, along with information concerning the icoshift intervals and the Y reference chemical values used for PLS calibrations, is available for download at http://www.models.life.ku.dk/research/data/ Wine_NMR/index.asp. 4.1.2. Case 2: the nickel-bath data The nickel-bath dataset is part of a study aimed at following the evolution of organic additives in an electroplating nickel bath (unpublished data). The samples analyzed were aqueous solutions containing approximately 10% of D2O. 1H NMR spectra were acquired on a Bruker Avance 500 spectrometer DRX-500 operating at a Larmor frequency of 500.13 MHz for 1H (11.75 T). All experiments were performed at 298 K suppressing the water resonance by using the standard Watergate pulse sequence (Bruker Biospin). All samples were individually tuned, matched and shimmed. A spectral window of 15.646 ppm was acquired in 2.2 s. The relaxation delay was 9 s. The FIDs were collected into 32 k complex data points and 128 scans were accumulated for each spectrum. Total acquisition time for each spectrum was about 25 min. Zero-filling to 64 k points and a 0.3 Hz line-broadening multiplication was performed prior to Fourier transformation. The spectra were baselineand phase corrected using MestRe-C 4.9.8.0 (www.mestrec.com, Mestreab Research SL, Santiago de Compostela, Spain). TSP was used as a chemical-shift reference (d = 0 ppm) and succinic acid was used as internal standard. In total 460 samples were analyzed and their NMR spectral region between 9.5 and 0.5 ppm was used for the present study. The Ni-bath dataset has dimensions (samples  variables): 460  40,905. 4. Experimental The primary aim of this study was to describe the new alignment algorithm and its performance on realistic NMR datasets. In the following a short description of each NMR dataset used in the performance test is provided. The second part of this section describes the detailed algorithmic aspects of icoshift (a stand-alone collection of algorithms) and provides comparisons with the previously published alignment algorithms highlighting advantages and drawbacks. Finally a comparison of the calculation speed is provided for all the investigated algorithms. 4.1. NMR datasets 4.1.1. Case 1: the wine data The wine data include the NMR spectra of 40 table wines of different origin and color (red, white and rosé) and is taken from a previous study [9]. The NMR samples were prepared from 495 ll wine and 55 ll of TSP-d4 (5.8 mM) in D2O. 1H NMR spectra were acquired on a Bruker Avance 400 spectrometer (Bruker Biospin GmbH, Rheinstetten, Germany), operating at 400.13 MHz Larmor’s frequency for protons, equipped with a 5 mm BBI probe with Zgradients. All experiments were performed at 298 K suppressing the water resonance by pre-saturation followed by a composite pulse. All samples were individually tuned, matched and shimmed and then acquired using a recycle delay of 5 s, 1536 scans and a dwell time of 60.4 ms for acquisition of 32 k data points, resulting in a total acquisition time of 1.979 s. Before Fourier transformation, the FIDs (Free Induction Decay) were zero-filled to 64 k points and apodized by Lorentzian line-broadening of 0.3 Hz. All spectra were individually baseline- and phase corrected using XWin-NMR (Bruker Biospin). The wine dataset has dimensions (samples  variables): 40  8712. Chemical analyses were also performed on the same wines in order to determine, among the others, their content of ethanol and lactic acid. These data have been used in the present study 4.1.3. Case 3: the rat plasma data This metabonomic dataset is from the study by Kristensen et al. [23] in which a combination of NMR analysis and chemometrics was used as a method for rapidly assessing the lipoprotein subfractions in rat plasma [24]. In that study 100 ll of plasma were transferred to a 5 mm NMR tube and 450 ll D2O were added. 1H NMR spectra were then acquired on a Bruker Avance 400 spectrometer (9.4 T) operating at 400.13 MHz. All samples were individually tuned, matched and shimmed and measured using a broad band inverse detection probe equipped with Z-gradients designed to 5 mm NMR tubes. Since the average body temperature of rats is 311 K, all experiments were performed at that temperature. Water pre-saturation was employed during the recycle period by a composite 90° pulse in order to suppress the intense water resonance. A spectral window of 8278.15 Hz was acquired in 1.97 s. The relaxation delay was 20 ls. The FIDs were collected into 32 k complex data points and 128 scans were accumulated for each sample. Total acquisition time was 15 min and prior to Fourier transformation the data were zero-filled to 64 k points and multiplied by a 0.3 Hz line-broadening function. The spectra were baseline- and phase corrected manually using Topspin™ (Bruker Biospin). A batch of this study, consisting of 72 plasma samples of rats fed with different diets, was used in the present study. The rat plasma data has the dimensions (samples  variables): 72  21,727. 4.1.4. Human urine data The human urine dataset is the target of a focused metabonomic study in which icoshift is applied. This study will be the subject of a forthcoming paper. The challenging region (because of the strong misalignment) between 2.74 and 2.64 ppm, containing an NMR doublet of citric acid together with a singlet of TMA (TriMethylAmine), is shown in Fig. 1 of the present study, demonstrating the capability of icoshift of solving intricate misalignments of signals belonging to different molecules. F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 1 H NMR spectra of buffered urine samples were acquired on a Bruker Avance Ultra Shield 500 spectrometer (Bruker Biospin) operating at 500.13 MHz (11.75 T) using a broadband inverse detection probe head equipped with a 120 ll flow-cell. Data were accumulated at 298 K employing a pulse sequence composed by a pre-saturation of the water resonance during the recycle period followed by a composite 90° pulse with an acquisition time of 1.57 s, a recycle delay of 5 s, 128 scans and a sweep width of 10416.67 Hz, resulting in 16 k complex data points. All samples were individually and automatically tuned, matched and shimmed. Prior to Fourier transformation, each FID was zero-filled to 64 k points and apodized by Lorentzian line-broadening of 0.30 Hz. The resulting spectra were manually phased and automatically baseline corrected using Topspin™ (Bruker Biospin) and the ppm scale was referenced towards the TSP peak at 0.00 ppm. Since the composition and the concentration were very similar for every sample, the receiver gain was initially set at a fixed value equal to 287.4 for all the experiments. The human urine dataset has the dimensions (samples  variables): 98  30,000. 4.2. Signal processing The flow chart of the algorithm is shown in Fig. 2. In this section, the three parts of the icoshift algorithm: interval definition (cf. Section 4.2.2), FFT engine (cf. Section 4.2.1) and reconstruction (cf. Section 4.2.4) are described. Other alignment methods related to icoshift and used for comparison are described in the subsequent sections (viz., PAFFT in Section 4.2.5, RSPA in Section 4.2.6 and COW in Section 4.2.7). 4.2.1. Cross-correlation by FFT The theory of the calculation of the cross-correlation coefficient using the Fourier Transform is well known and will only be outlined here [25]. Given two continuous functions XðtÞ and YðtÞ, the cross-correlation for a lag u between them can be defined as in Eq. (1) C XY ðuÞ ¼ Z 1 Yðt þ uÞXðtÞdt ð1Þ 1 xðf Þ ¼ FðXðtÞÞ ¼ Z 1 XðtÞe2pift dt ð2aÞ 1 XðtÞ ¼ F1 ðxðf ÞÞ ¼ Z 1 xðf Þe2pift df ð2bÞ 1 where t and f are generally referred to as time and frequency and xðtÞ and Xðf Þ are the representations of the same process in the time and in the frequency domain, respectively. However for this specific application and representation of the Fourier transform, t denotes the chemical shift (i.e., a frequency shift expressed in ppm). Correspondingly, f is another variable that can be expressed in ppm1, and therefore in some time measurement unit [25], but Xðf Þ is not the time domain representation of xðtÞ as generally intended by the NMR community. ~ in the domain W ¼ ½w; w, one If C XY ðuÞ has a maximum at u e ðtÞ ¼ Yðt þ u ~ Þ that is shifted along t and can define a function Y e ðtÞ is said to be has maximum cross-correlation to XðtÞ at t ¼ 0. Y ~ the optimal shift the aligned signal, XðtÞ the alignment target, u and w is the width of the search space for the optimal correction. The Fourier transform (F – Eq. (2a)) and its inverse (F1 – Eq. (2b)) allow the calculation of all the cross-correlation values for arbitrarily large values of w using the fact that C XY ðuÞ can be expressed as a function of F1 ðxðf Þyðf ÞÞ [25]. This holds also for discrete samples (x and y, respectively) of finite length N of the two functions, so long as XðtÞ and YðtÞ are periodic and that their period is of length N; that is, as long as they are completely determined by 199 x and y. However, this is not the case for sections of NMR spectra and, in order to avoid contamination from the end sections, it is necessary to pad x and y with a number of zeros equal to the largest correction allowed w [25]. The FFT algorithm makes it particularly efficient to calculate the C XY ðuÞ in the discrete case and is therefore often used for the signal alignment [17,18]. Several algorithmic techniques can be used to improve the efficiency of the FFT algorithm. In particular, a significant gain is achieved by aligning several samples at once rather than one at a time (Fig. 4 in the Supplementary Material). Note that this is currently possible in icoshift only if the boundaries of the intervals are common between spectra and that, for very large N and number of intervals, it may be too costly memory-wise to treat all the samples at once. It is also well known that the FFT algorithm is faster when N is equal to a power of two [25]. Therefore, in some cases, it is beneficial to pad the samples with zeros to a length of N " ¼ 2dlog2 Ne points, where dxe indicates the smallest integer larger than x. However, some numerical experiments showed that, while this is almost always true (in the Matlab environment) when samples are treated individually, it is no longer valid when samples are treated in blocks (cf. Supplementary Material Figs. 4 and 5). In particular, in the latter case the advantage in computation time appears to be appreciable (>5%) only when the length of this padding is limited compared to N or when N is small (below 64 points, Fig. 5 in the Supplementary Material). Because of the difficulty in predicting when zero padding to the next power of two is beneficial in the block case, this feature is currently not implemented in icoshift. 4.2.2. Interval definition and recursion The FFT engine can operate on the entire spectrum as well as on intervals of arbitrary length in sequence. Depending on how the intervals are defined, there are essentially three options available: to align the entire spectrum, to align separate, non-overlapping intervals independently, or to align the entire spectrum based on a reference interval. The first and the last options are in~ points, where deed trivial as it is sufficient to shift the signal by u ~ is calculated on the entire spectrum or on the reference interval, u respectively. The custom interval case is slightly more complex and, differently from other FFT-based alignment algorithms, adjacent intervals can share the common boundary but need not be contiguous. Moreover, if some regions in the samples are not ~ ¼ 0 for included in any interval, they are left untouched (i.e., u them). When automated options for interval definition are used and either the number of intervals, ns, or their length ‘ are fixed by the user, adjacent intervals share the common boundary, similarly to what is done with COW [20]. However, the treatment of the remainders is different in the two cases: if ns is fixed, ‘ is equal 1 to c ¼ bNn1 s c (i.e., the largest integer smaller than Nns ) for the last cð‘ þ 1Þ  N intervals and c þ 1 for the first N  ns c (that is, the remainders are equally split between intervals); on the contrary, if ‘ is fixed, the remainders are attached to the last interval, which has length ‘ þ N  ns ð‘  1Þ, where ns is bðN  1Þð‘  1Þ1 c. In order to keep the icoshift algorithm as general and versatile as possible, no elaborate segmentation routine based on, e.g., peak picking was implemented. This also implies that, when automated interval definition is employed, there is no safeguard against having a boundary occurring in the middle of a peak. icoshift implements a basic (optional) recursion for the customized interval case; namely, it is possible to perform a full-spectrum correction before the single intervals are aligned. This has proven to provide an improved alignment performance in certain cases at a limited additional computational cost. 200 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 4.2.3. Data interpolation using missing values The icoshift algorithm is capable of handling missing values that occur at the beginning or at the end of the discrete intervals of XðtÞ and YðtÞ. This is possible because the length of x and y need not be the same. In particular, let N x þ w and N y þ w be the lengths, after the zero padding to avoid end section contamination, of x and y, respectively, and, without lack of generality, N x > N y ; then, in order to calculate C xy ðuÞ in W, it is sufficient to pad y with N x  N y zeros [25]. Thus, missing values are simply treated by removing them, zero padding the shortest interval to the same length as the longest, and by correcting the resulting optimal shift with the factor D ¼ mx  my (where mx and my is the number of missing values in the leading sections of x and y, respectively). Namely, if ~ miss is the optimal shift after the removal of the missing values u and the padding to the same length, the optimal shift for the origi~¼u ~ miss þ D. The number of trailing missing values nal intervals is u is irrelevant in this respect. Fig. 9 illustrates how the use of missing values can improve the reliability of the aligned spectra avoiding the introduction of artifacts that can mislead the successive data analysis. 4.2.4. Reconstruction Similarly to the other FFT-based algorithms described herein, an insertion/deletion model is used by icoshift; therefore, the warping path (i.e., the function relating the chemical shift axis in the sample and target spectra) obtained by icoshift is piece-wise linear, made up of segments of slope one that are connected by vertical, horizontal or empty sections (cf. Fig. 6 in the Supplementary Material). In particular, having the chemical shift axes for the target and for the sample (x and y, respectively) on the abscissae and the ordinates, respectively, an horizontal segment corresponds to replicating the boundary point or to inserting missing values (insertion), whereas a vertical one entails a deletion. The reconstruction of the samples is performed independently for each spectrum and interval, but no global optimality criterion is used. Hence, unlike COW, but similarly to the other FFT-based alignment algorithms, there is no measure of the goodness of the fit after the deletion/ insertion occurs. This is clearly suboptimal but works well under the assumption that the boundaries are located in regions that do not contain any relevant signal. Algorithmic techniques such as dynamic programming, breadth first search, beam search or Fig. 9. Effect of the imputed value on the quality of the aligned signals. (a) Imputation of first/last boundary point in each interval, (b) imputation of missing values. genetic algorithms have been suggested for alignment algorithms that do not use FFT and cross-correlation. The best options for such global optimization are currently being investigated. 4.2.5. Peak Alignment by FFT (PAFFT) and Recursive Peak Alignment by FFT (RAFFT) The PAFFT algorithm corresponds to the icoshift with automatic segmentation into intervals of equal length and insertion using the first/last point in each interval. From the computational point of view, PAFFT treats the samples separately (one at a time) and implements padding to the nearest power of two for each interval. However, compared to icoshift, there is no explicit zero padding to avoid end section contamination; in practice, in the current implementation (http://physchem.ox.ac.uk/~jwong/specalign/data/ PAFFT.m, download: 18th August, 2009), the contamination is avoided only if N "  N P w, but this need not always be the case. The RAFFT algorithm recursively calls PAFFT with intervals of decreasing length until some lower limit for N is reached or the similarity does not improve. This recursion step is not explicitly implemented in icoshift, which only allows the shift of the full spectrum as a preliminary step. RAFFT suffers from the same drawbacks described for PAFFT. 4.2.6. Recursive Segment-wise Peak Alignment (RSPA) The RSPA method, like PAFFT and RAFFT operates on one sample at a time and includes an automated interval definition part and a recursion routine. The automated segmentation is based on peak picking and on some user-defined constants that depend on the dataset (e.g., the noise level, the minimum distance between peaks that are merged to the same interval, the height of peaks that are considered intense within the set of all the detected peaks). The principals of the algorithm are only outlined in this section, as more details are available in the original publication [18]; more attention will be given to describing the choices made for the implementation as the source code of RSPA is currently not publicly available. As a first step a full spectrum shift is performed. Secondly, the interval boundaries are defined based on peak positions and on the positions of the corresponding bracketing minima. These are found as the points where the numerical derivative changes sign (i.e., no smoothing is used) of the signal. Subsequently, the height of each peak is determined as the difference between the value at the maxima and the smallest value at the bracketing minima and initial estimates of the intervals are taken as the bracketing minima of intense peaks. The result, at this stage is an ordered set of intervals Am ¼ f½a1 ; b1 ; . . . ; ½aim ; bim g and heights Hm ¼ fh1 ; . . . ; hnm g for each sample m and the target. Intervals relative to intense peaks (here determined as those whose height exceeds the 30th percentile of Hm ) that are closer than a predefined value s are merged; that is, if aiþ1  bi 6 s, then a new interval ½ai ; biþ1  is formed in substitution of the two original ones. In a second stage, the initial estimates for the intervals for each sample are compared to those for the target and vice versa. Intervals in a spectrum are considered overlapping with a segment in the target, if their mean falls within the central part of the target’s interval (that is if 0:5ðak;sam þ bk;sam Þ 2 ½0:75atar þ 0:25btar ; 0:25atar þ 0:75btar ) and if overlap occurs for more than one k, then they are merged and a new segment is formed for the sample as ½aminðkÞ;sam ; bmaxðkÞ;sam . The same procedure is then repeated for the target to merge target intervals matched to the same sample interval. Finally, the boundaries of the segments in the sample and the target are compared and matched (that is, if overlapping intervals in target and sample are ½atar ; btar  and ½asam ; bsam , respectively, the final boundaries are ½minðasam ; aref Þ; maxðbsam ; bref Þ). This expansion of the intervals may lead to some overlap between adjacent intervals within either the sample or the target, which is F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 eliminated by merging the overlapping intervals. In the optimal case, the result of this procedure is a set of matched intervals containing relevant peaks and a series of intervening regions with unmatched peaks or noise. In the simulation of the RSPA based on icoshift, the intervening regions are not aligned as there would be no information useful to establish the correctness of the alignment. The recursion part includes a function that evaluates the goodness of the alignment of each interval in terms of the piece-wise correlation coefficient qt ðxi ; yi Þ (i.e., the normalized sum of the Pearson’s correlation coefficients between segments of length t for the ith interval). If this value is lower than a predefined threshold, the interval is split. The splitting is done at the minimum in the spectrum being aligned closer to the minimum of the vector xi  yi , where  denotes the element-wise product. If a segment resulting from the split is smaller than the minimum length allowed (taken as equal to s, i.e., the minimum distance between merged peaks), it is disregarded. The segments resulting from the splitting are subject to further (separate) alignment in the following recursion step. In the implementation of the RSPA used herein, an additional constraint is applied to the number of recursions, as it was observed that artifacts could emerge if too many recursions are applied to markedly different intervals that are sufficiently long to allow several splitting without reaching the lower limit for the segment length. 4.2.7. Correlation optimized warping (COW) The COW algorithm has been extensively used for the alignment of chromatographic data and has been tested also on NMR signals [19,20,7]. The methods also finds piece-wise linear warping paths, but, rather than using a insertion/deletion model it uses a compression/expansion model. That is, the warping paths are formed by segments whose slope is allowed to take only a limited number of values determined by the length of the corresponding segments (‘) in the sample, and by the so-called slack parameter t. The COW algorithm aligns the samples with a target by dividing them in an equal number of segments, whose boundaries (ai for i ¼ 1; . . . ; ns ), in the target, are allowed to take all the integer values within predefined intervals (namely aiþ1  ai þ 1 2 ½‘  t; ‘ þ t, where t is the so-called slack parameter). The optimal warping is the one that maximizes the sum of the Pearsons correlation coefficient for all the intervals after the segments in the target are interpolated to the same length as the corresponding interval in the sample (typically ‘). In particular, dynamic programming is used to ensure that the global maximum, given the local constraints, is attained. The COW algorithm works spectrum-wise, but from the algorithmic point of view, it is possible to treat the spectra in blocks with considerable reduction in computational complexity [20]. 4.3. Computation tests All the calculations have been performed on a personal computer equipped with i7 core, 965 chipset, 3.2 GHz, 12 GB of memory, Windows XP-64 SP3 and MatlabÒ 7.7 (2008b, The Mathworks Inc., Natick, MA, USA). In-house optimized Matlab routines derived from the code available at www.models.life.ku.dk have been used. The optimal segment length and slack parameter for COW have been obtained using the simplex optimization routine by Skov et al. [20]. In the corrected version of COW, the calculations on the nodes in the initial grid search are computed in parallel on the 8 available cores on the machine. This allowed a 50% reduction in computation time compared to the non-parallel version. The RSPA routine (cf. Section 4.2.6) for Matlab has been implemented in-house using icoshift as a computation engine. The functions for peak picking, automatic segmentation and recursion are 201 optimized to the largest possible extent and will be available for download at www.models.life.ku.dk. The synthetic datasets (Table 3) are made of uniformly distributed random numbers as the computational complexity of icoshift is only dependent on the length of the intervals, on their number ns, on the maximum allowed correction w, and on the number of P samples M (that is, OðM ni s ðN i þ wÞlog2 ðN i þ wÞÞ, where N i þ w is the length of the ith interval zero padded to handle the end section contamination). Automated segmentation to 100 segments was used setting the maximum allowed correction as the minimum between 100 and half the interval length. Acknowledgments Flemming H. Larsen, Maider Vidal, Mette Kristensen are acknowledged for providing the datasets used for assessing the algorithm’s features. The Danish Ministry of Food, Agriculture and Fisheries is acknowledged for sponsoring the project ‘‘Urinary biomarkers for eating patterns using NMR spectroscopy and Chemometrics” (3304-FVFP-060706-01) under which the present research has been developed. The Villum Kann Rasmussen project (Metabonomic Cancer Diagnostic) and The EU FP6 ToK project ‘‘Food informatics” (EC Contract No: MTKI-CT-2005-030056) and gratefully acknowledged for support to Francesco Savorani and. Giorgio Tomasi. Eventually, Rasmus Bro, Frans van den Berg and Lars Nørgaard are acknowledged for valuable scientific discussions. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.jmr.2009.11.012. References [1] D. Johnels, U. Edlund, E. Johansson, S. Wold, A multivariate method for carbon13 NMR chemical shift predictions using partial least-squares data analysis, J. Magn. Reson. 55 (1983) 316–321. [2] K.P. Gartland, C.R. Beddell, J.C. Lindon, J.K. Nicholson, Application of pattern recognition methods to the analysis and classification of toxicological data derived from proton nuclear magnetic resonance spectroscopy of urine, Mol. Pharmacol. 39 (1991) 629–642. [3] H. Winning, F.H. Larsen, R. Bro, S.B. Engelsen, Quantitative analysis of NMR spectra with chemometrics, J. Magn. Reson. 190 (2008) 26–32. [4] O. Beckonert, H.C. Keun, T.M. Ebbels, J. Bundy, E. Holmes, J.C. Lindon, J.K. Nicholson, Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts, Nat. Protoc. 2 (2007) 2692–2703. [5] M. Spraul, P. Neidig, U. Klauck, P. Kessler, E. Holmes, J.K. Nicholson, B.C. Sweatman, S.R. Salman, R.D. Farrant, E. Rahr, C.R. Beddell, J.C. Lindon, Automatic reduction of NMR spectroscopic data for statistical and pattern recognition classification of samples, J. Pharm. Biomed. Anal. 12 (1994) 1215– 1225. [6] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping, J. Chromatogr. A 805 (1998) 17–35. [7] G. Tomasi, F. van den Berg, C. Andersson, Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data, J. Chemom. 18 (2004) 231–241. [8] D. Bylund, R. Danielsson, G. Malmquist, K.E. Markides, Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatography–mass spectrometry data, J. Chromatogr. A 961 (2002) 237–244. [9] F.H. Larsen, F. van den Berg, S.B. Engelsen, An exploratory chemometric study of H-1 NMR spectra of table wines, J. Chemom. 20 (2006) 198–208. [10] R.J.O. Torgrip, M. Aberg, B. Karlberg, S.P. Jacobsson, Peak alignment using reduced set mapping, J. Chemom. 17 (2003) 573–582. [11] R.J.O. Torgrip, J. Lindberg, M. Linder, B. Karlberg, S.P. Jacobsson, J. Kolmert, I. Gustafsson, I. Schuppe-Koistinen, New modes of data partitioning based on PARS peak alignment for improved multivariate biomarker/biopattern detection in H-1-NMR spectroscopic metabolic profiling of urine, Metabolomics 2 (2006) 1–19. [12] E. Alm, R.J. Torgrip, K.M. Aberg, I. Schuppe-Koistinen, J. Lindberg, A solution to the 1D NMR alignment problem using an extended generalized fuzzy Hough transform and mode support, Anal. Bioanal. Chem. 395 (2009) 213–223. 202 F. Savorani et al. / Journal of Magnetic Resonance 202 (2010) 190–202 [13] J. Forshed, I. Schuppe-Koistinen, S.P. Jacobsson, Peak alignment of NMR signals by means of a genetic algorithm, Anal. Chim. Acta 487 (2003) 189–199. [14] G.C. Lee, D.L. Woodruff, Beam search for peak alignment of NMR signals, Anal. Chim. Acta 513 (2004) 413–416. [15] J.T.W.E. Vogels, A.C. Tas, J. Venekamp, J. Van Der Greef, Partial linear fit: a new NMR spectroscopy preprocessing tool for pattern recognition applications, J. Chemom. 10 (1996) 425–438. [16] R. Stoyanova, A.W. Nicholls, J.K. Nicholson, J.C. Lindon, T.R. Brown, Automatic alignment of individual peaks in large high-resolution spectral data sets, J. Magn. Reson. 170 (2004) 329–335. [17] J.W.H. Wong, C. Durante, H.M. Cartwright, Application of fast Fourier transform cross-correlation for the alignment of large chromatographic and spectral datasets, Anal. Chem. 77 (2005) 5655–5661. [18] K.A. Veselkov, J.C. Lindon, T.M.D. Ebbels, D. Crockford, V.V. Volynkin, E. Holmes, D.B. Davies, J.K. Nicholson, Recursive segment-wise peak alignment of biological H-1 NMR spectra for improved metabolic biomarker recovery, Anal. Chem. 81 (2009) 56–66. [19] F. van den Berg, G. Tomasi, N. Viereck, Warping: investigation of NMR preprocessing and correction, in: S.B. Engelsen, P.S. Belton, H.J. Jakobsen (Eds.), Magnetic Resonance in Food Science: The Multivariate Challenge, Royal Society of Chemistry, Cambridge, 2005, pp. 131–138. [20] T. Skov, F. van den Berg, G. Tomasi, R. Bro, Automated alignment of chromatographic data, J. Chemom. 20 (2006) 484–497. [21] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Appl. Spectrosc. 54 (2000) 413–419. [22] J.T.M. Pearce, T.J. Athersuch, T.M.D. Ebbels, J.C. Lindon, J.K. Nicholson, H.C. Keun, Robust algorithms for automated chemical shift calibration of 1D H-1 NMR spectra of blood serum, Anal. Chem. 80 (2008) 7158–7162. [23] M. Kristensen, F. Savorani, G. Ravn-Haren, M. Poulsen, J. Markowski, F.H. Larsen, L. Dragsted, S.B. Engelsen, NMR and interval PLS as reliable methods for determination of cholesterol in rodent lipoprotein fractions, Metabolomics (2009), in press, doi:10.1007/s11306-009-0181-3. [24] M. Petersen, M. Dyrby, S. Toubro, S.B. Engelsen, L. Norgaard, H.T. Pedersen, J. Dyerberg, Quantification of lipoprotein subclasses by proton nuclear magnetic resonance-based partial least-squares regression models, Clin. Chem. 51 (2005) 1457–1461. [25] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C, Cambridge University Press, Cambridge, UK, 2002.