# CorMID

Please load the CorMID package before running the examples within this document.

library(CorMID)

# The CorMID function to correct superimposed mass isotopologue distributions of 13C labelled molecules in GC-APCI-MS flux experiments

The package provides as the main functionality function CorMID which will estimate the fragment distribution, r, and the mass isotopologue distribution, M, for a numeric vector of measured ion intensities originating from a compound analyzed by GC-APCI-MS (see below).

## Definitions

### Gas chromatography (GC)

Gas chromatography is an analytical method to separate volatile compounds.

### Atmospheric pressure chemical ionization (APCI)

Atmospheric pressure chemical ionization (APCI) is an ionization method used in mass spectrometry which utilizes gas-phase ion-molecule reactions at atmospheric pressure (105 Pa) to convert molecules separated by GC into ions which can be analyzed by mass spectrometry.

### Mass spectrometry (MS)

Mass spectrometry is an analytical method to separate ions due to their mass, more specifically due to their difference ind mass to charge ratio.

### Mass Isotopologue Distributions (MID)

Compounds which have the same sum formula are termed isomers. Isomers can be structural, i.e. the arrangement of the atoms in the molecules are different. All isomers of a specific sum formula which are structurally identical and differ only in their isotopic composition are termed isotopomers. Here, again, several isotopomers may share the same sum formula in which case they are termed isotopologues.

Example: C2H6O would be a sum formula, with ethanol CH3-CH2-OH being one structural isomer of C2H6O. The molecules 13CH3-CH2-OH and CH3-13CH2-OH would be two isotopomers of ethanol. Together, they form one of three possible carbon isotopologues of ethanol, the M1 isotopologue with sum formula 13C12CH6O. The corresponding M2 isotopologue being 13C2H6O. In mass spectrometry we would measure the mass isotopologue distribution of C2H6O, more specifically of the structural isomer ethanol, by quantifiying 3 different ion masses for M0, M1 and M2. This vector is termed MID.

### Helper functions

Within the main function CorMID I make use of several helper functions, i.e. CountChemicalElements and CalcTheoreticalMDV. The first one simply counts the digit following a certain letter in a chemical sum formula. Here, we use it to determine the number of carbon, silicon and sulfor atoms (neglecting nitrogen, as the 15N isotope is of low abundance). As the anticipated user will probably work on derivatized compounds, I included two additional letters to the chemical alphabet, T for TMS and M for a MEOX substitution. In consequence for compound Glucose (5TMS 1MEOX) we would count:

fml <- "C6H12O6T5M1"
CountChemicalElements(x = fml)
#>  C  H  O  T  M
#>  6 12  6  5  1
CountChemicalElements(x = fml, ele=c("C","Si","T","Cl"))
#>  C Si  T Cl
#>  6  0  5  0

and receive as output a named vector for all present elements or only a selection of elements as specified by parameter ele.

The elements with a significant amount of natural occuring isotopes are relevant to calculate the theoretical mass distribution vector (or rather matrix respectively) of the compound. In the above example this is effectively Carbon and Silicon. As we have 5 TMS groups, we need to consider in total 21 C and 5 Si in our calculations:

fml <- "C21Si5"
td <- CalcTheoreticalMDV(fml=fml)
round(td,4)
#>       M+0    M+1    M+2    M+3    M+4    M+5    M+6
#> M0 0.5291 0.2575 0.1475 0.0471 0.0147 0.0034 0.0007
#> M1 0.0000 0.5354 0.2546 0.1464 0.0460 0.0143 0.0032
#> M2 0.0000 0.0000 0.5430 0.2522 0.1457 0.0450 0.0141
#> M3 0.0000 0.0000 0.0000 0.5566 0.2523 0.1466 0.0445
#> M4 0.0000 0.0000 0.0000 0.0000 0.5880 0.2600 0.1519
#> M5 0.0000 0.0000 0.0000 0.0000 0.0000 0.6987 0.3013
#> M6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000

The first row of the matrix (M0) gives the relative amounts of all potential isotopes for C21Si5 assuming natural abundance conditions. The second row (M1) shows the relative amounts for isotopologue M1 (containing at least one 13C). The final row (M6) shows the relative amounts when all biological carbon atoms are assumed to be 13C. The amount of biological carbon is estimated based on the amount of Si within the function. This might be overwritten by function parameters specifying the number of C of biological origin nbio and the number of measured ion signals above the detection limit nmz:

round(CalcTheoreticalMDV(fml=fml, nbio=21, nmz=21)[-(5:19),-(5:19)],4)
#>       M+0    M+1    M+2    M+3  M+19   M+20   M+21
#> M0  0.529 0.2575 0.1475 0.0471 0.000 0.0000 0.0000
#> M1  0.000 0.5349 0.2544 0.1463 0.000 0.0000 0.0000
#> M2  0.000 0.0000 0.5409 0.2512 0.000 0.0000 0.0000
#> M3  0.000 0.0000 0.0000 0.5469 0.000 0.0000 0.0000
#> M19 0.000 0.0000 0.0000 0.0000 0.678 0.1867 0.1352
#> M20 0.000 0.0000 0.0000 0.0000 0.000 0.7910 0.2090
#> M21 0.000 0.0000 0.0000 0.0000 0.000 0.0000 1.0000

Further, the package contains the convenience function recMID to reconstruct a measured MID based on a given corMID, r and sum formula. recMID returns an object of the similarly named class to allow easy visualization.

fml <- "C9H20O3Si2"
mid <- c(0.9,0,0,0.1)
r <- list("M+H"=0.8, "M-H"=0.1, "M+H2O-CH4"=0.1)
(rMID <- CorMID::recMID(mid=mid, r=r, fml=fml))
#>        M-2        M-1        M+0        M+1        M+2        M+3        M+4
#> 0.06881282 0.01414192 0.55678069 0.12190316 0.12056111 0.08497195 0.01854376
#>        M+5
#> 0.01428459
#> attr(,"class")
#> [1] "recMID"
plot(rMID)

The spectrum shown in the above plot would be measured for lactic acid (2 TMS), assuming 10% of the fully labeled isotopologue M3, natural abundance, 10% proton loss and 10% of [M+H]+H2O-CH4. It is the task of CorMID to disentangle this overlay of superimposed MID and estimate both, corMID and r.

## Main function CorMID

### Idea

The problem in GC-APCI-MS that we try to overcome is the formation of fragments forming superimposed MIDs. The ones we identified so far are [M+H], [M+], [M+H]-H2 and [M+H]+H2O-CH4. If we assume [M+H] to be generally the most abundant and hence use it as our fix point (base MID, shift = 0), than we observe superimposed MIDs starting at -2, -1 and +2 relative to [M+H] for [M+], [M+H]-H2 and [M+H]+H2O-CH4 respectively.

The basic idea of the correction is that we measure a superimposed/composite MID of one to several fragments all derived from the same base MID. This base MID (or correct MID, corMID) is exactly what we are looking for. Estimating it is complicated because we do not know the distribution of fragments, i.e. the amount of the individually occurring fragments or their ratios to each other respectively. Hence, we have to estimate corMID and the ratio vector r giving the distribution of present fragments, which together represent our measurement data optimally.

### Example

Lets start with an artificial Glucose spectrum where 10% is M6 labeled:

fml <- "C21Si5"
td1 <- CalcTheoreticalMDV(fml = fml, nbio=6, nmz=8)
bMID <- c(0.9,rep(0,5),0.1)
md1 <- apply(td1*bMID,2,sum)
round(md1,4)
#>    M+0    M+1    M+2    M+3    M+4    M+5    M+6    M+7    M+8
#> 0.4761 0.2318 0.1327 0.0424 0.0132 0.0030 0.0606 0.0253 0.0149

to obtain the measure distribution md1 which is our measured intensity values expressed relative. Please note that the measured MID contains additional peaks at M+7 and M+8, being the natural abundant isotopes of carbon atomes attached during derivatization. Now we may use CorMID to decompose this back:

CorMID(int=md1, fml=fml, r=unlist(list("M+H"=1)))
#>       M0       M1       M2       M3       M4       M5       M6
#> 89.84375  0.00000  0.00000  0.00000  0.00000  0.00000 10.15625
#> attr(,"err")
#>         err
#> 0.002743298
#> attr(,"ratio")
#> M+H
#>   1
#> attr(,"ratio_status")
#> [1] "fixed"
#> attr(,"mid_status")
#> [1] "estimated"

Notice, that we allowed only [M+H] to be present in option r. The result is a labeled vector representing the corrected or baseMID together with information on the fitting error err and regarding the options used during the function call as attributes ratio, ratio_status and mid_status with mid being estimated and ratio being fixed during the function call.

We could achieve something similar testing for all currently defined fragments by omitting the r option:

CorMID(int=md1, fml=fml)
#>       M0       M1       M2       M3       M4       M5       M6
#> 89.84375  0.00000  0.00000  0.00000  0.00000  0.00000 10.15625
#> attr(,"err")
#>         err
#> 0.002743298
#> attr(,"ratio")
#>       M+H        M+       M-H M+H2O-CH4
#>         1         0         0         0
#> attr(,"ratio_status")
#> [1] "estimated"
#> attr(,"mid_status")
#> [1] "estimated"

We essentially get the same result as before (except for ratio related attributes) because there is no superimposition in our test data. Now lets generate more difficult composite data to be fit by including a 20% proton loss…

md2 <- unlist(list("M-1"=0,0.8*md1)) + c(0.2*md1,0)
round(md2,4)
#>    M-1    M+0    M+1    M+2    M+3    M+4    M+5    M+6    M+7    M+8
#> 0.0952 0.4273 0.2120 0.1147 0.0365 0.0112 0.0145 0.0535 0.0232 0.0119

and let CorMID decompose this back…

CorMID(int=md2, fml=fml)
#>       M0       M1       M2       M3       M4       M5       M6
#> 89.84375  0.00000  0.00000  0.00000  0.00000  0.00000 10.15625
#> attr(,"err")
#>         err
#> 0.002466178
#> attr(,"ratio")
#>       M+H        M+       M-H M+H2O-CH4
#>       0.8       0.2       0.0       0.0
#> attr(,"ratio_status")
#> [1] "estimated"
#> attr(,"mid_status")
#> [1] "estimated"

which is pretty close to the truth. :)

### Function Details

Let’s look into the details of the function. Appart from some sanity checks and data preparation steps done by the wrapper function CorMID the main idea is to model a theoretical distribution based on a provided sum formula and fit a base MID and fragment ratios according to measurement data by function FitMID which is discussed in the following. The approach is brute force using two nested estimators for r and corMID seperately. It builds on the idea to test a crude grid of parameters first, identify the best solution and iteratively approache the true value by minimizing the grid.

The grid is set by an internal function poss_local. Basically, if we have a two carbon molecule we expect a corMID of length=3 {M0, M1, M2}. Let’s assume that corMID = {0.9, 0, 0.1}. Using a wide grid we would than test the following possibilities:

CorMID:::poss_local(vec=c(0.5,0.5,0.5), d=0.5, length.out=3)
#>    Var1 Var2 Var3
#> 3   1.0  0.0  0.0
#> 5   0.5  0.5  0.0
#> 7   0.0  1.0  0.0
#> 11  0.5  0.0  0.5
#> 13  0.0  0.5  0.5
#> 19  0.0  0.0  1.0

and identify {1, 0, 0} as best match after subjecting to a testing function. We decrease the step size of the grid by 50% and test in the next iteration:

CorMID:::poss_local(vec=c(1,0,0), d=0.25, length.out=3)
#>     Var1  Var2  Var3
#> 3  1.000 0.000 0.000
#> 5  0.875 0.125 0.000
#> 7  0.750 0.250 0.000
#> 11 0.875 0.000 0.125
#> 13 0.750 0.125 0.125
#> 19 0.750 0.000 0.250

and will get closer to the truth and find {0.875, 0, 0.125} to give the lowest error.

In summary, using this approach we can approximate the optimal vectors of corMID and r in a finite number of iterations to reach a desired precision <0.1%. We can nest MID fitting inside ratio fitting and thereby do both in parallel.

The error function currently employed is simply the square root of the summed squared errors, comparing the provided measurement data and a reconstructed MID based on a specific corMID and r.