2 Metabolomics

2.1 Experimental methods and detailed description

Additional sampling notes

During the sampling on the Pâmiut cruises, the sponges remained on deck or in the laboratory for approximately 30–45 min before they were frozen to -20°C. The outside temperature usually oscillated around 4–5°C, and the sorting area was approximately 10°C. The sample consisted of both choanosome and coretx.

Laboratory methods and data acquisition

Metabolites were separated in connection to downstream mass spectrometry (HRMS) analysis using two different chromatographic columns (UPLC): a hydrophilic interaction liquid chromatography (HILIC) column that retains polar compounds, and a reversed-phase (RP) column that favours retention of non-polar compounds. All samples were processed in randomized using glass instruments during the extraction to avoid chemical contamination.

Mass spectrometry analysis

Dried extracts in glass vials were dissolved in 200 µL solvent (HILIC: 50 µL H₂O and 175 µL acetonitrile AcN; RP: 140 µL H₂O and 10 µL AcN). Upon addition of the organic solvent for HILIC chromatography, all samples separated into two immiscible layers. The vials were centrifuged for 3 min at 2000 x g to yield an even separation. Only the top layer (approximately 150 µL) was transferred to a Chromacol 03-FISV MS-vial (Thermo Scientific, Waltham, Massachusetts, USA) for MS analyses. For RP chromatography, no layers were observed, and the entire volume of the dissolved sample was used. A five µL aliquot from each individual MS-vial for HILIC and RP, respectively, was combined to produce a quality control (QC) sample.

High resolution MS analysis system and settings

The extracts were analyzed back-to-back in positive and negative ionization mode on an Acquity I-Class Ultra Performance Liquid Chromatography UPLC coupled to a G2S Synapt Q-TOF with an electrospray ionization (ESI) ion source (all Waters Corp., Milford, MA, USA).

Chromatographic separation in HILIC mode was performed on an Acquity UPLC BEH Amide column (1.7 µm, 2.1 mm inner diameter × 50 mm, Waters Corp.). Mobile phase A consisted of 95:5 acetonitrile/MQ water with 5 mM ammonium formate and 0.1 % formic acid (FA), and mobile phase B consisted of 40:60 acetonitrile/MQ water with 5 mM ammonium formate and 0.1 % FA. The gradient elution profile was as follows: mobile phase A was decreased non-linearly (slope factor 8, MassLynx) from 100 % A to 100 % B over 14 min, 100 % B was held for 2 min and then decreased back to 100 % A over 1 min. The column was re-equilibrated at 100 % A for 6 min for a total runtime of 23 min.

Chromatographic separation in RP was performed on an Acquity UPLC BEH C18 column (1.7 µ m, 2.1 mm inner diameter × 50 mm, Waters). Mobile phase A consisted of MQ water with 0.1 % FA, and mobile phase B was AcN with 0.1 % FA. The gradient elution profile started at 95 % A, was decreased linearly over 14 min to 5 % A, and 5 % A was held for 2 min before the column was re-equilibrated at 95 % A for 4 min. The flow rate was set to 0.4 mL/min, the column temperature was set to 40°C, the samples were kept at 8°C and the injection volume was 5 µL in all experiments.

Data acquisition

Data acquisition was performed using MS^E mode, and lock mass correction was applied using a solution of leucine enkephalin in both positive and negative mode. Ionization parameters were set as follows in positive/negative mode; the capillary voltage was 1kV/1.5 kV, the cone voltage was 30 V/25 V, the source offset was 50/60 and the source temperature was set to 120°C. Nitrogen was used as desolvation and cone gas with gas flows of 800 l/h and 50 l/h, respectively, and desolvation temperature was set to 500°C/450°C. For MS^E acquisition a collision energy ramp from 20–45 eV was used with argon as collision gas. The instrument was calibrated in the m/z range 50–1500 using sodium formate prior to each analysis. All study samples were analysed in both RP and HILIC, in positive and negative ionization mode, resulting in four metabolite datasets per sponge specimen. The column and sample cone was cleaned in between each analysis mode. Prior to each analysis ten QC injections were made to condition the column, and to ensure stable retention times and signal intensities. The study samples were analysed in randomized with QC injections interspaced every 6^th injection.

MS data processing

Raw files were converted to netCDF files by Databridge (part of MassLynx, Waters Corporation, Milford, Massachusetts, USA). The netCDF files with the chromatographic spectra were sorted into folders according to species and processed with XCMS in R. Peak picking was performed using the centWave function with parameters ppm=8, peakwidth set to c(5,45) and the noise parameter set to 2000. Retention time alignment was performed with the obiwarp function and the response factor set to 10, grouping was performed with the “group” function and the “fillPeaks” function was used to impute a signal in cases where no matching pseudospectra were detected.

The data set was curated to remove features eluting in the void (retention time less than 45 s). A raw data set as well as two normalized data sets (Log₁₀-transformed and median fold change normalized) were produced and filtered to only retain features with a coefficient of variation < 30% in the QC samples. After subsequent evaluation, raw data sets were used in subsequent statistics and modelling.

An overview of the number of features and PC groups annotated by CAMERA, as well as their exclusion is given in a table at the end of this document.

2.2 Data processing

2.2.1 Peak picking with XCMS and annotation with CAMERA

We processed samples from all three sponnge species in random with interspersed injection of a combined QC sample to monitor stability of the UPLC-HRMS run. The acquired signals/spectra were converted to netCDF format using the Program DataBridge, and thereafter sorted into four folders, three for the sponge species (Gb, Sf, Wb) and one for the QC samples (QC). Peak picking and combination of pseudospectra is performed with the R package xcms, the subsequent annotation of adducts and isotopes with the R package CAMERA.

2.2.2 HILIC (Hydrophilic interaction chromatography) column with positive ESI (electron spray ionisation)

# HILIC POS ALWAYS CHECK DATE, CHROMATOGRAPHIC CLOUMN (HILIC, RP) AND ESI-MODE (naming, CAMERA: pos, neg)

# setwd() # work in directory containing the sorted CDF files. getwd()
library(xcms)
xset <- xcmsSet(method = "centWave", ppm = 8, peakwidth = c(5, 45), noise = 2000)
save(xset, file = "HILIC_pos_xset_20190417.Rda")
# load(file='HILIC_pos_xset_20190417.Rda') #When resuming after a break
xset <- group(xset)
xset2 <- retcor(xset, method = "obiwarp", response = 10, plottype = "deviation")
xset2 <- group(xset2)
xset3 <- fillPeaks(xset2)
save(xset3, file = "HILIC_pos_xset3_20190417.Rda")
reporttab <- diffreport(xset3, "Gb", "Sf", "Gb_Sf_HILIC_pos_20190417", 10)

library(CAMERA)
xsa <- xsAnnotate(xset3)
xsaF <- groupFWHM(xsa, perfwhm = 0.3)
xsaC <- groupCorr(xsaF, cor_eic_th = 0.7)
xsaFI <- findIsotopes(xsaC)
rules <- read.csv("data/rules_jan_pos.csv", header = T, sep = ",")
xsaFA <- findAdducts(xsaFI, polarity = "positive", rules = rules)
# write.csv(getPeaklist(xsaFA), file='HILIC_pos_20190417.csv')

The other data experiments, HILIC with negative ESI and RP with positive and negative ESI are processed accordingly.

2.2.3 HILIC Chromatography with negative ESI

# HILIC NEG 20190421

# setwd() # work in directory containing the sorted CDF files. getwd()
library(xcms)
xset <- xcmsSet(method = "centWave", ppm = 8, peakwidth = c(5, 45), noise = 2000)
save(xset, file = "HILIC_neg_xset_20190421.Rda")
# load(file='HILIC_neg_xset_20190421.Rda') #When resuming after a break
xset <- group(xset)
xset2 <- retcor(xset, method = "obiwarp", response = 10, plottype = "deviation")
xset2 <- group(xset2)
xset3 <- fillPeaks(xset2)
save(xset3, file = "HILIC_neg_xset3_20190421.Rda")
reporttab <- diffreport(xset3, "Gb", "Sf", "Gb_Sf_HILIC_neg_20190421", 10)

library(CAMERA)
xsa <- xsAnnotate(xset3)
xsaF <- groupFWHM(xsa, perfwhm = 0.3)
xsaC <- groupCorr(xsaF, cor_eic_th = 0.7)
xsaFI <- findIsotopes(xsaC)
rules <- read.csv("data/rules_jan_neg.csv", header = T, sep = ",")
xsaFA <- findAdducts(xsaFI, polarity = "negative", rules = rules)
# write.csv(getPeaklist(xsaFA), file='HILIC_neg_20190421.csv')

2.2.4 RP (Reversed phase) Chromatography with positive ESI

# RP POS 20190421

# setwd() # work in directory containing the sorted CDF files. getwd()
library(xcms)
xset <- xcmsSet(method = "centWave", ppm = 8, peakwidth = c(5, 45), noise = 2000)
save(xset, file = "RP_pos_xset_20190421.Rda")
# load(file='RP_pos_xset_20190421.Rda') #When resuming after a break
xset <- group(xset)
xset2 <- retcor(xset, method = "obiwarp", response = 10, plottype = "deviation")
xset2 <- group(xset2)
xset3 <- fillPeaks(xset2)
save(xset3, file = "RP_pos_xset3_20190421.Rda")
reporttab <- diffreport(xset3, "Gb", "Sf", "Gb_Sf_RP_pos_20190421", 10)

library(CAMERA)
xsa <- xsAnnotate(xset3)
xsaF <- groupFWHM(xsa, perfwhm = 0.3)
xsaC <- groupCorr(xsaF, cor_eic_th = 0.7)
xsaFI <- findIsotopes(xsaC)
rules <- read.csv("data/rules_jan_pos.csv", header = T, sep = ",")
xsaFA <- findAdducts(xsaFI, polarity = "positive", rules = rules)
# write.csv(getPeaklist(xsaFA), file='RP_pos_20190421.csv')

2.2.5 RP Chromatography with negative ESI

# RP NEG 20190422

# setwd() # work in directory containing the sorted CDF files. getwd()
library(xcms)
xset <- xcmsSet(method = "centWave", ppm = 8, peakwidth = c(5, 45), noise = 2000)
save(xset, file = "RP_neg_xset_20190422.Rda")
# load(file='RP_neg_xset_20190422.Rda') #When resuming after a break
xset <- group(xset)
xset2 <- retcor(xset, method = "obiwarp", response = 10, plottype = "deviation")
xset2 <- group(xset2)
xset3 <- fillPeaks(xset2)
save(xset3, file = "RP_neg_xset3_20190422.Rda")
reporttab <- diffreport(xset3, "Gb", "Sf", "Gb_Sf_RP_neg_20190422", 10)

library(CAMERA)
xsa <- xsAnnotate(xset3)
xsaF <- groupFWHM(xsa, perfwhm = 0.3)
xsaC <- groupCorr(xsaF, cor_eic_th = 0.7)
xsaFI <- findIsotopes(xsaC)
rules <- read.csv("data/rules_jan_neg.csv", header = T, sep = ",")
xsaFA <- findAdducts(xsaFI, polarity = "negative", rules = rules)
# write.csv(getPeaklist(xsaFA), file='RP_neg_20190422.csv')

2.2.6 Filtering of the raw data set

Features with following properties were removed:

Eluting in the void (RT < 45 s)
CV>30 % in QC samples
Adducts as annotated by CAMERA

As the presence of the same parent feature in the the form of isotopes and adducts distorts downstream global metabolome analyses, we generated two versions of curated or selective feature tables. In both, void and unstably measured features are removed (“cleaned”). One retains only the feature with the largest signal per pc group (“cleaned_pcgroup”), the other is more stringent and only contains explicitly annotated parent ions (“cleaned_MH”), i.e. [M+H]+ for positive ESI and [M-H]- for negative ESI acquired data.

library(dplyr)

# ================================ HILIC pos ===================================
raw_peaks <- read.csv("data/HILIC_pos_20190417.csv")
hp_dim1 <- dim(raw_peaks)[1]
hp_pcg1 <- length(unique(raw_peaks$pcgroup))

# Removing features eluting in the void
raw_peaks <- raw_peaks[raw_peaks$rt > 45, ]
hp_dim2 <- dim(raw_peaks)[1]
hp_pcg2 <- length(unique(raw_peaks$pcgroup))

# Removing features with a CV < 30%
f <- which(colnames(raw_peaks) == "IE_20170918_02001")  # first QC HILIC pos
l <- which(colnames(raw_peaks) == "IE_20170918_07601")  # last QC HILIC pos
raw_peaks["SD"] <- apply(raw_peaks[, f:l], 1, sd)
raw_peaks["MEAN"] <- apply(raw_peaks[, f:l], 1, mean)
raw_peaks["CV"] <- raw_peaks$SD/raw_peaks$MEAN
raw_peaks <- raw_peaks[raw_peaks$CV < 0.3, ]
hp_dim3 <- dim(raw_peaks)[1]
hp_pcg3 <- length(unique(raw_peaks$pcgroup))
# write.csv(raw_peaks, 'HILIC_pos_20190417_cleaned.csv')

# Keep only the feature with the greatest sum of signal per pc group
l <- dim(raw_peaks)[2] - 6
raw_peaks["fss"] <- apply(raw_peaks[, 13:l], 1, sum)  #feature signal sum
raw_peaks <- raw_peaks[order(raw_peaks$pcgroup, -raw_peaks$fss), ]
# for every pc group, the first line has the strongest signal. The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a
# data frame. If there are duplicate rows, _only the first row_ is preserved.
selected_peaks <- distinct(raw_peaks, raw_peaks$pcgroup, .keep_all = TRUE)
hp_dim4 <- dim(selected_peaks)[1]
hp_pcg4 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'HILIC_pos_20190417_cleaned_pcgroup.csv')
rm(selected_peaks)

# Keep only features with explicit annotations of [M+H]+
library(stringr)
raw_peaks["[M+H]+"] <- NA
pattern <- "[M+H]+"
raw_peaks["[M+H]+"] <- str_detect(raw_peaks$adduct, paste0("^\\Q", pattern, "\\E"))
selected_peaks <- raw_peaks[raw_peaks$`[M+H]+` == "TRUE", ]
hp_dim5 <- dim(selected_peaks)[1]
hp_pcg5 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'HILIC_pos_20190417_cleaned_MH.csv')

hilic_pos <- raw_peaks
rm(f, l, raw_peaks, selected_peaks)

# ================================ HILIC neg ===================================
raw_peaks <- read.csv("data/HILIC_neg_20190421.csv")
hn_dim1 <- dim(raw_peaks)[1]
hn_pcg1 <- length(unique(raw_peaks$pcgroup))

# Removing features eluting in the void
raw_peaks <- raw_peaks[raw_peaks$rt > 45, ]
hn_dim2 <- dim(raw_peaks)[1]
hn_pcg2 <- length(unique(raw_peaks$pcgroup))

# Removing features with a CV < 30%
f <- which(colnames(raw_peaks) == "IE_20170919_01301")  # first QC HILIC neg
l <- which(colnames(raw_peaks) == "IE_20170919_06901")  # last QC HILIC neg
raw_peaks["SD"] <- apply(raw_peaks[, f:l], 1, sd)
raw_peaks["MEAN"] <- apply(raw_peaks[, f:l], 1, mean)
raw_peaks["CV"] <- raw_peaks$SD/raw_peaks$MEAN
raw_peaks <- raw_peaks[raw_peaks$CV < 0.3, ]
hn_dim3 <- dim(raw_peaks)[1]
hn_pcg3 <- length(unique(raw_peaks$pcgroup))
# write.csv(raw_peaks, 'HILIC_neg_20190421_cleaned.csv')

# Keep only the feature with the greatest sum of signal per pc group
l <- dim(raw_peaks)[2] - 6
raw_peaks["fss"] <- apply(raw_peaks[, 13:l], 1, sum)  #feature signal sum
raw_peaks <- raw_peaks[order(raw_peaks$pcgroup, -raw_peaks$fss), ]
selected_peaks <- distinct(raw_peaks, raw_peaks$pcgroup, .keep_all = TRUE)
hn_dim4 <- dim(selected_peaks)[1]
hn_pcg4 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'HILIC_neg_20190421_cleaned_pcgroup.csv')
rm(selected_peaks)

# Keep only features with explicit annotations of [M-H]-
library(stringr)
raw_peaks["[M-H]-"] <- NA
pattern <- "[M-H]-"
raw_peaks["[M-H]-"] <- str_detect(raw_peaks$adduct, paste0("^\\Q", pattern, "\\E"))
selected_peaks <- raw_peaks[raw_peaks$`[M-H]-` == "TRUE", ]
hn_dim5 <- dim(selected_peaks)[1]
hn_pcg5 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'HILIC_neg_20190421_cleaned_MH.csv')

hilic_neg <- raw_peaks
rm(f, l, raw_peaks)

# ================================== RP pos ====================================
raw_peaks <- read.csv("data/RP_pos_20190421.csv")
rp_dim1 <- dim(raw_peaks)[1]
rp_pcg1 <- length(unique(raw_peaks$pcgroup))

# Removing features eluting in the void
raw_peaks <- raw_peaks[raw_peaks$rt > 45, ]
rp_dim2 <- dim(raw_peaks)[1]
rp_pcg2 <- length(unique(raw_peaks$pcgroup))

# Removing features with a CV < 30%
f <- which(colnames(raw_peaks) == "IE_20171002_01401")  # first QC RP neg
l <- which(colnames(raw_peaks) == "IE_20171002_07201")  # last QC RP neg
raw_peaks["SD"] <- apply(raw_peaks[, f:l], 1, sd)
raw_peaks["MEAN"] <- apply(raw_peaks[, f:l], 1, mean)
raw_peaks["CV"] <- raw_peaks$SD/raw_peaks$MEAN
raw_peaks <- raw_peaks[raw_peaks$CV < 0.3, ]
rp_dim3 <- dim(raw_peaks)[1]
rp_pcg3 <- length(unique(raw_peaks$pcgroup))
# write.csv(raw_peaks, 'RP_pos_20190421_cleaned.csv')

# Keep only the feature with the greatest sum of signal per pc group
l <- dim(raw_peaks)[2] - 6
raw_peaks["fss"] <- apply(raw_peaks[, 13:l], 1, sum)  #feature signal sum
raw_peaks <- raw_peaks[order(raw_peaks$pcgroup, -raw_peaks$fss), ]
selected_peaks <- distinct(raw_peaks, raw_peaks$pcgroup, .keep_all = TRUE)
rp_dim4 <- dim(selected_peaks)[1]
rp_pcg4 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'RP_pos_20190421_cleaned_pcgroup.csv')
rm(selected_peaks)

# Keep only features with explicit annotations of [M+H]+
library(stringr)
raw_peaks["[M+H]+"] <- NA
pattern <- "[M+H]+"
raw_peaks["[M+H]+"] <- str_detect(raw_peaks$adduct, paste0("^\\Q", pattern, "\\E"))
selected_peaks <- raw_peaks[raw_peaks$`[M+H]+` == "TRUE", ]
rp_dim5 <- dim(selected_peaks)[1]
rp_pcg5 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'RP_pos_20190421_cleaned_MH.csv')

rp_pos <- raw_peaks
rm(f, l, raw_peaks, selected_peaks)

# ================================== RP neg ====================================
raw_peaks <- read.csv("data/RP_neg_20190422.csv")
rn_dim1 <- dim(raw_peaks)[1]
rn_pcg1 <- length(unique(raw_peaks$pcgroup))

# Removing features eluting in the void
raw_peaks <- raw_peaks[raw_peaks$rt > 45, ]
rn_dim2 <- dim(raw_peaks)[1]
rn_pcg2 <- length(unique(raw_peaks$pcgroup))

# Removing features with a CV < 30%
f <- which(colnames(raw_peaks) == "IE_20171003_01301")  # first QC RP neg
l <- which(colnames(raw_peaks) == "IE_20171003_05701")  # 3rd last QC RP neg; QC empty after this injection
raw_peaks["SD"] <- apply(raw_peaks[, f:l], 1, sd)
raw_peaks["MEAN"] <- apply(raw_peaks[, f:l], 1, mean)
raw_peaks["CV"] <- raw_peaks$SD/raw_peaks$MEAN
raw_peaks <- raw_peaks[raw_peaks$CV < 0.3, ]
rn_dim3 <- dim(raw_peaks)[1]
rn_pcg3 <- length(unique(raw_peaks$pcgroup))
# write.csv(raw_peaks, 'RP_neg_20190422_cleaned.csv')

# Keep only the feature with the greatest sum of signal per pc group
l <- dim(raw_peaks)[2] - 6
raw_peaks["fss"] <- apply(raw_peaks[, 13:l], 1, sum)  #feature signal sum
raw_peaks <- raw_peaks[order(raw_peaks$pcgroup, -raw_peaks$fss), ]
selected_peaks <- distinct(raw_peaks, raw_peaks$pcgroup, .keep_all = TRUE)
rn_dim4 <- dim(selected_peaks)[1]
rn_pcg4 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'RP_neg_20190422_cleaned_pcgroup.csv')
rm(selected_peaks)

# Keep only features with explicit annotations of [M-H]-
library(stringr)
raw_peaks["[M-H]-"] <- NA
pattern <- "[M-H]-"
raw_peaks["[M-H]-"] <- str_detect(raw_peaks$adduct, paste0("^\\Q", pattern, "\\E"))
selected_peaks <- raw_peaks[raw_peaks$`[M-H]-` == "TRUE", ]
rn_dim5 <- dim(selected_peaks)[1]
rn_pcg5 <- length(unique(selected_peaks$pcgroup))
# write.csv(selected_peaks, 'RP_neg_20190422_cleaned_MH.csv')

hilic_neg <- raw_peaks
rm(f, l, raw_peaks, selected_peaks)

2.2.7 Data quality control

To monitor the stability of the signal during the UPLC-HRMS run, we plot the total signal per sample by injection order.

hp <- read.csv("data/HILIC_pos_20190417.csv", header = T, sep = ",")
hn <- read.csv("data/HILIC_neg_20190421.csv", header = T, sep = ",")
rp <- read.csv("data/RP_pos_20190421.csv", header = T, sep = ",")
rn <- read.csv("data/RP_neg_20190422.csv", header = T, sep = ",")

# subsetting, keep only sample data
hp <- hp[, 13:(dim(hp)[2] - 3)]  #14
hn <- hn[, 13:(dim(hn)[2] - 3)]
rp <- rp[, 13:(dim(rp)[2] - 3)]
rn <- rn[, 13:(dim(rn)[2] - 3)]

# sum signal of columns, i.e. per sample
hp[nrow(hp) + 1, ] <- apply(hp, 2, sum)
hn[nrow(hn) + 1, ] <- apply(hn, 2, sum)
rp[nrow(rp) + 1, ] <- apply(rp, 2, sum)
rn[nrow(rn) + 1, ] <- apply(rn, 2, sum)

# data frame gymnastics
hp_df <- data.frame(t((hp[dim(hp)[1], ])))
hn_df <- data.frame(t((hn[dim(hn)[1], ])))
rp_df <- data.frame(t((rp[dim(rp)[1], ])))
rn_df <- data.frame(t((rn[dim(rn)[1], ])))

colnames(hp_df) <- c("colsum_hp")
hp_df["id"] <- rownames(hp_df)
colnames(hn_df) <- c("colsum_hn")
hn_df["id"] <- rownames(hn_df)
colnames(rp_df) <- c("colsum_rp")
rp_df["id"] <- rownames(rp_df)
colnames(rn_df) <- c("colsum_rn")
rn_df["id"] <- rownames(rn_df)

library(stringr)
hp_df["hp_io"] <- as.integer(str_sub(hp_df$id, -4, -3))
hn_df["hn_io"] <- as.integer(str_sub(hn_df$id, -4, -3))
rp_df["rp_io"] <- as.integer(str_sub(rp_df$id, -4, -3))
rn_df["rn_io"] <- as.integer(str_sub(rn_df$id, -4, -3))

hp_df <- hp_df[order(hp_df$hp_io), ]
hp_df["n"] <- seq(nrow(hp_df))
hp_df["experiment"] <- "HILIC pos."

hn_df <- hn_df[order(hn_df$hn_io), ]
hn_df["n"] <- seq(nrow(hn_df))
hn_df["experiment"] <- "HILIC neg."

rp_df <- rp_df[order(rp_df$rp_io), ]
rp_df["n"] <- seq(nrow(rp_df))
rp_df["experiment"] <- "RP pos."

rn_df <- rn_df[order(rn_df$rn_io), ]
rn_df["n"] <- seq(nrow(rn_df))
rn_df["experiment"] <- "RP neg."

colnames(hp_df) <- c("colsum", "id", "io", "n", "experiment")
colnames(hn_df) <- c("colsum", "id", "io", "n", "experiment")
colnames(rp_df) <- c("colsum", "id", "io", "n", "experiment")
colnames(rn_df) <- c("colsum", "id", "io", "n", "experiment")

stability <- rbind(hp_df, hn_df, rp_df, rn_df)

library(ggplot2)
ggplot(stability, aes(x = n, y = colsum, color = experiment)) + geom_line(size = 1.5) + ggtitle("Signal intensity during UPLC-HRMS experiments") + xlab("sample injection order") + 
    ylab("total signal intensity per sample") + labs(colour = "Experiment") + theme(legend.position = "bottom") + geom_smooth(method = "lm", size = 0.5) + facet_grid(experiment ~ 
    ., scales = "free")

Figure 2.1: Cummulative signal intensity across all samples and chromatographic experiments shows only a minor effect of injection order.

rm(hp, hp_df, hn, hn_df, rp, rp_df, rn, rn_df, stability)

We see that while signal intensity in all experimental data seems fairly stable, the HILIC positive data set has the highest intensity and virtually no changes at all throughout data acquisition. Therefore, we mostly rely on analyses of the HILIC pos data set.

2.2.8 Summary of data processing

Below we outline the number of features removed in every step as shown in the table and corresponding figure. “CV>30%” is the data set subsequently labelled “cleaned”, “repr. feat.” are the PC-group data set and “ion”" are sometimes also labelled “MH”.

initial_f <- c(hp_dim1, hn_dim1, rp_dim1, rn_dim1)
rt_f <- c(hp_dim2, hn_dim2, rp_dim2, rn_dim2)
cv_f <- c(hp_dim3, hn_dim3, rp_dim3, rn_dim3)
representative_f <- c(hp_dim4, hn_dim4, rp_dim4, rn_dim4)
ion_f <- c(hp_dim5, hn_dim5, rp_dim5, rn_dim5)

initial_p <- c(hp_pcg1, hn_pcg1, rp_pcg1, rn_pcg1)
rt_p <- c(hp_pcg2, hn_pcg2, rp_pcg2, rn_pcg2)
cv_p <- c(hp_pcg3, hn_pcg3, rp_pcg3, rn_pcg3)
representative_p <- c(hp_pcg4, hn_pcg4, rp_pcg4, rn_pcg4)
ion_p <- c(hp_pcg5, hn_pcg5, rp_pcg5, rn_pcg5)

df <- cbind(initial_f, rt_f, cv_f, representative_f, ion_f, initial_p, rt_p, cv_p, representative_p, ion_p)
rownames(df) <- c("HILIC pos", "HILIC neg", "RP pos", "RP neg")

library(kableExtra)
options(kableExtra.html.bsTable = T)
kable(df, col.names = c("initial", "RT >45 s", "CV <30%", "repr. feat.", "ion", "initial", "RT >45 s", "CV <30%", "repr. feat.", "ion"), longtable = T, booktabs = T, 
    caption = "Number of features and pc groups 
      remaining after each filtering step.") %>% add_header_above(c("", features = 5, `pc groups` = 5)) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed", "responsive"), full_width = F, latex_options = c("striped", "scale_down"))

	initial	RT >45 s	CV <30%	repr. feat.	ion	initial	RT >45 s	CV <30%	repr. feat.	ion
HILIC pos	5113	4389	3507	2212	105	3298	2781	2212	2212	79
HILIC neg	4049	3081	2808	1351	123	2044	1495	1351	1351	62
RP pos	5821	5588	4673	2736	171	3420	3277	2736	2736	128
RP neg	3867	3621	3166	1678	105	2069	1918	1678	1678	80

library(reshape2)
df <- melt(df)
df["filter"] <- c(rep(("initial"), 4), rep(("RT >45 s"), 4), rep(("CV <30%"), 4), rep(("rep"), 4), rep(("ion"), 4))
df$filter <- factor(df$filter, levels = c("initial", "RT >45 s", "CV <30%", "rep", "ion"))
df["category"] <- c(rep(("feature"), 20), rep(("pc group"), 20))
df$Var2 <- NULL
colnames(df) <- c("acquisition", "value", "filter", "category")
str(df)

## 'data.frame':    40 obs. of  4 variables:
##  $ acquisition: Factor w/ 4 levels "HILIC pos","HILIC neg",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ value      : int  5113 4049 5821 3867 4389 3081 5588 3621 3507 2808 ...
##  $ filter     : Factor w/ 5 levels "initial","RT >45 s",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ category   : chr  "feature" "feature" "feature" "feature" ...

# library(ggplot2) ggplot(df, aes(fill=filter, x=acquisition, y=value))+ geom_bar(position='dodge', stat = 'identity')+ facet_grid(.~category)+ xlab('Data
# acquisition mode')+ylab('Count')+labs(fill='Filtering step')+ theme_bw()+ scale_fill_grey()+ scale_y_continuous(breaks=seq(0,6000,1000))+ theme(axis.text.x
# = element_text(angle = 45, vjust = 0.5, hjust = 0.5), legend.position='bottom')

After preliminary analyses of metabolomics data, a few outliers were re-identified by P. Cárdenas, and one specimen was removed from the data set as it originated from another species.

2.3 Chromatorgrams

In G. barretti barettin and 8,9-dihydrobarettin are major compounds with distinct peaks in the chromatogram. We wanted to see whether S. fortis and W. bursa also had major compounds and whether differences in sample depth would be directly reflected in the chromatograms.

gb_chrom <- read.csv("data/Gb_all_chromatograms.csv", header = T, sep = ";")
# differences in RT negligible/insignificant between the different chromatograms/samples

gb_chrom[, c("Gb10_RT", "Gb12_RT", "Gb13_RT", "Gb14_RT", "Gb15_RT", "Gb16_RT", "Gb17_RT", "Gb18_RT", "Gb19_RT", "Gb2_RT", "Gb20_RT", "Gb21_RT", "Gb3_RT", "Gb7_RT", 
    "Gb8_RT", "Gb9_RT")] <- list(NULL)
colnames(gb_chrom)[colnames(gb_chrom) == "Gb1_RT"] <- "RT"
gb_chrom <- gb_chrom[gb_chrom$RT <= 12, ]  # removing tail w/o any signal, run ended at 12 min

# ugly but works
p_gb_chrom <- ggplot(gb_chrom, aes(x = RT, y = Gb1)) + geom_line() + theme_classic() + geom_rect(data = NULL, aes(xmin = 6.3, xmax = 6.5, ymin = -Inf, ymax = Inf), 
    fill = "lightgreen", alpha = 0.5) + geom_rect(data = NULL, aes(xmin = 6.95, xmax = 7.15, ymin = -Inf, ymax = Inf), fill = "lightblue", alpha = 0.5) + geom_line(aes(x = RT, 
    y = Gb2)) + geom_line(aes(x = RT, y = Gb3)) + geom_line(aes(x = RT, y = Gb7)) + geom_line(aes(x = RT, y = Gb8)) + geom_line(aes(x = RT, y = Gb9)) + geom_line(aes(x = RT, 
    y = Gb10)) + geom_line(aes(x = RT, y = Gb15)) + geom_line(aes(x = RT, y = Gb16)) + geom_line(aes(x = RT, y = Gb17)) + geom_line(aes(x = RT, y = Gb19)) + geom_line(aes(x = RT, 
    y = Gb21)) + geom_line(aes(x = RT, y = Gb12, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Gb13, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, 
    y = Gb14, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Gb18, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Gb20, color = "red", alpha = 0.5)) + 
    xlab("RT") + ylab("Signal intensity") + ggtitle("Overlay chromatograms from G. barretti HILIC positive") + theme_classic(base_size = 7) + theme(legend.position = "none")

sf_chrom <- read.csv("data/Sf_all_chromatograms.csv", header = T, sep = ";")
sf_chrom[, c("Sf10_RT", "Sf11_RT", "Sf12_RT", "Sf13_RT", "Sf14_RT", "Sf15_RT", "Sf2_RT", "Sf3_RT", "Sf4_RT", "Sf5_RT", "Sf6_RT", "Sf7_RT")] <- list(NULL)
colnames(sf_chrom)[colnames(sf_chrom) == "Sf1_RT"] <- "RT"
sf_chrom <- sf_chrom[sf_chrom$RT <= 12, ]


p_sf_chrom <- ggplot(sf_chrom, aes(x = RT, y = Sf1)) + geom_line() + geom_line(aes(x = RT, y = Sf2)) + geom_line(aes(x = RT, y = Sf3)) + geom_line(aes(x = RT, 
    y = Sf4)) + geom_line(aes(x = RT, y = Sf5)) + geom_line(aes(x = RT, y = Sf6)) + geom_line(aes(x = RT, y = Sf7)) + geom_line(aes(x = RT, y = Sf10, color = "red", 
    alpha = 0.5)) + geom_line(aes(x = RT, y = Sf11, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Sf12, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, 
    y = Sf13, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Sf14, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Sf15, color = "red", alpha = 0.5)) + 
    xlab("RT") + ylab("Signal intensity") + ggtitle("Overlay chromatograms from S. fortis HILIC positive") + theme_classic(base_size = 7) + theme(legend.position = "none")

wb_chrom <- read.csv("data/Wb_all_chromatograms.csv", header = T, sep = ";")
wb_chrom[, c("Wb10_RT", "Wb11_RT", "Wb12_RT", "Wb13_RT", "Wb14_RT", "Wb15_RT", "Wb16_RT", "Wb17_RT", "Wb2_RT", "Wb3_RT", "Wb4_RT", "Wb5_RT", "Wb6_RT", "Wb7_RT", 
    "Wb8_RT")] <- list(NULL)
colnames(wb_chrom)[colnames(wb_chrom) == "Wb1_RT"] <- "RT"
wb_chrom <- wb_chrom[wb_chrom$RT <= 12, ]

p_wb_chrom <- ggplot(wb_chrom, aes(x = RT, y = Wb1)) + geom_line() + geom_line(aes(x = RT, y = Wb2)) + geom_line(aes(x = RT, y = Wb3)) + geom_line(aes(x = RT, 
    y = Wb4)) + geom_line(aes(x = RT, y = Wb5)) + geom_line(aes(x = RT, y = Wb6)) + geom_line(aes(x = RT, y = Wb7)) + geom_line(aes(x = RT, y = Wb8)) + geom_line(aes(x = RT, 
    y = Wb10)) + geom_line(aes(x = RT, y = Wb11)) + geom_line(aes(x = RT, y = Wb12)) + geom_line(aes(x = RT, y = Wb13, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, 
    y = Wb14, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Wb15, color = "red", alpha = 0.5)) + geom_line(aes(x = RT, y = Wb16, color = "red", alpha = 0.5)) + 
    geom_line(aes(x = RT, y = Wb17, color = "red", alpha = 0.5)) + xlab("RT") + ylab("Signal intensity") + ggtitle("Overlay chromatograms from W. bursa HILIC positive") + 
    theme_classic(base_size = 7) + theme(legend.position = "none")

library(gridExtra)
grid.arrange(p_gb_chrom, p_sf_chrom, p_wb_chrom, nrow = 3)

# k <- grid.arrange(p_gb_chrom, p_sf_chrom, p_wb_chrom, nrow=3) ggsave('chroms_20200701.pdf', plot=k, path='data/', device = 'pdf', units = 'mm', width = 175,
# height = 90, useDingbats=FALSE)

rm(p_gb_chrom, p_sf_chrom, p_wb_chrom, gb_chrom, sf_chrom, wb_chrom)

2.4 Multivariate analyses

We start by visualising all metabolomes with a PCA and a PERMANOVA to get a first impression of our data.

library(ropls)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(vegan)

# load data
hilic_pos <- read.csv("data/HILIC_pos_20190417_cleaned.csv", header = T, sep = ",")
hilic_neg <- read.csv("data/HILIC_neg_20190421_cleaned.csv", header = T, sep = ",")
rp_pos <- read.csv("data/RP_pos_20190421_cleaned.csv", header = T, sep = ",")
rp_neg <- read.csv("data/RP_neg_20190422_cleaned.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")

formatting <- function(metabolome, meta_data, r, my_colnames) {
    formatted <- metabolome
    formatted <- formatted[, 14:(dim(formatted)[2] - r)]
    formatted <- data.frame(t(formatted))
    formatted["ID"] <- rownames(formatted)
    formatted["unified_ID"] <- meta_data$unified_ID[match(formatted$ID, meta_data[[my_colnames]])]
    formatted["filter"] <- str_sub(formatted$unified_ID, 1, 2)
    formatted <- formatted[!formatted$filter == "QC", ]
    formatted <- na.omit(formatted)
    # formatted$filter <- NULL
    formatted$ID <- NULL
    formatted <- formatted[order(formatted$unified_ID), ]
    rownames(formatted) <- formatted$unified_ID
    return(formatted)
}

hilic_pos <- formatting(hilic_pos, meta_data, 6, "LC.MS.HILIC.positive")
hilic_neg <- formatting(hilic_neg, meta_data, 6, "LC.MS.HILIC.negative")
rp_pos <- formatting(rp_pos, meta_data, 6, "LC.MS.RP.positive")
rp_neg <- formatting(rp_neg, meta_data, 6, "LC.MS.RP.negative")

md <- meta_data[, c("Species", "Depth", "Latitude", "Longitude", "YEAR", "MeanBottomTemp_Cdeg", "MeanBotSalinity_PSU", "unified_ID")]

# Overview PCA illustrating the metabolomes differ by sponge species

pca_wrapper <- function(metabolome, md, my_title) {
    metabolome$unified_ID <- NULL
    metabolome$filter <- NULL
    metabolome <- prcomp(metabolome, scale = T)
    k <- summary(metabolome)[["importance"]]
    metabolome_df <- data.frame(metabolome$x)  #scores, i.e. principal components of the sponge sample 
    metabolome_df["unified_ID"] <- as.factor(rownames(metabolome_df))
    x1 <- paste("PC1", round(k[2, 1], digits = 3) * 100, "%")
    y1 <- paste("PC2", round(k[2, 2], digits = 3) * 100, "%")
    metabolome_df <- left_join(metabolome_df[, c("PC1", "PC2", "PC3", "unified_ID")], md[, c("Species", "Depth", "Latitude", "Longitude", "MeanBottomTemp_Cdeg", 
        "MeanBotSalinity_PSU", "unified_ID")])
    p <- ggplot(metabolome_df, aes(x = PC1, y = PC2)) + geom_point(size = 3, aes(shape = factor(Species))) + ggtitle(my_title) + xlab(x1) + ylab(y1) + theme(legend.position = "none")  #+labs(shape='Species')
    return(p)
}

hp_pca <- pca_wrapper(hilic_pos, md, "Hilic pos")
hn_pca <- pca_wrapper(hilic_neg, md, "Hilic neg")
rp_pca <- pca_wrapper(rp_pos, md, "RP pos")
rn_pca <- pca_wrapper(rp_neg, md, "RP neg")

library(gridExtra)
grid.arrange(hp_pca, hn_pca, rp_pca, rn_pca, nrow = 2)

Figure 2.2: PCA of the cleaned data sets. Dots are Gb samples, squares are Wb samples, traingles are Sf samples.

For a general overview of the metabolomic data, Fig. @ is an OPLS model predicting variation as a function of depth versus orthogonal variation in all three sponge metabolomes combined. It is based on the HILIC positive cleaned data.

library(ropls)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(vegan)

# load data
hilic_pos <- read.csv("data/HILIC_pos_20190417_cleaned.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")

formatting <- function(metabolome, meta_data, r, my_colnames) {
    formatted <- metabolome
    formatted <- formatted[, 14:(dim(formatted)[2] - r)]
    formatted <- data.frame(t(formatted))
    formatted["ID"] <- rownames(formatted)
    formatted["unified_ID"] <- meta_data$unified_ID[match(formatted$ID, meta_data[[my_colnames]])]
    formatted["filter"] <- str_sub(formatted$unified_ID, 1, 2)
    formatted <- formatted[!formatted$filter == "QC", ]
    formatted <- na.omit(formatted)
    # formatted$filter <- NULL
    formatted$ID <- NULL
    formatted <- formatted[order(formatted$unified_ID), ]
    rownames(formatted) <- formatted$unified_ID
    return(formatted)
}

hilic_pos <- formatting(hilic_pos, meta_data, 6, "LC.MS.HILIC.positive")
md <- meta_data[, c("Species", "Depth", "Latitude", "Longitude", "YEAR", "MeanBottomTemp_Cdeg", "MeanBotSalinity_PSU", "unified_ID")]
md <- md[md$unified_ID %in% rownames(hilic_pos), ]
md <- md[order(md$unified_ID), ]
all(rownames(hilic_pos) == md$unified_ID)
hilic_pos[, c("unified_ID", "filter")] <- list(NULL)
hilic.ropls.pca <- opls(hilic_pos, plotL = FALSE)
hilic.opls <- opls(hilic_pos, md[, "Depth"], permI = 1000, orthoI = NA, scaleC = "pareto", plotL = FALSE)
h.ortho.score <- getScoreMN(hilic.opls, orthoL = T)
h.pred.score <- getScoreMN(hilic.opls, orthoL = F)
h.df.3d <- data.frame(h.pred.score, h.ortho.score, md$Depth, md$unified_ID, md$Species)
# write.csv(h.df.3d, 'data/metabolome_opls_3d.csv') hilic.opls@summaryDF R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2 Total 0.534 0.987 0.662 43.7 1 4
# 0.001 0.001

library(plotly)
h.df.3d <- read.csv("data/metabolome_opls_3d.csv", sep = ",", header = T)
# format background and axes
axx <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)", title = "predicted", showbackground = TRUE)

axy <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)", title = "orthogonal 1", showbackground = TRUE)

axz <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)", title = "orthogonal 2", showbackground = TRUE)
met3D <- plot_ly(h.df.3d, x = ~h.df.3d$p1, y = ~h.df.3d$o1, z = ~h.df.3d$o2, symbol = ~md.Species, symbols = c("diamond", "x", "circle"), color = ~h.df.3d$md.Depth) %>% 
    add_markers() %>% layout(scene = list(xaxis = axx, yaxis = axy, zaxis = axz))
met3D

Figure 2.3: Multivariate analysis (OPLS) of all sponge metabolomes by depth. At this point, this is purely for overview. In depth methods and evaluation are provided in the section below.

# for saving f<- basename(tempfile('OPLS_metabolome_plotly', '.', '.html')) on.exit(unlink(f), add = TRUE) html <- htmlwidgets::saveWidget(met3D, f)

library(vegan)

permanova <- function(metabolome, md) {
    metabolome$filter <- NULL
    md <- md[md$unified_ID %in% metabolome$unified_ID, ]
    metabolome <- metabolome[order(metabolome$unified_ID), ]
    md <- md[order(md$unified_ID), ]
    metabolome$unified_ID <- NULL
    m <- adonis(metabolome ~ Species, md)
    return(m)
}

permanova(hilic_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    5.6422 2.82112  34.302 0.62026  0.001 ***
## Residuals 42    3.4543 0.08224         0.37974           
## Total     44    9.0965                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(hilic_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    2.0768 1.03838  26.046 0.55363  0.001 ***
## Residuals 42    1.6744 0.03987         0.44637           
## Total     44    3.7512                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model     R2 Pr(>F)    
## Species    2    6.0969 3.04847  35.972 0.6314  0.001 ***
## Residuals 42    3.5593 0.08475         0.3686           
## Total     44    9.6563                 1.0000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    6.4179  3.2089  36.013 0.63725  0.001 ***
## Residuals 41    3.6533  0.0891         0.36275           
## Total     43   10.0712                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the PCA and the PERMANOVA (function adonis, Pr(>F) is the p-value), we gather that the metabolomes regardless of how they were acquired are significatly different in the three sponge species. Therefore, we separate the metabolomes for investigating the effect of depth on them individually.

In order to investigate whether depth has an effect on the species-specific metabolomes, we use OPLS orthogonal projections of latent structures as multivariate analysis approach. This method allows to separate variations in the data based on e.g. depth and orthogonal, i.e. unrelated variation.

# splitting df and adapting meta data
spl_ada <- function(metabolome, md) {
    metabolome_gb <- metabolome[metabolome$filter == "Gb", ]
    metabolome_sf <- metabolome[metabolome$filter == "Sf", ]
    metabolome_wb <- metabolome[metabolome$filter == "Wb", ]
    
    md_gb <- md[md$unified_ID %in% metabolome_gb$unified_ID, ]
    md_sf <- md[md$unified_ID %in% metabolome_sf$unified_ID, ]
    md_wb <- md[md$unified_ID %in% metabolome_wb$unified_ID, ]
    
    metabolome_gb <- metabolome_gb[order(metabolome_gb$unified_ID), ]
    metabolome_sf <- metabolome_sf[order(metabolome_sf$unified_ID), ]
    metabolome_wb <- metabolome_wb[order(metabolome_wb$unified_ID), ]
    
    metabolome_gb$filter <- NULL
    metabolome_gb$unified_ID <- NULL
    metabolome_sf$filter <- NULL
    metabolome_sf$unified_ID <- NULL
    metabolome_wb$filter <- NULL
    metabolome_wb$unified_ID <- NULL
    
    md_gb <- md_gb[order(md_gb$unified_ID), ]
    md_sf <- md_sf[order(md_sf$unified_ID), ]
    md_wb <- md_wb[order(md_wb$unified_ID), ]
    
    mva_data <- list(metabolome_gb = metabolome_gb, metabolome_sf = metabolome_sf, metabolome_wb = metabolome_wb, md_gb = md_gb, md_sf = md_sf, md_wb = md_wb)
    return(mva_data)
}

hp_opls_df <- spl_ada(hilic_pos, md)
hn_opls_df <- spl_ada(hilic_neg, md)
rp_opls_df <- spl_ada(rp_pos, md)
rn_opls_df <- spl_ada(rp_neg, md)

# P should be NA; or 1
ropls_wrapper <- function(opls_df, p1, p2, p3) {
    # ropls' PCA
    gb.pca <- opls(opls_df$metabolome_gb, plotL = F, printL = F)
    sf.pca <- opls(opls_df$metabolome_sf, plotL = F, printL = F)
    wb.pca <- opls(opls_df$metabolome_wb, plotL = F, printL = F)
    
    md_gb <- opls_df$md_gb
    md_sf <- opls_df$md_sf
    md_wb <- opls_df$md_wb
    
    # ropls' OPLS
    gb.opls <- opls(opls_df$metabolome_gb, md_gb[, "Depth"], permI = 1000, orthoI = p1, scaleC = "pareto", plotL = F)
    sf.opls <- opls(opls_df$metabolome_sf, md_sf[, "Depth"], permI = 1000, orthoI = p2, scaleC = "pareto", plotL = F)
    wb.opls <- opls(opls_df$metabolome_wb, md_wb[, "Depth"], permI = 1000, orthoI = p3, scaleC = "pareto", plotL = F)
    
    # summary
    gb.opls <- gb.opls@summaryDF
    sf.opls <- sf.opls@summaryDF
    wb.opls <- wb.opls@summaryDF
    
    opls_diagnostics <- rbind(gb.opls, sf.opls, wb.opls)
    opls_diagnostics["Species"] <- c("Geodia barretti", "Stryphnus fortis", "Weberella bursa")
    return(opls_diagnostics)
}

a <- Sys.time()
hp_cleaned_opls <- ropls_wrapper(hp_opls_df, NA, 1, NA)
hn_cleaned_opls <- ropls_wrapper(hn_opls_df, NA, 1, NA)
rp_cleaned_opls <- ropls_wrapper(rp_opls_df, NA, 1, NA)
rn_cleaned_opls <- ropls_wrapper(rn_opls_df, 1, 1, NA)
b <- Sys.time()
b - a

cleaned <- rbind(hp_cleaned_opls, hn_cleaned_opls, rp_cleaned_opls, rn_cleaned_opls)
cleaned["Experiment"] <- c(rep("HILIC pos", 3), rep("HILIC neg", 3), rep("RP pos", 3), rep("RP neg", 3))
write.csv(cleaned, "data/cleaned_opls_diagnostics.csv", row.names = F)

knitr::include_graphics("data/opls_Gb_Hp.jpeg")

Figure 2.4: Example output for OPLS models generated with ROPLS. This is HILIC positive cleaned data set for G. barretti.

# R2X: The fraction of the variation of the X variables explained by the model.  R2Y: The fraction of the variation of the Y variables explained by the model.
# Q2X: The fraction of the variation of the X variables predicted by the model.  Q2Y: The fraction of the variation of the Y variables predicted by the model.

In the same manner, we can analyse the two remaining sets produced by different filtering options.

# load data ION
hilic_pos <- read.csv("data/HILIC_pos_20190417_cleaned_MH.csv", header = T, sep = ",")
hilic_neg <- read.csv("data/HILIC_neg_20190421_cleaned_MH.csv", header = T, sep = ",")
rp_pos <- read.csv("data/RP_pos_20190421_cleaned_MH.csv", header = T, sep = ",")
rp_neg <- read.csv("data/RP_neg_20190422_cleaned_MH.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")

hilic_pos <- formatting(hilic_pos, meta_data, 8, "LC.MS.HILIC.positive")
hilic_neg <- formatting(hilic_neg, meta_data, 8, "LC.MS.HILIC.negative")
rp_pos <- formatting(rp_pos, meta_data, 8, "LC.MS.RP.positive")
rp_neg <- formatting(rp_neg, meta_data, 8, "LC.MS.RP.negative")

md <- meta_data[, c("Species", "Depth", "Latitude", "Longitude", "YEAR", "MeanBottomTemp_Cdeg", "MeanBotSalinity_PSU", "unified_ID")]

hp_pca <- pca_wrapper(hilic_pos, md, "Hilic pos")
hn_pca <- pca_wrapper(hilic_neg, md, "Hilic neg")
rp_pca <- pca_wrapper(rp_pos, md, "RP pos")
rn_pca <- pca_wrapper(rp_neg, md, "RP neg")

grid.arrange(hp_pca, hn_pca, rp_pca, rn_pca, nrow = 2)

PCA of the metabolome data set filtered to retain only annotated ions. Dots are Gb samples, squares are Wb samples, traingles are Sf samples.

Figure 2.5: PCA of the metabolome data set filtered to retain only annotated ions. Dots are Gb samples, squares are Wb samples, traingles are Sf samples.

permanova(hilic_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    7.9121  3.9560  48.131 0.69623  0.001 ***
## Residuals 42    3.4521  0.0822         0.30377           
## Total     44   11.3642                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(hilic_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2   1.23799 0.61899  31.471 0.59978  0.001 ***
## Residuals 42   0.82609 0.01967         0.40022           
## Total     44   2.06408                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    8.9172  4.4586  52.824 0.71554  0.001 ***
## Residuals 42    3.5450  0.0844         0.28446           
## Total     44   12.4621                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    7.4863  3.7432  33.508 0.62043  0.001 ***
## Residuals 41    4.5801  0.1117         0.37957           
## Total     43   12.0665                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

hp_opls_df <- spl_ada(hilic_pos, md)
hn_opls_df <- spl_ada(hilic_neg, md)
rp_opls_df <- spl_ada(rp_pos, md)
rn_opls_df <- spl_ada(rp_neg, md)

a <- Sys.time()
hp_cleaned_opls <- ropls_wrapper(hp_opls_df, NA, 1, NA)
hn_cleaned_opls <- ropls_wrapper(hn_opls_df, NA, NA, NA)
rp_cleaned_opls <- ropls_wrapper(rp_opls_df, NA, 1, NA)
rn_cleaned_opls <- ropls_wrapper(rn_opls_df, 1, 1, NA)
b <- Sys.time()
b - a

ion <- rbind(hp_cleaned_opls, hn_cleaned_opls, rp_cleaned_opls, rn_cleaned_opls)
ion["Experiment"] <- c(rep("HILIC pos", 3), rep("HILIC neg", 3), rep("RP pos", 3), rep("RP neg", 3))
write.csv(ion, "data/cleaned_MH_opls_diagnostics.csv", row.names = F)

# load data PC_GROUP
hilic_pos <- read.csv("data/HILIC_pos_20190417_cleaned_pcgroup.csv", header = T, sep = ",")
hilic_neg <- read.csv("data/HILIC_neg_20190421_cleaned_pcgroup.csv", header = T, sep = ",")
rp_pos <- read.csv("data/RP_pos_20190421_cleaned_pcgroup.csv", header = T, sep = ",")
rp_neg <- read.csv("data/RP_neg_20190422_cleaned_pcgroup.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")

hilic_pos <- formatting(hilic_pos, meta_data, 8, "LC.MS.HILIC.positive")
hilic_neg <- formatting(hilic_neg, meta_data, 8, "LC.MS.HILIC.negative")
rp_pos <- formatting(rp_pos, meta_data, 8, "LC.MS.RP.positive")
rp_neg <- formatting(rp_neg, meta_data, 8, "LC.MS.RP.negative")

md <- meta_data[, c("Species", "Depth", "Latitude", "Longitude", "YEAR", "MeanBottomTemp_Cdeg", "MeanBotSalinity_PSU", "unified_ID")]

hp_pca <- pca_wrapper(hilic_pos, md, "Hilic pos")
hn_pca <- pca_wrapper(hilic_neg, md, "Hilic neg")
rp_pca <- pca_wrapper(rp_pos, md, "RP pos")
rn_pca <- pca_wrapper(rp_neg, md, "RP neg")

grid.arrange(hp_pca, hn_pca, rp_pca, rn_pca, nrow = 2)

PCA of the metabolome data set filtered to retain the feature with the strongest signal per PC group. Dots are Gb samples, squares are Wb samples, traingles are Sf samples.

Figure 2.6: PCA of the metabolome data set filtered to retain the feature with the strongest signal per PC group. Dots are Gb samples, squares are Wb samples, traingles are Sf samples.

permanova(hilic_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model     R2 Pr(>F)    
## Species    2    6.0660 3.03302  33.376 0.6138  0.001 ***
## Residuals 42    3.8167 0.09087         0.3862           
## Total     44    9.8828                 1.0000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(hilic_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model     R2 Pr(>F)    
## Species    2    5.1323 2.56616  32.819 0.6098  0.001 ***
## Residuals 42    3.2840 0.07819         0.3902           
## Total     44    8.4163                 1.0000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_pos, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    5.7191 2.85957  35.582 0.62886  0.001 ***
## Residuals 42    3.3754 0.08037         0.37114           
## Total     44    9.0945                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

permanova(rp_neg, md)

## 
## Call:
## adonis(formula = metabolome ~ Species, data = md) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Species    2    6.8700  3.4350  33.333 0.61919  0.001 ***
## Residuals 41    4.2251  0.1031         0.38081           
## Total     43   11.0952                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

hp_opls_df <- spl_ada(hilic_pos, md)
hn_opls_df <- spl_ada(hilic_neg, md)
rp_opls_df <- spl_ada(rp_pos, md)
rn_opls_df <- spl_ada(rp_neg, md)

a <- Sys.time()
hp_cleaned_opls <- ropls_wrapper(hp_opls_df, NA, NA, NA)
hn_cleaned_opls <- ropls_wrapper(hn_opls_df, NA, NA, NA)
rp_cleaned_opls <- ropls_wrapper(rp_opls_df, NA, NA, NA)
rn_cleaned_opls <- ropls_wrapper(rn_opls_df, NA, NA, NA)
b <- Sys.time()
b - a

pcg <- rbind(hp_cleaned_opls, hn_cleaned_opls, rp_cleaned_opls, rn_cleaned_opls)
pcg["Experiment"] <- c(rep("HILIC pos", 3), rep("HILIC neg", 3), rep("RP pos", 3), rep("RP neg", 3))
write.csv(pcg, "data/cleaned_PC_opls_diagnostics.csv", row.names = F)

In some cases, an OPLS model was not appropriate. Typically, the model chooses the number of orthogonal components by itself (orthI=NA) and when the first component is not significant, the model building will stop. However, one can force any number of orthogonal components and run the OPLS model anyway. You can spot those cases when the NA was set to 1 in the ropls_wrapper function

The multivariate analyses (OPLS) were quite time consuming, requiring 1.4 hours for the “cleaned” data sets, 46.3 min for the “PC-group” data stes and 4 min for the “ion” data sets on my system. Therefore, the combined model diagnostics are summarised in Tab ??.

library(kableExtra)
cleaned <- read.csv("data/cleaned_opls_diagnostics.csv", header = T)
pcgrp <- read.csv("data/cleaned_PC_opls_diagnostics.csv", header = T)
ion <- read.csv("data/cleaned_MH_opls_diagnostics.csv", header = T)

opls_diagnostics <- rbind(cleaned, pcgrp, ion)
opls_diagnostics["filtering"] <- c(rep("cleaned", 12), rep("pc group", 12), rep("ion", 12))

options(kableExtra.html.bsTable = T)
kable(opls_diagnostics, col.names = c("R2X cum", "R2Y cum", "Q2 cum", "RMSEE", "pre", "ort", "pR2Y", "pQ2", "Species", "Experiment", "data set"), longtable = T, 
    booktabs = T, caption = "Model diagnostics of the metabolome variations modelled by depth", row.names = FALSE) %>% add_header_above(c(Diagnostics = 8, `Data set attribution` = 3)) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed", "responsive"), font_size = 12, full_width = F, latex_options = c("striped", 
        "scale_down"))

R2X cum	R2Y cum	Q2 cum	RMSEE	pre	ort	pR2Y	pQ2	Species	Experiment	data set
0.425	0.998	0.8670	19.7	1	2	0.001	0.001	Geodia barretti	HILIC pos	cleaned
0.257	0.988	0.6530	42.0	1	1	0.067	0.002	Stryphnus fortis	HILIC pos	cleaned
0.496	0.999	0.5490	13.7	1	3	0.002	0.028	Weberella bursa	HILIC pos	cleaned
0.464	0.988	0.7670	47.4	1	2	0.010	0.001	Geodia barretti	HILIC neg	cleaned
0.436	0.924	0.6290	106.0	1	1	0.074	0.002	Stryphnus fortis	HILIC neg	cleaned
0.366	0.985	0.5560	42.4	1	2	0.233	0.013	Weberella bursa	HILIC neg	cleaned
0.585	0.998	0.7960	22.6	1	3	0.003	0.001	Geodia barretti	RP pos	cleaned
0.328	0.965	0.5680	71.5	1	1	0.039	0.010	Stryphnus fortis	RP pos	cleaned
0.343	0.980	0.3350	48.8	1	2	0.113	0.108	Weberella bursa	RP pos	cleaned
0.304	0.937	0.6140	99.8	1	1	0.040	0.003	Geodia barretti	RP neg	cleaned
0.390	0.924	0.5060	106.0	1	1	0.089	0.015	Stryphnus fortis	RP neg	cleaned
0.412	0.985	0.4500	41.9	1	2	0.029	0.027	Weberella bursa	RP neg	cleaned
0.429	0.997	0.8660	23.7	1	2	0.003	0.001	Geodia barretti	HILIC pos	pc group
0.362	0.998	0.7020	19.9	1	2	0.049	0.004	Stryphnus fortis	HILIC pos	pc group
0.542	0.995	0.5820	25.9	1	3	0.114	0.015	Weberella bursa	HILIC pos	pc group
0.499	0.990	0.7500	43.2	1	2	0.004	0.002	Geodia barretti	HILIC neg	pc group
0.433	0.990	0.7230	40.9	1	2	0.068	0.002	Stryphnus fortis	HILIC neg	pc group
0.390	0.986	0.6000	39.8	1	2	0.132	0.005	Weberella bursa	HILIC neg	pc group
0.602	0.996	0.8090	28.9	1	3	0.014	0.001	Geodia barretti	RP pos	pc group
0.464	0.990	0.5850	40.9	1	2	0.122	0.017	Stryphnus fortis	RP pos	pc group
0.393	0.979	0.3590	49.8	1	2	0.045	0.115	Weberella bursa	RP pos	pc group
0.373	0.979	0.6850	59.9	1	2	0.137	0.002	Geodia barretti	RP neg	pc group
0.637	0.991	0.6140	39.9	1	3	0.455	0.035	Stryphnus fortis	RP neg	pc group
0.440	0.980	0.4170	48.6	1	2	0.089	0.049	Weberella bursa	RP neg	pc group
0.761	0.995	0.6610	34.2	1	5	0.072	0.018	Geodia barretti	HILIC pos	ion
0.294	0.915	0.3540	112.0	1	1	0.226	0.109	Stryphnus fortis	HILIC pos	ion
0.694	0.988	0.6570	40.2	1	4	0.032	0.019	Weberella bursa	HILIC pos	ion
0.401	0.942	0.6580	105.0	1	2	0.013	0.002	Geodia barretti	HILIC neg	ion
0.815	0.992	0.7570	39.5	1	3	0.007	0.008	Stryphnus fortis	HILIC neg	ion
0.556	0.949	0.6530	77.3	1	2	0.005	0.004	Weberella bursa	HILIC neg	ion
0.778	0.910	0.7940	126.0	1	1	0.001	0.001	Geodia barretti	RP pos	ion
0.368	0.874	0.2560	137.0	1	1	0.026	0.159	Stryphnus fortis	RP pos	ion
0.479	0.960	0.4060	68.1	1	2	0.005	0.032	Weberella bursa	RP pos	ion
0.430	0.624	0.0345	244.0	1	1	0.268	0.198	Geodia barretti	RP neg	ion
0.821	0.392	0.0936	300.0	1	1	0.585	0.190	Stryphnus fortis	RP neg	ion
0.804	0.973	0.5850	61.8	1	4	0.065	0.027	Weberella bursa	RP neg	ion

# write.csv(opls_diagnostics, 'data/opls_diagnostics.csv', row.names = F)

opls_diagnostics$Species <- as.character(opls_diagnostics$Species)
opls_diagnostics$Species[opls_diagnostics$Species == "Geodia barretti"] <- c("G. barretti")
opls_diagnostics$Species[opls_diagnostics$Species == "Stryphnus fortis"] <- c("S. fortis")
opls_diagnostics$Species[opls_diagnostics$Species == "Weberella bursa"] <- c("W. bursa")

k <- ggplot(opls_diagnostics, aes(x = R2X.cum., y = Q2.cum.)) + geom_point(aes(col = Species, shape = Experiment)) + theme_bw() + theme(legend.position = "bottom") + 
    xlab("R2") + ylab("Q2") + guides(col = guide_legend(title = ""), shape = guide_legend(title = ""))
library(ggExtra)
ggMarginal(k, groupColour = TRUE, groupFill = TRUE)

Figure 2.7: Model performance visualised. The larger R2 and Q2, the better the model fits with our data.

2.5 VIPS

From the multivariate model (OPLS) you can derive a vaiable importance in projection (VIF) score for every feature in this model. This is a approximation (but not a statistics test!) of features that support the model. A feature with a VIP score \(\geq\) 1 is considered important and the higher the score, the greater the contribution to the model.

2.6 Metabolites and correlations

A manually curated data set of signal intensities was produced for analysis of identified known and novel compounds.

library(ggplot2)
library(ggrepel)
library(reshape)
library(tidyverse)
library(forcats)

cmp <- read.csv("data/metabolite_master_20190605.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")
md <- meta_data[c("Depth", "unified_ID")]

# FUN prep: remove, QC samples format depth and IDs
prep <- function(metabolites, md) {
    metabolites$X <- NULL
    metabolites["spec"] <- str_sub(metabolites$unified_ID, 1, 2)
    metabolites <- metabolites[!metabolites$spec == "QC", ]
    metabolites <- left_join(metabolites, md)
    metabolites <- metabolites[-c(9), ]  #removes Gb16 (Geodia atlantica)
    metabolites$Depth <- as.numeric(metabolites$Depth)
    metabolites <- metabolites[with(metabolites, order(metabolites$spec, metabolites$Depth)), ]
    metabolites[] <- lapply(metabolites, function(x) if (is.factor(x)) 
        factor(x) else x)
    metabolites$unified_ID <- factor(metabolites$unified_ID, levels = metabolites$unified_ID)
    return(metabolites)
}

metabolites <- prep(cmp, md)

library(reshape2)
mets <- metabolites[, c("asb", "his", "cre", "pch", "crn", "acl", "ser", "mbc", "chol_s", "bet", "choline", "ura", "unified_ID", "spec", "Depth")]
colnames(mets) <- c("arseno- betaine", "histamine", "creatine", "phospho- choline", "carnitine", "acetyl- choline", "serotonin", "2-methyl- butyroyl- carnitine", 
    "choline sulfate", "betaine", "choline", "uranidine", "unified_ID", "spec", "Depth")

mets <- melt(mets, id.vars = c("spec", "Depth", "unified_ID"))
mets["Species"] <- NA
mets$Species[mets$spec == "Gb"] <- c("Geodia barretti")
mets$Species[mets$spec == "Sf"] <- c("Stryphnus fortis")
mets$Species[mets$spec == "Wb"] <- c("Weberella bursa")

# labeller needs formula interface
all <- ggplot(mets, aes(x = Depth, y = value)) + geom_point() + facet_grid(variable ~ Species, labeller = labeller(variable = label_wrap_gen(10, multi_line = TRUE)), 
    scales = "free")
all <- all + ggtitle("VIPs") + ylab("signal") + xlab("Depth") + theme_bw() + theme(strip.text.x = element_text(face = "italic"), panel.background = element_rect(fill = "white", 
    colour = "black"), panel.border = element_rect(colour = "black", fill = NA), axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(labels = scales::scientific) + 
    scale_x_continuous(breaks = scales::pretty_breaks(n = 12))
all

cmp <- read.csv("data/metabolite_master_20190605.csv", header = T, sep = ",")
meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")
md <- meta_data[c("Depth", "unified_ID")]
metabolites <- prep(cmp, md)
mets <- metabolites[, c("asb", "his", "cre", "pch", "crn", "acl", "ser", "mbc", "chol_s", "bet", "choline", "ura", "spec", "Depth")]

mets_gb <- mets[mets$spec == "Gb", ]
mets_sf <- mets[mets$spec == "Sf", ]
mets_wb <- mets[mets$spec == "Wb", ]
mets_gb$spec <- NULL
mets_sf$spec <- NULL
mets_wb$spec <- NULL

mets_gb$ser <- NULL  # too many NAs

metabolite_stats <- function(mets) {
    # Prepare df for collecting results
    results <- data.frame(colnames(mets)[-length(colnames(mets))])
    colnames(results) <- c("metabolite")
    results[, c("cor_rho", "cor_p.value", "cor_fdr", "t.test_stat", "t.test_param", "t.test_p.val", "t.test_fdr")] <- NA
    
    n <- 0
    k <- dim(mets)[2] - 1
    while (n < k) {
        n <- n + 1
        corr <- cor.test(mets[, n], mets$Depth, method = "spearman")
        results$cor_rho[n] <- corr$estimate
        results$cor_p.value[n] <- corr$p.value
        tt <- t.test(mets[, n][mets$Depth < 1000], mets[, n][mets$Depth > 1000], var.equal = T)
        results$t.test_stat[n] <- tt$statistic
        results$t.test_param[n] <- tt$parameter
        results$t.test_p.val[n] <- tt$p.value
    }
    cor_fdr <- c(p.adjust(results$cor_p.value, method = "fdr"))
    # colnames(cor_fdr) <- c('cor_fdr')
    results$cor_fdr <- cor_fdr
    t.test_fdr <- c(p.adjust(results$t.test_p.val, method = "fdr"))
    # colnames(t.test_fdr) <- c('t.test_fdr')
    results$t.test_fdr <- t.test_fdr
    
    return(results)
}

gb_met_stats <- metabolite_stats(mets_gb)
sf_met_stats <- metabolite_stats(mets_sf)
wb_met_stats <- metabolite_stats(mets_wb)

gb_met_stats["Species"] <- c("Gb")
sf_met_stats["Species"] <- c("Sf")
wb_met_stats["Species"] <- c("Wb")

gb_met_stats$metabolite <- c("arsenobetaine", "histamine", "creatine", "phosphocholine", "carnitine", "acetylcholine", "2-methyl-butyroyl-carnitine", "choline sulfate", 
    "betaine", "choline", "uranidine")
sf_met_stats$metabolite <- c("arsenobetaine", "histamine", "creatine", "phosphocholine", "carnitine", "acetylcholine", "serotonin", "2-methyl-butyroyl-carnitine", 
    "choline sulfate", "betaine", "choline", "uranidine")
wb_met_stats$metabolite <- c("arsenobetaine", "histamine", "creatine", "phosphocholine", "carnitine", "acetylcholine", "serotonin", "2-methyl-butyroyl-carnitine", 
    "choline sulfate", "betaine", "choline", "uranidine")


results <- rbind(gb_met_stats, sf_met_stats, wb_met_stats)
kable(results, digits = 4, longtable = T, booktabs = T, caption = "Depth response for VIP metabolites as approximated by different methods.") %>% add_header_above(c("", 
    `depth correlation` = 3, `depth groups` = 4, "")) %>% kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed", "responsive"), font_size = 12, 
    full_width = F, latex_options = c("striped", "scale_down"))

metabolite	cor_rho	cor_p.value	cor_fdr	t.test_stat	t.test_param	t.test_p.val	t.test_fdr	Species
arsenobetaine	-0.7214	0.0033	0.0123	3.0590	13	0.0091	0.0189	Gb
histamine	-0.1821	0.5151	0.5666	1.7978	13	0.0955	0.1313	Gb
creatine	-0.5879	0.0381	0.0838	2.3348	11	0.0395	0.0621	Gb
phosphocholine	0.6429	0.0117	0.0321	-4.6000	13	0.0005	0.0018	Gb
carnitine	-0.7214	0.0033	0.0123	7.0103	13	0.0000	0.0001	Gb
acetylcholine	-0.0750	0.7926	0.7926	0.4203	13	0.6811	0.7492	Gb
2-methyl-butyroyl-carnitine	-0.2357	0.3966	0.5308	1.2505	13	0.2332	0.2850	Gb
choline sulfate	-0.5036	0.0582	0.1067	4.8391	13	0.0003	0.0018	Gb
betaine	0.4786	0.0735	0.1154	-2.9955	13	0.0103	0.0189	Gb
choline	0.2179	0.4343	0.5308	0.1295	13	0.8990	0.8990	Gb
uranidine	-0.8321	0.0002	0.0017	4.2900	13	0.0009	0.0024	Gb
arsenobetaine	-0.1923	0.5292	0.6350	0.3574	11	0.7276	0.8273	Sf
histamine	-0.4945	0.0889	0.1524	2.2062	11	0.0495	0.0991	Sf
creatine	-0.7967	0.0018	0.0111	5.3100	11	0.0002	0.0028	Sf
phosphocholine	0.7637	0.0036	0.0144	-4.9112	11	0.0005	0.0028	Sf
carnitine	0.5275	0.0673	0.1347	-2.4433	11	0.0326	0.0783	Sf
acetylcholine	-0.0165	0.9639	0.9639	0.2234	11	0.8273	0.8273	Sf
serotonin	0.5824	0.0403	0.0967	-1.8692	11	0.0884	0.1516	Sf
2-methyl-butyroyl-carnitine	0.8681	0.0001	0.0015	-3.5643	11	0.0044	0.0178	Sf
choline sulfate	-0.3147	0.3195	0.4260	1.6828	10	0.1233	0.1850	Sf
betaine	-0.3516	0.2392	0.3588	-0.2609	11	0.7990	0.8273	Sf
choline	0.7198	0.0075	0.0225	-2.5136	11	0.0288	0.0783	Sf
uranidine	-0.1264	0.6826	0.7447	0.7272	11	0.4823	0.6431	Sf
arsenobetaine	0.6059	0.0148	0.1772	-2.1540	14	0.0492	0.2949	Wb
histamine	-0.1765	0.5122	0.6028	0.4515	14	0.6585	0.6585	Wb
creatine	0.4000	0.1259	0.3776	-1.3384	14	0.2021	0.4850	Wb
phosphocholine	0.2363	0.4367	0.6028	-0.7905	11	0.4460	0.6547	Wb
carnitine	0.5294	0.0373	0.2237	-2.3928	14	0.0313	0.2949	Wb
acetylcholine	-0.1882	0.4839	0.6028	0.6617	14	0.5189	0.6547	Wb
serotonin	-0.1736	0.5526	0.6028	0.6219	12	0.5456	0.6547	Wb
2-methyl-butyroyl-carnitine	0.2029	0.4496	0.6028	-1.7389	14	0.1040	0.3140	Wb
choline sulfate	-0.0143	0.9642	0.9642	-0.4725	13	0.6444	0.6585	Wb
betaine	0.3382	0.2000	0.4801	-0.6415	14	0.5316	0.6547	Wb
choline	0.2588	0.3319	0.6028	-1.7352	14	0.1047	0.3140	Wb
uranidine	-0.4294	0.0985	0.3776	1.1807	14	0.2574	0.5148	Wb

cmp <- read.csv("data/metabolite_master_20190605.csv", header = T, sep = ",")

meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")
md <- meta_data[c("Depth", "unified_ID", "Species")]

# 6-bromo-8-hydroxy-conicamin = cp4 cyclo Pro-Arg = solterin or soelterin

# Prepare df for Gb compounds
metabolites <- prep(cmp, md)
mets <- metabolites[, c("bar", "dhb", "dhhb", "GeoA", "GeoAi", "GeoB", "sol", "brc", "cp4", "GeoC", "L6bhp", "barrettide_A", "barrettide_B", "unified_ID", "Depth", 
    "Species")]
mets <- mets[mets$Species == "Geodia barretti", ]
colnames(mets) <- c("barettin", "8,9-dihydrobarettin", "8,9−dihydro−8−hydroxy−barettin", "Geobarrettin A", "Geobarrettin A isomer", "Geobarrettin B", 
    "cyclo Pro-Arg", "6-bromo-conicamin", "6-bromo-8-hydroxy-conicamin", "Geobarrettin C", "L-6-bromo- \n hypaphorine", "Barrettide A", "Barrettide B", "unified_ID", 
    "Depth", "Species")
mets[, c("unified_ID", "Species")] <- list(NULL)

mets <- reshape2::melt(mets, id.vars = c("Depth"))
ggplot(mets, aes(x = Depth, y = value)) + geom_point() + facet_wrap(. ~ variable, scales = "free") + theme_bw()

mets <- metabolites[, c("sfn", "ian", "unified_ID", "Depth", "Species")]
mets <- mets[mets$Species == "Stryphnus fortis", ]
mets[, c("unified_ID", "Species")] <- list(NULL)
colnames(mets) <- c("Stryphnusin", "Ianthellin", "Depth")

mets <- reshape2::melt(mets, id.vars = c("Depth"))
ggplot(mets, aes(x = Depth, y = value)) + geom_point() + facet_wrap(. ~ variable, scales = "free") + theme_bw()

cmp <- read.csv("data/metabolite_master_20190605.csv", header = T, sep = ",")

meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")
md <- meta_data[c("Depth", "unified_ID", "Species")]

# 6-bromo-8-hydroxy-conicamin = cp4 cyclo Pro-Arg = solterin or soelterin

# Prepare df for Gb compounds
metabolites <- prep(cmp, md)
mets <- metabolites[, c("bar", "dhb", "dhhb", "GeoA", "GeoAi", "GeoB", "sol", "brc", "cp4", "GeoC", "L6bhp", "barrettide_A", "barrettide_B", "unified_ID", "Depth", 
    "Species")]
mets <- mets[mets$Species == "Geodia barretti", ]
mets[, c("unified_ID", "Species")] <- list(NULL)


metabolite_stats <- function(mets) {
    # Prepare df for collecting results
    results <- data.frame(colnames(mets)[-length(colnames(mets))])
    colnames(results) <- c("metabolite")
    results[, c("cor_rho", "cor_p.value", "cor_fdr", "t.test_stat", "t.test_param", "t.test_p.val", "t.test_fdr")] <- NA
    
    n <- 0
    k <- dim(mets)[2] - 1
    while (n < k) {
        n <- n + 1
        corr <- cor.test(mets[, n], mets$Depth, method = "spearman")
        results$cor_rho[n] <- corr$estimate
        results$cor_p.value[n] <- corr$p.value
        tt <- t.test(mets[, n][mets$Depth < 1000], mets[, n][mets$Depth > 1000], var.equal = T)
        results$t.test_stat[n] <- tt$statistic
        results$t.test_param[n] <- tt$parameter
        results$t.test_p.val[n] <- tt$p.value
    }
    cor_fdr <- c(p.adjust(results$cor_p.value, method = "fdr"))
    # colnames(cor_fdr) <- c('cor_fdr')
    results$cor_fdr <- cor_fdr
    t.test_fdr <- c(p.adjust(results$t.test_p.val, method = "fdr"))
    # colnames(t.test_fdr) <- c('t.test_fdr')
    results$t.test_fdr <- t.test_fdr
    
    return(results)
}

mets_stats <- metabolite_stats(mets)

# better names
Metabolites <- c("barettin", "8,9-dihydrobarettin", "8,9−dihydro−8−hydroxy−barettin", "Geobarrettin A", "Geobarrettin A isomer", "Geobarrettin B", "cyclo Pro-Arg", 
    "6-bromo-conicamin", "6-bromo-8-hydroxy-conicamin", "Geobarrettin C", "L-6-bromohypaphorine", "Barrettide A", "Barrettide B")
mets_stats$metabolite <- Metabolites

kable(mets_stats, digits = 4, longtable = T, booktabs = T, caption = "Depth response for _G. barretti_ metabolites as approximated by different methods.") %>% 
    add_header_above(c("", `depth correlation` = 3, `depth groups` = 4)) %>% kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed", 
    "responsive"), font_size = 12, full_width = F, latex_options = c("striped", "scale_down"))

metabolite	cor_rho	cor_p.value	cor_fdr	t.test_stat	t.test_param	t.test_p.val	t.test_fdr
barettin	-0.7714	0.0012	0.0106	17.2398	13	0.0000	0.0000
8,9-dihydrobarettin	-0.5607	0.0323	0.0699	2.2800	13	0.0401	0.0850
8,9−dihydro−8−hydroxy−barettin	-0.7571	0.0016	0.0106	3.8027	13	0.0022	0.0057
Geobarrettin A	-0.6542	0.0082	0.0212	4.4642	13	0.0006	0.0021
Geobarrettin A isomer	-0.6792	0.0054	0.0189	4.5859	13	0.0005	0.0021
Geobarrettin B	-0.4571	0.0888	0.1649	1.9183	13	0.0773	0.1256
cyclo Pro-Arg	-0.6893	0.0058	0.0189	19.1189	13	0.0000	0.0000
6-bromo-conicamin	-0.2143	0.4420	0.6385	2.2089	13	0.0457	0.0850
6-bromo-8-hydroxy-conicamin	0.2857	0.3011	0.4893	-0.0877	13	0.9314	0.9740
Geobarrettin C	-0.0179	0.9540	0.9540	0.8091	13	0.4330	0.5629
L-6-bromohypaphorine	-0.0857	0.7630	0.9017	1.0363	13	0.3190	0.4607
Barrettide A	-0.1571	0.5756	0.7482	0.5131	13	0.6165	0.7286
Barrettide B	0.0788	0.8380	0.9078	-0.0336	8	0.9740	0.9740

cmp <- read.csv("data/metabolite_master_20190605.csv", header = T, sep = ",")

meta_data <- read.csv("data/Steffen_et_al_metadata_PANGAEA.csv", header = T, sep = ";")
md <- meta_data[c("Depth", "unified_ID", "Species")]

# Prepare df for Gb compounds
metabolites <- prep(cmp, md)
mets <- metabolites[, c("sfn", "ian", "unified_ID", "Depth", "Species")]
mets <- mets[mets$Species == "Stryphnus fortis", ]
mets[, c("unified_ID", "Species")] <- list(NULL)

mets_stats <- metabolite_stats(mets)

Metabolites <- c("Stryphnusin", "Ianthellin")
mets_stats$metabolite <- Metabolites

kable(mets_stats, longtable = T, booktabs = T, caption = "Depth response fro _S. fortis_ metabolites as approximated by different methods.") %>% add_header_above(c("", 
    `depth correlation` = 3, `depth groups` = 4)) %>% kable_styling(bootstrap_options = c("striped", "hover", "bed", "condensed", "responsive"), font_size = 12, 
    full_width = F, latex_options = c("striped", "scale_down"))

metabolite	cor_rho	cor_p.value	cor_fdr	t.test_stat	t.test_param	t.test_p.val	t.test_fdr
Stryphnusin	0.9393939	0	0	-1.715332	8	0.1246247	0.1246247
Ianthellin	0.9393939	0	0	-1.931431	8	0.0895227	0.1246247

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.15.5
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggExtra_0.9             plotly_4.9.2            ropls_1.14.1            reshape2_1.4.3          usdm_1.1-18             raster_3.0-12          
##  [7] sp_1.4-1                rnaturalearthdata_0.1.0 rnaturalearth_0.1.0     sf_0.8-1                marmap_1.0.3            mapdata_2.3.0          
## [13] maps_3.3.0              ggmap_3.0.0.901         kableExtra_1.1.0.9000   gridExtra_2.3           reshape_0.8.8           ggrepel_0.8.2          
## [19] igraph_1.2.5            metap_1.3               vegan_2.5-6             lattice_0.20-41         permute_0.9-5           forcats_0.5.0          
## [25] stringr_1.4.0           dplyr_0.8.5             purrr_0.3.3             readr_1.3.1             tidyr_1.0.2             tibble_3.0.0           
## [31] ggplot2_3.3.0           tidyverse_1.3.0         knitr_1.28             
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1        backports_1.1.5     sn_1.6-1            plyr_1.8.6          lazyeval_0.2.2      splines_3.5.1       crosstalk_1.1.0.1  
##   [8] TH.data_1.0-10      digest_0.6.25       htmltools_0.4.0     fansi_0.4.1         magrittr_1.5        memoise_1.1.0       cluster_2.1.0      
##  [15] modelr_0.1.6        sandwich_2.5-1      jpeg_0.1-8.1        colorspace_1.4-1    blob_1.2.1          rvest_0.3.5         haven_2.2.0        
##  [22] xfun_0.13.1         crayon_1.3.4        jsonlite_1.6.1      survival_3.1-11     zoo_1.8-7           glue_1.3.2          gtable_0.3.0       
##  [29] webshot_0.5.2       shape_1.4.4         BiocGenerics_0.28.0 scales_1.1.0        mvtnorm_1.1-0       DBI_1.1.0           bibtex_0.4.2.2     
##  [36] miniUI_0.1.1.1      Rcpp_1.0.4          isoband_0.2.0       plotrix_3.7-7       xtable_1.8-4        viridisLite_0.3.0   units_0.6-6        
##  [43] bit_1.1-15.2        stats4_3.5.1        htmlwidgets_1.5.1   httr_1.4.1          TFisher_0.2.0       ellipsis_0.3.0      pkgconfig_2.0.3    
##  [50] farver_2.0.3        dbplyr_1.4.2        later_1.0.0         tidyselect_1.0.0    labeling_0.3        rlang_0.4.5         munsell_0.5.0      
##  [57] cellranger_1.1.0    tools_3.5.1         cli_2.0.2           generics_0.0.2      RSQLite_2.2.0       broom_0.5.5         fastmap_1.0.1      
##  [64] evaluate_0.14       yaml_2.2.1          bit64_0.9-7         fs_1.4.0            RgoogleMaps_1.4.5.3 ncdf4_1.17          nlme_3.1-145       
##  [71] mime_0.9            formatR_1.7         xml2_1.3.0          compiler_3.5.1      rstudioapi_0.11     png_0.1-7           e1071_1.7-3        
##  [78] reprex_0.3.0        stringi_1.4.6       highr_0.8           rgeos_0.5-2         Matrix_1.2-18       classInt_0.4-2      multtest_2.38.0    
##  [85] vctrs_0.2.4         mutoss_0.1-12       pillar_1.4.3        lifecycle_0.2.0     Rdpack_0.11-1       data.table_1.12.8   bitops_1.0-6       
##  [92] gbRd_0.4-11         httpuv_1.5.2        R6_2.4.1            promises_1.1.0      bookdown_0.18       KernSmooth_2.23-16  codetools_0.2-16   
##  [99] MASS_7.3-51.5       assertthat_0.2.1    rjson_0.2.20        withr_2.1.2         mnormt_1.5-6        multcomp_1.4-12     adehabitatMA_0.3.14
## [106] mgcv_1.8-31         parallel_3.5.1      hms_0.5.3           class_7.3-16        rmarkdown_2.1       shiny_1.4.0.2       numDeriv_2016.8-1.1
## [113] Biobase_2.42.0      lubridate_1.7.4