# Introduction to Using ECHO

#### 2020-06-10

The echo.find package provides a function (echo_find()) designed to find rhythms from expression data using extended harmonic oscillators. To read more about our inital work on this project and cite us, see ECHO: an application for detection and analysis of oscillators identifies metabolic regulation on genome-wide circadian output by H. De los Santos, et al. (2020) Further, for users who prefer an interface more than coding, as well as built-in visualizations, our GitHub repository can be found here. There, you can find a shiny application for finding rhythms and automatically visualizing results, with features such as Venn diagrams, heat maps, gene expression plots (with or without replicates visualized), and parameter density graphs. A FAQ for possible user errors can also be found there.

Also, it should be noted that starting points have been improved since echo.find v4.0. For more information on these new starting points/algorithm improvements and comparisons with the Bioinformatics version, please see ECHO_V4_Update.pdf on the GitHub.

In this vignette, we’ll walk through an example of how to use echo.find, and how to choose from the several different built-in methods of preprocessing.

We’ll start by loading our library, which contains the echo_find() function. It also has an example dataframe, expressions, which we’ll be using throughout this vignette. Here we’ll look at the first few rows and columns of our dataset.

library(echo.find)
head(expressions[,1:5])
##   Gene.Name      CT2.1      CT2.2      CT2.3      CT4.1
## 1  Sample 1  1.6331179  1.4976053  1.5138102  1.3095535
## 2  Sample 2 -0.6303192 -0.6027464 -0.5105009 -0.5062033
## 3  Sample 3         NA -0.7802214 -0.7767950  0.2847617
## 4  Sample 4  0.4659923  0.4940659         NA  0.1018655
## 5  Sample 5  0.7026372  0.6405812  1.0235155  1.7199453
## 6  Sample 6  0.9261508  0.8858768  0.8035570         NA

Note the data format: its first column first column has gene labels/names, and all other columns have numerical expression data. This expression data is ordered by time point then by replicate, and has evenly spaced time points. Any missing data has cells left blank. In order to use the echo_find() function, data must be in this format. Now, let’s look at one the data expressions, Sample 2. Here we plot each of the replicates in a different color, then plot the difference between them in gray.

library(ggplot2)

tp <- seq(2,48,by=2) # our time points
num_reps <- 3 # number of replicates

samp <- 2 # sample we want to look at
ex.df <- expressions[samp,-1] # expression data for the first sample

# our visualization data frame
ribbon.df <- data.frame(matrix(ncol = 3+num_reps, nrow = length(tp)))
# assigning column names
colnames(ribbon.df) <- c("Times","Min","Max",
paste(rep("Rep",num_reps),c(1:num_reps), sep="."))
ribbon.df$Times <- tp # getting min values of replicates ribbon.df$Min <- sapply(seq(1,ncol(ex.df), by = num_reps),
function(x) min(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE))
# getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(ex.df), by = num_reps), function(x) max(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE)) # assign each of the replicates to the visualization data frame for (i in 1:num_reps){ ribbon.df[,3+i] <- t(ex.df[,seq(i,ncol(ex.df),by=num_reps)]) } # color names color_bar <- c("Rep.1"="red","Rep.2"="blue","Rep.3"="green") # visualize, with shading for each row p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"), fill = "gray", alpha = 0.5)+ # create shading ggtitle(expressions[samp,1])+ # gene name is title scale_color_manual("",values=color_bar)+ scale_fill_manual("",values=color_bar)+ theme(plot.title = element_text(hjust = .5), legend.position = "bottom",legend.direction = "horizontal")+ labs(x="Hours", y="Expression") #Label for axes # add specific replicate lines for (i in 1:num_reps){ p <- p + geom_line(data = ribbon.df, aes_string(x="Times",y=paste("Rep",i,sep = ".")), colour=color_bar[i]) } suppressWarnings(p) # to ignore warnings for missing values It very clearly has an oscillitory pattern with a small amount of damping, making echo_find() the perfect function for our dataset. ## Running echo_find() So we begin by assigning our parameters and running the echo_find() function. In this first run, we look for rhythms between 20 and 26 hours, with no preprocessing, assigning these results to a new dataframe. # echo_find() parameters genes <- expressions begin <- 2 # first time point end <- 48 # last time point resol <- 2 # time point resolution num_reps <- 3 # number of replicates low <- 20 # low period seeking high <- 26 # high period seeking run_all_per <- FALSE # we are not looking for all periods paired <- FALSE # these replicates are unrelated, that is, a replicate being # called "replicate 1" at time point 2 means nothing rem_unexpr <- FALSE # do not remove unexpressed genes # we do not assign rem_unexpr_amt, since we're not removing unexpressed genes is_normal <- FALSE # do not normalize is_de_linear_trend <- FALSE # do not remove linear trends is_smooth <- FALSE # do not smooth the data results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, num_reps = num_reps, low = low, high = high, run_all_per = run_all_per, paired = paired, rem_unexpr = rem_unexpr, is_normal = is_normal, is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth) head(results[,1:16]) ## Gene Name Convergence Iterations Amplitude.Change.Coefficient ## 1 Sample 1 0 0 -0.05319493 ## 2 Sample 2 0 0 0.01320716 ## 3 Sample 3 0 0 -0.07006987 ## 4 Sample 4 0 0 -0.10549713 ## 5 Sample 5 0 0 -0.04007378 ## 6 Sample 6 0 0 0.11028233 ## Oscillation Type Initial.Amplitude Radian.Frequency Period Phase Shift ## 1 Forced 1.2113578 0.3141593 20.00000 -0.4740208 ## 2 Harmonic 0.7143799 0.2749550 22.85169 2.4479292 ## 3 Forced 1.6853526 0.2811149 22.35095 3.7599389 ## 4 Forced 0.7119285 0.2416610 26.00000 6.7165400 ## 5 Forced 1.4596162 0.3141593 20.00000 -1.6465263 ## 6 Damped 0.8471170 0.2416610 26.00000 10.8530577 ## Hours Shifted Equilibrium Value Slope Tau P-Value ## 1 1.508855 0.32913626 NA 0.8894112 4.750294e-25 ## 2 13.948668 0.12870158 NA 0.9011140 1.138728e-25 ## 3 8.975854 -0.09077893 NA 0.9781573 5.940921e-30 ## 4 24.206766 -0.08338100 NA 0.7870119 5.818796e-20 ## 5 5.241056 0.44750943 NA 0.8989084 6.154595e-26 ## 6 7.089738 0.43873442 NA 0.7967642 2.025281e-20 ## BH Adj P-Value BY Adj P-Value ## 1 1.425088e-24 4.422349e-24 ## 2 4.554914e-25 1.413486e-24 ## 3 7.129105e-29 2.212311e-28 ## 4 8.728194e-20 2.708542e-19 ## 5 3.692757e-25 1.145940e-24 ## 6 3.471911e-20 1.077407e-19 Now we can see that the results data frame has information about the parameters, including forcing coefficient values (whether the oscillation is damped, driven, harmonic, etc.) and p-values. Let’s look at how the fit and parameters turned out for our initial sample. Here we add the fitted values to our plot in black and print the parameters to the console. # assign the fit to the visualization data frame ribbon.df$Fit <- t(results[samp,(17+(length(tp)*num_reps)):ncol(results)])

# visualize, with shading for each row
p <- p +
geom_line(data = ribbon.df,
aes_string(x="Times",y="Fit"),
colour="black")

suppressWarnings(p) # to ignore warnings for missing values

# print sample's parameters
cat(paste0("Gene Name: ",results$Gene Name[samp],"\n", "Convergence:", results$Convergence[samp],"\n",
"Iterations:",results$Iterations[samp],"\n", "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
"Oscillation Type:",results$Oscillation Type[samp],"\n", "Amplitude", results$Amplitude[samp],"\n",
"Radian.Frequency:",results$Radian.Frequency[samp],"\n", "Period:",results$Period[samp],"\n",
"Phase Shift:",results$Phase Shift[samp],"\n", "Hours Shifted:",results$Hours Shifted[samp],"\n",
"P-Value:",results$P-Value[samp],"\n", "BH Adj P-Value:",results$BH Adj P-Value[samp],"\n",
"BY Adj P-Value:",results$BY Adj P-Value[samp],"\n")) ## Gene Name: Sample 2 ## Convergence:0 ## Iterations:0 ## Forcing Coefficient: ## Oscillation Type:Harmonic ## Amplitude0.0132071618056893 ## Radian.Frequency:0.274955003239582 ## Period:22.8516856691083 ## Phase Shift:2.44792917886537 ## Hours Shifted:13.9486682661758 ## P-Value:1.13872848886594e-25 ## BH Adj P-Value:4.55491395546377e-25 ## BY Adj P-Value:1.4134857624926e-24 This fit matches pretty closely to the trend, which is emphasized by the very low adjusted p-value. As we predicted, the oscillation is also damped, which is shown by the positive forcing coefficient and the designation of the oscillation type. Now let’s see how preprocessing affects the results. Here we search for all possible periods, using the default values for low and high, as well as allowing for all our preprocessing options: removing unexpressed genes, normalizing, removing linear trends, and smoothing. run_all_per <- TRUE # looking for all possible periods rem_unexpr <- TRUE # remove unexpressed genes rem_unexpr_amt <- 70 # percentage of unexpressed genes is_normal <- TRUE # normalize is_de_linear_trend <- TRUE # remove linear trends is_smooth <- TRUE # smooth the data # we're using the default values of low and high, since we're looking for all periods results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, num_reps = num_reps, run_all_per = run_all_per, paired = paired, rem_unexpr = rem_unexpr, rem_unexpr_amt = rem_unexpr_amt, is_normal = is_normal, is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth) head(results[,1:16]) ## Gene Name Convergence Iterations Amplitude.Change.Coefficient ## 1 Sample 1 0 0 -0.05153645 ## 2 Sample 2 0 0 0.01158224 ## 3 Sample 3 0 0 -0.07418176 ## 4 Sample 4 0 0 -0.07811707 ## 5 Sample 5 0 0 -0.02671833 ## 6 Sample 6 0 0 0.03607892 ## Oscillation Type Initial.Amplitude Radian.Frequency Period Phase Shift ## 1 Forced 0.6164627 0.3243511 19.37155 -0.6585024 ## 2 Harmonic 1.5772249 0.2731878 22.99951 2.5290535 ## 3 Forced 0.4876710 0.2788346 22.53374 3.5663709 ## 4 Forced 0.4262501 0.2191192 28.67474 7.1947247 ## 5 Harmonic 0.9433863 0.3249428 19.33628 4.2981209 ## 6 Damped 1.6033815 0.1655005 37.96476 5.9626842 ## Hours Shifted Equilibrium Value Slope Tau P-Value ## 1 2.030215 0.002159044 0.003476106 0.9381728 1.095403e-27 ## 2 13.741944 0.095058426 0.005083800 0.9371975 1.240717e-27 ## 3 9.743463 -0.217673167 -0.024666260 0.7694577 3.766646e-19 ## 4 24.514726 0.110275406 -0.007324020 0.9157424 1.861411e-26 ## 5 6.108965 -0.267900933 0.010171317 0.8799940 6.327600e-25 ## 6 1.936557 -0.040058347 -0.035910190 0.7245971 3.695378e-17 ## BH Adj P-Value BY Adj P-Value ## 1 4.962868e-27 1.540083e-26 ## 2 4.962868e-27 1.540083e-26 ## 3 5.022195e-19 1.558493e-18 ## 4 5.045212e-26 1.565635e-25 ## 5 1.084731e-24 3.366150e-24 ## 6 4.031322e-17 1.251004e-16 Since we’ve now searched for all possible periods, periods can now fall outside our predetermined range of 20 to 26 that we set in our first run. Let’s see how this affected the fit and parameters of the sample we looked at. rep_genes <- results[samp,17:(16+(length(tp)*num_reps))] # getting min values of replicates ribbon.df$Min <- sapply(seq(1,ncol(rep_genes), by = num_reps),
function(x) min(unlist(rep_genes[,c(x:(num_reps-1+x))]),
na.rm = TRUE))
# getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(rep_genes), by = num_reps), function(x) max(unlist(rep_genes[,c(x:(num_reps-1+x))]), na.rm = TRUE)) for (i in 1:num_reps){ # assign each of the replicates ribbon.df[,3+i] <- t(rep_genes[,seq(i,ncol(rep_genes),by=num_reps)]) } # assign the fit to the visualization data frame ribbon.df$Fit <- t(results[samp,(17+(length(tp)*num_reps)):ncol(results)])

# visualize, with shading for each row
p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables
geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"),
fill = "gray", alpha = 0.5)+ # create shading
ggtitle(expressions[samp,1])+ # gene name is title
scale_color_manual("",values=color_bar)+
scale_fill_manual("",values=color_bar)+
theme(plot.title = element_text(hjust = .5),
legend.position = "bottom",legend.direction = "horizontal")+
labs(x="Hours", y="Expression") #Label for axes
for (i in 1:num_reps){
p <- p +
geom_line(data = ribbon.df,
aes_string(x="Times",y=paste("Rep",i,sep = ".")),
colour=color_bar[i])
}

p <- p +
geom_line(data = ribbon.df,
aes_string(x="Times",y="Fit"),
colour="black")

suppressWarnings(p) # to ignore warnings for missing values

# print sample's parameters
cat(paste0("Gene Name: ",results$Gene Name[samp],"\n", "Convergence:", results$Convergence[samp],"\n",
"Iterations:",results$Iterations[samp],"\n", "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
"Oscillation Type:",results$Oscillation Type[samp],"\n", "Amplitude", results$Amplitude[samp],"\n",
"Radian.Frequency:",results$Radian.Frequency[samp],"\n", "Period:",results$Period[samp],"\n",
"Phase Shift:",results$Phase Shift[samp],"\n", "Hours Shifted:",results$Hours Shifted[samp],"\n",
"P-Value:",results$P-Value[samp],"\n", "BH Adj P-Value:",results$BH Adj P-Value[samp],"\n",
"BY Adj P-Value:",results\$BY Adj P-Value[samp],"\n"))
## Gene Name: Sample 2
## Convergence:0
## Iterations:0
## Forcing Coefficient:
## Oscillation Type:Harmonic
## Amplitude0.011582239502997
## BY Adj P-Value:1.54008261904479e-26