hpgltools was written to make working with high-throughput data analyses easier. These analyses generally fall into a few stages:
Before any of these tasks may be performed, the data must be loaded into memory. hpgltools attempts to make this easier with create_expt() and subset_expt().
The following examples will use a real data set from a recent experiment in our lab. The raw data was processed using a mix of trimmomatic, biopieces, bowtie, samtools, and htseq. The final count tables were deposited into the ‘preprocessing/count_tables/’ tree. The resulting data structure was named ‘most_v0M1,’ named because it is comprised of count tables with 0 mismatches and 1 randomly-placed multi-match.
The annotation file was mgas_5005.gff.xz resides in ‘reference/gff/’.
The count tables and meta-data were loaded through the create_expt() function and the genome annotations were loaded with gff2df().
library(hpgltools)
data_file <- system.file("cdm_expt.rda", package="hpgltools")
cdm <- new.env()
load(data_file, envir=cdm)
rm(data_file)
ls()
## [1] "cdm" "old_options" "rmd_file"
Two variables should exist now: rmd_file in case I want to knitr this file, cdm which is a list including the data required to make an expressionset. Using this information, I can create an expressionset.
expt <- create_expt(count_dataframe=cdm$cdm_counts,
metadata=cdm$cdm_metadata,
gene_info=cdm$gene_info)
## Reading the sample metadata.
## Matched 1926 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
knitr::kable(head(expt$design))
sampleid | type | stage | replicate | mutantname | media | exptdate | libdate | batch | condition | readspassed | ncrna | xncrna | remaining | genome | xgenome | otherbacterial | xother | colors | counts | intercounts | design | file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HPGL0418 | HPGL0418 | WT | LL | 1 | 5448-1 | CDMg | 20130430 | 20140409 | a | wt_ll_cg | 17425675 | 350440 | 2.01% | 17075235 | 16061860 | 94.07% | 356542 | 2.09% | #E12C8C | processed_data/count_tables/HPGL0418/HPGL0418_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
HPGL0419 | HPGL0419 | WT | LL | 2 | 5448-2 | CDMg | 20130430 | 20140409 | b | wt_ll_cg | 18106361 | 398984 | 2.20% | 17707377 | 16619183 | 93.85% | 392803 | 2.22% | #E12C8C | processed_data/count_tables/HPGL0419/HPGL0419_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
HPGL0420 | HPGL0420 | mga | LL | 1 | Mga-1 | CDMg | 20130430 | 20140409 | a | mga1_ll_cg | 20499936 | 417440 | 2.04% | 20082496 | 18062942 | 89.94% | 416260 | 2.07% | #BE5067 | processed_data/count_tables/HPGL0420/HPGL0420_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
HPGL0421 | HPGL0421 | mga | LL | 2 | Mga-2 | CDMg | 20130430 | 20140409 | b | mga1_ll_cg | 17216381 | 397021 | 2.31% | 16819360 | 14795703 | 87.97% | 376295 | 2.24% | #BE5067 | processed_data/count_tables/HPGL0421/HPGL0421_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
HPGL0422 | HPGL0422 | WT | LL | 1 | 5448-1 | CDMf | 20130430 | 20140415 | a | wt_ll_cf | 17112706 | 367791 | 2.15% | 16744915 | 15114930 | 90.27% | 367974 | 2.20% | #8E7E40 | processed_data/count_tables/HPGL0422/HPGL0422_forward-trimmed-v1M1l20.count.xz | unknown | a | null |
HPGL0423 | HPGL0423 | WT | LL | 2 | 5448-2 | CDMf | 20130430 | 20140415 | b | wt_ll_cf | 18782729 | 395748 | 2.11% | 18386981 | 17330239 | 94.25% | 386443 | 2.10% | #8E7E40 | processed_data/count_tables/HPGL0423/HPGL0423_forward-trimmed-v1M1l20.count.xz | unknown | b | null |
summary(expt)
## Length Class Mode
## title 1 -none- character
## notes 1 -none- character
## initial_metadata 23 data.frame list
## expressionset 1 ExpressionSet S4
## design 23 data.frame list
## conditions 8 -none- character
## batches 8 -none- character
## samplenames 8 -none- character
## colors 8 -none- character
## state 5 -none- list
## original_expressionset 1 ExpressionSet S4
## original_libsize 8 -none- numeric
## libsize 8 -none- numeric
## original_metadata 1 AnnotatedDataFrame S4
The data structure generated by create_expt() is a list containing the following slots:
The primary reason I created the expt object was that I did not fully understand how expressionSets worked. From my perspective, the documentation was rather opaque and therefore I decided to create a simpler version of the expressionset. As I learned how to manipulate S4 classes more efficiently, I gradually began to realize that they are not so bad. Nonetheless, I found it nice to be able to keep some extra state information with my expressionsets, along with a backup copy in case of mistakes, and some easier ways to extract information like conditions/batches/colors/etc. Thus my *expt() functions grew into wrappers to call the native functions for manipulating expressionsets.
With the above in mind, once we have the various metadata and count data loaded into memory, a most common next step is examine the data before molesting it:
raw_metrics <- sm(graph_metrics(expt, qq=TRUE, cis=NULL))
The function graph_metrics() performs all of the likely plots one might want. Some of which are not really appropriate for non-normalized data unless it is incredibly well behaved (after 30 years, I still want to spell behaved ‘behaived’, why is that?).
## View a raw library size plot
raw_metrics$libsize
## Or boxplot to see the data distribution
raw_metrics$boxplot
## The warning is because it automatically uses a log scale and there are some 0 count genes.
## Perhaps you prefer density plots
raw_metrics$density
## $plot
##
## $condition_summary
## condition min 1st median mean 3rd max
## 1: wt_ll_cg 0.5 121.5 1548 6331 5535 573423
## 2: mga1_ll_cg 0.5 93.0 1351 6186 5157 621451
## 3: wt_ll_cf 0.5 94.0 1262 6165 4804 763579
## 4: mga1_ll_cf 0.5 80.0 1176 5421 4489 559589
##
## $batch_summary
## batch min 1st median mean 3rd max
## 1: a 0.5 91.0 1300 6040 4879 621451
## 2: b 0.5 101.8 1351 6012 5082 763579
##
## $sample_summary
## sample min 1st median mean 3rd max
## 1: HPGL0418 0.5 112.25 1481 6265 5402 573423
## 2: HPGL0419 0.5 138.25 1595 6397 5656 539699
## 3: HPGL0420 0.5 96.50 1438 6893 5600 621451
## 4: HPGL0421 0.5 91.25 1256 5479 4753 475817
## 5: HPGL0422 0.5 80.50 1123 5548 4197 617111
## 6: HPGL0423 0.5 102.25 1447 6782 5521 763579
## 7: HPGL0424 0.5 80.25 1169 5454 4534 559589
## 8: HPGL0425 0.5 80.25 1181 5389 4434 538745
##
## $table
## id sample counts condition batch
## 1: M5005_Spy0001 HPGL0418 8783 wt_ll_cg a
## 2: M5005_Spy0002 HPGL0418 9489 wt_ll_cg a
## 3: M5005_Spy0003 HPGL0418 550 wt_ll_cg a
## 4: M5005_Spy0004 HPGL0418 9844 wt_ll_cg a
## 5: M5005_Spy0005 HPGL0418 1045 wt_ll_cg a
## ---
## 15404: M5005_SpyT0063 HPGL0425 86 mga1_ll_cf b
## 15405: M5005_SpyT0064 HPGL0425 87 mga1_ll_cf b
## 15406: M5005_SpyT0065 HPGL0425 80 mga1_ll_cf b
## 15407: M5005_SpyT0066 HPGL0425 57 mga1_ll_cf b
## 15408: M5005_SpyT0067 HPGL0425 17 mga1_ll_cf b
## quantile/quantile plots compared to the median of all samples
raw_metrics$qqrat
raw_metrics$tsneplot
## Here we can see some samples are differently 'shaped' compared to the median than others
## There are other plots one may view, but this data set is a bit too crowded as is.
## The following summary shows the other available plots:
summary(raw_metrics)
## Length Class Mode
## nonzero 9 gg list
## nonzero_table 6 data.frame list
## libsize 9 gg list
## libsizes 4 data.table list
## libsize_summary 7 data.table list
## boxplot 9 gg list
## corheat 3 recordedplot list
## smc 9 gg list
## disheat 3 recordedplot list
## smd 9 gg list
## pcaplot 9 gg list
## pcatable 8 data.frame list
## pcares 4 data.frame list
## pcavar 7 -none- numeric
## tsneplot 9 gg list
## tsnetable 8 data.frame list
## tsneres 4 -none- list
## tsnevar 8 -none- numeric
## density 5 -none- list
## legend 2 -none- list
## qqlog 3 recordedplot list
## qqrat 3 recordedplot list
## ma 0 -none- NULL
The plots are all generated by calling plot_something() where the somethings are:
On the other hand, we might take a subset of the data to focus on the late-log vs. early-log samples.
The expt_subset() function allows one to pull material from the experimental design.
Once we have a smaller data set, we can more easily use PCA to see how the sample separate.
head(expt$design)
## sampleid type stage replicate mutantname media exptdate libdate
## HPGL0418 HPGL0418 WT LL 1 5448-1 CDMg 20130430 20140409
## HPGL0419 HPGL0419 WT LL 2 5448-2 CDMg 20130430 20140409
## HPGL0420 HPGL0420 mga LL 1 Mga-1 CDMg 20130430 20140409
## HPGL0421 HPGL0421 mga LL 2 Mga-2 CDMg 20130430 20140409
## HPGL0422 HPGL0422 WT LL 1 5448-1 CDMf 20130430 20140415
## HPGL0423 HPGL0423 WT LL 2 5448-2 CDMf 20130430 20140415
## batch condition readspassed ncrna xncrna remaining genome
## HPGL0418 a wt_ll_cg 17425675 350440 2.01% 17075235 16061860
## HPGL0419 b wt_ll_cg 18106361 398984 2.20% 17707377 16619183
## HPGL0420 a mga1_ll_cg 20499936 417440 2.04% 20082496 18062942
## HPGL0421 b mga1_ll_cg 17216381 397021 2.31% 16819360 14795703
## HPGL0422 a wt_ll_cf 17112706 367791 2.15% 16744915 15114930
## HPGL0423 b wt_ll_cf 18782729 395748 2.11% 18386981 17330239
## xgenome otherbacterial xother colors
## HPGL0418 94.07% 356542 2.09% #E12C8C
## HPGL0419 93.85% 392803 2.22% #E12C8C
## HPGL0420 89.94% 416260 2.07% #BE5067
## HPGL0421 87.97% 376295 2.24% #BE5067
## HPGL0422 90.27% 367974 2.20% #8E7E40
## HPGL0423 94.25% 386443 2.10% #8E7E40
## counts
## HPGL0418 processed_data/count_tables/HPGL0418/HPGL0418_forward-trimmed-v1M1l20.count.xz
## HPGL0419 processed_data/count_tables/HPGL0419/HPGL0419_forward-trimmed-v1M1l20.count.xz
## HPGL0420 processed_data/count_tables/HPGL0420/HPGL0420_forward-trimmed-v1M1l20.count.xz
## HPGL0421 processed_data/count_tables/HPGL0421/HPGL0421_forward-trimmed-v1M1l20.count.xz
## HPGL0422 processed_data/count_tables/HPGL0422/HPGL0422_forward-trimmed-v1M1l20.count.xz
## HPGL0423 processed_data/count_tables/HPGL0423/HPGL0423_forward-trimmed-v1M1l20.count.xz
## intercounts design file
## HPGL0418 unknown a null
## HPGL0419 unknown b null
## HPGL0420 unknown a null
## HPGL0421 unknown b null
## HPGL0422 unknown a null
## HPGL0423 unknown b null
## elt stands for: "early/late in thy"
batch_a <- subset_expt(expt, subset="batch=='a'")
batch_b <- subset_expt(expt, subset="batch=='b'")
a_metrics <- sm(graph_metrics(batch_a, cis=NULL))
b_metrics <- sm(graph_metrics(batch_b, cis=NULL))
a_metrics$pcaplot
b_metrics$pcaplot
a_metrics$tsneplot
b_metrics$tsneplot
It is pretty obvious that the raw data is a bit jumbled according to PCA. This is not paricularly suprising since we didn’t normalize it at all. Therefore, in this block I will normalize it a few ways and follow up with some visualizations of showing how the apparent relationships change in the data.
## doing nothing to the data except log2 transforming it has a surprisingly large effect
norm_test <- normalize_expt(expt, transform="log2")
## This function will replace the expt$expressionset slot with:
## log2(data)
## It backs up the current data into a slot named:
## expt$backup_expressionset. It will also save copies of each step along the way
## in expt$normalized with the corresponding libsizes. Keep the libsizes in mind
## when invoking limma. The appropriate libsize is the non-log(cpm(normalized)).
## This is most likely kept at:
## 'new_expt$normalized$intermediate_counts$normalization$libsizes'
## A copy of this may also be found at:
## new_expt$best_libsize
## Filter is false, this should likely be set to something, good
## choices include cbcb, kofa, pofa (anything but FALSE). If you want this to
## stay FALSE, keep in mind that if other normalizations are performed, then the
## resulting libsizes are likely to be strange (potentially negative!)
## Leaving the data unconverted. It is often advisable to cpm/rpkm
## the data to normalize for sampling differences, keep in mind though that rpkm
## has some annoying biases, and voom() by default does a cpm (though hpgl_voom()
## will try to detect this).
## Leaving the data unnormalized. This is necessary for DESeq, but
## EdgeR/limma might benefit from normalization. Good choices include quantile,
## size-factor, tmm, etc.
## Not correcting the count-data for batch effects. If batch is
## included in EdgerR/limma's model, then this is probably wise; but in extreme
## batch effects this is a good parameter to play with.
## Step 1: not doing count filtering.
## Step 2: not normalizing the data.
## Step 3: not converting the data.
## Step 4: transforming the data with log2.
## transform_counts: Found 923 values equal to 0, adding 1 to the matrix.
## Step 5: not doing batch correction.
l2_metrics <- sm(graph_metrics(norm_test, cis=NULL))
## a quantile normalization alone affect some, but not all of the data
norm_test <- sm(normalize_expt(expt, norm="quant"))
q_metrics <- sm(graph_metrics(norm_test, cis=NULL)) ## q for quant, oh oh oh!
## cpm alone brings out some samples, too
norm_test <- sm(normalize_expt(expt, convert="cpm"))
c_metrics <- sm(graph_metrics(norm_test, cis=NULL)) ## c for cpm!
## low count filtering has some effect, too
norm_test <- sm(normalize_expt(expt, filter="pofa"))
f_metrics <- sm(graph_metrics(norm_test, cis=NULL)) ## f for filter!
## how about if we mix and match methods?
norm_test <- sm(normalize_expt(expt, transform="log2", convert="cpm",
norm="quant", batch="combat_scale", filter=TRUE,
batch_step=4, low_to_zero=TRUE))
## Some metrics are not very useful on (especially quantile) normalized data
norm_graphs <- sm(graph_metrics(norm_test, cis=NULL))
Now lets see some of the resulting metrics, in this case I will just compare some pca plots, as they are good at fooling our silly visual brains into seeing patterns.
l2_metrics$pcaplot
## Also viewable with plot_pca()$plot
## PCA plots seem (to me) to prefer log2 scale data.
q_metrics$pcaplot
## only normalizing on the quantiles leaves the data open to scale effects.
c_metrics$pcaplot
## but cpm alone is insufficient
f_metrics$pcaplot
## only filtering out low-count genes is helpful as well
norm_graphs$pcaplot
## The different batch effect testing methods have a pretty widely ranging effect on the clustering
## play with them by changing the batch= parameter to:
## "limma", "sva", "svaseq", "limmaresid", "ruvg", "combat", combatmod"
knitr::kable(norm_graphs$pcares)
propVar | cumPropVar | cond.R2 | batch.R2 |
---|---|---|---|
82.42 | 82.42 | 99.36 | 0.00 |
6.39 | 88.81 | 86.57 | 1.80 |
4.02 | 92.83 | 49.64 | 1.55 |
2.80 | 95.63 | 8.87 | 17.53 |
2.11 | 97.74 | 48.94 | 6.99 |
1.27 | 99.01 | 6.16 | 68.83 |
0.99 | 100.00 | 0.45 | 3.31 |
## Thus we see a dramatic decrease in variance accounted for
## by batch after applying limma's 'removebatcheffect'
## (see batch.R2 here vs. above)
norm_graphs$smc
norm_graphs$disheat ## svaseq's batch correction seems to draw out the signal quite nicely.
## It is worth noting that the wt, early log, thy, replicate c samples are still a bit weird.
norm_graphs$tsneplot
This is a relatively small data set, so performing some differential expression analyses really should not take long at all.
When performing these analyses with hpgltools, it will attempt to perform similar analyses with limma, edgeR, and DESeq2 via the all_pairwise() function. The most likely argument is ‘model_batch’ which may be used to explicitly include/exclude a batch factor in the model, or ask it to attempt including batch factors from sva/ruv/etc. By default it will attempt to include a column from the experimental design named ‘batch’.
spyogenes_de <- sm(all_pairwise(expt))
## Even the lowest correlations are quite high.
The result of all_pairwise() is a list of the results from limma, edger, and deseq. In addition, I implemented a very simplistic, differential expression function named ‘basic()’. It also provides some simple measurements of how well the various analyses agree (ergo the black and white heatmap).
Working with these separate tables can be more than a little annoying, combine_de_tables() attempts to simplify this. It will bring together the various tables, and if asked attempt to bring them together into a pretty-ified excel workbook.
all_pairwise() arbitrarily performs all possible pairwise comparisons. This is not necessarily what one actually wishes to see. Therefore, the argument keepers takes a list of contrasts:
my_keepers <- list(
"wt_media" = c("wt_ll_cf", "wt_ll_cg"),
"mga_media" = c("mga_ll_cf", "mga_ll_cg"))
In the above example, if the keepers argument to combine_de_tables() is given as my_keepers, then the resulting table will not have the set of 6 possible comparisons, but instead will only have 2 tables named ‘wt_media’ and ‘mga_media’, which if printed to excel will be sheets named accordingly.
Because these tables can get quickly extremely large, the excel workbooks may become too big for the zip(1) program to properly merge. If that happens, one may either remove some(or all) of the plots and/or have it print csv versions of the same tables.
spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel=FALSE))
summary(spyogenes_tables)
## Length Class Mode
## data 6 -none- list
## limma_plots 0 -none- list
## edger_plots 6 -none- list
## deseq_plots 6 -none- list
## limma_ma_plots 0 -none- list
## edger_ma_plots 6 -none- list
## deseq_ma_plots 6 -none- list
## limma_vol_plots 0 -none- list
## edger_vol_plots 6 -none- list
## deseq_vol_plots 6 -none- list
## comp_plot 0 -none- list
## venns 0 -none- list
## keepers 6 -none- list
## contrast_list 6 -none- character
## de_summary 22 data.frame list
## Try changing the p-adjustment
spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel=FALSE, padj_type="BH"))
head(spyogenes_tables$data[[1]])
## seqnames start end width strand source type score
## M5005_Spy0001 NC_007297 202 1557 1356 + RefSeq gene undefined
## M5005_Spy0002 NC_007297 1712 2848 1137 + RefSeq gene undefined
## M5005_Spy0003 NC_007297 2923 3120 198 + RefSeq gene undefined
## M5005_Spy0004 NC_007297 3450 4565 1116 + RefSeq gene undefined
## M5005_Spy0005 NC_007297 4635 5204 570 + RefSeq gene undefined
## M5005_Spy0006 NC_007297 5207 8710 3504 + RefSeq gene undefined
## phase id dbxref iscircular gbkey
## M5005_Spy0001 undefined M5005_Spy0001 GeneID:3571011 undefined Gene
## M5005_Spy0002 undefined M5005_Spy0002 GeneID:3571012 undefined Gene
## M5005_Spy0003 undefined M5005_Spy0003 GeneID:3571013 undefined Gene
## M5005_Spy0004 undefined M5005_Spy0004 GeneID:3571014 undefined Gene
## M5005_Spy0005 undefined M5005_Spy0005 GeneID:3571015 undefined Gene
## M5005_Spy0006 undefined M5005_Spy0006 GeneID:3571016 undefined Gene
## genome moltype strain name note
## M5005_Spy0001 undefined undefined undefined dnaA M5005_Spy0001
## M5005_Spy0002 undefined undefined undefined dnaN M5005_Spy0002
## M5005_Spy0003 undefined undefined undefined M5005_Spy_0003 M5005_Spy0003
## M5005_Spy0004 undefined undefined undefined ychF M5005_Spy0004
## M5005_Spy0005 undefined undefined undefined pth M5005_Spy0005
## M5005_Spy0006 undefined undefined undefined trcF M5005_Spy0006
## gene locustag parent product proteinid
## M5005_Spy0001 dnaA M5005_Spy_0001 character(0) undefined undefined
## M5005_Spy0002 dnaN M5005_Spy_0002 character(0) undefined undefined
## M5005_Spy0003 undefined M5005_Spy_0003 character(0) undefined undefined
## M5005_Spy0004 ychF M5005_Spy_0004 character(0) undefined undefined
## M5005_Spy0005 pth M5005_Spy_0005 character(0) undefined undefined
## M5005_Spy0006 trcF M5005_Spy_0006 character(0) undefined undefined
## transltable genesynonym limma_logfc limma_adjp deseq_logfc
## M5005_Spy0001 undefined character(0) 0.22320 0.5637 0.22190
## M5005_Spy0002 undefined character(0) 0.17990 0.7002 0.15950
## M5005_Spy0003 undefined character(0) -0.06607 0.9421 -0.04509
## M5005_Spy0004 undefined character(0) 0.06423 0.9447 0.05406
## M5005_Spy0005 undefined character(0) 0.07449 0.9169 0.05262
## M5005_Spy0006 undefined character(0) -0.11200 0.6719 -0.09324
## deseq_adjp edger_logfc edger_adjp limma_ave limma_t limma_b
## M5005_Spy0001 0.6396 0.23690 0.6469 9.525 1.3760 -6.173
## M5005_Spy0002 0.7992 0.17470 0.7889 9.621 0.9257 -6.657
## M5005_Spy0003 0.9705 -0.03071 1.0000 5.855 -0.3763 -6.771
## M5005_Spy0004 0.9578 0.06851 0.9715 10.060 0.3657 -7.081
## M5005_Spy0005 0.9646 0.06755 0.9980 6.452 0.4288 -6.768
## M5005_Spy0006 0.8751 -0.07858 0.9398 9.907 -0.9875 -6.613
## limma_p deseq_basemean deseq_lfcse deseq_stat deseq_p
## M5005_Spy0001 0.2286 8667.0 0.2008 1.1050 0.2691
## M5005_Spy0002 0.3981 9174.0 0.2241 0.7119 0.4765
## M5005_Spy0003 0.7225 699.7 0.2303 -0.1958 0.8448
## M5005_Spy0004 0.7299 12590.0 0.2149 0.2516 0.8014
## M5005_Spy0005 0.6863 1017.0 0.2201 0.2391 0.8110
## M5005_Spy0006 0.3698 11000.0 0.1784 -0.5226 0.6013
## edger_logcpm edger_lr edger_p basic_nummed basic_denmed
## M5005_Spy0001 9.562 1.48300 0.2234 9.801 9.581
## M5005_Spy0002 9.643 0.71140 0.3990 9.774 9.596
## M5005_Spy0003 5.935 0.01412 0.9054 6.242 6.311
## M5005_Spy0004 10.100 0.11900 0.7302 10.510 10.440
## M5005_Spy0005 6.473 0.07032 0.7909 6.651 6.576
## M5005_Spy0006 9.904 0.18440 0.6676 10.020 10.140
## basic_numvar basic_denvar basic_logfc basic_t basic_p
## M5005_Spy0001 7.389e-02 9.516e-03 0.21930 -1.0740 4.471e-01
## M5005_Spy0002 2.584e-02 1.152e-02 0.17810 -1.3030 3.383e-01
## M5005_Spy0003 5.987e-02 1.568e-04 -0.06930 0.4000 7.575e-01
## M5005_Spy0004 5.725e-02 2.709e-02 0.06281 -0.3059 7.918e-01
## M5005_Spy0005 1.134e-02 4.831e-03 0.07502 -0.8344 5.038e-01
## M5005_Spy0006 3.987e-02 2.776e-03 -0.11570 0.7922 5.600e-01
## basic_adjp limma_adjp_bh deseq_adjp_bh edger_adjp_bh
## M5005_Spy0001 7.350e-01 5.636e-01 7.297e-01 6.468e-01
## M5005_Spy0002 6.858e-01 7.002e-01 8.853e-01 7.890e-01
## M5005_Spy0003 9.025e-01 9.421e-01 1.000e+00 1.000e+00
## M5005_Spy0004 9.187e-01 9.447e-01 1.000e+00 9.715e-01
## M5005_Spy0005 7.660e-01 9.169e-01 1.000e+00 9.979e-01
## M5005_Spy0006 7.986e-01 6.719e-01 9.540e-01 9.397e-01
## basic_adjp_bh lfc_meta lfc_var lfc_varbymed p_meta
## M5005_Spy0001 6.791e-01 0.22720 7.468e-06 3.287e-05 2.404e-01
## M5005_Spy0002 6.066e-01 0.17230 2.446e-05 1.420e-04 4.245e-01
## M5005_Spy0003 8.810e-01 -0.05478 6.498e-04 -1.186e-02 8.242e-01
## M5005_Spy0004 9.002e-01 0.06218 2.212e-06 3.557e-05 7.538e-01
## M5005_Spy0005 7.182e-01 0.06376 1.499e-05 2.351e-04 7.627e-01
## M5005_Spy0006 7.574e-01 -0.09577 1.045e-03 -1.091e-02 5.462e-01
## p_var
## M5005_Spy0001 6.260e-04
## M5005_Spy0002 2.026e-03
## M5005_Spy0003 8.680e-03
## M5005_Spy0004 1.697e-03
## M5005_Spy0005 4.483e-03
## M5005_Spy0006 2.445e-02
Finally, extract_significant_genes() may choose ‘significant’ genes based upon a few metrics including z-score vs. the distribution of logFC; a logFC cutoff, (ajusted)p-value cutoff, and/or top/bottom n genes.
spyogenes_sig <- sm(extract_significant_genes(spyogenes_tables, excel=FALSE))
knitr::kable(head(spyogenes_sig$limma$ups[[1]]))
seqnames | start | end | width | strand | source | type | score | phase | id | dbxref | iscircular | gbkey | genome | moltype | strain | name | note | gene | locustag | parent | product | proteinid | transltable | genesynonym | limma_logfc | limma_adjp | deseq_logfc | deseq_adjp | edger_logfc | edger_adjp | limma_ave | limma_t | limma_b | limma_p | deseq_basemean | deseq_lfcse | deseq_stat | deseq_p | edger_logcpm | edger_lr | edger_p | basic_nummed | basic_denmed | basic_numvar | basic_denvar | basic_logfc | basic_t | basic_p | basic_adjp | limma_adjp_bh | deseq_adjp_bh | edger_adjp_bh | basic_adjp_bh | lfc_meta | lfc_var | lfc_varbymed | p_meta | p_var | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M5005_Spy1735 | NC_007297 | 1696949 | 1698145 | 1197 | - | RefSeq | gene | undefined | undefined | M5005_Spy1735 | GeneID:3571136 | undefined | Gene | undefined | undefined | undefined | speB | M5005_Spy1735 | speB | M5005_Spy_1735 | character(0) | undefined | undefined | undefined | character(0) | 3.509 | 0.0356 | 3.691 | 0 | 3.696 | 0 | 2.346 | 13.17 | 1.865 | 1e-04 | 83.91 | 0.4335 | 8.514 | 0 | 2.925 | 101.70 | 0 | 4.349 | 1.082 | 3.503e-02 | 1.818e-02 | 3.266 | -20.02 | 3.833e-03 | 6.256e-01 | 3.555e-02 | 1.078e-14 | 6.258e-21 | 2.895e-02 | 3.710 | 1.708e-01 | 4.602e-02 | 1.778e-05 | 9.487e-10 |
M5005_Spy1575 | NC_007297 | 1535635 | 1536831 | 1197 | - | RefSeq | gene | undefined | undefined | M5005_Spy1575 | GeneID:3571303 | undefined | Gene | undefined | undefined | undefined | norA | M5005_Spy1575 | norA | M5005_Spy_1575 | character(0) | undefined | undefined | undefined | character(0) | 2.371 | 0.0356 | 2.406 | 0 | 2.422 | 0 | 6.834 | 12.96 | 2.727 | 1e-04 | 1849.00 | 0.2276 | 10.570 | 0 | 7.338 | 87.72 | 0 | 8.250 | 5.881 | 3.683e-02 | 5.610e-02 | 2.368 | -10.99 | 9.467e-03 | 6.256e-01 | 3.555e-02 | 3.902e-23 | 4.838e-18 | 6.812e-02 | 2.507 | 8.917e-02 | 3.557e-02 | 1.921e-05 | 1.107e-09 |
M5005_Spy1715 | NC_007297 | 1675257 | 1678751 | 3495 | - | RefSeq | gene | undefined | undefined | M5005_Spy1715 | GeneID:3571155 | undefined | Gene | undefined | undefined | undefined | scpA | M5005_Spy1715 | scpA | M5005_Spy_1715 | character(0) | undefined | undefined | undefined | character(0) | 1.780 | 0.0359 | 1.837 | 0 | 1.850 | 0 | 5.291 | 11.71 | 1.906 | 1e-04 | 1182.00 | 0.2667 | 6.888 | 0 | 6.681 | 39.32 | 0 | 3.909 | 2.155 | 3.255e-03 | 1.759e-03 | 1.754 | -35.03 | 1.314e-03 | 6.256e-01 | 3.594e-02 | 1.553e-09 | 7.695e-08 | 1.025e-02 | 1.820 | 1.114e-01 | 6.122e-02 | 3.110e-05 | 2.901e-09 |
Since most of my circos graphs are for pyogenes, it is likely that the defaults are appropriate for this particular organism.
Much(all) of the following is taken from the material in tests/testthat/test_70mga.R
microbe_ids <- as.character(sm(get_microbesonline_ids("pyogenes MGAS5005")))
mgas_df <- sm(load_microbesonline_annotations(microbe_ids[[1]])[[1]])
mgas_df$sysName <- gsub(pattern="Spy_", replacement="Spy", x=mgas_df$sysName)
rownames(mgas_df) <- make.names(mgas_df$sysName, unique=TRUE)
## First make a template configuration
circos_test <- sm(circos_prefix())
## Fill it in with the data for s.pyogenes
circos_kary <- circos_karyotype("mgas", length=1895017)
## Wrote karyotype to circos/conf/karyotypes/mgas.conf
## This should match the karyotype= line in mgas.conf
## Fill in the gene category annotations by gene-strand
circos_plus <- circos_plus_minus(mgas_df, circos_test)
## Writing data file: circos/data/mgas_plus_go.txt with the + strand GO data.
## Writing data file: circos/data/mgas_minus_go.txt with the - strand GO data.
## Wrote the +/- config files. Appending their inclusion to the master file.
## Returning the inner width: 0.84. Use it as the outer for the next ring.
circos_limma_hist <- circos_hist(spyogenes_de$limma$all_tables[[1]], mgas_df,
circos_test, outer=circos_plus)
## Writing data file: circos/data/mgas_logFC_hist.txt with the logFC column.
circos_deseq_hist <- circos_hist(spyogenes_de$deseq$all_tables[[1]], mgas_df,
circos_test, outer=circos_limma_hist)
## Writing data file: circos/data/mgas_logFC_hist.txt with the logFC column.
circos_edger_hist <- circos_hist(spyogenes_de$edger$all_tables[[1]], mgas_df,
circos_test, outer=circos_deseq_hist)
## Writing data file: circos/data/mgas_logFC_hist.txt with the logFC column.
circos_suffix(cfgout=circos_test)
circos_made <- circos_make(target="mgas")
## For some reason this fails weirdly when not run interactively.
getwd()
## [1] "/cbcb/nelsayed-scratch/atb/git/hpgltools/vignettes"
circos result
GenoplotR is new to me, but it seems to work?
genoplot_chromosome()
## Warning in strings[i] <- downloaded: number of items to replace is not a
## multiple of replacement length
Perhaps instead of looking at subsets of the data, we may want to consider wt/mga. If so, we can just change the condition to any column in the design matrix.
wt_mga_expt <- set_expt_condition(expt=expt, fact="type")
wt_mga_plots <- sm(graph_metrics(wt_mga_expt))
wt_mga_norm <- sm(normalize_expt(wt_mga_expt, transform="log2", convert="raw", filter=TRUE, norm="quant"))
wt_mga_nplots <- sm(graph_metrics(wt_mga_norm))
wt_mga_de <- sm(all_pairwise(input=wt_mga_expt,
combined_excel="wt_mga.xlsx",
sig_excel="wt_mga_sig.xlsx",
abundant_excel="wt_mga_abundant.xlsx"))
## How well do the various DE tools agree on this data?
wt_mga_de$combined$comp_plot
## $summary
## WT_vs_mga
## le 0.9578
## ld 0.9642
## ed 0.9978
## lb 0.9673
## eb 0.9215
## db 0.9245
wt_mga_plots$tsneplot
wt_mga_nplots$pcaplot
wt_mga_de$combined$limma_plots$WT_vs_mga$scatter
wt_mga_de$combined$limma_ma_plots$WT_vs_mga$plot
wt_mga_de$combined$limma_vol_plots$WT_vs_mga$plot
pander::pander(sessionInfo())
R version 3.4.3 (2017-11-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: Vennerable(v.3.1.0.9000), ruv(v.0.9.6) and hpgltools(v.2017.10)
loaded via a namespace (and not attached): Rtsne(v.0.13), colorspace(v.1.3-2), rprojroot(v.1.2), htmlTable(v.1.11.0), corpcor(v.1.6.9), XVector(v.0.18.0), GenomicRanges(v.1.30.0), base64enc(v.0.1-3), rstudioapi(v.0.7), ggrepel(v.0.7.0), bit64(v.0.9-7), AnnotationDbi(v.1.40.0), codetools(v.0.2-15), splines(v.3.4.3), doParallel(v.1.0.11), robustbase(v.0.92-8), geneplotter(v.1.56.0), knitr(v.1.17), ade4(v.1.7-8), Formula(v.1.2-2), packrat(v.0.4.8-1), annotate(v.1.56.1), cluster(v.2.0.6), graph(v.1.56.0), compiler(v.3.4.3), backports(v.1.1.1), assertthat(v.0.2.0), Matrix(v.1.2-12), lazyeval(v.0.2.1), limma(v.3.34.3), acepack(v.1.4.1), htmltools(v.0.3.6), tools(v.3.4.3), bindrcpp(v.0.2), gtable(v.0.2.0), glue(v.1.2.0), GenomeInfoDbData(v.0.99.1), reshape2(v.1.4.2), dplyr(v.0.7.4), Rcpp(v.0.12.14), Biobase(v.2.38.0), preprocessCore(v.1.40.0), nlme(v.3.1-131), iterators(v.1.0.8), stringr(v.1.2.0), openxlsx(v.4.0.17), testthat(v.1.0.2), gtools(v.3.5.0), devtools(v.1.13.4), XML(v.3.98-1.9), edgeR(v.3.20.1), DEoptimR(v.1.0-8), MASS(v.7.3-47), directlabels(v.2017.03.31), zlibbioc(v.1.24.0), scales(v.0.5.0), BiocInstaller(v.1.28.0), RBGL(v.1.54.0), parallel(v.3.4.3), SummarizedExperiment(v.1.8.0), RColorBrewer(v.1.1-2), yaml(v.2.1.15), memoise(v.1.1.0), gridExtra(v.2.3), pander(v.0.6.1), ggplot2(v.2.2.1), rpart(v.4.1-11), latticeExtra(v.0.6-28), stringi(v.1.1.6), RSQLite(v.2.0), highr(v.0.6), genefilter(v.1.60.0), S4Vectors(v.0.16.0), foreach(v.1.4.3), RMySQL(v.0.10.13), checkmate(v.1.8.5), BiocGenerics(v.0.24.0), BiocParallel(v.1.12.0), GenomeInfoDb(v.1.14.0), rlang(v.0.1.4), pkgconfig(v.2.0.1), matrixStats(v.0.52.2), bitops(v.1.0-6), evaluate(v.0.10.1), lattice(v.0.20-35), purrr(v.0.2.4), bindr(v.0.1), htmlwidgets(v.0.9), labeling(v.0.3), bit(v.1.1-12), plyr(v.1.8.4), magrittr(v.1.5), DESeq2(v.1.18.1), R6(v.2.2.2), IRanges(v.2.12.0), Hmisc(v.4.0-3), DelayedArray(v.0.4.1), DBI(v.0.7), foreign(v.0.8-69), withr(v.2.1.0), mgcv(v.1.8-22), survival(v.2.41-3), RCurl(v.1.95-4.8), nnet(v.7.3-12), tibble(v.1.3.4), crayon(v.1.3.4), KernSmooth(v.2.23-15), rmarkdown(v.1.8), locfit(v.1.5-9.1), grid(v.3.4.3), sva(v.3.26.0), data.table(v.1.10.4-3), blob(v.1.1.0), digest(v.0.6.12), xtable(v.1.8-2), tidyr(v.0.7.2), genoPlotR(v.0.8.7), stats4(v.3.4.3), munsell(v.0.4.3) and quadprog(v.1.5-5)