CloudMap

CloudMap: A Cloud-Based Pipeline for Analysis of Mutant Genome Sequences.

Genetics. 2012 Dec;192(4):1249-69. doi: 10.1534/genetics.112.144204. Epub 2012 Oct 10.

Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032.

Abstract

Whole genome sequencing (WGS) allows researchers to pinpoint genetic differences between individuals and significantly shortcuts the costly and time-consuming part of forward genetic analysis in model organism systems. Currently, the most effort-intensive part of WGS is the bioinformatic analysis of the relatively short reads generated by second generation sequencing platforms. We describe here a novel, easily accessible and cloud-based pipeline, called CloudMap, which greatly simplifies the analysis of mutant genome sequences. Available on the Galaxy web platform,CloudMap requires no software installation when run on the cloud, but it can also be run locally or via Amazon’s Elastic Compute Cloud (EC2) service. CloudMap uses a series of predefined workflows to pinpoint sequence variations in animal genomes, such as those of premutagenized and mutagenized Caenorhabditis elegans strains. In combination with a variant-based mapping procedure, CloudMap allows users to sharply define genetic map intervals graphically and to retrieve very short lists of candidate variants with a few simple clicks. Automated workflows and extensive video user guides are available to detail the individual analysis steps performed (http://usegalaxy.org/cloudmap). We demonstrate the utility ofCloudMap for WGS analysis of C. elegans and Arabidopsis genomes and describe how other organisms (e.g., Zebrafish and Drosophila) can easily be accommodated by this software platform. To accommodate rapid analysis of many mutants from large-scale genetic screens, CloudMap contains an in silico complementation testing tool that allows users to rapidly identify instances where multiple alleles of the same gene are present in the mutant collection. Lastly, we describe the application of a novel mapping/WGS method (“Variant Discovery Mapping”) that does not rely on a defined polymorphic mapping strain, and we integrate the application of this method into CloudMap. CloudMap tools and documentation are continually updated at http://usegalaxy.org/cloudmap.
 

Video User Guides

Galaxy Install steps and CloudMap Dependencies

Frequently Asked Questions (FAQs):

CloudMap Questions:
1) How much coverage do I need for this to work?
2) Why does my annotated variant not correspond to the same position in Wormbase? What version of the C. elegans genome do you use?
3) I sequenced a pooled sample and I see called homozygous variants outside my mapping region. Also, upon inspection of the alignment for variants within my mapping region, I see that some variants that were called homozygous are a bit questionable. What’s going on?
4) Why is my mapping not as tight as the one in the paper?
5) Are the Hawaiian variants subtracted from my final annotated set of variants in the Hawaiian Variant Mapping workflow?
6) For Variant Discovery Mapping mapping, should I cross to my starting strain or to N2?
7) What’s the difference between Hawaiian Mapping and Variant Discovery Mapping and EMS density mapping?
8) I have a Hawaiian mapped sample and I ran the workflow that generates HA and VDM plots. There’s a discrepancy between the 4 plots for my linked chromosome, which plot do I trust?
9) I don’t have a candidate list, and I don’t want my variants filtered in any way. What should I do?
10) Do you provide a list of transgene silencers that I can use with the CloudMap Candidate Gene Checker tool?
11) I only see a signal on the left arm of chromosome I but there are no candidates in that region. Is this the correct mapping region?

 
General Galaxy Questions:
1) What type of FASTQ quality encoding should I have?
2) How do I know what type of FASTQ encoding I have?
3) Does the pipeline support paired-end data?
4) How do I edit a workflow?
5) My correct input file is in my history but the Galaxy tool won’t recognize it as an input
6) My job failed (turned red) within Galaxy and I’m sure the inputs are correct and this job has previously worked for me.
7) I’m having a problem with one of the GATK tools.

 
Other:
1) What do I do if my problem isn’t mentioned in the FAQ?

 

 

CloudMap-Specific Questions:
1) How much coverage do I need for this to work?

We’ve had good success with the mapping plots given 20x coverage. Results may vary based on the nature of the mutation, coverage of samples that you’re using for subtraction, and luck (i.e. whether the causal mutation has good coverage in the sample).

2) Why does my annotated variant not correspond to the same position in Wormbase? What version of the C. elegans genome do you use?

We align to the WS220 release of Wormbase (2010) so that alignments can easily be viewed in the UCSC genome browser. Wormbase works off of a later release and doesn’t support viewing of large alignments in GBrowse. We have plans to update the genome release we align to soon.

3) I sequenced a pooled sample and I see called homozygous variants outside my mapping region. Also, upon inspection of the alignment for variants within my mapping region, I see that some variants that were called homozygous are a bit questionable. What’s going on?

We rely on the GATK variant caller to call variants. The variant caller is not optimized for calling variants in a pooled sample of recombinants from a mapping cross population. For this reason, we recommend examining variants within the mapping region in the BAM alignment file to validate prior to proceeding with experiments.

4) Why is my mapping not as tight as the one in the paper?

Several reasons: 1) You may have used less than the 50 pooled F2s that we used in the paper 2) You may have selected for multiple mutations in your (broad) mapping interval (for example, a transgene) 3) When pooling the F2 recombinants to extract DNA, you may have used a disproportionate amount of worms from a few plates. This has the same effect as sequencing fewer recombinants

5) Are the Hawaiian variants subtracted from my final annotated set of variants in the Hawaiian Variant Mapping workflow?

Yes

6) For Variant Discovery Mapping mapping, should I cross to my starting strain or to N2?

N2 is better because you will have more variants in the strain to use for mapping. These additional variants are background variants that the strain accumulated relative to N2 through genetic drift or introduction of the transgene. By crossing back to the starting strain and then subtracting starting strain variants in the course of VDM mapping, you subtract variants that could have been used for mapping

7) What’s the difference between Hawaiian Mapping and Variant Discovery Mapping and EMS density mapping?

Hawaiian Mapping uses variants in the crossing strain for mapping (crossing strains other than Hawaiian can be used if they have been whole genome sequenced and all their variants are known). Variant Discovery Mapping and EMS Density mapping use the variants in the mutagenized strain for mapping. Because the Hawaiian strain has many more mutations than are introduced by a mutagen such as EMS or genetic drift, Hawaiian Mapping is more accurate. Variant Discovery Mapping is more accurate and faster than EMS Density Mapping because it relies on a bulk segregant approach whereas EMS Density Mapping relies on serial backcrosses.

8) I have a Hawaiian mapped sample and I ran the workflow that generates HA and VDM plots. There’s a discrepancy between the 4 plots for my linked chromosome, which plot do I trust?

We find that the Hawaiian Mapping plots are most accurate for reasons discussed in the answer to the previous question (HA strain has more variants)

9) I don’t have a candidate list, and I don’t want my variants filtered in any way. What should I do?

The candidate gene checker does not filter results. It simply adds another column at the end of the annotated variants file where it labels any variants that you had in your candidate gene list.

10) Do you provide a list of transgene silencers that I can use with the CloudMap Candidate Gene Checker tool?

We have found several mutants in screens for neuronal fate loss where the causal variant was actually in a gene affecting transgene silencing and neuronal cell fate was unaffected. Many of these transgene silencers affect chromatin regulation. However, we are not publishing a comprehensive list of transgene silencers because the list of transgene silencers available in Wormbase contains some genes that we believe are incorrectly annotated. In addition, we don’t want to mislead researchers into thinking that mutations in these transgene silencers are somehow “uninteresting” with regard to the phenotype being assayed. For those interested, you may use Worm Mine (http://www.wormbase.org/tools/wormmine/) to make a list of transgene silencers as annotated by Wormbase or to individually check variants in your mutant for possible transgene silencing effects.

11) I only see a signal on the left arm of chromosome I, but there are no candidates in that region. Is this the correct mapping region?

As mentioned in the CloudMap paper, there is a previously described pattern of genetic incompatibility between the Bristol and Hawaiian strains on the left arm of LG I (Seidel et al. 2008). In all mutant plots examined thus far, the loess regression line shows a dip in this region although the normalized frequency plots do not show a linkage signal because they partially correct for this bias. However, if you made a mistake in selecting homozygous F2 mutants (more on this below), AND instead selected for the first F2 progeny from a cross, you have inadvertently been selecting for linkage to this left arm of chromosome I. This is likely because there is a dose-dependent effect of the PEEL element resulting from hermaphrodite sperm, and younger hermaphrodites contribute more of the PEEL element than old hermaphrodites (Seidel et al. 2011). Reasons for seeing no mapping signal include: 1) Incorrectly picking some percentage of WT F2s instead of homozygous mutant F2s. Often this happens when the F2 phenotype isn’t clear. It is recommended to single mutant F2s and then to further confirm that you correctly picked F2 mutants by examining F3 progeny. This mistake is quite common. 2) Picking what you believe are homozygous recessive F2 mutants that are actually dominant mutations. This has the same effect as sequencing mutants heterozygous for the causal variant and the mapping signal will be missing. Make sure you know whether your mutation is recessive or dominant prior to sequencing.

 
 
General Galaxy Questions:
1) What type of FASTQ quality encoding should I have?

FASTQ Quality encoding should be FASTQ Sanger. If you have a different type of FASTQ quality encoding you can convert to FASTQ Sanger using the FASTQ Groomer tool. http://en.wikipedia.org/wiki/FASTQ_format

2) How do I know what type of FASTQ encoding I have?

The FastQC tool reports this information and you will need it if using the FASTQ Groomer tool to convert to FASTQ Sanger. We use the FASTQC tool as part of the pipeline http://en.wikipedia.org/wiki/FASTQ_format

3) Does the pipeline support paired-end data?

Yes, but you need to edit the workflows. See the user guide at usegalaxy.org/cloudmap

4) How do I edit a workflow?

See the tutorials here: http://wiki.galaxyproject.org/Learn

5) My correct input file is in my history but the Galaxy tool won’t recognize it as an input

The data type of the file is likely not in a format that the tool expects. You can change the datatype of a file, or convert the datatype of a file using the pencil icon in your history See the tutorials here: http://wiki.galaxyproject.org/Learn

6) My job failed (turned red) within Galaxy and I’m sure the inputs are correct and this job has previously worked for me.

Try re-running the job. The Galaxy Main server often experiences high user volume and jobs sometimes fail unexpectedly. If that doesn’t work, send a bug report to the Galaxy team by click on the bug icon on your failed job

7) I’m having a problem with one of the GATK tools.

Check the GATK website: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions

 

Other:

1) What do I do if my problem isn’t mentioned in the FAQ?

Sign up for the Galaxy mailing list (http://wiki.galaxyproject.org/MailingLists), read Galaxy FAQs (http://wiki.galaxyproject.org/Learn/FAQ), contact tool authors (mentioned on every tool page), watch videos (usegalaxy.org) and read user guides (usegalaxy.org/cloudmap)