Next Generation Sequencing (NGS)/De novo Genome Assembly (Method)
Objective
[edit | edit source]De novo genome assembly generates a genome reference. Different characteristics might be of consideration depending on the biological question:
- A high reference contiguity will mean larger assembled sequences, enabling certain types of downstream analysis.
- reference completeness
- reference accuracy
It is important to always keep in mind the Tradeoffs on Genome Assembly.
Overview
[edit | edit source]The generation of short reads by next generation sequencers has lead to an increased need to be able to assemble the vast amount of short reads that are generated. This is no trivial problem, as the sheer number of reads makes it near impossible to use, for example, the overlap layout consensus (OLC) approach that had been used with longer reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a de Bruijn graph based k-mer based approach.
A clear distinction has to be made by the size of the genome to be assembled.
- small (e.g. bacterial genomes: few Megabases)
- medium (e.g. lower plant genomes: several hundred Megabases)
- large (e.g. mammalian and plant genomes: Gigabases)
All de-novo assemblers will be able to cope with small genomes, and given decent sequencing libraries will produce relatively good results. Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome. Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads). Assembling such a genome with Illumina reads will probably will require using a machine that has about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster (ABySS, Ray, Contrail), or invest into commercial software (CLCbio_Genomics_Workbench).
Useful background
[edit | edit source]- Whole Genome Shotgun
- Paired End Sequencing
- Long Mate Paired libraries
- Long Range Information
- Overlap Layout Consensus assembly
- De Bruijn Graph assembly
- Sequence Scaffolding
Biological questions
[edit | edit source]Generating a reference sequence won't solve many interesting Biological questions but will provide a basis for all kinds of downstream analysis.
Inputs and outputs
[edit | edit source]Inputs
[edit | edit source]- Genomic sequence
Outputs
[edit | edit source]- Assembled reference sequences
- Assembly metrics
- Contiguity Stats
- Completeness metrics
- Accuracy metrics
Experimental design
[edit | edit source]Like any project, a good de novo assembly starts with proper experimental design. Biological, experimental, technical and computational issues have to be considered:
- Biological issues: What is known about the genome?
- How big is it? Obviously, bigger genomes will require more material.
- How frequent, how long and how conserved are repeat copies? More repetitive genomes will possibly require longer reads or long distance mate-pairs to resolve structure.
- How AT rich/poor is it? Genomes which have a strong AT/GC imbalance (either way) are said to have low information content. In other words, spurious sequence similarities will be more frequent.
- Is is haploid, diploid, or polyploid? Currently genome assemblers deal best with haploid samples, and some provide a haploid assembly with annotated heterozygous sites. Polyploid genomes (e.g. plants) are still largely problematic.
- Experimental issues: What sample material is available?
- Is it possible to extract a lot of DNA? If you have only little material, you might have to amplify the sample (e.g. using MDA), thus introducing biases.
- Does that DNA come from a single cell, a clonal population, or a heterogeneous collection of cells? Diversity in the sample can create more or less noise, which different assemblers handle differently.
- Technical issues: What sequencing technologies to use?
- How much does each cost?
- What is the sequence quality? The greater the noise, the more coverage depth you will need to correct for errors.
- How long are the reads? The longer the reads, the more useful they will be to disambiguate repetitive sequence.
- Can paired reads be produced cost-effectively and reliably? If so, what is the fragment length? As with long reads, reliable long distance paired can help disambiguate repeats and scaffold the assembly.
- Can you use a hybrid approach? E.g. short and cheap reads mixed with long expensive ones.
- Computational issues: What software to run?
- How much memory do they require? This criteria can be final, because if a computer does not have enough memory, it will either crash, or slow down tremendously as it swaps data on and off the hard drive.
- How fast are they? This criteria is generally less stringent, since the assembly time is generally minor within a complete genome assembly and annotation project. However, some scale better than other.
- Do they require specific hardware? (e.g. large memory machine, or cluster of machines)
- How robust are they? Are they prone to crash? Are they well supported?
- How easy are they to install and run?
- Do they require a special protocol? Can they handle the chosen sequencing technology?
Typical steps in the method
[edit | edit source]A genome assembly project, whatever its size, can generally be divided into stages:
- Experiment design
- Sample collection
- Sample preparation
- Sequencing
- Pre-processing
- Assembly
- Post-assembly analysis
Next steps
[edit | edit source]Discussion of where the method leads.
Workflows
[edit | edit source]Example galaxy workflow
[edit | edit source]Link to an example galaxy workflow for for the method (including example datasets) on a given galaxy instance or to the XML document describing the workflow.
Example command line workflow
[edit | edit source]Discussion
[edit | edit source]Key considerations
[edit | edit source]- If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.
- Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)
- For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.
- Before submitting data to a de-novo assembler it might often be a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. (More is not always better) That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG can perform read correction prior to assembly.
- Before running any large assembly, double and triple check the parameters you feed the assembler.
- Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problematic regions
- If you run de Bruijn graph based assemblies you will want to try different k-mer sizes. Whilst there is no rule of thumb for any individual assembly, smaller k-mers would lead to a more tangled graph if the reads were error free. Larger k-mer sizes would yield a less tangled graph, given error free reads. However, a lower k-mer size would likely be more resistant to sequencing errors. And a too large k might not yield enough edges in the graph and would therefore result in small contigs.
Deciding on software tools
[edit | edit source]This is based both on personal experience as well as on published studies. Please note however that genomes are different and software packages are constantly evolving.
An Assemblathon challenge which uses a synthetic diploid genome assembly was reported on by Nature to call SOAP de novo, Abyss and ALLPATHS-LG the winners.
However a talk on the result website http://assemblathon.org/assemblathon-1-results names SOAP de novo, sanger-sga and ALLPATHS-LG to be consistently amongst the best performers for this synthetic genome.
I want to assemble:
- Mostly 454 or Ion Torrent data
- small Genome =>MIRA, Newbler
- all others use Newbler
- Mixed data (454 and Illumina)
- small genome => MIRA, but try other ones as well
- medium genome => no clear recommendation
- large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
- Mostly Illumina (or Colorspace)
- small genome => MIRA, velvet
- medium genome => no clear recommendation
- large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
(For large genomes this is based on the fact that not many assemblers can deal with large genomes, and based on the assemblathon outcome. For 454 data this is based on Newbler's good general performance, and MIRA's different outputs, its versatility and the theoretical consideration that de Bruijn based approaches might fare worse)
Post assembly you might want to try the SEQuel software to improve the assembly quality.
I want to start a large genome project for the least cost
- Use Illumina reads with ALLPATHS-LG specification (i.e. overlapping), the reads will work in e.g. SOAP de novo as well
(This recommendation is based on the assemblathon outcome, the original ALLPATHS publication (Gnerre et al., 2011) as well as a publication that used ALLPATHS for the assembly of Arabidopsis genomes (Schneeberger et al., 2011).
Each software has its particular strength, if you have specific requirement, the result from Assemblathon will guide you. Another comparison site GAGE has also released its comparison (Salzberg et al. 2011). Also there exists QUAST tool for assessing genome assembly quality.
Links to related discussion on BioStar: Template:Biostar