Next Generation Sequencing (NGS)/Big Data

Next Generation Sequencing (NGS)
Introduction	Big Data	Bioinformatics from the outside

Wikipedia has related information at Big_data

Big Data

Data Deluge

The first problem you face is probably the large size of the NGS FASTQ files - the "data deluge" problem. You no longer only have to deal with microplate readings, or digitalized gel photos; the size of NGS data can be huge. For example, compressed FASTQ files from a 60x human whole genome sequencing can still require 200Gb. A small project with 10–20 whole genome sequencing (WGS) samples can generate ~4TB of raw data. Even these estimates do not include the disk space required for downstream analysis.

Storing data

Referenced from a post from BioStars^[1]:

Very high end: enterprise cluster and SAN.
High end: Two mirrored servers in separate buildings or Cloud.
Typical: External hard drives and/or NAS with raid-5/6

Moving data

Moving data between collaborators is also non-trivial. For RNA-Seq samples, FTP may suffice, but for WGS data, shipping hard drives may be the only solution.

Externalizing compute requirements from the research group

It is difficult for a single lab to maintain sufficient computing facilities. A single lab will probably own some basic computing hardware; however, many tasks will have huge computational demands (e.g. memory for de novo genome assembly) that require them to be performed elsewhere. An institution / core facility may host a centralized cluster. Alternatively, one might consider doing the task on the cloud.

NIH maintains a centralized computing cluster called Biowulf.
Bioinformatics cloud computing is suggested.^[2]^[3] EBI has adopted a cloud-based platform called Helix Nebula.^[4]

References

↑ Wo, H. (24 March 2011). "Question: Huge Ngs Data Storage And Transferring". Biostars. Biostar Genomics, LLC. Retrieved 28 April 2016.
↑ Akhlaghpour, H. (3 July 2012). "Genomic Analysis in the Cloud". YouTube. Google. Retrieved 28 April 2016.
↑ Schadt, E.E.; Linderman, M.D.; Sorenson, J.; Lee, L.; Nolan, G.P. (2010). "Computational solutions to large-scale data management and analysis". Nature Reviews Genetics. 11 (9): 647–57. doi:10.1038/nrg2857. PMC 3124937. PMID 20717155.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
↑ Lueck, R. (16 January 2013). "Big data and HPC on-demand: Large-scale genome analysis on Helix Nebula – the Science Cloud" (PDF). Trust-IT Services. Retrieved 28 April 2016.

[WoQuestion11-1] Wo, H. (24 March 2011). "Question: Huge Ngs Data Storage And Transferring". Biostars. Biostar Genomics, LLC. Retrieved 28 April 2016.

[AkhlaghpourGenomic12-2] Akhlaghpour, H. (3 July 2012). "Genomic Analysis in the Cloud". YouTube. Google. Retrieved 28 April 2016.

[SchadtComp10-3] Schadt, E.E.; Linderman, M.D.; Sorenson, J.; Lee, L.; Nolan, G.P. (2010). "Computational solutions to large-scale data management and analysis". Nature Reviews Genetics. 11 (9): 647–57. doi:10.1038/nrg2857. PMC 3124937. PMID 20717155.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)

[LueckBig13-4] Lueck, R. (16 January 2013). "Big data and HPC on-demand: Large-scale genome analysis on Helix Nebula – the Science Cloud" (PDF). Trust-IT Services. Retrieved 28 April 2016.

[1]

[2]

[3]

[4]