NGSomics (Next Generation Sequencing Genomics): An Intuitive, Parallelized NGS Pipeline for Analysis and Drug Correlation
Some content is outdated and will be updated later to reflect this project more accurately. Stay tuned!
Try it now at: http://www.app.ngsomics.com
Download dependencies on Docker
Abstract
Report
Question:
Can the tedious process of NGS (Next Generation Sequencing) be reduced to a singular, simple cloud platform which chains multiple tools together in parallel?
Goal:
To make the process of using multiple NGS tools more efficient and accessible (described in design criteria).
Hypothesis:
Chaining multiple NGS tools together on a cloud platform will increase both portability and efficiency of data processing through parallelization.Materials:
- Entire genome analysis toolkit pipeline (GATK/Churchill/Platypus/Freebayes): confidently sequence/map necessary DNA data for (1) variant calling (2) differential gene expression [TopHat/Cuffdiff].
- DNA microarray with analysis platform ( Genechip/ChiPSeq): Interpret raw DNA to convert to BAM files for later analysis.
- Cloud Tools: The freely available Docker Toolbox allows for crossplatform connection and sync between containers.
Methods:
Data Cleanup:
The raw data generated needs to be better formatted for variant calling analysis.
Raw reads:
After the DNA microarray has been scanned by the processing machine, billions of reads (strings of DNA text) are produced. Map to Reference: The raw reads from the patient are then compared to a reference genome database (Genome in a Bottle Consortium), which has been standardized and tested universally. Mark Duplicates/Sort: When the reads are mapped against the reference, copies of reads are also sequenced in the process. These additional reads are not relevant and are marked separately so that they can be ignored later on in the process. Indel Realignment: Since the algorithms used to map the reads are prone to error, such as an indel (deletion/addition of bases in DNA) getting mapped incorrectly with bases and would appear as evidence for a SNP (change of one base in DNA) instead, this realignment process finds the most reliable location of the reads. First, the intervals to be realigned are identified, and then the sequence with the highest confidence level (based on various statistical tests) is used. Base Recalibration: Because the algorithms used to identify variants rely so much on the accuracy (based on quality scores) of each base in the sequence, base quality scores (that is a statistical estimate of error per base) are recalibrated and checked. Due to possible over or underestimation of the scores produced, machine learning is utilized in order to account for these errors and give more accurate results.
Variant Discovery:
The “cleaned-up” data is used to identify variants, through a balance between sensitivity and specificity (minimizing false negatives and false positives respectively).
Variant Calling:
(A haplotype is a set of DNA variations/polymorphisms that are related and found on the same chromosome) The HaplotypeCaller determines which areas in the genome are likely to have a SNP or indel and changes the BAM file (binary sequence alignment data) into a VCF/gVCF file (variant call format) for later analysis. [The genomic VCF has more information, records for all the sites whether a variant was present or not, whereas a normal VCF file has only information for the site where a variant was detected.] Joint Genotyping: Running the GenotypeGVCFs on the VCFs produced in the previous step creates a set of raw SNP and indel calls that enables a more sensitive (being able to perceive a variant more readily) detection of variants. Variant Recalibration: In order to assign a quality score to each variant call, machine learning is again utilized to give the most accurate and reliable result. The variant quality score is then used to winnow the raw set of variant calls in a way that best suits the desired level of quality, allowing for a more specific (being able to perceive a variant more carefully) analysis.
Evaluation:
This last optional step is done to refine the genotype calls and evaluate the quality of the calls to standardized resources. This step, however, is instrumental in the machine learning algorithms.
Detailed Procedure:
- Obtain DNA data via microarray analysis or online databases if applicable.
- Using genome analysis pipelines, map and sequence the experimental genome with the reference genome as located in online databases.
- Use Docker Toolbox to setup and connect multiple disparate analysis tools.
- Setup Docker parallelization to allow for efficient and nonlinear use of chained tools.
- Connect data input and output from each tool to the next and display results intuitively to user.
- Analyze pipeline output and link phenotype to certain drug in database.
Methods:
- \Next Generation Sequencing: Use cutting-edge bioinformatics methods to sensitively and specifically map genome while calling possible DNA base pair variants and gene expression differences to reliably determine proteins transcribed.
- Parallelization: Either use open-source software or create own method of running multiple tools in parallel.
- Flexibility: Allow users to swap certain tools with other open-source tools by loosely connecting each container and allowing for easy swapping.
- Unified Variant Calling: Seek out indels (deletion/addition) and SNPs (changes in single base pair) and structural changes (large and very affecting)
- Drug Correlation: Creation of online drug database through a NoSQL (Mongo) database which allows for quick access and query.
Design Criteria (Software Criteria):
Correctness: Refer to DNA Variation Criteria to determine the accuracy of this program.
Efficiency:
Understandability and Maintainability:
Robustness:
Reusability:
DNA Variation Criteria:
- Next Generation Sequencing: Use cutting-edge bioinformatics methods to sensitively and specifically map genome while calling possible DNA base pair variants and gene expression differences to reliably determine proteins transcribed.
- Parallelization: Either use open-source software or create own method of running multiple tools in parallel.
- Flexibility: Allow users to swap certain tools with other open-source tools by loosely connecting each container and allowing for easy swapping.
- Unified Variant Calling: Seek out indels (deletion/addition) and SNPs (changes in single base pair) and structural changes (large and very affecting)
- Drug Correlation: Creation of online drug database through a NoSQL (Mongo) database which allows for quick access and query.
References:
- Understand the architecture. (n.d.). Retrieved October 1, 2015, from https://docs.docker.com/introduction/understandingdocker/
- Parse. (n.d.). Retrieved October 13, 2015, from https://parse.com/docs
- NextGeneration Sequencing (NGS). (n.d.). Retrieved October 13, 2015, from http://www.illumina.com/technology/nextgenerationsequencing.html
- Inflammatory Bowel Disease Crohn's Disease . (n.d.). Retrieved September 30, 2015, from http://www.agsaflorida.com/Home/PatientEducationLibrary/tabid/18426/ctl/View/mid/33464/De fault.aspx?ContentPubID=325
- GATK Best Practices Recommended workflows for variant discovery analysis with GATK. (n.d.). Retrieved September 10, 2015, from https://www.broadinstitute.org/gatk/guide/bestpractices?bpm=DNAseq
- A comparison of methods for differential expression analysis of RNAseq data. (n.d.). Retrieved October 7, 2015, from http://www.biomedcentral.com/14712105/14/91
- (n.d.). Retrieved September 2, 2015, from http://biorxiv.org/content/biorxiv/early/2014/05/28/005611.full.pdf
- https://www.dnastar.com/blog/nextgensequencing/howaccurateisyourvariantcaller/