View on GitHub

Ngsomics

Cloud-based, Dockerized, Mobile-optimized next-generation sequencing and analysis solution

Download this project as a .zip file Download this project as a tar.gz file

NGSomics (Next Generation Sequencing Genomics): An Intuitive, Parallelized NGS Pipeline for Analysis and Drug Correlation

Some content is outdated and will be updated later to reflect this project more accurately. Stay tuned!

Try it now at: http://www.app.ngsomics.com

Download dependencies on Docker

Abstract

Report

Question:

Can the tedious process of NGS (Next Generation Sequencing) be reduced to a singular, simple cloud platform which chains multiple tools together in parallel?

Goal:

To make the process of using multiple NGS tools more efficient and accessible (described in design criteria).

Hypothesis:

Chaining multiple NGS tools together on a cloud platform will increase both portability and efficiency of data processing through parallelization.

Materials:

Entire genome analysis toolkit pipeline (GATK/Churchill/Platypus/Freebayes): confidently sequence/map necessary DNA data for (1) variant calling (2) differential gene expression [TopHat/Cuffdiff].

DNA microarray with analysis platform ( Genechip/ChiPSeq): Interpret raw DNA to convert to BAM files for later analysis.

Cloud Tools: The freely available Docker Toolbox allows for crossplatform connection and sync between containers.

Methods:

Data Cleanup:

The raw data generated needs to be better formatted for variant calling analysis.

Raw reads:

After the DNA microarray has been scanned by the processing machine, billions of reads (strings of DNA text) are produced. Map to Reference: The raw reads from the patient are then compared to a reference genome database (Genome in a Bottle Consortium), which has been standardized and tested universally. Mark Duplicates/Sort: When the reads are mapped against the reference, copies of reads are also sequenced in the process. These additional reads are not relevant and are marked separately so that they can be ignored later on in the process. Indel Realignment: Since the algorithms used to map the reads are prone to error, such as an indel (deletion/addition of bases in DNA) getting mapped incorrectly with bases and would appear as evidence for a SNP (change of one base in DNA) instead, this realignment process finds the most reliable location of the reads. First, the intervals to be realigned are identified, and then the sequence with the highest confidence level (based on various statistical tests) is used. Base Recalibration: Because the algorithms used to identify variants rely so much on the accuracy (based on quality scores) of each base in the sequence, base quality scores (that is a statistical estimate of error per base) are recalibrated and checked. Due to possible over or underestimation of the scores produced, machine learning is utilized in order to account for these errors and give more accurate results.

Variant Discovery:

The “cleaned-up” data is used to identify variants, through a balance between sensitivity and specificity (minimizing false negatives and false positives respectively).

Variant Calling:

(A haplotype is a set of DNA variations/polymorphisms that are related and found on the same chromosome) The HaplotypeCaller determines which areas in the genome are likely to have a SNP or indel and changes the BAM file (binary sequence alignment data) into a VCF/gVCF file (variant call format) for later analysis. [The genomic VCF has more information, records for all the sites whether a variant was present or not, whereas a normal VCF file has only information for the site where a variant was detected.] Joint Genotyping: Running the GenotypeGVCFs on the VCFs produced in the previous step creates a set of raw SNP and indel calls that enables a more sensitive (being able to perceive a variant more readily) detection of variants. Variant Recalibration: In order to assign a quality score to each variant call, machine learning is again utilized to give the most accurate and reliable result. The variant quality score is then used to winnow the raw set of variant calls in a way that best suits the desired level of quality, allowing for a more specific (being able to perceive a variant more carefully) analysis.

Evaluation:

This last optional step is done to refine the genotype calls and evaluate the quality of the calls to standardized resources. This step, however, is instrumental in the machine learning algorithms.

Detailed Procedure:

Obtain DNA data via microarray analysis or online databases if applicable.

Using genome analysis pipelines, map and sequence the experimental genome with the reference genome as located in online databases.

Use Docker Toolbox to setup and connect multiple disparate analysis tools.

Setup Docker parallelization to allow for efficient and nonlinear use of chained tools.

Connect data input and output from each tool to the next and display results intuitively to user.

Analyze pipeline output and link phenotype to certain drug in database.

Methods:

\Next Generation Sequencing: Use cutting-edge bioinformatics methods to sensitively and specifically map genome while calling possible DNA base pair variants and gene expression differences to reliably determine proteins transcribed.

Parallelization: Either use open-source software or create own method of running multiple tools in parallel.

Flexibility: Allow users to swap certain tools with other open-source tools by loosely connecting each container and allowing for easy swapping.

Unified Variant Calling: Seek out indels (deletion/addition) and SNPs (changes in single base pair) and structural changes (large and very affecting)

Drug Correlation: Creation of online drug database through a NoSQL (Mongo) database which allows for quick access and query.

Design Criteria (Software Criteria):

Correctness: Refer to DNA Variation Criteria to determine the accuracy of this program.

Efficiency:

Using Docker Starter Edition allows up to 10 Docker Engines for our servers, each of which is capable of compiling and running the applications on our server

Parallel Efficiency of using all 10 of these engines can be calculated as S/(p*T(p)) where S represents the execution time of the program (linearly), p is the number of engines use simultaneously, and T(p) is the execution time when using p engines

Understandability and Maintainability:

Each container can be stopped, started, and saved at any time. This allows for easy image duplication and scaling up with one command, so 1 image setup can be easily upgraded to several.

Each container image is listed separately in the setup so that an image can replace any other image in the pipeline at any time

Dockerfiles, which hold container configuration, are automatically saved on Docker Hub, an online repository of Docker software, and can be duplicated or distributed freely

The NoSQL database is easily accessible and can be updated in a few steps from any browser

Robustness:

Docker’s design allows for easy swapping of containers, meaning that different software tools which have the same purpose can be interchanged to allow for robustness of input

The drug database is compiled and scraped from various online sources, and a drug match can be determined using multiple factors if some data is innacurate, other data can be used as a substitute in determining a drug match

Reusability:

Docker containers, by nature, can be downloaded and distributed with ease. Docker Hub allows for public distribution of image configurations. Thus, this can be developed and continued easily by future efforts.

The drug matching criteria can use user feedback in improving future drug matching results. Based on a user feedback system, the accuracy of the drug match can be determined and can be improved over time. This database and its metadata can also be opensourced to allow for future development.

DNA Variation Criteria:

Next Generation Sequencing: Use cutting-edge bioinformatics methods to sensitively and specifically map genome while calling possible DNA base pair variants and gene expression differences to reliably determine proteins transcribed.

Parallelization: Either use open-source software or create own method of running multiple tools in parallel.

Flexibility: Allow users to swap certain tools with other open-source tools by loosely connecting each container and allowing for easy swapping.

Unified Variant Calling: Seek out indels (deletion/addition) and SNPs (changes in single base pair) and structural changes (large and very affecting)

Drug Correlation: Creation of online drug database through a NoSQL (Mongo) database which allows for quick access and query.

References:

Understand the architecture. (n.d.). Retrieved October 1, 2015, from https://docs.docker.com/introduction/understandingdocker/

Parse. (n.d.). Retrieved October 13, 2015, from https://parse.com/docs

NextGeneration Sequencing (NGS). (n.d.). Retrieved October 13, 2015, from http://www.illumina.com/technology/nextgenerationsequencing.html

Inflammatory Bowel Disease Crohn's Disease . (n.d.). Retrieved September 30, 2015, from http://www.agsaflorida.com/Home/PatientEducationLibrary/tabid/18426/ctl/View/mid/33464/De fault.aspx?ContentPubID=325

GATK Best Practices Recommended workflows for variant discovery analysis with GATK. (n.d.). Retrieved September 10, 2015, from https://www.broadinstitute.org/gatk/guide/bestpractices?bpm=DNAseq

A comparison of methods for differential expression analysis of RNAseq data. (n.d.). Retrieved October 7, 2015, from http://www.biomedcentral.com/14712105/14/91

(n.d.). Retrieved September 2, 2015, from http://biorxiv.org/content/biorxiv/early/2014/05/28/005611.full.pdf

https://www.dnastar.com/blog/nextgensequencing/howaccurateisyourvariantcaller/