You are here:
Last updated: May 22, 2018

Genomic pipelines

A bioinformatics pipeline is a set of complex algorithms (tools), which is used to process sequence data, in order to generate a list of variants or assemble a genome(s).

Pipeline development is rarely simple, quick or linear.  If a step does not pass quality control or generates results that do not make biological sense, bioinformaticians will have to change the methodologies, tools or parameters used in their pipeline.

A given genomic pipeline may involve some or all of the following steps; however, pipelines, tools and workflows will differ from study to study. There are no gold standard genomic tools and many tools are only suitable for a specific genomic application.

Read alignment

Current sequencing technologies do not allow for whole DNA or RNA molecules to be sequenced (with the exception of some microbial genomes and smaller RNAs). Instead, DNA or RNA is broken into a large number of smaller fragments that are then sequenced and re-assembled with a tool called a sequence aligner. Aligners use a reference genome to help assign where fragments (reads) might originate from on the genome and relative to one another. Aligners’ trade-off the need to be specific enough to exclude false positives whilst being flexible enough to allow real variants to be aligned to the reference genome (Figure 1).

Figure 1: An example of an aligned sequence to a reference genome with a heterozygous single nucleotide variation (SNV). If the read aligner is too rigid, these SNVs would not have been mapped because it does not perfectly match the reference genome. If the read aligner is too lenient, then false positives will be incorrectly aligned. 

There are many metrics used to compare sequence aligners including:

  • Accuracy
  • Speed
  • Uniqueness of alignments
  • False positive rates
  • How well reads aligned in repeat regions
  • Impact on downstream results

Read alignment is a computationally complex process and requires high performance computers (HPC). The compute time depends on the sample complexity, read depth, tools used, as well as compute resources available.

The ideal aligner will vary depending on the sequencing technology, experiment (DNA, RNA, miRNA, methylation, cancer) and compute resources available (memory, RAM, cloud compute and storage) and may require several iterations or local realignments to fine-tune an assembly. If there is no reference genome or it is low quality, genome assembly becomes more complex and is referred as de novo genome assembly.

Aligners will generate a SAM or BAM file that has a list of sorted and indexed reads. These files will also include read information (read name header), nucleotide base calls (sequence), per-base quality scores (confidence that a base is correct) and alignment quality scores. The quality of the read alignment will have an impact on the downstream analysis and results.

Quality control

  • Alignments can be improved by using realignment, bias correction and error correction tools that look at factors such as coverage distribution, the number of independent reads aligned, strand bias, insert sizes and other alignment metrics.
  • It is important to visualise variants using a tool such as the Integrative Genomics Viewer (IGV) to confirm that the alignment has been successful.

Common tools

There are many variant calling tools available for a range of sequencing methodologies and technologies. Some aligners include:

  • BWA (Burrow-Wheelers Alignment) and BigBWA (Illumina)
  • STAR (Illumina)
  • Bowtie2 (Illumina)
  • Tophat2 (Illumina)
  • Biostar (PacBio)
  • Torrent Suite (Ion Torrent)

Please note this is not a recommended list of tools but a small subset of frequently used tools. We recommend consulting with an expert or literature before selecting a tool.

Have you thought about?

  • Is the aligner you selected suitable for your experiment?
  • Have you organised computational resources and personnel to perform the alignment?
  • How have you balanced the trade-off between accuracy, sensitivity and speed in the aligner you selected?

If you have any questions with regards to the read alignment, speak with your sequencing provider and/or a bioinformatician.

Back to top

Variant calling

Variant calling is the process of identifying differences between the aligned reads from a sequencing run and the reference sequence to which the reads are aligned or control samples.

Genomic variant callers are optimised to detect specific types of variants such as single nucleotide variation (SNV), insertions and deletions (indels), alternative splicing, methylation patterns, copy number variation (CNV), structural variations (SV) or differential RNA/epigenetic expression.

The process of variant calling generates a Variant Call Format (VCF) file. This is a tab-delimited text file that includes a list of variants with confidence scores, chromosome positions, unique variant identifiers quality scores, sequence information for the variant and reference genome and can include other study-specific information.

There are no gold standard variant callers, and it is generally best practice to use multiple tools. Variant callers, similar to read aligners, make a trade-off between factors such as sensitivity, precision and speed. This trade-off is particularly important when identifying low frequency variants from random sequencing errors. Variant callers also have issues detecting variants in repetitive or homopolymeric regions, as the read depth is often too low. Long read sequencing technologies can improve read alignment and variant calling in such regions. 

Quality control

  • Compare finding from multiple variant calling tools.
  • Visualise read alignment and variants with tools such as the Integrative Genomics Viewer (IGV). In particular, check the read depth, base and alignment quality observed at the site of the variant.

Common tools

There are many aligners available for a range of sequencing methodologies and technologies. Some common aligners include:

  • Genome Analysis Tool Kit (GATK) (multiple aligners and variant callers)
  • Platypus
  • Mutect
  • Strelka
  • Varscan2
  • Indellocator
  • SAMtools

Please note this is not a recommended list of tools but a small subset of frequently used tools. We recommend consulting with an expert or literature before selecting a tool.

Have you thought about?

  • Will you compare or use multiple variant callers?
  • Did you use the same reference genome to align and to call variants?
  • Is the variant caller suitable for your experiment?

If you have any questions with regards to variant calling contact a bioinformatician with experience in the field.

Back to top

Variant annotation

Variant calling can generate thousands and sometimes millions of variants per experiment, most of which have no biological significance and even fewer have clinical significance. By understanding the biology of a variant, researchers can remove and prioritise variants.

Annotation gives variants a biological context by capturing or predicting the structure and function of the gene product. These gene products, or genomic elements, are coding and non-coding genomic regions of biological interest e.g. transcript or protein, mRNA, miRNA, promoter, enhancer, DNA-binding site or transcription factors, and so on [1].

Annotation uses biological databases and/or scientific literature to assign or predict:

  • Gene products or function - transcript or protein, promoter, enhancer, etc.
  • How the variant may impact on function (nonsense variations, frameshifts, splicing, truncation events, or loss of essential regulatory elements)
  • Which regulatory pathway the variant is associated with
  • If a variant is in an evolutionarily conserved region or a common variant (high conservation indicates that a region is biological important although this rule is not always true in non-coding regions [2])
  • Predicting potential pathogenicity or biological impact (disease-associated genes)

Automatic annotation uses algorithms to capture or predict information on gene elements associated with a given variant [2]. This is an efficient and consistent process that is often incorporated into a pipeline. Manual annotation is still considered to be the ‘gold standard’ for more precise, accurate and comprehensive annotation, and is more effective at identifying particular elements (e.g. splice variants and pseudogenes) [2].

Annotation has limitations due to: 

  • Errors in the sequencing or analysis (if the variant is not called then it cannot be annotated)
  • Limited available information in databases
  • Incorrect or poorly validated information in databases (a particular issue with public databases)
  • Human reference genome and transcriptome are incomplete
  • Poor annotation in non-coding regions

Common tools and databases

Please note this is not a recommended list of tools and databases but a small subset of frequently used tools. We recommend consulting with an expert or literature before selecting a tool or database.

If you have any questions with regards to variant annotation speak to a bioinformatician with experience in the field.

Back to top

Clinical interpretation of results

Interpretation of the variant according to available evidence and in the context of the participants is essential in clinical genomics research. Once a list of variants has been generated and annotated with biological information, experts or teams of experts interpret the findings and ascertain which (if any) of the variant(s) are implicated in a participant’s phenotype or disease.

To evaluate if a variant(s) is potentially disease-causing it must be assessed in the context of the experiment, participants’ clinical presentation and family history. Experts use information from a wide range of sources and variant classification criteria or frameworks to evaluate variants. Experts will assess candidate variant(s) using (but not limited to) the following parameters [2]:

  • Is the variant in a read of low quality or of low read depth (is this a real result?) Such variants are often:
    • Within regions that are difficult to align to e.g. repetitive or homopolymeric regions (Note: coverage is an average across the whole genome)
    • Have low base or alignment scores
  • In a known disease-associated gene?
  • Is the variant rare?  i.e not seen in the normal, healthy target population at rates greater than the disease prevalence rate or the predicted carrier rates (for a recessive condition).
  • Does the variant fit the known modes of inheritance for the condition? (homozygous, de novo mutation i.e. not present in either parent).
  • In an evolutionarily-conserved region of the protein?
  • Predicted to have a high functional impact that may potentially affect the gene function? (If so, whether it is in a relevant gene and/or pathway).
  • Is there agreement between various in-silico prediction tools? (e.i. do multiple tools  predicted similar pathogenicity?)

If the research is part of a family study, it is important to check if the variant(s) segregates with the condition within the family. I.e. do all affected people carry the variant? Is there an affected person that does not carry the variant? (If this is the case, then this is strong evidence against the pathogenicity of a variant).

Evaluating candidate variants becomes more complicated if the disease has reduced penetrance or does not follow Mendelian inheritance e.g. is caused by the effect of multiple variants, complex rearrangements and so on. Consultation with a clinical expert or expert in the field of research is recommended. 

It may also be prudent to gain further specific consent at this point, as participant recall is often poor. This is particularly important if broad consent was initially used.

How are results reported?

In clinical genomics practice and research, variants are often classified on a five-point scale, ranging from pathogenic to benign, which indicates the likelihood that the particular variant is associated with disease (Figure 1)[3]. However, there are a number of different frameworks that can be used including the American College of Medical Genetics and Genomics–Association for Molecular Pathology (ACMG–AMP) guidelines [4]. Experts may also use criteria such as clinical utility and actionability, to assess if the variant should be returned to participants. This is discussed in Returning results .  

Figure 1: Five-point scale to indicate the likelihood that a particular variant is associated with a disease. Colour blind image

Variant of unknown or uncertain significance (VUS): An uncertain result known as a VUS indicates that a variant has been found in an individual but it has not been reliably characterised as benign or pathogenic or in other words, there is uncertainty about what this means [5].

Reporting practice by laboratories of VUS in clinical practice varies internationally and concern that a VUS may be treated as an actual result rather than an uncertain variant has also been reported [6] 

In addition to these classifications for the disease of interest, findings may also be:

  • An incidental finding: A finding that is unexpected, not related to the primary indication of the test and may or may not be relevant to the patients’ health [7]
  • Secondary finding:  Often used interchangeably with incidental findings but refer to findings that are sought after [5]
  • An uninformative result: No variant is found that could explain the condition. This could be due to:
    • Participants’ condition having no underlying genomic cause
    • Limits of current technology: relevant gene not tested (in non-coding region etc.) or causative variants not reported
    • Complex biology of the disease: e.g. low penetrance, complex or variable phenotypes, disease that involves multiple genes or variants

As new data and information emerges, it is increasingly possible for a variant to be reclassified. There is currently little data regarding the frequency of variant reclassification and its clinical impact. In a research context, this may not be feasible nor appropriate.  However, given the fast-moving nature of clinical research, it may be important for some research studies to discuss viability of revisiting variant classification in the EDP or management plan [8].

If a candidate variant(s) or incidental finding(s) are identified, these results may be returned to a participant if appropriate consent has been obtained. This is discussed in Returning results 

Quality control

  • Visualising variants with tools such as Integrative Genomics Viewer (IGV), assist in identifying unusual sequencing artifacts and can help exclude false positives.
  • Does the variant make biological sense?
  • If a family study, does the variant segregate into families?
  • Are functional studies required to test if the variant is disease-causing?

Have you thought about?

  • Do the results make biological and clinical sense?
  • Could the variant be a sequencing artifact or misalignment?
  • Is the read depth suitable and does it have good base and alignment quality?
  • Is your variant(s) of interest in a difficult location in the genome to sequence (homopolymeric or other repetitive regions) or low quality area (end of the reads)?
  • How will the return of results be resourced? (e.g. access to clinical review committees and appropriate referral pathways for genetic counselling)
  • What are your obligations and responsibilities regarding the management of results, patient information and samples, not just processes and interactions?
  • How will you confirm the results?

If you have any questions about the interpretation or validation, talk with a pathologist, clinical geneticist or genetic counsellor with experience in the field. 

Back to top

Functional studies

Analysis of genomic information can suggest that a variant has a role in a particular phenotype or disorder, but unless there are many cases to support this, such a connection may need to be confirmed by functional studies.

For more information see - Functional studies (Understanding technology)

Back to top

[1] Steward C A, Parker A P J, Minassian B A, Sisodiya S M, Frankish A, and Harrow J. Genome annotation for clinical genomic diagnostics: strengths and weaknesses. Genome Medicine 20179:49 

[2] Quintáns B, Ordóñez-Ugalde A, Cacheiro P, Carracedo A and Sobrido M J. (2014). Medical genomics: The intricate path from genetic variant identification to clinical interpretation. Applied & Translational Genomics, 3(3), 60–67.

[3] Introduction and Overview to Genomic Test Reports. (2017) The American Society of Human Genetics.https://www.ashg.org/education/csertoolkit/intro.html

[4] Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et. al. On behalf of the ACMG Laboratory Quality Assurance Committee, H. L. (2015). Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine : Official Journal of the American College of Medical Genetics, 17(5), 405–424. 

[5] Principles for the translation of ‘omics’-based tests from discovery to health care (2015) https://www.nhmrc.gov.au/guidelines-publications/g10 National Health and Medical Research Council (NHMRC) 

[6] Vears DF, Senecal K, Borry P. (2017) Reporting practices for variants of uncertain significance from next generation sequencing technologies. Euro J Med Genet; 60(10);553-558

[7] Souzeau E, Burdon K P, Mackey D A, Hewitt A W, Savarirayan R, Otlowski M, and Craig J E. (2016) Ethical Considerations for the Return of Incidental Findings in Ophthalmic Genomic Research.  Translational vision science and technology Vol. 5,1,1-11.

[8] Macklin S, Durand N, Atwal P and Hines s. (2018) Observed frequency and challenges of variant reclassification in a hereditary cancer clinic Genetics in Medicine 20;346–350