Glossary
- 1 based coordinate system
- A coordinate system where the first base of a sequence is one. In this coordinate system both start and end coordinates are inclusive. The total length of a token is given by
end - start + 1
.
- 0 based coordinate system
- A coordinate system where the first base of a sequence is zero. In this coordinate system, the start coordinate in inclusive and the end coordinate is exclusive. The total length of a token is given by
end - start
.
- Template sequence
- A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.
- Technical sequence
- A synthetic nucleotide sequence often ligated to a template sequence for the purpose of classification, quantification or some other quality control need. Some common examples of a technical sequence are a sequencing adapter, sample barcode or a molecular barcode.
- Sequence Read
- A raw sequence that comes off a sequencing machine. A read may consist of multiple ordered segments.
- Sequence Read Segment
- A contiguous IUPAC encoded sequence optionally accompanied by the corresponding Phred encoded quality scores and auxiliary meta data tags. A segment can contain a template sequence, technical sequence or a combination of the two.
- Input segment
- An enumerated segment of the input read. Input segment indexes are zero based.
- Output segment
- An enumerated segment of the output read. Output segment indexes are zero based.
- Sample barcode
- A technical sequence used to classify a read into read groups. A sample barcode can have multiple segments and a match to all segments is required by the decoder to declare a successful classification. The sequence and quality of the concatenated sample barcode are written in SAM records to the standardized BC and QT auxiliary tags. When decoding with Phred-adjusted maximum likelihood Pheniqs also writes the decoding error probability to the XB tag. Multiplex barcode segment indexes are zero based.
- Molecular barcode
- Also known as a UMI (Unique Molecular Identifier) is a technical sequence used to identify unique DNA fragments. Duplicate read elimination is often used to assist in quantification and to improve read confidence. The sequence and quality of the molecular barcode are written in SAM records to the community proposed RX and QX auxiliary tags.
- Split file layout
- Read segments are split into separate files by read group and segment index. All segments in a file have the same read group. A file contains only a single segment from each read. This is a common layout for FASTQ files.
- Interleaved file layout
- Read segments are split into separate files by read group. All segments in a file have the same read group. A file contains multiple segments from each read. Read segments are consecutive and in order which implies the GO property of the @HD header tag is set to
query
. This is a common layout for SAM based formats but is also sometimes used with FASTQ.
- Combined file layout
- Read segments from multiple read groups are written to the same file. A file contains multiple segments from each read. Read segments are consecutive and in order which implies the GO property of the @HD header tag is set to
query
. This layout can be more efficient for SAM based formats. Although writing combined output to FASTQ is supported, the lack of standardized meta data encoding in FASTQ makes this impractical in real world scenarios.
- Read Group
- A set of reads that were generated from a single run of a sequencing instrument. Read groups are identified in the SAM/BAM/CRAM file by a number of tags that are defined in the official SAM specification. In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on the same lane will constitute a separate read group. For detailed semantics of the various read group fields you should consult the sequence alignment map format specification.
- FI auxiliary tag
- The one based index of a segment in the read.
- TC auxiliary tag
- The number of segments in the read.
- RG auxiliary tag
- Read group identifier. The value must be consistent with the ID property of an @RG header tag.
- BC auxiliary tag
- Raw uncorrected sample barcode sequence with any quality scores stored in the QT tag. The BC tag should match the QT tag in length. In the case of multiple sample barcode segments, all segments are concatenated with a hyphen (‘-’) separator, following the sequence alignment map format specification recommendation.
- QT auxiliary tag
- Phred quality of the sample barcode sequence in the BC tag. Phred score is + 33 encoded. In the case of multiple sample barcode segments, all segments are concatenated with a space (‘ ’) separator, following the sequence alignment map format specification recommendation.
- XB auxiliary tag
- The probability that the decoding of the sample barcode stored in BC tag was incorrect.
- RX auxiliary tag
- Sequence bases from the unique molecular identifier with any quality scores stored in the QX tag. The RX tag should match the QX tag in length. These could be either corrected or uncorrected. Unlike MI tag, the value may be non-unique in the file. In the case of multiple molecular barcode segments, all segments are concatenated with a hyphen (‘-’) separator, following the sequence alignment map format specification recommendation. If the bases represent corrected bases, the original sequence can be stored in OX tag.
- QX auxiliary tag
- Phred quality of the unique molecular identifier sequence in the RX tag. Phred score + 33 encoded. The qualities here may have been corrected with raw bases and qualities stored in OX tag and BZ tag respectively. In the case of multiple molecular barcode segments, all segments are concatenated with a space (‘ ’) separator, following the sequence alignment map format specification recommendation. If the qualities represent corrected values, the original values can be stored in BZ tag.
- OX auxiliary tag
- Raw uncorrected unique molecular identifier bases, with any quality scores stored in the BZ tag. In the case of multiple molecular barcode segments, all segments are concatenated with a space (‘ ’) separator, following the sequence alignment map format specification recommendation.
- BZ auxiliary tag
- Phred quality of the uncorrected unique molecular identifier sequence in the OX tag. Phred score + 33 encoded. In the case of multiple molecular barcode segments, all segments are concatenated with a space (‘ ’) separator, following the sequence alignment map format specification recommendation.
- MI auxiliary tag
- Molecular Identifier. A unique ID within the SAM file for the source molecule from which this read is derived. All reads with the same MI tag represent the group of reads derived from the same source molecule.
- XM auxiliary tag
- The probability that the decoding of the molecular barcode stored in RX tag was incorrect.
- CB auxiliary tag
- Unique cell identifier. Pheniqs populates this tag with the corrected sequence bases of the cellular barcode.
- CR auxiliary tag
- Raw uncorrected cellular identifier bases, with any quality scores stored in the CY tag. In the case of multiple cellular barcode segments, all segments are concatenated with a hyphen (‘-’) separator, following the sequence alignment map format specification recommendation.
- CY auxiliary tag
- Phred quality of the uncorrected cellular identifier sequence in the CY tag. Phred score + 33 encoded. In the case of multiple cellular barcode segments, all segments are concatenated with a space (‘ ’) separator, following the sequence alignment map format specification recommendation.
- XC auxiliary tag
- The probability that the decoding of the cellular barcode stored in CB tag was incorrect.
- XE auxiliary tag
- The expected number of base call errors in the segment as computed from the quality scores.
- LB auxiliary tag
- Library name.
- SM auxiliary tag
- Sample name.
- PG auxiliary tag
- Program record identifier. The value must be consistent with the ID property of an @PG.
- PU auxiliary tag
- Platform unit. The value must be consistent with the PU property of an @RG header tag.
- CO auxiliary tag
- Free-text comment.
- @HD header tag
- Defines the format version, record sort order and grouping in a SAM header.
- @RG header tag
- Defines a read group in a SAM header.
- @RG:ID header tag
- Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section.
- @RG:BC header tag
- Barcode sequence identifying the sample or library. This value is the expected barcode bases as read by the sequencing machine in the absence of errors.
- @RG:CN header tag
- Name of sequencing center producing the read.
- @RG:DS header tag
- Description. UTF-8 encoding may be used.
- @RG:DT header tag
- Date the run was produced (ISO8601 date or date/time).
- @RG:LB header tag
- Library associated with the read group.
- @RG:PG header tag
- Programs used for processing the read group.
- @RG:PI header tag
- Predicted median insert size.
- @RG:PL header tag
- Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biotechnology), and SOLID. Pheniqs uses this tag to interpret metadata from the read ID in FASTQ formatted reads.
- @RG:PM header tag
- Platform model. Free-form text providing further details of the platform/technology used.
- @RG:PU header tag
- Platform unit unique identifier.
- @RG:SM header tag
- Platform unit unique identifier.
- @PG header tag
- Defines a program in a SAM header.
- Phred adjusted maximum likelihood decoding
- A maximum likelihood decoder that directly estimates the decoding error probability from the base calling error probabilities provided by the sequencing platform. Abbreviated PAMLD.
- Minimum distance decoding
- A minimum distance decoder, abbreviated MDD, consults the edit distance between the expected and observed sequence. It does not take base calling error probabilities into consideration. For more information see minimum distance decoding on wikipedia. For a good review of the application of minimum distance decoding to sequencing see
Short Barcodes for Next Generation Sequencing by Mir, K. et al.
- HTSlib
- HTSlib is an Open Source implementation of a unified multi-threaded C library for accessing high-throughput sequencing data encoded according to the sequence alignment map format specification in either the SAM, BAM or CRAM format. It also provides efficient and standardized means of manipulating FASTQ files. SAM is a human readable TAB-delimited text format superior to FASTQ in the sense that it supports standardized meta data annotations and is easier to parse. BAM is a BGZF compressed binary encoding of SAM data that provides efficient random access for indexed queries. CRAM is a newer, more advanced, binary encoding that is fully compatible with BAM and offers significantly better lossless compression, a lossy compression scheme and improved IO but does not support encoding degenerate IUPAC nucleotides. HTSlib is licensed under the MIT license.
- FASTQ
- a text-based file format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. SAM encoded files have several major advantages over FASTQ files like standardized metadata tags, better compression, efficient random access and easier parsing and are slowly replacing FASTQ files. See the FASTQ wikipedia page for more details about the FASTQ file format.
- Sanger Format
- A Phred quality encoding that encodes quality scores from 0 to 93 using ASCII 33 to 126. Since encoding to ASCII involves adding 33 to the Phred value the encoding offset is said to be 33. Current Illumina platforms directly produce FASTQ in Sanger format and it is the only quality encoding allowed in SAM records. See the FASTQ wikipedia page for more details.
- QC Fail Flag
- Some reads are marked as failing quality control. This is signaled on the comment portion of the read identifier in FASTQ files or the 512 flag on a SAM record flag. Reads that pass quality control are called PF for pass-filter (SAM flag ON), while those that fail are often referred to as QC fail reads (SAM flag OFF).
- Relative path
- A relative path is one that does not begin with
/
. Relative paths must be resolved against a base directory path. By default Pheniqs resolves those against work directory where Pheniqs was executed unless base input url
or base output url
was specified in which case input and output paths are resolved against those base directories.
- Absolute path
- An absolute path is one that begin with
/
. Absolute paths ignore base input url
or base output url
.
- Standard stream
- POSIX systems use standard streams, stdin, stdout and stderr to transfer data between applications without writing it to a file on a hard drive. This allows to pipe the output from one utility into the input of the next in the pipeline without the need for additional storage. Avoiding writing to the disk can also significantly accelerate some pipelines since drives are often orders of magnitude slower than computer memory.
- Static linking
- A statically linked build copies all or most of the library dependencies into the binary executable, producing a standalone executable that can be moved around the system and to compatible systems. This method, although yielding a larger executable, resolves conflicts with different version of the dependencies that might already be present on the system. Building a portable binary from scratch using the built-in package manager makes any Pheniqs revision easily available on cluster environments where the user rarely has elevated permissions required to perform a standard build.
- Feed resolution
- The number of consecutive read segments in the feed that have the same read identifier. In most interleaved files all segments of a read are adjacent and in the same segment order.
- Closed class decoding
- Closed class decoding refers to a decoding regime when a discrete list of classes, nucleotide sequences in our case, is known in advance and so a prior distribution is available, unlike open class decoders, when the discrete list of classes is unknown and so the prior distribution is hidden.