Transition from FASTQ/SFF to unmapped BAM
One of the changes in the update from Torrent Suite Software 2.x to Torrent Suite Software 3.0 is a transition from FASTQ and SFF to BAM as the principal format in which primary analysis results are delivered. The new file is sometimes referred to as unmapped BAM to indicate that it can still be produced in the absence of any reference genome. The mapping fields of the BAM are set to the usual defaults for unmapped reads. In the case of a reference genome being available, the mapping information is added to the BAM. In TS 3.0, the unmapped BAM file names have the extension
*.basecaller.bam
.
The FASTQ and SFF formats are still provided in Torrent Suite Software 3.0 and 3.2. In Torrent Suite Software 3.4, the FASTQ and SFF files are no longer produced by default, though for backward compatibility both plugins and command-line tools are provided to make it easy to create FASTQ and SFF files on demand. Note, however, that although the SFF file that may be created on demand in Torrent Suite 3.4 has the same usual standard data format, the actual content of one of the data fields in that SFF file is modified as further discussed below.
There are multiple reasons motivating the implementation of the previously-announced transition from SFF to BAM:
- Community-driven standard The BAM format is a community-driven standard that has become the format of choice for representing next-generation sequence data for multiple sequencing technologies. Delivering the data in BAM format will enable greater compatibility with current and future software tools.
- Native data compression BAM supports native compression of the data, SFF was not designed with compression in mind. An unmapped BAM from Torrent Suite Software 3.0 is typically less than half the size of the corresponding SFF file, though it contains more information.
- SFF format constraints The SFF format is rigid, there is no ability for it to adapt to improvements in the underlying sequencing technology that it represents. For example, an SFF file cannot be used for runs that incorporate multiple flow orders or multiple sequencing keys.
- Extensibility The BAM format is extendable. New kinds of information can be added to it without breaking existing parsers. Having the ability to record in the BAM file new kinds of data derived during primary analysis will enable current and future improvements in the analysis methods used in downstream applications such as variant calling improvements.
In Torrent Suite Software 3.0 and 3.2, the unmapped BAM file is generated directly from the BaseCaller, and the SFF and FASTQ files are then generated from the unmapped BAM. In Torrent Suite Software3.4, the SFF and FASTQ generation are removed from the default analysis pipeline, which has undergone significant changes. The BAM and SFF generated by Torrent Suite Software 3.0 and 3.2 store the same library insert flow values, bases, and quality scores that are available in the SFF from Torrent Suite Software 2.x, and all three would produce the same FASTQ file. There are some small differences in some of the auxiliary information available relating to 5-clipping and 3-clipping.
In Torrent Suite Software 3.0 and 3.2, the corrected flow signals provided in the SFF and unmapped BAM correspond to previously-made base calls and include residual terms related to the closeness between actual measurements and measurements predicted for the base calls under a predictive model of phasing effects.
In Torrent Suite Software 3.4, the corrected flow signal is no longer generated by the analysis pipeline. The flow signals formerly stored in the read data section of the SFF and the FZ tag of the unmapped BAM files are replaced by raw normalized signal values, which include phasing effects.
The specific contents and relative differences are outlined in the table below.
Data Type | TS-2.x SFF | TS-3.0/3.2 SFF | TS-3.0/3.2 unmapped BAM | TS-3.4 unmapped BAM |
---|---|---|---|---|
Flow order | Specified in global header | Specified in read group header's FO tag | ||
Key sequence | Specified in global header |
In non-barcoded runs, or in composite SFFs in barcoded runs, key sequence is specified in global header. In per-barcode SFFs, global header contains combined sequences of key + barcode. |
In non-barcoded runs, key sequence is specified in read group header's KS tag. In barcoded runs, KS tag contains combined sequences of key + barcode.
|
|
Flow values | Specified in read data section | The FZ tag stores the corrected flow signals for all flows, including those clipped away | Instead of storing the corrected flow signals, the new ZM tag stores normalized signals, which include phasing effects. Signal in ZM tag may be 3' clipped, but is guaranteed to include information about all saved bases (see Bases & Qualities below) | |
Flow index | Specified in read data section | Not directly available, but can be inferred from the flow order plus the ZF tag (which specifies the flow that generated the first base after the 5' clipped region) | ||
Bases & Qualities | Bases and qualities, including 5' and 3' clipped regions, are stored in the read data section | The read data section contains bases and qualities, including 5' clipped region but excluding 3' clipped region. The data for the 3' clipped region is excluded because in Torrent Suite 3.0 the SFF is derived from the unmapped BAM file and the BAM file itself does not contain the 3'-clipped data |
The usual BAM fields specify the base calls and quality value for all bases surviving 5' and 3' clipping.Base calls and quality values for clipped regions are not stored directly.Base calls for clipped regions can be derived from the flow signals and flow order. Quality values for clipped regions are not available. Note: Content of these fields may be changed during alignment processing. If the aligner maps a read to the reverse strand, base sequence in the BAM file is reverse-complemented and qualities are reversed. |
|
5' clipping data | Base position of key clipping is specified in clip_qual_left field. If barcode clipping was performed, its position is stored in clip_adapter_left field. | Base position of key + barcode clipping specified in clip_adapter_left field. Field clip_qual_left is unused and always equals zero. | Not directly available, can be derived from the flow values, flow order, and ZF tag | |
3' clipping data | Base position of 3' adapter clipping specified in clip_adapter_right field, base position of 3' quality clipping specified in clip_qual_right field | The clip_adapter_right and clip_qual_right fields are set to zero.This is because the 3'-clipped bases are not included in the SFF and specifying positions corresponding to bases that are absent from the SFF causes some downstream utilities to break. |
ZG tag specifies the flow ( zero-based) during which the first base of the adapter was incorporated. ZA tag specifies the number of insert bases (insert starts after the key and barcode, and ends before the adapter). For reads where 3' adapter trimming was more restrictive than quality trimming, ZA equals the read length, and thebase position of 3' quality clipping is only known to be greater than or equal to the read length. When the read lengthis less than the ZA tag or the adapter is not found, the read length corresponds tothe base position of 3' quality clipping. |
|
Flow corresponding to the first base after the 5' clipped region | Can be inferred from flow index values | ZF tag specifies the 0-indexed position of the flow generating the first base after the 3' clipped region |
Backward compatibility to create FASTQ and SFF output
In Torrent Suite Software 4.x, use the
FileExporter
pluginto make SFF or FASTQ files. The
FileExporter
plugin offers these output options:
In Torrent Suite Software 3.4, FASTQ and SFF can be created either on the command-line or with a plugin. See above for some important differences in the interpretation of flow values stored in the SFF introduced in Torrent Suite Software 3.4.
The 3.4 plugins
-
Use the
SFFCreator
plugin to make SFF files. Click on the Select plugins to run button in the run report and select theSFFCreator
plugin. Upon plugin completion, a download link appears in the plugin results section. -
Use the
FastqCreator
plugin to make FASTQ files. Click on the Select plugins to run button in the run report and select theFastqCreator
plugin. Upon plugin completion, a download link appears in the plugin results section.
The 3.4 command line
SFF and FASTQ can be created directly from Ion Torrent BAM files (either mapped or unmapped) with the following commands, which are the same as those used in the plugin method:
-
SFF can be created with the
bam2sff
command-line utility that is installed with Torrent Suite Software. Running the command with no arguments prints the usage information to the screen. Example of a typical call:
bam2sff -o out.sff in.bam
-
FASTQ can be created with a call to the
SamToFastq
tool that is part of the Picard package. Example of a typical call using the Picard version installed with Torrent Suite Software:
java -Xmx8g -jar /opt/picard/picard-tools-current/SamToFastq.jar I=in.bam F=out.fastq
Sample data
There are no significant differences in the headers of the BAM files produced by TS-3.2 and TS-3.4. Find below an example of a single alignment from a mapped BAM file converted into text SAM format, for each of TS-3.2 and TS-3.4. In this example the two software versions have produced the same base calls, though that will not always be the case. The differences between the two entries are as follows:
- As usual, the read name differs in the initial 5-character hash, as every analysis run produces a unique base name for reads.
- The per-base predicted quality scores differ due to ongoing improvements made in the quality score prediction method used
- As noted in the table above, the corrected flow values provided by the FZ tag in TS-3.2 are replaced by raw normalized values provided by the ZM tag in TS-3.4. The flow values in the ZM tag include phasing effects.
TS-3.2
97X4E:00028:00037 0 Amplicon11 1 92 86M * 0 0 GTTCCTCTGGGATGCAACATGAGAGAGCAGCACACTGAGGCTTTATGGGTTGCCCTGCCACAAGTGAACAGGTCCCAGCATGAAAG GGAGBFFEED?DDDDGAGNIIEDDDGDIDAABDEDBFII@BBF?FFFN?F?A@>5>77-5AA<BBCB@CBB?AAA=ADFFF777*7 XA:Z:map4-1 ZA:i:86 MD:Z:86 ZF:i:25 PG:Z:tmap RG:Z:97X4E.IonXpress_007 ZG:i:183 NM:i:0 AS:i:86 XS:i:-2147483647 FZ:B:S,100,1,95,1,2,96,0,108,205,87,0,100,0,0,0,1,113,1,105,106,196,95,96,106,109,96,209,0,202,1,0,0,101,0,95,0,98,3,0,315,6,0,2,0,102,1,7,0,98,0,85,0,0,94,0,209,0,0,1,3,102,102,0,0,98,1,1,105,2,99,0,110,1,1,1,0,108,96,0,108,0,0,105,0,0,100,6,101,0,94,0,0,118,93,0,106,8,96,85,0,106,8,7,97,4,0,13,0,96,205,113,2,293,0,10,98,99,0,299,0,197,88,0,0,280,0,0,3,106,6,0,106,0,1,155,20,1,0,3,2,97,0,108,191,2,1,101,2,101,9,111,197,0,0,0,1,112,108,196,5,99,0,278,0,2,104,7,102,0,100,0,5,103,8,10,0,100,6,103,255,11,0,96,97,102,4,0,0,111,104,9,178,4,0,10,88,0,110,99,5,92,0,0,86,8,0,273,86,107,5,6,107,0,8,90,100,9,93,0,99,19,7,174,95,106,3,0,92,0,110,18,95,7,0,0,0,114,4,93,10,98,0,80,21,0,179,7,183,4,165,0,5,89,78,16,78,9,83,71,24
TS-3.4
LLPMG:00028:00037 0 Amplicon11 1 92 86M * 0 0 GTTCCTCTGGGATGCAACATGAGAGAGCAGCACACTGAGGCTTTATGGGTTGCCCTGCCACAAGTGAACAGGTCCCAGCATGAAAG CC@C@EBBCB?CDAAE@BCBBB?@BBBDB>@>@???BDC?CCCACDFF?@@;991111-1<<:ACDDBGBBA<<>;=DBBA88819 XA:Z:map4-1 ZA:i:86 ZB:i:30 MD:Z:86 ZF:i:25 PG:Z:tmap RG:Z:LLPMG.IonXpress_007 ZG:i:183 NM:i:0 ZM:B:s,260,2,244,2,2,244,2,270,498,220,2,248,2,2,2,2,286,6,266,260,470,226,236,262,284,220,462,2,470,2,12,2,246,2,218,2,248,10,12,692,66,30,24,36,236,2,34,4,246,2,244,2,20,214,4,466,2,4,2,12,252,266,2,14,222,30,2,230,36,216,2,264,2,2,2,2,270,238,2,260,2,2,256,30,2,214,48,216,2,198,2,2,280,226,42,252,30,238,208,0,212,66,60,200,64,0,38,48,240,484,276,10,692,0,50,234,252,2,688,2,466,230,0,0,638,0,10,46,276,22,0,270,2,10,368,66,14,10,14,10,242,0,274,452,16,16,242,28,240,30,270,454,0,16,0,28,282,268,470,12,248,0,646,16,16,254,48,240,0,226,0,10,258,28,34,0,252,24,250,614,32,12,238,250,238,18,0,2,272,242,26,404,26,0,42,210,0,268,252,32,208,2,2,194,46,0,624,192,268,16,28 ZP:B:f,0.0047309,0.00782546,0.0023048 AS:i:86 XS:i:-2147483647