Technical Note

Base Recalibration

Base recalibration is a process to improve base calls by re learning the homopolymer flow signal distribution from the alignment of a fraction of library reads. Base recalibration applies the learned rules to make flow signal adjustments and improve subsequent base calls .

Learn the homopolymer flow signal distribution

After an initial round of regular base calling, a random sample of up to 2 million reads is taken and aligned to the reference genome. The alignments are post-processed to determine the joint distribution of true homopolymer length and observed flow signals for each nucleotide.These distributions are determined separately for four quadrants on the chip, and for two bins of flows (the first half of the flows make up the first bin, the second half makes up the second bin). The frequency table is further processed to generate final flow quality value (QV) table, which is then used for recalibration of all reads produced by BaseCaller. The flow QV table contains observed flow QV and homopolymer length for each flow signal. For a given flow signal, a list of homopolymers are observed with frequencies of C i , , and C j , where i and j represent homopolymer lengths. The flow QV is is logarithmically related to the homopolymer-calling error probabilities and is defined as -10 * log 10 (1- C n / C ) , where C n is the maximum, C is the total of frequencies, and n is the assigned homopolymer length. This graph shows the flow QV distributions of four nucleotide types.

The flow QV table has 32 sets of QV and homopolymer distributions, corresponding to the 4 nucleotide types, 4 regions out of 2-by-2 chip spatial stratification, and 2 equal partitions of flows. The difference between barcoded and non-barcoded runs is illustrated below:

Read Recalibration

In pyrosequencing, basecalling is constructed from homopolymer calls of all nucleotide flows. During the flow signal distribution learning process, it is possible that some population of flow signal might under-call homopolymer and another population of flow signal could over-call homopolymer. Any under-calling or over-calling changes the derived basecalling. In the flow QV table, learned homopolymer lengths might be different from originally-called lengths for a population of flow signal, where new HP length and base calling would be produced. The flow signals and predicted base quality scores need to be modified to satisfy newly-called homopolyers (for example, an original 6 mer T at flow signal 649 is recalibrated as 7 mer T, and it is illegal to have a 7 mer T that has flow signal of 649).

The predicted base quality score adjustment is done by simply ensuring production of the same number of quality scores as the number of recalibrated base calls.In the case of a deletion, the first predicted quality score is removed; in the case of an insertion of a longer homopolymer, the last quality score of the homopolymer is copied; and in the case of insertion of a new 1-mer, the flow QV is used.

For flow signal adjustment, the flow signal boundary ( FS Left and FS Right ) of each homopolymer is identified from flow QV table. The corrected flow signal is defined as 98 * (fs - FS Left ) / (FS Right - FS Left ) + HP * 100 49 , where HP is the calibrated homopolymer length and fs is the reported signal. It is possible for recalibration to produce illegal flow sequences where positive flow is called at a later flow, while no other positive flow is called in between (Case 1 in the table below) or same positive flow is called in a row (Case 2). A linear algorithm is developed to identify potential illegal flows due to recalibration and these flows are skipped during recalibration.

Case 1 Case 2

Flow order CG T C T GAG

Original flow signal: 0, 100, 0 , 0, 49 , 0, 100, 0

Calibrated flow signal: 0, 100, 0 , 0, 51 , 0, 100, 0

Flow order CG T C T GAG

Original flow signal: 0, 100 , 100 , 0, 49 , 0, 100, 0

Calibrated flow signal: 0, 100 , 100 , 0, 51 , 0, 100, 0

Implementation

Base recalibration is implemented as part of the Base Calling module and is enabled by default if a reference library is associated with the run. For any given run, recalibration can be disabled by unchecking the Enable Base Recalibration box in the advanced section of the analysis launch page:

Performance

Because the recalibration involves alignment, the run time is dependent on the complexity of reference library. Performance from runs aligning with hg19 is shown in this table:

Chip type Ion 314 TM Chip Ion 316 TM Chip Ion 318 TM Chip

Reads count (Million)

0.63

2.11

5.73

Reads sampling time (Min)

NA

NA

4.1

Base alignment time ( Min )

5.5

15.2

14.8

Flow alignment time ( Min )

2.5

6.4

6.2

Reads recalibration ( Min )

1.1

3.8

9.8

Total time ( Min )

9.1

25.4

34.9

Base recalibration significantly improves the accuracy of longer homopolymers, as shown in the following figures:

Original, before recalibration:

After recalibration:

The following table shows the improvement of mapping throughput and overall sequencing error of a typical Ion AmpliSeq Comprehensive Cancer Panel run on the Ion 318 Series Chip:

Original, before recalibration

After recalibration Percent change

AQ17 (MB)

539

552

2.38

Q20 (MB)

479

501

4.61

Q47 (MB)

423

447

5.73

Error rate (%)

1.096

0.871

-20.5

Summary

Base recalibration remeasures the homopolymer flow signal distribution from a sample of reads. This process adds to the reanalysis runtime (from about 10 to 35 minutes on Ion PGM runs), but improves mapping throughput and the accuracy of long homopolymer calls, while reducing the overall sequencing error rate.