Archive for the ‘Uncategorized’ Category.

The Solexa Pipeline

Some old notes on the Solexa pipeline, some of which may still be relevant. The latex conversion of this document is a work in progress, so bare with me.

A pdf version of this document is available here: pipeline

Firecrest

By default Firecrest operates on a per tile basis, the main Firecrest binary is run_image_analysis. It reads an xml list of absolute paths to tile images. These are created by Goat and stored in the Data/CX-CC_Firecrest1.9.2_DD-MM-YYYY_username/Firecrest/LXX/s_X_XXXX_XX_jnl.xml files. The main Firecrest loop is shown in figure 1. The main loop is largely straightforward. Firecrest loops over all cycles (typically 36) and then performs image analysis on each of the four images for this cycle. The image analysis identifies clusters, and the intensity and noise values associated with these. In the first cycle cluster reference positions are stored (later cycles are aligned against these positions).

Figure 1: The Firecrest main loop


firecrest-loop

\section{Firecrest runs in 2 passes}

When Firecrest is launched by the Makefiles in the runfolder it is called twice per tile. The first pass runs a full image analysis on every 50th tile. It’s “useful” output is an offsets file. This file specifies transformations (shift and scaling) that are applied to image data in the second pass.

The offsets file is created from the xml output of the first firecrest pass. These are stored in
(\texttt{Data/CX-CC\_Firerest1.9.2\_DD-MM-YYYY\_username/Firecrest/LXX/s\_X\_XXXX\_XX\_qcm.xml}). When the first Firecrest pass has finished \texttt{parse\_offset\_scaling.py} (a part of Goat), is called by the Makefile. This strips the scaling/offset data from the xml files are stores them in the \texttt{Offsets} directory on a per lane basis.

\texttt{update\_offsets.py} is then called. This file creates \texttt{default\_offsets.txt} which contains transformations which will be globally applied to identified clusters on the next Firecrest pass. It creates this offsets file as an average of the individual tile/image offsets while attempting to maintain a small standard deviation (NOTE: need to add more details here). Firecrest can also optionally use a precalculated per device offsets file (NOTE: also more details here).

\section{Image Analysis}

The bulk of the image analysis takes place in the \texttt{ImageData} object (stored in \texttt{ImageProcess.cpp/h}). Stages of the image analysis process are illustrated in figure~\ref{firecrest-imageanalysis-stages}. To illustrate the image analysis we’ll be looking at the processing of an example image. The original input image is shown in figure~\ref{firecrest-image-input}.

Image analysis stages in Firecrest



firecrest-image-input: An example input image and magnified selection.

The first stage of processing applies a Mexican Hat filter to the image. (NOTE: I guess attenuating low and high frequencies makes the image smoother and the edges stronger? Need to add more details.)



firecrest-image-post-mexicanhat: Image after Mexican hat filter.

\subsection{Noise estimation}

For noise estimation the image is broken in to a number of regions, for each region a single noise value is calculated. The image is broken in the smallest even number of regions greater than 125 pixels square. This value has been selected in order to provide enough samples to calculate noise (NOTE: from Klaus).

For each region an initial histogram with fine binning is created. The inital bin width is calculated as: $w = \frac{\sigma}{7\times50}$, where the standard deviation is calculated from the filtered image (in this region). That is, according the code comments, ten times finer than any interval, with an additional factor of 5 to account for the overestimation of noise dispersion in the original image. This value is rounded to the nearest power of 10. Seven has been selected heuristically based on the sample size. The minimum value stored in the histogram is $min = (\frac{\mu-2.5\times\sigma}{w}-0.5)\times w$, where the average is value calculated from the filter image in this region. The number of bins is calculated as $\frac{5\times\sigma}{w}$.

The histogram is then smoothed in three passes, or until smoothing no longer results in a tighter standard deviation. Smoothing is by “mean filtering”, this iterates over the histogram subtracting the value windowsize (of half the size of the s.d.) less than the current position and adding that windowsize greater.

A new histogram is then created based on these refined values. The binning width here is $\frac{\sigma}{7}$ (where $\sigma$ is the new sd). The minimum value to store in the histogram is calculated as $\mu-2\times\sigma-0.5w$. This histogram is clipped at such that only values $v+\sqrt{v} > m-\sqrt{m}\times e^{-0.5}$ (where $v$ is the value at in the histogram, and $m$ is the maximum value in the histogram).

If after clipping there are less than 5 bins, and additional bin is added to each slide of the histogram. If there are still less than 5 bins noise estimation fails. A gaussian is fitted to the resulting histogram. The average from the gaussian is used to populate the “background” image in this region, the standard deviation to population the “noise” image. Example background and noise images can been seen in figures~\ref{firecrest-image-background} and \ref{firecrest-image-noise}.



firecrest-image-noise: Noise image.



firecrest-image-background: Background image.

\subsection{Object Identification}

The background image (calculated during noise estimation) is subtracted from the filtered image (an example is shown in figure~\ref{firecrest-image-subtracted}). This image is then thresholded. Thresholding retains pixels whose value is greater than a “factor” multiplied by the associated pixel in the noise image. The factor currently used is 4. A thresholded image is shown in figure~\ref{firecrest-image-thresholded}.



firecrest-image-subtracted: Image with background subtracted.



firecrest-image-thresholded: Image after thresholding.

After thresholding object detection takes place. The algorithm is reasonably straightforward. Every pixel in the image is itterated over when a pixel above the threshold in encountered a boundary is created. The boundary is created by walking all adjacent above threshold pixels clockwise. The area covered by this newly identified cluster is also calculated, and the pixels in this position on the original image unset (to avoid detecting this cluster again). During this process the maximum pixel intensity in this cluster is also found. It is the maximum pixel intensity that is passed to the base caller (rather than a function of all the pixels in this cluster). Initially identified clusters are shown in figure~\ref{firecrest-image-identifiedobjects}.



firecrest-image-identifiedobjects: Image after object identification, gray pixel indicate the maximum pixel identified in this object.

\subsubsection{Deblending}

If an objects area is greater than $2f^2$ where $f$ is an input parameter and usually set at 2.7 the object is split. However if objects are too large (greater than $16f^2$) it is simply ignored (i.e. it is kept, there is a TODO suggesting it should be deleted in the source). I’m not sure how this is squared with the “many” clusters that are usually identified in blobs.



firecrest-image-deblended: Image after object deblending, gray pixels indicate the maximum pixel in this object. Split objects no longer have boundaries, and so only the maximum pixel is shown.

\subsubsection{Local Backgrond Compensation}

Once final objects have been identified local background can be compensated for. This is calculated in the function \texttt{get\_background\_dispersion} and subtracted from the cluster intensity in \texttt{integrate\_objects}. Local background compensation takes a 10×10 window around the pixel with maximum intensity in a cluster. All pixels that are not part of a cluster are extracted and sorted.

An iterative process then occurs such that (to quote the source): “The background is given by the median of the pixel distribution, and the noise is chosen such that 68\% of the pixels lie within a 2-sigma interval. The procedure is iterated and each time 3-sigma outliers are removed to avoid large contaminations by residual low-level signal.”

The background is therefore the median pixel, noise is calculated as: $\frac{c}{2}-0.34c – \frac{c}{2}+0.34c$ where $c$ is the number of pixels. This process occurs iteratively removing maximum pixels if they are greater than background = $3\times$noise. This iteration terminates if less than 2 pixels are removed or less than 20 pixels remain.

Bustard

Bustard is the Illumina basecaller, the base call is simply the maximum intensity after Crosstalk and Phasing corrections have been applied. Quality score calculation is more involved (NOTE: add more)

\section{Crosstalk matrix correction}
Crosstalk correct is perform as described for capillary sequence by Li and Speed~\cite{li98}.\section{Phasing correction}

Phasing is a term used to describe the signal contribution of bases from the previous and next cycles. Because the labelled base is not always successfully incorporated or two bases are incorporated instead of one this “signal leakage” sometimes occurs.

Phasing and prephasing are calculated as follows:

~

~

$phasing_b$ = average of previous cycle where $b$ is max – average of previous cycle where $b$ is NOT max

~

$prephasing_b$ = average of next cycle where $b$ is max – average of next cycle where $b$ is NOT max

~

~
Where $b$ is a given base. Phasing and prephasing are averaged over the run and used to derive a correction matrix. (NOTE: add detail and check)

\section{Basecalling}
(NOTE: add this)

Gerald

\section{Purity and Chastity}

Purity is defined as the ratio of the largest intensity to the sum of all intensities for this cycle/cluster. Chastity is a similar metric, and is defined as the ratio of the largest intensity to the sum of the largest and second largest intensities. As pointed out by Irina, the logic of this definition is somewhat faulty. If the second highest intensity is negative (as it can be due to background compensation) it will contribute to the sum incorrectly. To further complicate the situation Chastity is also sometimes used to mean a limit for purity (for example in PhasingEstimate.cpp). E.g. Chastity of 0.6 means to not accept purity above 0.6.

Purity and chastity are used as filters to identify mixed clusters, these are cluster which have grown from more than one molecule.

File Descriptions

\section{Initial run folder}
Before processing by the pipeline the run folder need only contain an Images directory and .params file. The params file should have the same name as the run folder.

The Images directory contains a subdirectory for each lane in the format \texttt{Lnnn} where \texttt{nnn} is the lane number. For example lane 1 would be \texttt{L001}. The lane directory contains a subdirectory for each cycle, the Solexa run folder documentation states that this is in \texttt{C.} format, version number is usually 1. Each cycle folder contains a set of tif images for the cycle in the format: \texttt{s\_\_\_.tif} (this differs from the Solexa documentation).
\subsection{.params file}

The \texttt{.params} file is in the root of the run folder and must have the same name as the run folder. It’s an xml format file, and it’s general format can be seen in the following example:

\VerbatimInput{examples/myfake.params}
Figure~\ref{runfolder-initial} gives an overview of the initial run folder structure.


runfolder_initial: Initial run folder structure, folders are represented by boxes, unboxed items are files.

\section{Pipefile files, before run}
To run the pipeline, you first tell it to create a set of makefiles in the run folder. For our \texttt{myfakerun} folder we might execute the following command to do this:
\begin{center}
\texttt{./GAPipeline-0.3.0b3/Goat/goat\_pipeline.py –cycles=1-36 myfakerun –make}
\end{center}
This creates a \texttt{Data} directory and populates it with files, note you need to tell the pipeline how many cycles ran. The \texttt{Data} directory contains a directory with a name similar to the following:
\begin{center}
\texttt{C1-36\_Firecrest1.9.2\_07-01-2008\_username}
\end{center}

This is named after the cycle numbers, Firecrest version number (?), date and username of the user who ran the software. A breakdown of this directory structure with additional directories highlighted is shown in~\ref{runfolder-makefilescreated}.


runfolder-makefilescreated: Run folder, after the pipeline has created it’s initial Makefiles. New folders are highlighted in grey. Blue dots highlight files of unknown function.

\section{Firecrest}

After firecrest (the Solexa image analysis tool) has run a huge number of files are created, these are described in figure~\ref{runfolder-postfirecrest}. From our perspective the most useful of these files are the \texttt{s\_\_\_int.txt.gz} and \texttt{s\_\_\_nse.txt.gz} files created in the root Firecrest directory. These files contain intensity and noise information associated with each cluster identified during image analysis.


runfolder-postfirecrest: Run folder, after firecrest run. New files are highlighted in Olive. Blue dots indicate files of unknown function.

\section{Bustard}

Bustard, the Solexa base caller, also creates a large number of files, from a user perspective the \texttt{s\_\_\_seq.txt} in the Bustard directory are the most relevant. These files contain the final base calls. \texttt{prb} and \texttt{qhg} files appear to contain information which is processed to create the final quality phred style quality score files by Gerald (the Solexa reporting tool). Figure~\ref{runfolder-postbustard} describes the files produced by Bustard during its run.


runfolder-postbustard: Run folder, after Bustard run. New files are highlighted in Olive. Blue dots indicate files of unknown/unclear function.

\section{Gerald}

In order to run Gerald, you need a Gerald configuration file. The format is quite straight forward and is described in \texttt{GERALD User Guide and FAQ.html} in the Solexa pipeline documentation. An example, simple configuration might be:

\VerbatimInput{examples/gerald.config}

Once you have a configuration file, you can ask Gerald to create a set of Makefiles with the following command (one line):

~

\scriptsize
\texttt{./GAPipeline-0.3.0b3/Gerald/GERALD.pl ./myfake/gerald.config}

\texttt{–EXPT\_DIR}
\texttt{ ./myfake/Data/C1-36\_Firecrest1.9.2\_08-01-2008\_username/Bustard1.9.2\_08-01-2008\_username –FORCE}
\normalsize

~

This creates a directory with the name \texttt{GERALD\_08-01-2008\_username} (where 8-01-2008 is the current date), under the \texttt{Bustard1.9.2\_08-01-2008\_username} directory. This contains a copy of the configuration file originally provided (\texttt{config.txt}) and version of this file converted in to xml format (\texttt{config.xml}) and a makefile. To run Gerald, simply run make. The \texttt{htm} files in this directory contain run reports, details of error rates, signal intensities etc. The most useful files in this directory are the \texttt{s\_\_sequence.txt} files. These contain final sequence and quality scores produced by Gerald. Many other text files are created, files created by Gerald are summarised in figure~\ref{runfolder-postgerald}.


runfolder-postgerald: Run folder, after Gerald run. New files are highlighted in Olive. Blue dots indicate files of unknown/unclear function.

Nextgen sequencing primer

If you’ve reached this page, you should probably just email me: [email protected] I love talking about sequencing and DNA sequencing platform technologies.

I wrote the quite some time ago, though much of it is out of date many of the basic principals apply. The latex conversion is also currently incomplete, so bare with me…

 

A pdf version is available here: primer

Overview

This document is intended to be a brief overview of next-gen (2nd Generation) sequencing technologies. We will cover Illumina, 454 and SOLiD sequencing platforms. Up until 2006 the dominant form of sequencing used the underlying method of [sanger77] and had for the previous 30 years. In it’s final form Sanger sequencing produced reads of the order of 1000 bp. It is important to note the Sanger sequencing does not determine the sequence of a single region of the genome, but rather all regions starting with a given “primer” sequence. This means that when sequencing a repetitive or haploid genome there maybe significant ambiguities, for example at SNP sites.

In comparison to Sanger sequencing next-gen methods produce far shorter read lengths, on the order of 25 to 100~bp. Read length is a significant factor in both the assembly of new reference sequences and the mapping of variation [whiteford05]. It is fair to say that longer reads are better for almost all applications. However next-gen technologies have several advantages. Firstly they are operate at a much higher throughput than Sanger sequencing. Secondly the cost per base is far smaller than Sanger sequencing. Finally many next-gen approaches result in sequence data from a single molecule, and therefore single location on the genome. This allows for the more accurate determination of SNPs.

Solexa/Illumina Sequencing

The Solexa sequencing technology operates by sequencing by synthesis (SBS). Using this approach a complementary strand is synthesised along an existing single stranded template, the incorporated bases are labelled (in this case fluorescently) and detected. Each base has a different label, and therefore by reading the labels you are infer the sequence of the original template sequence.
Illumina/Solexa sequencers currently produce reads in the range of 35 to 75 bp. With a throughput of up to 10~Gb per run (non-PF). The error rate in the totally dataset is generally around 6%, however most users never see this dataset. This is because the standard Illumina data analysis pipeline filters the reads to remove “mixed” clusters. These are clusters form from more than one data template, and therefore produce ambiguous basecalls. This filter removes approximately 40% of the data, resulting in between 4 and 5Gb per run of data, with an error rate of between 1 and 2%. Typically the Illumina sequencers produce reads of 36bp, which maybe “paired end” (2 sequences a known distance apart, in this case usually 200bp, though long insert protocols are occasionally used). A typical 36bp single end takes 3days to run, giving a throughput of approximately 1Gb per day.

The above describes the state of play with the Illumina Genome Analyzer 1 (sometimes called “Classic”). However a new instrument, the GA2 (previously known as the “Hyperkaster”) improves data quality and throughput significantly, Much of this is down to improved optics which increases the signal to noise ratio. Using the GA2 50bp reads produce error rates similar to 36~bp reads on the GA1. Reads as long as 75bp have been attempted, and have produced without catastrophic error rates in late cycles (it remains to be seen if there utility justifies the increased runtime).

Library Preparation

Let’s examine the process from the beginning. Library preparation is not my area, but I’ll try an explain the process as I understand it. You start with a sample of DNA, for example a set of Chromosomes from an individual, this is then broken in to relatively large fragments (either by nebulisation or sonification, my understanding nebulisation is more common), these fragments would be on the order of 10,000bp. These large fragments are PCR amplified to give enough starting material for sequencing.

A second round of neculisation then occurs, which breaks the 10,000bp fragments in to the size required for sequencing. Typically around 200bp. A size selection step occurs to insure only those fragments of the required size are used (A electrophoresis gel is run, and the region corresponding to the required size is extracted). This then forms the starting material for sequencing.

Sequencing Preparation

Once the sample DNA has been prepared adapters sequences are added to both ends of the fragment. The fragments are then attached to primer sequences on the flowcell (see figure 1). The result of this is that these single molecules are randomly scattered across the flow cell (A flowcell is simply a series of 8 glass channels each of which may contain a single (or multiplexed) sample). Sequencing by Synthesis could theoretically occur at this point, on the single molecule level. However it is unlikely that the fluorescence from a single molecule would provide enough signal to reliably sequence. In order to increase the signal “clusters” are grown from these single molecules using a technique called “bridge amplification”, this is a non-trivial, patent protect process. Once clusters have been grown the sample is loaded on to the device and sequencing can begin.

Figure 1: A GA1 Flowcell

Sequencing

The flowcell is loaded in to the device, the flowcell is manually focussed (an often problematic process!). Sequencing can now begin and occurs in a cyclic process, one cycle per read position. At the start of each cycle the flowcell is move under a peltier. The peltier heats the flowcell while the sequencing chemistry is performed. First, any “blockers” and labels are removed, leaving any previously incorporated bases attached. Labelled nucleotides are then washed over the flowcell. The nucleotides also contain a blocker, this prevents any additional nucleotides from incorporating and therefore ensures that in a single cycle only one base is added to each molecule.


After the chemistry has been performed, imaging occurs. The flowcell is positioned under a CCD camera, and a laser is “bounced” off the bottom of the flowcell (so as to avoid it shining directly in to the CCD). The laser excites the fluorescent labels while the CCD camera takes a picture. Imaging is performed under 2 lasers and using 2 filters on the camera. This gives a total of 4 images per cycle, each of which reflects the luminescence, of one type of fluorophore/labelled nucleotide. The flowcell is imaged as a series of small, non-overlapping regions. In Solexa terminology these are called “tiles” (an example tile image is shown in figure 2. During a complete 36 cycle runs a single tile is composed of 4*36 images (4 channels times 36 cycles). On a GA1 these images are 1002x1004pixels, clusters are approximately 4 pixels in radius which corresponds to around 1micron.

The primary data analysis problem is therefore to segment and register each tile stack, and perform post image analysis corrections (basecalling) details of this can be found in the document entitled “The Solexa Pipeline” (I’ll have to upload this too sometime).

Error profile

Figure 3 shows the distribution of errors in a Solexa run. 76,000 reads from a $\phi$X~174 control lane were aligned using a brute force alignment (which will find the best match with any number of errors). It therefore includes contamination, an analysis of which is provided elsewhere (contact new at sgenomics dot org for more information). Quality scores were not used in alignment. It should also be noted that this is the error rate for the PF fraction of the data, this is all a user normally sees. In Solexa sequencing errors tend to follow this basic trend, increasing slowly with read length upto around 5% in the final cycle. Early cycle errors tend to be around 0.3 to 0.5% and we believe are mostly caused by contamination. Errors increase with cycle for two basic reasons, the first is single loss. As cycle progresses the fraction of molecules that no longer produce signal increases (through photobleaching, chemical, or physical effects). Secondly the number of molecule that are “out of phase” with the current cycle increases and can no longer be corrected for by the primary data analysis software. Phasing is caused by the non-incorporation of a labelled base or non-blocking of a labelled base (and therefore the incorporation of too many bases), this manifests itself as signal leakage between cycles. The jump in error rate at cycle 12 is caused by purity filtering (the minimal purity over the first 12 bases is used).

Figure 3: Error rate by cycle in a Solexa $\\phi$X~174 control lane (not a great run, GA1), this plot was created from the first 76,000 reads in this provided dataset.

Figure 4 shows the distribution of errors, 35 being fully correct. As expect errors are relatively evenly distributed thoughout reads. If you look at the entire dataset (not just PF) a similar trend is seen, errors start at a much higher level and the error rate in the total dataset is around 6%.

Figure 4: Distribution of errors a Solexa $\\phi$X~174 control lane (not a great run, GA1), this plot was created from the first 76,000 reads in this provided dataset.

Focusing

During imaging the camera refocuses between tiles, this results in a change in the Z-height on the camera (as it moves up and down to focus). Guoying Qi at the Sanger institute has made the following plots of the Z-height as the images progresses. Figure 5 shows the Z-height for a single cycle. There are two interesting artifacts in these plots, which tell us something about the underlying technology. The first is the sawtooth pattern in Z-height. The peaks of these occurs at the extreme ends on the lanes (channels on the flowcells). In this case the Z-height is slowly decreasing as the end of the lane approaches, it then abruptly begins to increase as the camera walks back down the flowcell (images are taken in a zig-zag pattern). This indicates that the flowcell is not perfectly flat in its mounting and is sloped along its Y-axis.

Figure 5: The Z-Height for cycle 2 across all tiles/lanes

The second artifact is the gradual decay in Z-height as the cycle progresses. Our current theory is that this is a temperature dependent effect, the ambient temperature of the in higher at the beginning of the cycle, due to the heat produced by the peltier while base incorporation is performed. This causes the flowcell to warp slightly altering the focusing. As the ambient temperature is approached the flowcells original shape is restored and the Z-height decreases. This is perhaps backed up by the fact that the first cycle often does now show this decay in Z-height (see figure 6).

Figure 6: The Z-Height for all tiles/lanes across the first six cycles

454

I know very little about 454 sequencers, other than that the primary data analysis is the most closed of the 3 platforms. Images are stored in a proprietary “pif” format, for which no documentation is available (update: if I remmember correctly pifs are actually tiff files with a few bytes on the front).

454 Sequencers produce reads of appoximately 100~bp, however the throughput is not as high as Illumina or ABI’s and the cost per base is higher (so I’m told). They also have significant problems with “homopolymer runs”, that is runs of the same base. The 454 technology produces an intensity trace where a higher peak represents a longer run of the same base, this causes some ambiguity.

Update: So, I really don’t know that much about the underlying technology of 454 sequencers, and I’ve never worked with the data but essentially like Ion torrent machines and unlike Solexa sequencers the don’t have dye terminators. This is what causes the homopolymer problems, as more than one base can incorporate in a single “flow”. The dye terminator technology is patented, and I assume difficult to license/get around. If you could use the dye terminator technology from Ilm on an Ion torrent machine, that would probably improve things quite a lot (as always this is random speculation and guessing).

Figure 7: A 454 Tile image

Figure 8: Magnified region of a 454 tile image.

Solid

The ABI SOLiD is a next-gen, high-throughput, sequencing by ligation, DNA sequencer. The ABI SOLiD sequencing system is significantly different from both traditional Sanger sequencing and other high throughput DNA sequencers. In terms of informatics this is largely apparent by the novel 2 base encoding system employed by the device.

The technique first attaches prepared DNA samples to beads. Each bead contains PCR amplified copies of a single molecule. Beads are then attached to the surface of a flowcell. The sequencing then operates in a cyclic process though the ligation of successive “probe” sequences to the beads. Each probe is ligated, read (via it’s fluorescence) and then removed. This is further complicated by the use of five different primers to offset probes and allow the reads to completely cover the target, however this is not a significant issue from an informatics perspective.

Each ligation probe is composed of 3 degenerated bases (all possible 3~bp sequences synthesised combinatorially (CHECK)) followed by a 2~bp read sequence, and then 3 fluorescently labelled universal bases. Degenerate bases are used to provide a discrimination advantage [drmanac02] during hybridisation. An example probe might be: 3′ NNNATZZZ 5′.

Ideally this probe would hybridise and be ligated against 5′ NNNTANNN 3′ in the sample. The probe is designed such that it fluoresces in a given channel. Table 1 shows the channel (colour) that is emitted by a given probe.

Table 1: ABI Solid base to colour chart. Different dyes are represented as numbers. In ABI documentation these are usually denoted by the following colours: 0. Blue 1. Green 2. Orange 3. Red. and represent these dyes: 0. 6-FAM 1. CY3 2. Texas Red 3. CY5

Basecalling

Table 2: From table 1 we can determine the colour to sequencing mapping shown.

As you can see the sequencer adopts a novel 2 base encoding strategy. That is a single 2bp combination does not emit a unique signal, but one that is shared amongst 3 others. In order order disambiguate these signals and obtain base calls you therefore need to trace a path from signal symbol to signal symbol (colour to colour), and in order to do this unambiguously you need to know the first base in the sequence. When sequencing this is known as the first base is part of the primer sequence. Table 2 shows the colour to base mapping which can be inferred from table 1. An example disambiguation of the read sequence 011023022 is shown in figure 9.

Figure 9: An example showing the conversion of the colour sequence 011023022 and a known starting base (T) in to a base sequence: TTGTTCGGAG

SNP Calling

SNP calling is the process of identifying single nucleotide changes in a genome with respect to a reference sequence. Traditionally this is performed by aligned reads against a reference and noting differences between the alignment and reference. It is usual to require a large number of reads covering a given SNP in order to confirm the call.

This method can also be used with the ABI SOLiD. That is to say, a colour space sequence is translated in to a base pair sequence, which is then aligned to the genome in order to identify SNPs. However there maybe certain advantages to aligning in colour space directly in this case the reference sequence is translated in to colour space. Colour space reads are aligned to the colour space reference, these regions are then translated back to sequence space and SNPs called. As we shall see this has a number of advantages in the identification of miscalls.

When errors occur

Miscalled, inserted or deleted colours manifest themselves similarly in terms of base space, that is they significantly alter the base calls of all remaining bases in the read, this is because subsequent base calls are dependent on previous base calls which causes the error from one call to propagate though all subsequent bases, figure 10 illustrates this effect.

Figure 10: The effect of a single colour space miscall on base pair sequence

Advantages of miscall errors for SNP calling

One of the marketed features of the ABI SOLiD is that this error propagation has an advantage for SNP calling, and is their main reason for employing colour space alignments. The reasoning is as follows: “if you align in colour space to a position that contains a single mismatch then you have a high degree of confidence that this read came from this location, however because the read and reference are vastly different in terms of sequence this is likely to be a miscall”. Let’s try to quantify this.

* Reads align to a unique location, the number of allowable mismatches will need to be tuned such that this is the case, or significant problems will occur.
* If there is a single colour mismatch at position $t$ of a read of length $l$ then this SNP is really a miscall or this and subsequent bases are mutated to one of $4^{l-t}$ sequences. Therefore the probability of this being a true call at the position in the sequence is $c = \frac{1}{4^{l-t}-1}$ given that all alternate sequences are equally likely (some may contain more mismatches than others but we take this as a close approximation). The probabilities for a 25~bp read are shown in table~\ref{miscall-probabilities}. If $d$ (the device miscall probability) is greater than $c$ (probability of mismatches in the remainder of the read) then it more likely that this base has been miscalled.

Table 3: Probabilities of a single colour space miscall being a true base pair sequence in a 25~bp read: %octave: for t= 1:24;l=25; c = (1/(4^(l – t) – 1));printf(“%e\n”,c); end for

If the mismatch occurs in the last base then we have no other information to base the probability that this is a miscall on, it’s likelihood is therefore $d$ (probability of a miscall based on device characteristics). These probabilities however are based on the notion that this data came from this location on the reference. We must therefore tune the minimal number of allowable alignment matches such that it is unlikely that is read came from an alternate location. The number of times we would expect any given 25~bp read with $m$ or fewer mismatches appear in a randomly distributed human genome sized sequence ($3.2\times10^9$~bp) is shown in table~\ref{human-mismatch-alignment}, the method used to generate this is described in the appendix. Even though it has previously been shown that substring in the human genome are not randomly distributed [whiteford05] [whiteford08] we can only expect these values to degenerate as order increases and they therefore provide an acceptable upper bound. From this table we can see that with 25~bp reads, we can only allow up to 3 mismatches. For 35~bp reads this increases to 8~mismatches and at 50~bp we have reached 16~mismatches. This is confirmed in practice where 25~bp reads with greater than 3 mismatches align poorly or to multiple locations (NOTE: multiple locations true?). It should also be noted that mismatches are not just due to miscalls but also true SNPs which as we shall see later cause 2 mismatches, with 25~bp reads we can therefore expect to identify 1 SNP and allow one mismatch before the read is no longer alignable.

Given a device miscall rate $d$, the probability of a read of length $n$ containing $m$ or more miscalls is:

\begin{equation}
s = \sum\limits_{k=1}^{m}\frac{n!}{k!(n-k)!}(d^k(1-d)^{n-k})
\end{equation}

and therefore can expect to discard $100\times s$ percent of reads, this is tabulated in table~\ref{discarded} for varying error rate and read length taking the maximal read lengths identified in~\ref{human-mismatch-alignment}.

\begin{center}
\begin{table}[h]
\begin{center}
{\footnotesize
\begin{tabular}{ | r | c | c | c |}

%octave: d=0.1; n=50; m=50;s=0; for k=17:m; s=s+sum((factorial(n)/((factorial(k)*factorial(n-k))))*d^k*(1-d)^(n-k)); endfor; printf(“%d %f\n”,k,s);
\hline
Error rate & 25~bp read, $\geq$4~mismatches & 35~bp read, $\geq$9~mismatches & 50~bp read, $\geq$17~mismatches \\
\hline
0.005 & 0.0007\% & 0.0\% & 0.0\% \\
0.01 & 0.0107\% & 0.0\% & 0.0\% \\
0.02 & 0.1446\% & 0.0\% & 0.0\% \\
0.03 & 0.6186\% & 0.0001\% & 0.0\% \\
0.04 & 1.6522\% & 00.0007\% & 0.0\% \\
0.05 & 3.4091\% & 00.0042\% & 0.0\% \\
0.06 & 5.9757\% & 00.0170\% & 0.0\% \\
0.07 & 9.3612\% & 00.0533\% & 0.0\% \\
0.08 & 13.5092\% & 00.1388\% & 0.0\% \\
0.09 & 18.3146\% & 00.3130\% & 0.0001\% \\
0.1 & 23.6409\% & 00.6304\% & 0.0004\% \\
0.2 & 76.6007\% & 25.4988\% & 1.4442\% \\
\hline
\end{tabular}
}
\caption{Percentage of reads which will be unalignable, or misaligned due to high numbers of errors for varying error rate and read length.}
\label{discarded}
\end{center}
\end{table}
\end{center}

\begin{center}
\begin{table}[h]
\begin{center}
{\footnotesize
\begin{tabular}{ | r | c | c | c |}
\hline
%octave: l=3200000000; m=0; for k=1:25; n= 25; g = (3^k)*(factorial(n)/(factorial(k)*factorial(n-k))); m=m+g; e=(m/(4^n))*(l-n+1); f=l/e; printf(“%d %d %d %e %e\n”,k,g,m,e,f); endfor
Mismatches in alignment $m$ & Alignments ($n$ = 25~bp) & Alignments ($n$ = 35~bp) & Alignments ($n$ = 50~bp) \\
1 & $2.131628\times10^{-4}$ & $2.846031\times10^{-10}$ & $3.786532\times10^{-19}$ \\
2 & $7.887024\times10^{-3}$ & $1.479936\times10^{-8}$ & $2.820967\times10^{-17}$ \\
3 & $1.843858\times10^{-1}$ & $4.937863\times10^{-7}$ & $1.364098\times10^{-15}$ \\
4 & $3.096616$ & $1.198947\times10^{-5}$ & $4.845417\times10^{-14}$ \\
5 & $3.979072\times10^{1}$ & $2.258093\times10^{-4}$ & $1.348140\times10^{-12}$ \\
6 & $4.067318\times10^{2}$ & $3.433106\times10^{-3}$ & $3.059108\times10^{-11}$ \\
7 & $3.394680\times10^{3}$ & $4.329522\times10^{-2}$ & $5.820293\times10^{-10}$ \\
8 & $2.356333\times10^{4}$ & $4.618474\times10^{-1}$ & $9.473970\times10^{-9}$ \\
9 & $1.378524\times10^{5}$ & $4.228817$ & $1.339611\times10^{-7}$ \\
10 & $6.864397\times10^{5}$ & $3.361118\times10^{1}$ & $1.665153\times10^{-6}$ \\
11 & $2.930661\times10^{6}$ & $2.339455\times10^{2}$ & $1.836907\times10^{-5}$ \\
12 & $1.078543\times10^{7}$ & $1.435951\times10^{3}$ & $1.812322\times10^{-4}$ \\
13 & $3.434975\times10^{7}$ & $7.815829\times10^{3}$ & $1.609417\times10^{-3}$ \\
14 & $9.494372\times10^{7}$ & $3.789239\times10^{4}$ & $1.293288\times10^{-2}$ \\
15 & $2.282504\times10^{8}$ & $1.642140\times10^{5}$ & $9.446183\times10^{-2}$ \\
16 & $4.782005\times10^{8}$ & $6.379199\times10^{5}$ & $6.294956\times10^{-1}$ \\
17 & $8.751801\times10^{8}$ & $2.226228\times10^{6}$ & $3.839698$ \\
18 & $1.404486\times10^{9}$ & $6.991152\times10^{6}$ & $2.149581\times10^{1}$ \\
19 & $1.989509\times10^{9}$ & $1.978121\times10^{7}$ & $1.107056\times10^{2}$ \\
20 & $2.516029\times10^{9}$ & $5.047735\times10^{7}$ & $5.255314\times10^{2}$ \\
21 & $2.892115\times10^{9}$ & $1.162548\times10^{8}$ & $2.303356\times10^{3}$ \\
22 & $3.097253\times10^{9}$ & $2.418299\times10^{8}$ & $9.333844\times10^{3}$ \\
23 & $3.177524\times10^{9}$ & $4.547617\times10^{8}$ & $3.501041\times10^{4}$ \\
24 & $3.197592\times10^{9}$ & $7.741593\times10^{8}$ & $1.216688\times10^{5}$ \\
25 & $3.200000\times10^{9}$ & $1.195764\times10^{9}$ & $3.920430\times10^{5}$ \\
26 & NA & $1.682231\times10^{9}$ & $1.171969\times10^{6}$ \\
27 & NA & $2.168698\times10^{9}$ & $3.251770\times10^{6}$ \\
28 & NA & $2.585670\times10^{9}$ & $8.376996\times10^{6}$ \\
29 & NA & $2.887615\times10^{9}$ & $2.004130\times10^{7}$ \\
30 & NA & $3.068782\times10^{9}$ & $4.453635\times10^{7}$ \\
31 & NA & $3.156444\times10^{9}$ & $9.194611\times10^{7}$ \\
32 & NA & $3.189317\times10^{9}$ & $1.763948\times10^{8}$ \\
33 & NA & $3.198282\times10^{9}$ & $3.145834\times10^{8}$ \\
34 & NA & $3.199864\times10^{9}$ & $5.218665\times10^{8}$ \\
35 & NA & $3.200000\times10^{9}$ & $8.061403\times10^{8}$ \\
36 & NA & NA & $1.161483\times10^{9}$ \\
37 & NA & NA & $1.564844\times10^{9}$ \\
38 & NA & NA & $1.978821\times10^{9}$ \\
39 & NA & NA & $2.360953\times10^{9}$ \\
40 & NA & NA & $2.676211\times10^{9}$ \\
41 & NA & NA & $2.906889\times10^{9}$ \\
42 & NA & NA & $3.055181\times10^{9}$ \\
43 & NA & NA & $3.137949\times10^{9}$ \\
44 & NA & NA & $3.177452\times10^{9}$ \\
45 & NA & NA & $3.193253\times10^{9}$ \\
46 & NA & NA & $3.198406\times10^{9}$ \\
47 & NA & NA & $3.199721\times10^{9}$ \\
48 & NA & NA & $3.199968\times10^{9}$ \\
49 & NA & NA & $3.199998\times10^{9}$ \\
50 & NA & NA & $3.200000\times10^{9}$ \\
\hline
\end{tabular}
}
\caption{Expected number of alignments with $m$ or fewer mismatches for reads of length $n$ against a genome of $3.2\times10^9$~bp. Values less than one mean that in a randomly distributed sequence of this size we would expect reads with $m$ mismatches to be unique.}
\label{human-mismatch-alignment}
\end{center}
\end{table}
\end{center}

So far we have determined how many miscalls will cause our reads is misalign, we have also shown that given a good alignment the probability of a colour mismatch toward the end of a read being caused by a miscall decreases toward the end of a read. We now address the question of true SNPs, how they effect colour space sequence and given a particular colour space miscall rate what the SNP miscall rate is.

As shown in table~\ref{miscall-probabilities} in most read positions the probability of a single colour change being due to a true sequence polymorphism is low. In current protocols therefore all single colour changes are discarded. A true independent SNP will cause two consecutive changes in colour space to one of 3 other two colour combinations, as shown in figure 11. It should also be noted that a true SNP call can not be registered incorrectly by a single miscall, two colour changes are required to change a valid SNP in to an alternate valid single SNP colour sequence. Given this the question is therefore at a given device miscall rate what therefore is the probability of two consecutive miscalls causing a valid two colour change?

Figure 11: An example showing the two colour changes that are caused by single, real SNPs

The probability of two consecutive miscalls is $d^2$ (where $d$ is the device error rate), out of the $4^2$ colour combinations 3 would be interpreted as valid SNP calls the final single SNP misscall probability is therefore $f = \frac{3d^2}{4^2}$.

Theoretical Conclusions

We have shown that an ABI device with a miscall rate of $d$ has a SNP call rate of $\frac{3d^2}{3^2}$.
Current ABI devices have a miscall rate of 0.1 (based on the Q scores in the Yoruba dataset, however these are not evenly distributed across reads so, this may be unfair). They therefore have a SNP calling error rate of 0.0033 (0.33\%, approximately Q25) which is comparable with that obtainable from Solexa devices. We have also determined the number of mismatches that may be tolerated before alignments repeat, it is crucial to the error correction process that the alignment maps against the correct position on the reference. For a 25~bp read against a human genome size sequence of random distribution we have shown that a maximum of 3 mismatches can be tolerated, effectively 1 true SNP and 1 error and that at an error rate of 0.1, 23\% of reads would have 4 or more errors, and would therefore not align or misalign. Reducing the device miscall rate to 0.05 would result in a final SNP call error rate of approximately Q31. Increasing the read length from 25~bp to 35 or more would would decrease the probability of misalignment significantly.

In this document we have only discussed the use of ABI sequence data for SNP calling. DNA sequencer will typically also be used for \emph{de novo} assembly of read sequences. The ABI SOLiD will not be able to take advantage of it’s ability to correct against a reference here and we will be left with the raw error rate when overlapping reads.

Distribution of Errors in the E.Coli Dataset

76,000 reads from the SOLiD v.2 E.Coli dataset provided by ABI were aligned. Only the first end read was used, which in this dataset was 25~bp. I used a bruteforce aligner (which accepts any number of errors) to align this data, the aligner does not factor quality scores in to its alignment. The reference sequence was translated in to colour space, and the reads aligned. The purpose of this analysis is to get a handle on the single colour change error rate, and the distrubtion of errors in the reads. Figure 12 show the error rate by cycle. Cycles one and two have predictably high error rates, the first is not a colour space symbol (a “T”) and therefore results in a 100% error rate. The second can not be interperated correctly in colour space (as it partly covers the adapter and the real sequence), its alignment is therefore almost random and therefore has a high error rate.

Interperation of the error rates on the remaining cycles is difficult. I expect to see a more significant bias due to the order in which the ligation probes are added, when compared to the ligation order some correlation does appear to exist but it is not hugely significant. A trend toward increase error which increasing cycle does exist, as does a significant peak in errors on the penultimate base (Erin: you told me why, but I can’t remember).

Figure 13 shows the distribution of errors in these reads. 2771reads had 18 errors, while only 13 had 17 errors (these could be alignments that overlapped with “N”s which were not converted in to colour space sequence. Such a large drop is unlikely and suggests that the dataset may have been filtered to remove poorly aligning reads.

Figure 12: Error rate by cycle in SOLiD v.2 E.Coli Dataset, this plot was created from the first 76,000 reads in this provided dataset.

Figure 13: Distribution of errors per read in SOLiD v.2 E.Coli Dataset, this plot was created from the first 76,000 reads in this provided dataset. Reads of score 24 or 25 should be considered “fully correct”.

Appendix

Enumerating Mismatched Alignments

In order to enumerate the number of possible mismatches in a sequence of length $n$ we first determine the number of possible mismatch positions. This is a simple exercise in combinatorics but is perhaps best illustrated with an example. If we have a sequence of length 5 ($n = 5$) then mismatches can occur at positions $1, 2, 3, 4$ or $5$. We wish to know the number of possible sequences with 3 mismatches ($k = 3$). This is simply the binomial coefficient [knuth73a-binomial] (colloquially, the number of ways to pick $k$ objects out of $n$ options):

\begin{equation}\frac{n!}{k!(n-k)!}\end{equation}If we are interested in the number of strings of length $n$ with $m$ or less mismatches we therefore have:
\begin{equation}
g_{n,m} = \sum\limits_{k=1}^{m}{3^k\times\frac{n!}{k!(n-k)!}}\end{equation}

As each position of each mismatch can be one of 3 alternate bases. If we would like to know the number of times we would expect to see a match within $m$ mismatches of a string of length $n$ against a randomly distributed target string of length $l$ we have:
\begin{equation}e = \frac{g_{n,m}}{4^n}\times(l-n+1)\end{equation}
As $g_{n,m}/4^n$ is the number of times we would expect to see this sequence in any given alignment position and $(l-n+1)$ is the number of alignment positions in the target sequence.

[drmanac02]: Sequencing by Hybridization (SBH): Advantages, Achievements, and Opportunities, R.Drmanac et al., abeb, 77, 2002, July 75
[sanger77]: DNA Sequencing with chain-terminating inhibitors, F Sanger et al., Proceeding of the National Academy of Sciences U.S.A., 74, 1977, 5463-5467
[whiteford05]: An analysis of the feasibility of short read sequencing, N Whiteford et al., bar, 33, 19, e171-, 2005
[whiteford08]: Visualizing the Repeat Structure of Genomic Sequences, N Whiteford et al., Complex Systems, 17, 4, 2008
[knuth73a-binomial]: The Art of Computer Programming (Volume I): Fundamental Algorithms, D. E. Knuth, 51–73, 1973

Accessing the Common Crawl Dataset from the command line

The common crawl dataset is a crawl of the web which has been made freely available on S3 as a public dataset. There are a couple of guides out there for accessing the common crawl dataset from Hadoop, but I wanted to take a peak at the data before analysing it.

Here’s how you do this from EC2, this doesn’t incur any charges, but doing this from an external host will.

First, fire up an instance on EC2 and login. Then:

sudo su
yum install git
git clone git://github.com/s3tools/s3cmd.git

The current version (April 2012) of s3cmd is borked for “requester pays” datasets. It needs patches as described here: http://arxiv.org/help/bulk_data_s3

The instructions basically say add the following lines to S3/S3.py:

        if self.s3.config.extra_headers:
          self.headers.update(self.s3.config.extra_headers)

after:

class S3Request(object):
    def __init__(self, s3, method_string, resource, headers, params = {}):
        self.s3 = s3
        self.headers = SortedDict(headers or {}, ignore_case = True)

Then install it:

python setup.py install
s3cmd --configure

In the AWS management console, go to the top right where your name is, select, select “Security Credentials” get your access key and secret key and enter them in to s3cmd. For the other options you can accept the defaults. You can then access the dataset.

List the bucket:

s3cmd ls --add-header=x-amz-request-payer:requester s3://aws-publicdatasets/common-crawl/crawl-002

Fetch a file:

s3cmd get --add-header=x-amz-request-payer:requester s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/1/1262851198963_1.arc.gz

Once you’ve fetched a file you can decompress it as follows:

gunzip -c 1262851198963_1.arc.gz > text

Simple histogram in python,matplotlib (no display, write to png)

Reads from a file called p, uses 10000 bins, filters out values < -10000. Sets a range of -10000 to 3500000, max value of 20. [sourcecode language="python"] #!/usr/bin/env python import numpy as np import matplotlib as mpl import matplotlib.mlab as mlab mpl.use('Agg') import matplotlib.pyplot as plt inp = open ("p","r") x = [] for line in inp.readlines(): if int(line) > -10000: x.append(int(line)) print x # the histogram of the data n, bins, patches = plt.hist(x, 10000, normed=0, facecolor='green') print bins print n # add a 'best fit' line #y = mlab.normpdf( bins, mu, sigma) #l = plt.plot(bins, y, 'r--', linewidth=1) plt.xlabel('Position') plt.ylabel('Population') plt.title('My data') #plt.axis([-10000,3500000, 0, 20]) plt.grid(True) plt.savefig('histogram.png') [/sourcecode]