Archive for the ‘Uncategorized’ Category.

Manteia Predictive Medicine

Why was Manteia Important?

There are three main components to the sequencing approach used by Illumina, originally developed by Solexa in 2004:

  • Sequencing-by-synthesis (detection of nucleotides incorporated by a polymerase)
  • Reversibly terminated, labelled nucleotides
  • Cluster generation through bridge amplification.

The general concept of sequencing-by-synthesis had been around since at least 1989.

Reversible terminators, in particular the 3′-O-Azidomethyl terminators used by Solexa were originally reported in 1991. Solexa modified these to incorporate a cleavable fluorescent label.

That leaves one final missing piece, cluster generation. Without an amplification approach Solexa would have had to use single molecule detection. Single molecule detection was significantly harder in the late 1990s. And even now, single molecule sequencing shows error rates significantly higher than achievable with clonal approaches.

Solexa acquired this technology from Swiss startup Manteia Predictive Medicine for, at an estimate, 1.5M USD [1]. This is a shockingly low price for a technology now foundational to a billion dollar market… but reports suggest that Manteia owner Serono had lost strategic interest in sequencing, which may explain the low acquisition price.

What was Manteia?

Manteia was a spin out of Serono, founded by Dr Pascal Mayer. They appear to have been pretty forward thinking, and were part of a small crop on next-gen sequencing startups that appeared in the late 1990s (there are now over 40).

Most of the technical information here, comes from a couple of presentations on slideshare [2] [3]. But there’s also ample information in their patent filings.

The basic approach suggested by Manteia, is pretty familiar, amplify single molecules on a solid surface, and sequence.

And indeed, the bridge amplification approach described, closely matches that presented by Illumina today:

Bridge Amplification – From Manteia Presentation [2]
Bridge Amplification from Illumina.

The image also very look very much like those from early genome analyzer 1 and 2 instruments. In fact look pretty similar to today’s Miseq images. In fact, Manteia images look slightly cleaner. Reflect the fact they they used a single channel chemistry, showing no crosstalk.

Manteia (left) and Miseq (right) images.

The Manteia sequencing approach was far different than that used by Solexa however. While Manteia were also proposing sequencing-by-synthesis, this presentation doesn’t show fluorophores being cleaved. In fact, it look like the dyes stay attached, and signal is ever increasing.

Manteia sequencing schematic.

Manteia would therefore have had to detect stepwise increases in signal. The chemistry appears to lack reversible terminators. Homopolyers would therefore cause larger jumps in signal intensity. Precise determination of homopolymer length would probably have been difficult, a problem that still exists in some approaches (Ion Torrent).

As for Illumina sequencing, phasing (clusters getting out of sync as they fail to incorporate) would be an issue. But the lack of reversible terminators makes phasing errors harder to compensate for algorithmically.

Photobleaching, would effect not only the current base, but signals for all prior incorporations. This would need to be compensated for and could be another source of error.

Ultimately, an ever increasing “background” intensity would also limit read length.

Overall, these issues seem challenging. Switching this chemistry over to a reversible terminator/cleavable dye approach was likely key to Solexa’s development of a viable sequencing platform.

But Manteia do appear to have generated proof of concept sequencing results:

Manteia sequencing results

Unfortunately, we’ll never know how Manteia might have progressed the platform based on these early sequencing results. As development at Manteia stopped before the platform was fully realized.

What does this means for Illumina today?

Manteia initially reported cluster generation in 1998. The earliest IP I could find referring to the bridge amplification approach was from September 1998, and expired in 2020.

Illumina is currently using Solexa’s reversible terminator IP to block the sale of MGI instruments in the US. That IP appears to expire in 2024.

After this point, it seems likely that other players will be able to act on this IP and build essentially, Solexa-style SBS sequencers.

Current Illumina instruments have progressed somewhat. The 2-color sequencing, and exclusion amplification IP is still active as far as I’m aware.

However even without this IP, it’s likely that a respectable Illumina-clone could be assembled based on the original Solexa approach.

It therefore seems likely that new “Solexa-style” approaches will appear in the not too distant future. And at the very least, MGI will be able to sell instruments in the US again.

This broadening of the sequencing market is likely what’s driving Illumina push into diagnostic applications. Recent acquisitions of GRAIL, and Verinata Health give them a route into clinical diagnostics. Where they not only own the core sequencing technology, but the diagnostic application too. This allows them to take a larger share of the profit, and more closely lock-in a diagnostic customer base.

Notes/References

[1] Solexa’s 2004 accounts note that “These rights were purchased from Manteia SA under terms of a joint asset purchase agreement with Lynx Therapeutics”. Overall it looks likes the Manteia rights cost them somewhere in the region of 1.5M USD, that is unless payments from the recently merged Lynx side of Solexa hide additional payments.

[2] “A very large scale, high throughput and low cost DNA sequencing method based on a new 2-dimensional DNA auto-patterning process”,P. Mayer, L. Farinelli, G. Matton, C. Adessi, G. Turcatti, J.J. Mermod, E. Kawashima, presented at the Fith International Automation in Mapping and DNA Sequencing Conference, St. Louis (MI,USA), October 7-10, 1998. Invited presentation (P. Mayer). Colonies from the 1998 presentation:

[3] A non confidential corporate presentation of “Manteia Predictive Médicine” as of September 2003. Présents DNA colony sequencing resutls, instrument, DNA preparation for genotyping.

GeneMind’s Single Molecule Sequencer

It’s been a while since we heard much from Direct Genomics. This Shenzhen based sequencing company was a reboot of early NGS player and Quake spinout Helicos.

The company is also notable for being founded by Jiankui He. He is the scientist responsible for the germ-line genetic editing of 3 human babies. And is reportedly serving a 3 year jail sentence.

One of the last papers referring to the Direct Genomics sequencing platform is this one from 2017. Where they discuss the Direct Genomics GenoCare single molecule sequencing platform.

With He’s jail sentence the future of Direct Genomics seemed uncertain. But it looks like the GenoCare platform still exists. Now under development by GeneMind, a Shenzhen based Biotech company founded in 2012. A couple of papers appearing this year describing the platform under their stewardship.

In particular a medRxiv paper from September this year, describes a new two color single molecule sequencing approach.

This seemingly uses the same virtual terminators as the original Helicos approach, with the dye and “terminator” sitting on the base:

GenoCare two color sequencing isn’t the same as the Illumina two color approach, and doesn’t appear to provide the same advantages. On GenoCare’s platform two terminators (C and T) are labelled with a green dye, and two others (G and A) with a red dye [1].

This means that of each cycle, 4 images still need to be acquired. One set after flowing in C and A. And another after cleaving the dyes and flowing in T and G [2]:

In contrast to this, Illumina two color sequencing incorporates information from “dark” bases. This means they can take only two images, and perform no intermediate chemistry in their two color approach:

Illumina’s 4,2 and 1 channel chemistries compared. Image from Illumina.

The GenoCare two color approach therefore doesn’t provide any advantage in terms of imaging speed/performance. Though the optics is likely cheaper than a four color system, it’s unclear as to why they don’t use a two color approach similar to Illumina’s which would allow them to take only two images, with no intermediate cleavage step on the same optical system. Perhaps there are IP issues here.

The GenoCare optical system uses objective style TIRF and a sCMOS camera. This appears to generate reasonable quality images, showing an SNR of >10. There have been huge advances in single molecule imaging since Helicos and this should simplify imaging/data analysis.

However, this doesn’t seem to have translated into massive improvements in data quality. With the GenoCare system showing mismatch, insertion, and deletion error rates of 0.61%, 1.45%, and 2.76% respectively.

Reads are generally short, but usable. With a lopsided distribution heavy in short reads:

Overall, the GenoCare system has an error rate one to two orders of magnitude higher than the market dominating Illumina approach. Their optical system is likely more expensive (requiring single molecule sensitivity). And read lengths are shorter.

The sole advantage of this platform is that amplification is not required. This may simplify sample preparation, and result in lower bias for certain applications. Of course also the case for other single molecule approaches (PacBio, nanopore).

Overall, it’s difficult to see how this platform could be competitive with Illumina (or the largely similar SBS approach used by MGI)… but perhaps they’ll manage to find a way forward.

Notes

[1] “Two terminators are labeled with a green dye, whose peak fluorescent emission wavelength is 552 nm, while the other two terminators are labeled with a red dye with peak fluorescent emission at 664 nm.”. The paper doesn’t appear to not which dyes are associated with which bases, but this combination seems likely from the figures.

[2] The exact combination isn’t clear from the paper, but they can only image one red and one green labelled terminator per imaging cycle to generate an unambiguous read.

The $15 Genome through reduced reagents

A couple of people have asked me about a new publication claiming a $15 genome. The paper seems to suggest that by reducing reagent costs, the cost of sequencing a genome can be dramatically reduced, by a factor of 10 to 100x.

To support this claim the paper suggests that: “SCT has a much thinner reagent layer and unlike flow chips and has no tubing which decreases reagent usage by orders of magnitude. Since the reagent cost accounts for 80-90% of current NGS platforms this decrease in reagent usage of SCT drops the operating costs from about $1000 to about $15 for WGS.”

To support this statement they cite a paper which discusses an academic labs sequencing costs on a Hiseq 4000 [1].

The problem of course, is that the authors don’t have access to Illumina’s cost of goods. They just go by the cost Illumina sell their kits for. And as such you can’t use this in any way to justify that the reagent costs are 90% of the sequencing cost.

I’ve seen this a number of times from researchers. Who imagine that reducing the quantity of reagents used will have a dramatic effect on the cost of sequencing.

The truth is, Illumina likely sell consumables at 90% profit. So, if Illumina are selling a $1000 genome, we already know a $100 genome is possible.

Moreover, there’s a lot more than just reagents in Illumina’s consumables costs. In addition to the raw costs of the reagents among other costs there’s at least: packaging, logistics, and the cost of a nano patterned flow cell.

It’s reasonable to imagine that this nano patterned flowcell is the bulk of Illumina’s consumable costs of goods. And as such using a smaller quantity of reagents likely has minimal effect on costs.

What I did like about the paper is that it provides a potential method for simplifying the fluidics system. A simple, compact sequencing platform which just has to flow a single buffer over the flow cell [2] seems like a neat idea. But I don’t expect it to massive impact costs.

Talking of costs, Dante labs have recently been offering 30x human genomes for 150Euros:

Some have suggested that Dante sell their sequencing at a loss, and make up for it elsewhere… however I suspect the answer is that Dante uses MGI sequencing which they can get far cheaper than Illumina.

Dante don’t specify the technology used, but data quality metrics they’ve previously presented looked very similar to MGIs current offerings. Illumina have blocked the sale of MGI instruments in the US, and filed suits against them in the EU. This may in part explain why Dante are somewhat vague about who exactly is supplying their sequencing services.

Notes

[1] Some relevant quotes from this paper:

“When cost data were only available for a kit as a whole, kit costs were apportioned equally across all items in the kit.”

“The most expensive item was the HiSeq 4000 sequencing machine (Illumina), which cost £474,373, with an annual maintenance cost of £55,641. This sequencing system requires two consumable kits (a HiSeq 3000/4000 Sequencing by Synthesis [SBS] Kit costing £4207 and a HiSeq 3000/4000 Paired End [PE] Cluster Kit costing £2597), with half of each kit required per case.”

[2] They say not a flow cell but…

QuantumSi’s Protein Sequencing Approach

I’ve previously written about QuantumSi’s DNA sequencing work. Recently, QuantumSi have been promoting their protein sequencing platform. This may be a device that incorporates both DNA and protein sequencing functionality [1]. In this post I’m going to take a look at one of their protein sequencing patents [0] and review the approach.

Expectation Management

Before we dig into the technical details, let’s review the approach at a high level, in the context of DNA sequencing.

The basic process used to sequence proteins can be briefly described as follows:

  1. Isolate single proteins
  2. Attach a label to the terminal amino acid, and detect the label.
  3. Remove a single terminal amino acid.
  4. Go to step 2 to identify the next amino acid.

At a high level is not unlike single molecule sequencing-by-synthesis, in that monomers are detected sequentially. The difference here being that rather than incorporating monomers, in this approach they are cleaved.

While the basic process is similar to DNA sequencing, building the machinery to sequence proteins is far more complex. In DNA sequencing we have a bunch of tool (proteins) developed by nature which we can harness to develop sequencing approaches.

DNA’s complementary nature provides a simple approach to both amplifying polymers and introducing labels. We have a vast array of proteins that incorporate nucleotides, degrade nucleotides, and modify DNA sequences. For the most part, none of this machinery exists when working with proteins.

As we can’t amplify proteins, we are stuck with single molecule approaches. From DNA sequencing, we’ve seen that this alone limits our accuracy, and in general single molecule approaches have an error rate of >10% whereas amplified approaches (Illumina) have error rates significantly less than 1%.

Not only this, but the “alphabet” of proteins is an order of magnitude greater than for DNA sequencing. Beyond the ~20 standard amino acids, number of modified variants also exist further complicating labeling and identification.

Our base line expectation is therefore that initial data quality will be worse than DNA sequencing. This may not matter if the applications are compelling, as there’s not as much competition in the protein sequencing space.

Technical Approach

There are two technical approaches described in detail in the patent [2]. Both approaches use single molecule optical detection of a protein under sequencing attached to a surface (one example shows 18% occupancy [3]). The readout system appears to be similar to that mentioned in my previous post.

Sequencing Approach 1 – Label + Cleavage Enzyme

The first approach uses a labeled recognition enzyme. In this approach a (fluorescently) labeled recognition protein is used to detect the terminal base. From what I can tell these proteins don’t bind strongly, so they are transiently binding on and off.

At the same time, a cleavage enzyme is in the mix. The cleavage enzyme, at some appropriately low concentration comes in and removes terminal amino acids.

Example 6 shows what appears to be experimental data for this process. The example uses ATTO 542 label ClpS2. and an aminopeptidase (VPr). The protein they are attempting is sequence is YAAWAAFADDDWK.

ClipS2 binds to Y,W and F and apparently doesn’t bind to other terminal amino acids.

Two raw traces for this sequencing experiment are shown:

These two figures (20A and 20C) show two independent sequencing runs. Transient binding of ClpS2 to Y, W and F is shown. The experiment starts with the Y exposed as the terminal animo acid. As such we see transient binding of ClpS2 to Y from time point zero. At some point the cleavage enzyme comes in and the “Y” gets chopped off. “A” is now exposed at the terminal animo acid. ClpS2 shows no binding to “A” and we therefore don’t see any binding.

“A” then gets cleaved, exposing yet another “A” (still no binding). This is cleaved revealing W as the terminal base and we see transient binding again etc.

As it goes this is interesting, but it doesn’t really tell us anything about the sequence, which will naturally be runs of “F or W or Y” and not “F or W or Y”. It’s clear from the traces that the length of the transient binding period provides little informative information. For example, the time taken to cleave two “A” differs by a factor of 5 between the two experiments.

So in order to determine which of F, W, and Y was detected we need to look more closely at the transient binding. Figures 20B and 20D show histograms of the pulse duration. The variation and average durations seem to differ significantly for each animo acid.

This provides a basic proof of concept for animo acid detection. But it’s quite limited. The issues are as follows:

  • There’s no mechanism for detecting runs of identical amino acids.
  • The demonstration shows detection of 4 out of > 20 amino acids.
  • While the durations are somewhat distinct for these amino acids, it seems likely that 20 such distributions would show significant overlap.
  • The sequence is quite short (so we don’t know how long we can go without damaging the protein/other issues occurring).

Table 1 of the patent lists 33 amino acid recognition proteins, with 13 different binding patterns. These appear to cover 16 of the >20 amino acids ambiguously. A combination of there recognition proteins (and ideally a few more) might bring you closer to a full sequencing approach.

But, overall this work seems to provide a basic proof of concept of the detection process. With some additional work I could see this working as a protein fingerprinting technology. Where an ambiguous protein sequence is compared against a database of known protein sequences. In the above example the fingerprint might be something like:

“One or more Y”,”One or more not F,W,Y”,”One or more W”, “One or more not F,W,Y”, “One or more F”

With a long enough sequence, this may unambiguously identify a particular protein/class of proteins. However single point mutations might be more challenging.

Using additional recognition proteins would improve this fingerprinting process. Progressing this toward a sequencing platform seems challenging, and might require development of the technical approach.

Sequencing Approach 2 – Labeled Amino acid specific cleavage enzyme

Some data is also shown for a second approach. Here the cleavage enzyme is specific to certain animo acids. The demonstration shows that they have a method for incorporating labels into these exopeptides [5]. But there doesn’t seem to be any data demonstrating sequencing using this approach.

A list of amino acid specific exopeptides is provided [6]. However these seem to be far more limited than the recognition enzymes in approach 1. Only three types of specific exopeptides are listed those specific to Glu/Asp, Met or Proline.

This approach therefore seems more challenging and less developed.

Summary

Overall this seems like an interesting approach to a problem that hasn’t received that much attention. I’d expect the initial platform to be nearer to a “protein fingerprinting technology” than a full protein sequencing instrument. This seems like an interesting tool in its own right. If the initial instrument is framed as a protein sequencing platform, I would expect the error rate to be far higher than we’re used to seeing for DNA sequencing (probably in the order of >20%).

However, all this speculation is based on a single patent. They may have developed beyond this patent and it will be interesting to see what is finally released.

Notes

That’s it, you can stop reading now…

Ok, well… this section contains a few other notes from the patent. It doesn’t really add much to the discussion above, but they are here in part for my own reference. The footnotes below also support some of the assertions in the text above and may be of interest.

As is often the case, this patent mentions a number of other approaches which could be used. The patent discuss nanopore readout briefly. Indicating that the protein being sequenced could be immobilized on a nanopore. Recognition molecules (labels) are then detected through changes in conductance of the nanopore.

Other sections refer to “conductivity labels”.

Shielding elements. This is essentially a protein (or other element) that shields the recognition molecule from photo damage. Shield proteins are used in other single molecule sequencing approaches, and a number of methods are presented in the patent.

Other approaches to removing terminal amino acids are mentioned… Edman degradation, phenyl isothiocyanate.

Various other detection methods are briefly discussed for example Aptamers.

While I’ve not mentioned it in the text above there are some other nice plots of the pulse duration differences for ClpS2 in example 5:

Example 5: ClpS2 as recognition label. ClpS2 is labeled with dye. Single molecule intensity traces shown in figure 19B.

Similar plots are shown for ClpS, and ClpS1.

Footnotes

[0] https://www.freepatentsonline.com/y2020/0209257.html US20200209257A1

[1] “In some aspects, the application relates to the discovery of polypeptide sequencing techniques that allow both genomic and proteomic analyses to be performed using the same sequencing instrument.”

“Such strategies may require modification of an existing analytic instrument, such as a nucleic acid sequencing instrument, which may not be equipped with a flow cell or similar apparatus capable of reagent cycling. The inventors have recognized and appreciated that certain polypeptide sequencing techniques of the application do not require iterative reagent cycling, thereby permitting the use of existing instruments without significant modifications which might increase instrument size.”

[2] As always, a number of other approaches are also mentioned. But these are the approaches that have the best supporting data. I review some of the alternative approaches at the end of the post.

[3] One example shows proteins attached using a DNA linker, with 18% single protein occupancy. This seems lower than Poisson. Only single wells will be sequencable so this limits throughput. Example 2 (page 127).

[4] Amino acid recognition proteins. Table 1. Lists 33 amino acid recognition proteins and their preferred binding. Many of these appear to prefer the same amino acids. There are therefore 13 different types of “preferred binding” listed: FWY: 4, FWYL: 6, FWYLVI: 1, phosphorus-Y: 5, FWYLI: 1, KR: 1, DE: 1, KRH: 3, P: 2, KRHWFY: 5, PMV: 1, G: 2, A: 1.

[5] They reference the following paper, which uses non-natural amino acids. Chin J. W et al. J Am Chem Soc. 2002 Aug. 7 124(31):9026-9027

[6] Aminopeptidases. Table 3 lists amino peptidases. These should selectively cleave terminal amino acids. There seem to be 3 classes here with limited coverage of the amino acid space (Glu/Asp: 1, Met: 2, Proline: 6). Table 4 provides a much longer list of non-specific Amino-peptidases.