Archive for July 2021

Nautilus Biotechnology

This post previously appeared on my substack.

Company Background

Nautilus (originally Ignite Biosciences) was founded in 2016 by Parag Mallick and Sujal Patel. Sujal was previosly CEO of storage company Isilon. Isilon storage gained huge popularity in genomics for storage next-gen sequencing data. They had deal with Illumina at one point, and were probably the easiest way of getting scalable storage up and running. More recently Isilon’s popularity in genomics seems to have wained, with users switching over to cloud based solutions. 

youtube interview with Sujal covers Nautilus’ background and how Sujal got involved in Biotechnology coming from a tech background. Parag, like many others was using Isilon’s platform for genomic applications. In his interviews Sujal draws parallels between Nautilus’ proteomics platform and the explosive growth of next-gen DNA sequencing.

However, Sujal also positions Nautilus as most appropriate for Pharma applications. This is very different than next-gen DNA sequencing where Pharma was not an early driver of growth (and still isn’t).

They raised a $76M series B in May 2020. And like seemingly everybody else, are doing a SPAC.

Technology

Nautilus Biotechnology is building a high throughput single molecule protein fingerprinting platform. There are a few other companies doing this (Encodia, QuantumSi, Dreampore, Erisyon (disclosure, I’ve worked with Erisyon in the past, but hold no equity)).

Looking over their patents, there seem to be 3 areas of innovation:

  1. A method for arraying single proteins on a surface.
  2. A method for identifying/fingerprinting proteins.
  3. A method for developing libraries of affinity reagents for use in fingerprinting.

I’ve cover each of these in turn below and then review the complete approach.

Arraying Single Proteins

If you randomly stick proteins (or anything else) to a surface there will be some probability that two of more proteins will be right next to each other. If that happens you wont be able to resolve the single proteins and a mixed, unusable, signal will be generated.

So, random attachment limits throughput. Many platforms run into this problem, in Solexa/Illumina sequencing on the Genome Analyzer 40% of reads were from “mixed” clusters and were discarded. On Oxford Nanopore’s device you have bilayers/wells with multiple pores which are not easily usable. In general such approaches are “Poisson limited” in a well based system this means 37% of wells will have single occupancy.

Most platforms attempt to solve this at some point. Illumina introduced patterned flowcells and ExAmp. Genia have worked on a pore insertion approach, to ensure single occupancy.

But in general, it’s not something that makes or breaks a platform. It only limits throughput, doesn’t effect data quality. So it’s usually addressed in a second generation product.

Nautilus however have been working on this issue for proteomics. Why they are focusing on this at an early stage isn’t clear. But one possibility is that mixed signals are particularly problematic in protein fingerprinting. That is to say, they can’t easily be classified as mixed. This could cause a significant fraction of proteins to be misclassified.

The Nautilus arraying approach works by creating a kind of adapter molecule which they call a SNAP (Structured Nucleic Acid Particle). This is a DNA nanoball (created using rolling circle amplification, similar to MGI in their sequencing platform). But it’s structured such that there’s a single site on the nanoball to which a protein can attach. The advantage here is that a SNAP can be relatively large and sit on a lithographically fabricated site on a surface, likely of a few 100nm. The result is an array of easily separable sites on a surface, each of which presents a single protein attachment site.

Their patents suggest a number of methods for making SNAPs or similar structures. But the DNA nanoball approach seems like the most obvious and the only one which appears to have experimental support. It looks like they need to do size selection on the nanoballs, which complicates the process somewhat. But they seem to have some images showing single dyes attached to the nanoballs.

How well this might work in a complete platform, with a complex sample is unclear.

Identifying/Fingerprinting Proteins 

To me the patents relating to this part of the Nautilus approach felt the weakest. Essentially they say use a number of different affinity reagents (aptamers, or antibodies) to generate binding signals. Then use those binding signals to determine which protein is present.

So, you’d flow in one reagent, get a binding signal, flow in a second, get another signal, etc. All this binding information is then compared to a database of known protein binding fingerprints. The patent seems to refer to between 50 and 1000 reagents. In a youtube interview, Sujal suggested this generates 10 to 20Tb of image data.

They also suggest that you can also use affinity reagents that specifically bind to trimers or some other short motif. This almost then beings to look similar to a sequencing-by-hybridisation approach. Where you generate short reads and overlap them to recover the original sequence.

The patent I’ve looked at is completely theoretical, all the examples are simulations. The idea itself seems relatively obvious. The claims and specification make a big deal out of the process being iterative. But this doesn’t seem to be hugely significant to me, and somewhat obvious. The patent has a single claim. This claim in framed in terms of comparing binding measurements against a database. This suggests to me that they’re not seriously looking at sequencing-like applications.

Developing Libraries of Affinity Reagents

third patent refers to methods of developing affinity reagents for use in the above process. Here they talk about methods for generating aptamers and other affinity reagents. The aptamer generation process seems to be a relatively standard aptamer evolution approach:

They seem to have performed a slightly smarter aptamer selection process than that described in the flowchart above. In this process they create candidate aptamers then sequence them on an Illumina sequencer. This gives them the sequence and location on the flowcell of each aptamer. They then wash a fluorescently labelled protein over the flowcell and measure binding. This gives them a high throughput way of measuring aptamer-protein binding efficiency. I suspect they’re not the first to do this however. The approach will likely mostly be complicated by the fact that Illumina have made it harder to modify the sequencing protocol on recent instruments.

They present some data from this approach, and show binding versus protein concentration for a few aptamers:

However, the aptamers they’ve discovered don’t seem to be covered in the patent. Perhaps there’s an unpublished filling which covers specific aptamers or other affinity reagents in more detail.

Elsewhere in the patent they discuss generating affinity reagents (likely antibodies) that specifically bind to 5mers. They propose doing this by first creating 2mer and 3mer specific affinity reagents and combining them.

The aptamer stuff in this patent is the most convincing. And I suspect they’re working on an aptamer based solution, however aptamers have a somewhat troubled history. It seems by no means easy to get an aptamer based platform working well. While the patent is interesting because it discloses some details of their approach. The patent itself doesn’t seem very strong. It originally had 131 claims. 111 of these have been cancelled, leaving a single independent claim.

Conclusion

In summary, they seem to be building an optical single molecule protein fingerprinting platform. Proteins are arrayed on a surface, exposed to a number (perhaps 100s) of fluorescently labeled affinity reagents. These are probably a combination of aptamers or antibodies.

By combining the binding information from all these different reagents, they can produce a unique fingerprint for a single protein. And by comparing this to a database of known proteins they can calculate the abundance of each protein in a sample. Because they’re using an optical approach this should be relatively high throughput. They also have IP on a chip based (QuantumSi/PacBio-like) platform, but to me this seems less scalable…

There are a number of applications for such a platform, but they mostly talk about Pharma (drug design, evaluation). Where such an approach would provide a more sensitive method of evaluating the performance of a drug, and how it effects protein expression.

For me, the most developed part of the approach is the arraying technology (using SNAPs). But this isn’t really required to get the platform up and running. It also doesn’t create any kind of IP barrier. It helps push throughput, but it isn’t clear to me that it’s of fundamental importance in building a protein fingerprinting platform.

The other parts of the approach (from the published patents) seem less well developed. I’d also note that while Sujal draws parallels with DNA sequencing, this approach is closer qPCR or DNA microarrays. Where you’re comparing detection events against a known database.

I’ll be watching with interest, but at the moment I’m more excited about approaches that provide “de novo” information that’s a little closer to sequencing than fingerprinting.

Twinstrand Biosciences

This post originally appeared on my substack newsletter.

Business

Twinstrand is a University of Washington spinout based in Seattle. Crunchbase lists Twinstrand as being founded in 2015 which is not long after the foundational work was done (in 2012), they’ve recently raised a series B of $50M bringing their total to $73.2M. I see 62 employees on LinkedIn.

Approach and Applications

The basic play is that Illumina sequencing has an error rate that’s too high for some applications. To me, this is was kind of surprising. In Illumina sequencing, around 90% of bases are Q30. That’s an error rate of 1 in 1000. Do you really need an error rate lower than this? Twinstrand propose a number of applications, these are largely around very low level mutations.

  • Detecting residual acute myeloid leukemia (AML) after treatment.
  • Mutagenesis assays, for chemical and drug safety testing.
  • Cellular Immunotherapy Monitoring

In general, I’m used to seeing plays (like GRAIL) around cancer screening. But this is aimed more at cancer monitoring. The US national cost of cancer care is $150B, there are around 1.8M cancer cases. So, if we assume that this test will be required for cancer monitoring of every patient, and yields $1000 in profit that’s $1.8B in profit. Probably enough to support the company, and make investors happy…

But for the Twinstrand play to work, and justify a healthy valuation, at least the following needs to be true:

  1. “ultra-high accuracy” is needed for cancer monitoring.
  2. The Twinstrand approach is a practical method of generating “ultra-high accuracy” reads.
  3. The Twinstrand approach is the only and best way to get “ultra-high accuracy”.

The first may be true, but it’s obviously not what GRAIL and other players have been working on for early stage cancer screening, where the focus has shifted toward base modification/methylation.

As to Twinstrand’s practicality? Hopefully we can gain some insight by reviewing the approach.

Technology

The technique relies on adding two pieces of information to double stranded DNA. The first is a unique index (a UMI) which uniquely identifies each double stranded fragment. The second is a strand-defining element (SDE). This a marker that allows the two strands forming a double stranded fragment to be distinguished.

Twinstrand use two UMIs. One of each end of the original double stranded fragment. They call these two UMIs “α” and “β” in the figure below. 

The Y shaped adapters (labelled Arm 1,2) in the diagram above introduce an asymmetry between the strands. This provides the strand-defining element (SDE) described above.

To make this clearer I decided to break to the diagram further, showing the individual amplification steps involved:

Post amplification, and in 5’ orientation you will get 4 distinct read types as shown above. Each of these can be classified as coming from either the forward or reverse strand of the original dsDNA fragment.

From there it’s obvious that you can use this information to filter out errors that occurred during amplification (including bridge amplification):

For amplification errors to propagate they’d need to occur at the same position, and of the same base. So, I’d assume a ballpark estimate is somewhere around Q60… and their reports include identifying mutation frequency down to a rate of 10^-5.

Problems

Wow, great! Q60 reads, who wouldn’t want that!

Well the major problem is that you’re going to throw away a lot of throughput. At best you will need to sequence each strand 2 to 4 times. This might be fine if you have an amplification step in your protocol anyway. Much like UMIs the Twinstrand process will just provide additional information removing error and bias.

But unlike UMIs you want to optimize for duplicates. And not just duplicates but duplicating starting material a fixed number of times. I.e. the ideal is probably to see ~4 different sequences for every original fragment of dsDNA (one of each type).

In practice, this is problematic, in their patent they state “3.1% of the tags had a matching partner present in the library, resulting in 2.9 million nucleotides of sequence data”. As far as I can tell the input datasets was 390Mb of sequence data. Processed, corrected reads therefore represent about 0.75% of the input dataset. This is a huge hit of your throughput.

The above describes the original IP, from ~2012. Most of their patents appear to be based around this basic process. However a patent from 2018 looks like it might be worth digging into in more detail. In this patent it looks like they try to more closely model errors that occur during the sequencing process (incorporating fluorescence intensity information into a two pass basecalling process).

Dreampore – Protein Sizing

This post was originally published on substack.

I came across another patent from the Dreampore folks when investigating the platform, and thought it might be interesting to write up. It’s actually not clear if this patent has been licensed to Dreampore, but 3 of the inventors are listed as working at Dreampore so it seems reasonably likely.

The patent appears to be based around work from a 2018 paper. However the patent itself is interesting to me in part because of the way it’s written and the scope of the claims.

The patent is framed around identifying impurities (protein fragments) in a 96% pure sample. They state that “it is impossible to identify these fragments by classical mass spectrometry or by HPLC”. As in their publications they use aerolysin nanopores, a three chamber device is presented:

It’s not clear to me why you want to use three chambers. They suggest that pure product could be removed from the second chamber… but don’t describe how or why this is useful. They also don’t appear to use a three chamber device in the paper.

They show what appears to be experimental data from translocations of a protein (RR10). While it’s not stated in the patent, it seems clear from the publication this is a 10 amino acid long arginine homopeptide (RRRRRRRRRR). What they shown is that they and distinguish between homopeptides of lengths between 10 and 5 amino acids.

The paper further clarifies this showing samples containing single peptide types, as compared to the mixture:

They then take this further, and show relative concentrations of peptides in a sample from a supplier claiming 98% purity:

So, overall they show size determination at a the single amino acid level and present an application in the classification of impurities during reagent production. As far as this goes, this is fine.

The patent then describes how this might be used to measure enzymatic activity. The idea is to quantify enzymatic activity on the single cell level. Enzymes are captured on a solid support. You then flow in the enzymes substrate and pass the product through a nanopore. This seems like an interesting research problem, but I’m less clear on the market for such a device, and there’s no experimental data. 

The patent overall seems quite weak (I don’t blame the authors for this, it seems like it might have been rushed).

The single independent claim in this patent reads:

“The use of an aerolysin nanopore or a nanotube for the electrical detection of peptides, protein separated by at least one amino acid and other macromolecules such as polysaccharides or synthetic or natural polymers present in a preparation where said nanopore or nanotube is inserted into a lipid membrane which is subjected to a difference in potential of over −160 mV, in a reaction medium comprising an alkali metal halide electrolyte solution with a concentration of less than 6M and at a temperature of less than 40° C., and where said use is intended to differentiate said peptides, proteins and other molecules according to their length and their mass.”

Which seems very narrow. And what does over -160mV mean? If I look to the paper I can understand that they term -100mV as greater than -50mV, but it’s obviously not in terms of absolute magnitude. And shouldn’t this be specified in relation of the cis/trans side of the pore? The specification doesn’t provide much clarity here.

There are numerous errors in the patent for example confusing voltage and current (“voltage of 25pA”). And other statements and typos that make me think the patent was rushed. Which seems like a shame…

In terms of the technical approach itself, I’m not sure if measuring reagent impurities is a compelling market. But being able to detect single amino acid differences seems like a step toward some more interesting applications.

Dreampore – Nanopore Protein Sequencing

This post was originally published on substack.

Company

Dreampore is a French Nanopore Protein sequencing company based in Paris.

There’s not much information available on Dreampore, most of it comes from a Genomeweb review from December 2019. In this article they state that Dreampore has raised €600,000, and they have four employees. As far as I can tell from their company registration they were founded in 2018 and currently have 3 to 5 employees. According to LinkedIn, the CEO (Luc Lenglet) is also leading two other companies. I could only see one current employee on LinkedIn who appeared to be fully dedicated to the company.

Technology

Surprisingly I wasn’t able to find a patent covering the work presented in their Nature paper on protein sequencing. So this review is based on the publication only.

The work uses a protein nanopore, and detects molecules as they pass through and block a bias current. This is much like other forms of nanopore DNA sequencing. 

The publication builds on a previous work where they detect translocations of >7mer arginine (R) homopeptides. I’ll be covering this in a future post, because it’s kind of interesting in its own right. But essentially these RRRRRRR peptides block the aerolysin pore for a detectable duration. In the sequencing paper they use xRRRRRRR peptides where the x position varies. The poly-arginine region helps the peptide stick around in the pore long enough to be detected. But the idea is that the blockage current varies enough based on the single differing position.

And histograms suggest that in most cases it does:

When you look at the full set of amino acids, current blockages are less well separated:

The plots above use Ib/I0. This appears to be the signal normalized against the baseline current. It’s not super common to do this, and I wonder why have normalize against the baseline, rather than just measuring the offset against the baseline in pA. Possibly their measurements vary significantly with buffer concentration…

The raw ABF files (which suggests measurements were taken on a Axopatch) are available. So it’s possible to confirm this. But the scaling makes the plots a little harder to interpret. From example traces it looks like blockages are probably between 60 and 70pA (they all appear to be 0.3 and 0.4 in scaled units, and a typical baseline current appears to be ~100pA). So, you’re cramming 20 states into ~10pA. The best you’re like to do in terms of noise is likely ~1pA RMS at 10KHz.

They have a plot in the supplementary information which shows that in practice, they get about 10 pA of peak-to-peak noise on blockages.

From the supplementary information the average dwell time seems to be ~5ms (which remember is for 8 amino acids). So, let’s say 1ms per AA. So if we average down to 1KHz, we can probably get this to ~1pA of noise. 

It seems likely that if they attempted sequencing, multiple positions are likely contributing to the signal. Let’s be conservative and say 3 positions. For 20 AAs that means 8000 possible combinations. So I’d speculate this comes down to:

0.001pA difference between each state and 1pA of noise

Which seems like a very hard problem to solve. Certainly one or two orders of magnitude harder than nanopore DNA sequencing.

Conclusion

The positive side of this paper, is that they’ve clearly shown differences between most amino acids. In practice, I don’t think these differences are good enough to clearly differentiate between all 20 AAs. But it does indicate that if you had a way of sufficiently slowing the translocation of a protein you might be able to show some kind of characteristic signal.

The remaining problems are however two fold:

  • How do you slow the translocation of proteins sufficiently.
  • How to you deal with contributions from adjacent bases. 

Both these problems are pretty tough. On the plus side, we likely only need to generate a characteristic fingerprint for a protein to be able to address useful applications. But even to get to that point, the above problems likely need to be addressed.

This paper suggests that with further work, it might just be possible. I’ll be keeping an eye on this and other nanopore protein sequencing approaches, as any kind of usable data from such a platform would be pretty exciting.