An Encoding And Correction Approach for DNA Data Storage

I’ve previously noted that there’s significant interest in using DNA as a data storage medium. In my previous post, I discussed a correction/selective amplification approach which might help remove errors in errored synthesis platforms.

In this post I consider an encoding and selective amplification approach that might work particularly well for DNA data storage.

In this approach only a subset of bases are used to encode information. Other bases are used to provide synchronisation points.

For example we might use the bases A,T, and C to encode information. G would be used as a synchronisation base. We might for example, have 9 bases of information followed by a synchronisation “G” [1].

We can see how this could work by way of the following example:

True sequence:
0123456789012345678901234567890123456789
TACTACTATCGTCATCATCTGCTAATCATTGACTTTACTA

Our synchronisation “G”s will allow us to selectively amplify those synthesized strands matching the “true” (desired) sequence which do not contain insertion errors.

For example, the following strand contains an error at position 7. We would use the technique previously described, that is we would use a normal polymerase and perform stepwise incorporation by flowing in bases in the “true”/desired order.

Error at position 7:
01234567890123456789012345678901234567890
TACTACTCATCGTCATCATCTGCTAATCATTGACTTTACTA
ATGATGAGTAGCAGTAGTA

The presence of regular synchronisation “G”s makes it harder for an errored strand to advance when undergoing stepwise synthesis, as the strand needs to wait for up to 9 bases to flow through the system until it can start to advance when out of sync.

As previously noted, this scheme can be used to selectively amplify strands without insertion errors (between rounds of melting). The amplification scheme could be applied at regular intervals to remove error’d strands from the system.

This amplification scheme does not help with deletion errors, these as possibly less critical here as they appear as a length error (which may be illuminated though size selection). The most critical errors maybe a combination of insertions and deletions which result in strands of the same length as our desired strand. This scheme could help remove these.

Notes

[1] Naturally, different bases, and different spacing could be used. Potentially you might want to switch between using different sets of bases to encode information, and for synchronisation throughout a strand.

The encoding used, could be one of a number of schemes. Of particular interest might be an encoding that minimises the impact of deletion errors with respect to the desired sequence (for example, uses longer homopolymers to encode data).

Using an SBS-like approach to selectively amplify

Today I was pondering that fact that there are DNA synthesis approaches that may result in high error rates.

One significant class of errors is insertions. In particular, homopolymer errors. One of the issues with enzymatic DNA synthesis is that so far, it’s been difficult to incorporate bases with reversible terminators.

One approach could be to limit the number of bases incorporated purely through the concentration present. This is likely to result in a highly errored product however. Even if your error rate is 5%, after incorporating 100 bases, less than 5% of your product will be fully correct.

If we could selectively amplify only the correct strands, this might give us more utility out of an inefficient/errored synthesis platform.

Let’s say we get some reasonable fraction of fully correct strands at 20 bases [1]. Size selection might be problematic [2] as many errors will be either the same length, or nearly the same length. We assume that insertion errors dominate, and it’s these errors that we’re mostly interested in removing.

One approach might be to selectively completely amplify only those strands which don’t contain insertions. You can do this, by step-wise synthesis of a complementary strand. By exposing the strand to reversibly terminated [3] bases in the correct order only. The scheme is somewhat similar to sequencing-by-synthesis, but here is used for selective amplification.

To take an example, say we have attempted to synthesize the sequence CGTCCCTAGTCGACTGACGT. We would expose the synthesized strands to complementary bases in the correct order [4] during stepwise synthesis. This stepwise process would be, similar to sequencing-by-synthesis (incorporate, wash/remove, cleave terminators etc.).

A fully correct strand, or one containing deletions only will incorporate a base at every position. A complete complementary strand will therefore be created.

A strand with an insertion however will become out of sync with the correct/desired bases. It will therefore no incorporate a base at every position.

In the example below, we can see how a single insertion error, will result in a strand half the size of the original. Insertion errors are therefore converted to larger fragment size errors (and produce significantly smaller fragments in many cases).

In this example bases are flowed into the pool in the order G,C,A,G etc. and incorporated from the 3′ end of the template.

In the errored strand, bases incorporate correctly until the 6th position. At this point, the synthesis process gets out of sync. An A,T,C, and A fail to incorporate, before another G is encountered. The final synthesised strand is ~50% smaller than the fully correct template.

True sequence
   01234567890123456789
3' CGTCCCTAGTCGACTGACGT 5'
5' GCAGGGATCAGCTGACTGCA 3'

Insertion
   012345678901234567890
3' CGTCCCCTAGTCGACTGACGT 5'
5' GCAGGGGATCA           3'

The process described would most likely need to be performed cyclicly (between rounds of melting), to amplify the pool of strands sufficiently. After this selective amplification process, size selection [6] could take place to select the correct (or a “more correct” subset). This subset might be used for downstream applications, or as a substrate for further synthesis [5].

This amplification process might remove the most problematic errored strands from the synthesis process [6] as well as potentially allowing us to gain more utility for an errored synthesis process.

Notes

[1] I’m selecting 20 bases to keep the examples simple.

[2] Again size selection of short fragments is problematic anyway, but this is just an example.

[3] Or maybe, without terminators if you don’t care so much about homopolymer errors and only interested in removing other insertions.

[4] Appropriated primed+a normal polymerase, suitable for incorporating the base we are using.

[5] Effectively you might try and “reset” the synthesis process periodically, by removing errors from the pool.

[6] The most problematic errored strands might be those that are the same size as the fully correct template. These strands would need to be the result of at least an insertion and a deletion. The above scheme will not completely amplify these strands, and could therefore help mitigate against this issue.

Eve Biomedical

Today I’m going to briefly review a slightly less well known DNA sequencing company, Eve Biomedical. Checkout the complete list of sequencing companies, for links to all other posts.

Business

Eve Biomedical is a Californian company (C2934369) founded on 11/13/2006. They received ~1MUSD of SBIR grants, starting in 2013. Crunchbase lists them as having raised 7.7MUSD (from DFJ) most recently in December of 2012. I could find only 3 current employees on LinkedIn, and around 10 previous employees.

Technology

I could find 2 patents assigned to Eve Biomedical [1]. I’m going to focus on the earliest, because it’s the most fascinating to me [1a]. The patent describes a process they call “Rotation-dependent transcriptional sequencing”.

The sequencing approach relies of the fact that as an RNA polymerase incorporates a base, it rotates the template DNA strand. The approach is therefore sequencing-by-synthesis. However rather than synthesising a complementry strand of DNA (as in Illumina and other approaches) DNA is being transcribed to RNA and we are detecting the incorporation process as it takes place.

How to we detect the rotation of single strands of DNA? Stick a bead on it… An asymmetric magnetic bead is used. This allows us to put it under slight tension using a magnetic field (so it doesn’t move around under brownian motion). The patent shows 2.7 micron beads, it’s slightly surprisingly to me that a single polymerase/strand is able to move what I assume is a comparatively huge bead (if anyone has dimensions  for polymerases I’d be very interested!).

A schematic of the system (from the patent) which I’ve annotated is shown below:

With this basic system in place, there are a couple of questions remaining. The first is, what exactly do these tags look like. The patent describes tags that look like a bigger bead with a smaller one stuck to it, I didn’t look into the construction details, but I’d imagine these can easily be fabricated:

The second question, is how to we use these parts to build a DNA sequencing platform.

If we are able to detect incorporation events, it’s hopefully clear that we can use this system to sequence DNA. One option is to simply flow bases in one at a time, and detect when bases are incorporated. This approach is similar to other single-channel, unterminated sequencing-by-synthesis approaches (Direct Genomics comes to mind, Ion torrent would be a related bulk approach).

The patent mentions this of course, but focuses on a different approach. Here they supply all but one base in equal quantities. A 4th base is supplied in limit quantity. This means that every time the polymerase needs to incorporate the “limit quantity” base, it pauses and hangs around waiting for one to come along.

Assuming that the polymerase otherwise incorporates at a constant rate, you can use these pauses to detect where that base occurs on a strand.

You then need to perform the sequencing experiment 4 times [6], limiting one base each time. The plot below, shows an example trace showing where slower incorporation indicates the incorporation of a “G”:

 

The patent also shows what appears to be a complete worked through real dataset (the text implies it’s real). In this example, 4 experiments are performed each reaction containing a limited quantity of one base. The rotation traces are captured and then combined to determine the template sequence:

The sequence contains a couple of homopolymers, I guess the incorporation process slows down for twice as long at these points. Overall, it’s difficult to tell exactly how well the process is working from the trace above (or get any idea of what the error rate might be). But if this is proof of concept real data, it looks pretty reasonable!

The approach seems attractive because it vastly reduces the number of cycles/washes that would need to be performed (just 4 versus 1 for every base position sequenced).

The Eve Biomedical approach has other advantages too, because the beads are big, the optical system should be relatively simple and cheap (commodity mobile phone CMOS sensors?). But there are disadvantages too. Because the  sequencing is occurring in real-time sequencing is limited to a single field of view. This might make it harder to scale the platform.

Eve Biomedical have a second patent [1]. On my brief reading, this refers to using the same RNA polymerase, limited quality of one base system. However in this patent they suggest using it with a nanopore/nanostructure platform. The patent doesn’t seem to include real data/rig images.

The Eve Biomedical approach is really unique, and I’m not seen anyone else suggest using rotation to detect incorporation. While a very different approach, it reminds somewhat of the scheme Depixus have presented.

I’d love to see it played out a little more, if only because it’s so different.

Notes

[1] https://patents.google.com/patent/US20180010181A1

[1a] https://patents.google.com/patent/US20120214171A1
“As a consequence of transcription, the RNA polymerase exerts torque on the nucleic acid, which, in turn, manifests itself as rotation of a tag attached to the nucleic acid.”

“Such a method generaly includes contacting an RNA polymerase with a target nucleic acid molecule under sequencing conditions, detecting the rotational patern of the rotation tag, and repeating the contacting and detecting steps a plurality of times.”

“acid molecule comprises a rotation tag. The sequence of the target nucleic acid molecules is based, sequentialy, on the presence or absence of a change in the rotational patern in the presence of the at least one nucleoside triphosphate”

“FIG.3 are graphs showing two modes of nucleic acid sequencing described herein: Panel A shows an asynchronous, real-time “nucleotide patern’ sequencing strategy, where a limited concentration of a single nucleoside triphosphate (guanine(G) in this Panel) causes the polymerase to pause when incorporating G nucleotides in to the nascent Strand. Panel B shows asynchronous sequencing strategy, where a “base-by-base’ introduction of nucleoside triphosphates results in a continuous decoding of the nucleotide Sequence.”

“Rotation-dependent transcriptional sequencing relies upon transcription of target nucleic acid molecules by RNA polymerase. The RNA polymerase is immobilized on a solid surface, and a rotation tag is bound to the target nucleic acid molecules. During transcription, RNA polymerase establishes a transcription bubble in the template nucleic acid that contains within it an RNA:DNA hybrid of approximately 8 bases. As the RNA polymerase advances along the double-stranded nucleic acid template, it must unwind the helix at the leading edge of the bubble and reanneal the strands at the trailing edge. The torque produced as a result of the unwinding of the double-stranded helix results in rotation of the template nucleic acid relative to the RNA polymerase of about 36° per nucleotide incorporated. Therefore, when the RNA polymerase is immobilized on a solid surface and a rotation tag is attached to the template nucleic acid, the rotation of the template nucleic acid can be observed and is indicative of transcriptional activity (i.e., incorporation of a nucleoside triphosphate) by the enzyme.”

[2] https://www.genome.gov/27554929/

[3] ~1MUSD in SBIR grants: https://www.sbir.gov/sbirsearch/detail/671328

[4] Crunchbase lists a total of 7.7MUSD raised, from one disclosed investor DFJ. https://www.crunchbase.com/organization/eve-biomedical

[5] Here a figure from the patent which as I understand it shows magnetic beads going in and out of focus and the field strength is varied. This process isn’t used in sequencing, but shows how the beads/strands can be put under slight tension during the sequencing process.

[6] Strictly speaking 3 times, as you could obviously infer one of the bases as “not being any of the others”. You could also probably using the Cygnus type “mixtures of bases” error correction schemes with this approach.

Apton Biosystems

Another day, another sequencing company. This time Apton Biosystems. Checkout the complete list of sequencing companies, for links to all other posts. Unfortunately there’s not much to say about Apton, but this post contains what information seems to be available.

Business

According to their site Apton Biosystems was founded in 2012. SEC filings [1] indicate they raised ~10.5MUSD in 2015. The website lists investors as including Khosla Ventures, Cenova Capital, Samsung Catalyst Fund, and Cowin Venture [5].

From archived copies of their website at the Internet Archive, the site was updated sometime in 2018 (before April?) to provide more information and explicitly mention sequencing (as seemly the focus). Existing patents however only mention sequencing incidentally (by my reading) and focus on protein detection.

Technology

There’s very little information on the Apton sequencing approach, which I would assume is still in development. Current patents do not refer to a specific sequencing approach (for example, sequencing by synthesis or sequencing by hybridisation), and only mention sequencing as one possible application.

The website is more explicit, saying “Apton Biosystems is developing a high throughput system that can sequence the human genome for $10.” [5]. A number of DNA sequencing focused jobs have also been posted [3] [4].

A recent poster provides a little more information [2]. In particular it’s clear that they are developing a single molecule optical approach (like Helicos, Direct Genomes, SeqLL), but 4 color (one would assume one dye per base, but see below). They say the approach can give 20nm resolution, and is therefore a super-resolution approach (beating the diffraction limit). However, there’s still no mention of a particular sequencing chemistry, just that that SNPs were detected using hybridisation probes and single base extension reactions.

It’s also not clear what super-resolution approach is being used…

In Illumina sequencing cluster positions have long been identified with sub-pixel resolution (you do a gaussian fit around the maximum intensity to get a sub-pixel resolution cluster location) as such it could be said that the Illumina approach can/does have super-resolution qualities.

However, without more details of the sequencing approach used by Apton it’s unclear what advantages this brings.

The poster and patents, also mention error correction, but it’s unclear how they would apply this to sequencing. The use of error correction in the platform, obviously bring to mind Cygnus’ approach.

I’ll be on the look out for more patents, and more information as it appears.

Notes

[1] SEC Filing appear to indicate they raised in the order of 10.5MUSD in 2015: https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001557398

[2] Poster/talk abstract?

http://cancerres.aacrjournals.org/content/78/13_Supplement/415

“The purpose of this study was to evaluate the system capabilities of a new single-molecule detection platform capable of both genomic and proteomic analysis of cellular pathways using very small amounts of tumor material. The system has 4-color optics, single-fluorophore detection capability, localization of molecules to within a 20nm area, a flow cell with an area of 940 mm2 and the ability to detect > 109 molecules on the surface.”

[3] Job posting

https://www.ventureloop.com/ventureloop/jobdetail.php?jobid=893846&utm_source=joraus&utm_campaign=joraus&utm_medium=organic

“Apton Biosystems Inc. is based in Pleasanton, CA.  The company was started in 2012 with the goal of revolutionizing the way cellular processes and pathways are characterized. We are developing an ultra-low-cost and high-throughput sequencing platform based on our 4-color single molecule detection system.  This system was developed for the detection and quantification of multiple analytes (DNA, RNA and Proteins) on the same platform and uses proprietary optical and authentication technologies to push the limits of molecular detection and quantification.”

[4] https://www.glassdoor.com/job-listing/lead-data-analyst-bioinformatic-scientist-apton-biosystems-JV_IC1147390_KO0,41_KE42,58.htm?jl=2740301303

“We are developing an ultra-low-cost and high-throughput sequencing platform based on our 4-color single molecule detection system.”

[5] Website Aptionbio.com.

Old/Demo website? ganaraajassociates.

[6]

https://patents.google.com/?assignee=APTON+BIOSYSTEMS+Inc

https://patents.google.com/patent/US20150330974A1

“In one embodiment, the method is computer implemented. In another embodiment, K is one bit of information per cycle. In other embodiments, K is two bits of information per cycle. K can also be three or more bits of information per cycle.”

Long section describing the use of aptamer probes with a long tail. The aptamers are detected by synthesising a complementary strand, and detecting the associated incorporations using an ISFET sensor (this seems odd in an otherwise optical system).