An Encoding And Correction Approach for DNA Data Storage

I’ve previously noted that there’s significant interest in using DNA as a data storage medium. In my previous post, I discussed a correction/selective amplification approach which might help remove errors in errored synthesis platforms.

In this post I consider an encoding and selective amplification approach that might work particularly well for DNA data storage.

In this approach only a subset of bases are used to encode information. Other bases are used to provide synchronisation points.

For example we might use the bases A,T, and C to encode information. G would be used as a synchronisation base. We might for example, have 9 bases of information followed by a synchronisation “G” [1].

We can see how this could work by way of the following example:

True sequence:
0123456789012345678901234567890123456789
TACTACTATCGTCATCATCTGCTAATCATTGACTTTACTA

Our synchronisation “G”s will allow us to selectively amplify those synthesized strands matching the “true” (desired) sequence which do not contain insertion errors.

For example, the following strand contains an error at position 7. We would use the technique previously described, that is we would use a normal polymerase and perform stepwise incorporation by flowing in bases in the “true”/desired order.

Error at position 7:
01234567890123456789012345678901234567890
TACTACTCATCGTCATCATCTGCTAATCATTGACTTTACTA
ATGATGAGTAGCAGTAGTA

The presence of regular synchronisation “G”s makes it harder for an errored strand to advance when undergoing stepwise synthesis, as the strand needs to wait for up to 9 bases to flow through the system until it can start to advance when out of sync.

As previously noted, this scheme can be used to selectively amplify strands without insertion errors (between rounds of melting). The amplification scheme could be applied at regular intervals to remove error’d strands from the system.

This amplification scheme does not help with deletion errors, these as possibly less critical here as they appear as a length error (which may be illuminated though size selection). The most critical errors maybe a combination of insertions and deletions which result in strands of the same length as our desired strand. This scheme could help remove these.

Notes

[1] Naturally, different bases, and different spacing could be used. Potentially you might want to switch between using different sets of bases to encode information, and for synchronisation throughout a strand.

The encoding used, could be one of a number of schemes. Of particular interest might be an encoding that minimises the impact of deletion errors with respect to the desired sequence (for example, uses longer homopolymers to encode data).