Scripts to download SARS-CoV-2 replacements

I wanted to download a set of mutations in SARS-CoV-2. CoV-GLUE seems to be a reasonable database of mutations in SARS-CoV-2. However the web interface doesn’t seem to have an option to download a dataset. And there isn’t a published API. So I threw together some ugly bash/awk to get what I wanted. I don’t imagine this will work for long, as the website appears to be under active development. But here are my notes anyway.

The website works off a (undocumented?) JSON API. I used the follow JSON template to get replacements (non-synonymous substitutions) which occur in 2 or more sequences:

{"multi-render":{"tableName":"cov_replacement","allObjects":false,"whereClause":"(true) and  (((num_seqs >= 2)))","rendererModuleName":"covListReplacementsRenderer","pageSize":500,"fetchLimit":500,"fetchOffset":FETCHOFFSET,"sortProperties":"-num_seqs,+variation.featureLoc.feature.name,+codon_label_int,+replacement_aa"}}

The above goes in a file called templ. I then just modify “FETCHOFFSET” using sed and download the first 4500 mutations (at the time of writing there are 4000 odd mutations. You’d want to stick this all in a loop… but I didn’t bother:

rm all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/0/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/1000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/1500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/2000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/2500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/3000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/3500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/4000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all


The we extract all the mutation IDs:

awk  'BEGIN{RS="id\":\"";FS="\""}{print $1}' all > allmuts

And then fetch them from the server, this will create 4000 odd files, which we can then parse further:

#awk '{print "wget --header \"Cookie: _cle=accepted\" --header \"Content-Type: application/json\" --post-file=./info.json http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/custom-table-row/cov_replacement/" $1}' allmuts  > allmuts.get

The CoV-GLUE database seems like a great resource. I hope they add a feature to download sequences/results soon. I’ve seen database results presented in a few preprints. It would be nice it those papers could also include the raw data, otherwise they’re unfortunately going to end up being difficult to replicate…

Singular Genomics Systems

Company

Singular Genomics Systems was founded in 2016. Pitchbook lists them as having raise 45.5MUSD and as being at Series B. They list Coatue Management, Domain Associates, F-Prime Capital, Revelation Partners and Arch Ventures [1] as investors [3]. They are a member of JLabs [2]. There appear to be ~60 employees listed on LinkedIn.

Technology

It appears that Singular Genomics have license technology from Jingyue Ju’s lab [4]. Jingyue Ju’s lab has generated a huge amount of IP around various approaches to DNA sequencing, so this doesn’t really narrow things down very much.

Singular’s patents also describe two different optical sequencing approaches. One is a single molecule “real time” sequencing approach (the closest similar commercial platform would be PacBio). The other is a ensemble approach (with an example showing amplified DNA on beads). I’ll briefly review these two patents below. But the main takeaway is that they appear to be working on a optical approach. I suspect it’s slightly more likely that they are working on an ensemble approach (as these are more common, and easier to get working).

The ensemble approach also shows the closest to what could be real data. So let’s look at this first:

Ensemble Approach

From 20200102609 – Represents the first 10 cycles of four color SBS data for a fragment of the PhiX 174 DNA immobilized on beads in a flow cell. The graph shows fluorescence emission intensity obtained by using a mixture of 4 labeled, blocked dNTPs: dCTP-Bodipy, dTTP-R6G, dATP-AF568, dGTP-AF647. The fluorescence images were taken during the chase step, as dark, blocked dNTPs were being incorporated into any remaining previously unextended complementary DNA strands.

The above figure from [6] shows one innovation they describe on the basic sequencing-by-synthesis approach. Essentially what they’re suggesting is that after flowing in your standard labelled reversible terminators you flow in a “chasing mix”. This chasing mix is a set of all 4 nucleotides, with reversible terminators, but no labels. What this means is that you give all the strands a second chance to incorporate a nucleotide. This “second chance” doesn’t give you any more signal. But it does hopefully mean that unextended strands are kept “in step”.

So called “phasing” (strands failing to incorporate a base, and getting out of step) is a major source of error in sequencing-by-synthesis. I guess the idea here, is that an unlabelled nucleotide might incorporate with better efficiency than an labelled one.

Beyond this, the patent discusses methods of speeding up imaging, potentially by taking images during the “chasing step”. This is interesting in the sense that were otherwise the imaging time would be wasted, you can use it here to help extend unextended strands, without otherwise altering the signal.

The graph above shows the first 10 cycles of a fragment of PhiX. To properly understand this data it would need normalization, but there doesn’t seem to be much in the way of phasing. This work appears to have been preformed on beads. This seems to suggest that they have something up and running. Unfortunately it doesn’t tell us much about their proposed amplification/cluster/polony generation approach. I’d guess they are using a bead based platform to evaluate the chemistry and have other ideas around amplification. But it’s also possible that they are designing a bead based platform (like Ion Torrent/454).

Single Molecule Approach

Another patent [5] discusses a single molecule approach. In this approach they’re watching a polymerase incorporate nucleotides in realtime. Here they’re suggesting detection through FRET one option appears to be have a couple of FRET acceptor/donor sites on the polymerase. As the polymerase incorporates a nucleotide a conformational change occurs and your get a FRET. You also use a label on the nucleotide to then observe incorporation using FRET.

The paper suggests observation via grating style TIRF is desirable. Where the grating could be incorporated into the flowcell. There are a few other bits and pieces of interest in the patent, such as attachment methods, but nothing that looked like an experimental setup or data to me.

Overall these patents don’t give a clear picture of where Singular Genomics is heading. It seems that the approach is likely optical, and I suspect not single molecule based on the lack of experimental data. Will be interesting to see how things develop!

Notes

[1] https://www.archventure.com/portfolio/

[2] https://jlabs.jnjinnovation.com/sites/jlabs/files/JLABSPortfolioSocial.pdf

[3] https://pitchbook.com/profiles/company/226128-25#investors

[4] https://www.cbinsights.com/company/singular-genomics

[5] Single molecule approach: https://patents.google.com/patent/US20180258472A1

[6] http://www.freepatentsonline.com/y2020/0102609.html

In order to decrease SBS cycle times, in embodiments of the present disclosure the identity of distinguishable, blocked dNTP analogue incorporated into the labeled, blocked extension product(s) generated in the sequencing reaction of such cycle is assessed while the sequencing reaction is running, i.e., before the chasing reaction is initiated. In other embodiments, such assessment is conducted during the chasing reaction. In embodiments, such assessment is conducted about less than 60 seconds before termination of the sequencing reaction, about less than 60 seconds before initiation of the chasing reaction, about less than 300 seconds after initiation of the chasing reaction, or about less than 60 to about less than 10 seconds before termination of the chasing reaction. In embodiments, such assessment is conducted substantially simultaneously with initiation of the chasing reaction. In embodiments, such assessment is conducted at the conclusion of or after the chasing reaction.

chasing conditions (i.e., conditions under which an unlabeled, blocked dNTP analogue species can be incorporated into a primed template DNA molecule that was not extended to include a distinguishable, blocked dNTP analogue species), thereby forming the unlabeled, blocked extension product(s).

DNA Sequencing with Simultaneous Imaging and Chase Steps

Provided here is an example of the embodiment of a sequencing-by-synthesis method where the DNA bases were identified during the chase step. In this example, identical DNA fragments derived from the PhiX 174 genome were immobilized on 1 micron beads. The beads were tethered to a glass coverslip which was part of a flow cell. All necessary reagents for SBS were sequentially delivered into the flow cell. At first, four distinguishable, blocked dNTP analogues were presented into the flow cell. Each dNTP was labeled with a different fluorophore as follows: dCTP-Bodipy, dTTP-R6G, dATP-AF568, dGTP-AF647. A sequencing polymerase was used to incorporate these dNTPs into the complementary strand. A small volume of buffer was then used to remove any excess dye-labeled, blocked dNTPs. As a second step, dark, blocked dNTPs were introduced into the flow cell. During this second step, a set of four images was taken, one for each of the colors corresponding to each dye-labeled dNTP, while the dark dNTPs continued to be incorporated into any unextended DNA templates on the bead. The images were obtained using a Nikon microscope, with a 20×0.75 NA objective, and standard filter sets corresponding to each of the dyes. Note that the images were taken simultaneously with the chasing step, at a temperature of 60° C., demonstrating the compatibility of the two processes in terms of reaction conditions. The excess dark, blocked dNTPs were then washed out and a deprotection reagent was brought in. This reagent cleaved the blocking group and the dye from the incorporated dNTPs. The cycle was then repeated as many times as desirable. The results from the first 10 cycles obtained using this method are shown in FIG. 1. Each bar represents the fluorescent signal from the corresponding base. Spectral cross-talk correction, which compensates for the spectral bleed through from one emission channel into another, was applied to these data. No additional corrections have been applied. We have shown sequencing read lengths of >75 bases in this manner.

http://www.freepatentsonline.com/y2019/0077726.html

http://www.freepatentsonline.com/y2020/0102609.html

http://www.freepatentsonline.com/y2019/0352508.html

https://www.indeed.com/cmp/Singular-Genomics-1/reviews

https://patents.google.com/patent/US20180258472A1/en?inventor=eli+glezer&oq=eli+glezer&sort=new&page=2

Single Technologies

Company

Single Technologies is based in Stockholm and was founded in 2012 [1]. Crunchbase lists them has having raised 4.7MUSD from JovB Holding, Sciety, and KTH Holdings [3]. There are ~10 employees on LinkedIn. The majority of the co-founders appear to come from a optical background.

Technology

Single have a few patents which largely refer to the optical system. The patents do not explicitly mention sequencing. What follows is based on one of their patents, and I’ll then try and frame this based on what they say on their website.

Single Technologies imaging setup from [2].

In The Single Technologies imaging system [2], the sample sits on a rotating sample holder. Essentially, it appears to be a drum that rotates under the objective lens. They suggest rotation means that the the sample is only subjected to constant forces. In a sense, this is similar to the TDI imaging that Illumina does on their instruments. The sample moves at a constant speed under the optical system and you essentially “scan” it. I can see that a curved sample surface is however a big departure from a traditional flat flowcell moving on an XY/XYZ stage.

The rotation trajectory appears to need to be very well defined, and they suggest using air bearings could help, and talk about precision (between laps) of 100nm. The patent describes the use of confocal microscopy, so rather than using a line scan imager (as in TDI) they will likely be scanning point-by-point. The instrument is called the “Theta” so it seems like a Confocal Theta Microscopy may also be a possibility.

From the patent we get the sense that they are innovating around the imaging system. The website more explicitly says that they are looking at single molecule detection. From the patent, and explicit mention of confocal microscopy on the site, I would not expect this imaging configuration to be compatible with “real-time” observation of nucleotide incorporation (PacBio-style).

The website mentions a patterned flowcell, but I didn’t see a patent referring to this. It would be interesting to better understand how this works, particularly in a single molecule context. One of the big issues with Illumina chemistry has been avoiding “mixed clusters”. These are clusters which are formed from more than one than template. Because the templates (DNA to be sequenced) randomly attached to the flowcell, there is some probability that two templates are very near each other and form a single “mixed” cluster. On the Genome Analyzer 2, these “mixed clusters” accounted for ~50% of data (from memory). Significantly limiting throughput.

If you have a number of wells/sites and are trying to optimize for “single occupancy”. Randomly flowing templates into wells limits you to ~36% [4] of wells containing a single template (with many containing none, or multiple templates).

Illumina solved this issue with their “exclusion amplification” chemistry [5]. In this approach as templates attach they are rapidly amplified quickly fill the well. This means that there isn’t any space left for a second template to enter a well. My understanding is that with this approach, most of the “mixed clusters” disappear, and you get a dramatic increase in reads.

Getting back to Single Technologies. My question is are you limited to only being able to use 36% of sites/wells? Or is there a single molecule approach to ensuring that each site only contains a single template? This would be quite interesting.

Outside of the optical system there doesn’t seem to be much to say about Single Technologies. They state that their approach “can be applied to almost any fluorescent based sequencing chemistry”. Which strongly suggests to me that they don’t have any innovation around chemistry. Looking at their team, I also don’t see employees with the background required to develop a new sequencing chemistry either.

So I suspect their plan is to innovate around the optics/flowcell only. Perhaps they can partner with someone else for the chemistry, or be acquired by an existing sequencing company [6], where they would provide a throughput advantage with improved optics/flowcells.

Notes

[1] https://www.singletechnologies.com/about-us

[2] http://www.freepatentsonline.com/y2019/0049382.html

[3] https://www.crunchbase.com/organization/single-technologies#section-funding-rounds

[4] https://books.google.co.jp/books?id=AJm7CwAAQBAJ&pg=PA43&lpg=PA43&dq=poisson+limit++single+occupancy+well&source=bl&ots=b33KVRG8fO&sig=ACfU3U2jGUbwzdGwzAqZ_LTHSiO5DZwJMg&hl=en&sa=X&ved=2ahUKEwj77dej68zpAhVVA4gKHcT0AJUQ6AEwCnoECAcQAQ#v=onepage&q=poisson%20limit%20%20single%20occupancy%20well&f=false

[5] https://www.illumina.com/science/technology/next-generation-sequencing/sequencing-technology/patterned-flow-cells.html

[6] I suspect of current players, only Illumina and BGIs chemistry would be compatible. In both cases this would mean removing amplification from their workflow so that they became single molecule approaches. The question then is if the Single Technologies approach can sufficiently reduce issues around photo bleaching to make this worth it. Moving to single molecule might help increase read length in some cases (by removing phasing issues) but if reduces average read length/throughput through photo bleaching, this might not be worth it.

Website Quotes

Left: DNA fragments located in SINGLE’s patterned flow cells. Middle: the Theta Sequencer. Right: sequencing data generated.

“The sequencer consists of the Worlds fastest scanner, a new type of fluidics adopted for large areas and sequencing chemistry.”

“Theta contains the world’s fastest single molecule-sensitive confocal imaging system”

“Single Technologies is pushing the limits of genomics by combining single molecule imaging, fast large area confocal scanning, grating techniques, fluidics, nanotechnology and a large portion of out of the box thinking. Our technology is being explored by leaders and industry in the genomics field who cares about Big Data generation.”

“Single Technologies was founded by Johan Strömqvist, Bengt Sahlgren, Annika Bolind Bågenholm and Raoul Stubbe in 2012/2013. The origin of the company is a unique combination of PhD research in single molecule imaging and biotechnology at the Royal Institute of Technology and R&D in the fiber optical grating industry by the founders of Proximion in Stockholm, Sweden.”

SCANNER

Theta contains the world’s fastest single molecule-sensitive confocal imaging system. The technology digitizes the samples simply and intuitively at diffraction limited resolution with negligible bleaching. And it’s ludicrously fast, a 15×15 mm area can be scanned in just a few seconds, and a total area of 125×65 mm could be scanned without any compromises.

FLUIDICS

Theta contains a new type of automated fluidics which avoids micro channels, allowing rapid exchange of liquids over large areas and using less reagents, setting new standards for optimized reactions. It is enabled by a combination of Single’s revolutionary scanning technology and unique approaches to effective flow and diffusion.

CHEMISTRY

Theta’s fast large area scanning and automated effective fluidics can be applied to almost any fluorescent based sequencing chemistry, supported by Single’s new patterned surface methods, giving the benefits of speed and capacity compared with other systems, backing the increased need of sequenced data.

Element Biosciences

Company

The company was founded in 2017 [1] and has raised >95MUSD, from Fidelity, Venrock, JSCapital, and foresight capital (among others?).

The founding team is all ex-Illumina. The CEO (Molly He) was Senior Director of Protein Engineering and Enzymology at Illumina. The co-founding CTO (Michael Previte) was Associate Principal Scientist at Illumina. And co-founder Matt Kellinger was Staff Scientist at Illumina. There are currently 69 employees listed on LinkedIn.

Technology

There’s not much publicly disclosed on Element’s technological approach. However we can get some idea from patents and job adverts. Job adverts show positions in “Optical and Systems Engineering” and “Image and System Processing”.

They also have a couple of patents published, one refers to “Low non-specific binding supports and formulations for performing solid-phase nucleic acid hybridization and amplification” [3].

Essentially this patent refers to a process for creating a surface which oligos can be attached to. This serves a similar function to the flowcell surface in Illumina sequencing. The surface described appears to be compatible with clonal amplification.

So essentially this looks very similar to the Illumina/Solexa approach. A surface, to which oligos are attached, which undergo cluster growth (clonal amplification) [5]. Sequencing could then occur using sequencing-by-synthesis.

The patent claims better CNR (contrast-to-noise ratio), than other surfaces. And shows a figure comparing various surfaces:

These plots look similar to the pairwise plots you see when comparing the overlapping emissions in Illumina sequencing [4], in particular these plots look similar to the AC pairwise intensity plots. The Reference surface looks pretty bad (which appears to be a commercially available surface), the high CNR surface looks broadly similar to what I’ve seen in the past (Genome Analyzer 2 data [4]).

Improved CNR is clearly of benefit. However in terms of data quality, other factors likely limit accuracy and read length. If improved CNR allows for shorter imaging times/less illumination this could help with photo-bleaching issues. However it seems likely that phasing would still be a issue and limit read lengths.

Overall, going by the patent and job adverts, my best guess would be that Element are building an optical (likely not single molecule) sequencing-by-synthesis platform. As such, projected accuracy and read length would be in the same ballpark as Illumina. Illumina’s margins are quite high and there’s certainly room to compete on pricing. Many of the original Solexa patents are expiring, so they probably have freedom to use the same basic clonal amplification/sequencing-by-synthesis approach. I would guess that there are also reversible terminators that are available to them (or they could just do without).

Notes

[1] According to Crunchbase. https://www.crunchbase.com/organization/element-bioscience#section-overview There website lists a 80.3MUSD round on Jan 9th 2020. And a total of “more than $100 million”. They also list a series A of 15MUSD on July 19th 2019. This would suggest that sometime prior to the series A they raised ~5MUSD. Which would make sense as a large seed round.

[2] https://jobs.lever.co/elembio

[3] http://www.freepatentsonline.com/y2020/0149095.html

[4] See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734321/ figure 2.

[5] There are a number of “spotty” images in the patent. These look broadly similar to the random cluster images you’d see from Genome Analyzer 2 era Illumina instruments.