Scripts to download SARS-CoV-2 replacements
I wanted to download a set of mutations in SARS-CoV-2. CoV-GLUE seems to be a reasonable database of mutations in SARS-CoV-2. However the web interface doesn’t seem to have an option to download a dataset. And there isn’t a published API. So I threw together some ugly bash/awk to get what I wanted. I don’t imagine this will work for long, as the website appears to be under active development. But here are my notes anyway.
The website works off a (undocumented?) JSON API. I used the follow JSON template to get replacements (non-synonymous substitutions) which occur in 2 or more sequences:
{"multi-render":{"tableName":"cov_replacement","allObjects":false,"whereClause":"(true) and (((num_seqs >= 2)))","rendererModuleName":"covListReplacementsRenderer","pageSize":500,"fetchLimit":500,"fetchOffset":FETCHOFFSET,"sortProperties":"-num_seqs,+variation.featureLoc.feature.name,+codon_label_int,+replacement_aa"}}
The above goes in a file called templ. I then just modify “FETCHOFFSET” using sed and download the first 4500 mutations (at the time of writing there are 4000 odd mutations. You’d want to stick this all in a loop… but I didn’t bother:
rm all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/0/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/1000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/1500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/2000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/2500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/3000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/3500/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
rm index.html;cp templ c;sed -i 's/FETCHOFFSET/4000/g' c;wget http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/ --post-file=./c --header "Cookie: _cle=accepted" --header "Content-Type: application/json";cat index.html >> all
The we extract all the mutation IDs:
awk 'BEGIN{RS="id\":\"";FS="\""}{print $1}' all > allmuts
And then fetch them from the server, this will create 4000 odd files, which we can then parse further:
#awk '{print "wget --header \"Cookie: _cle=accepted\" --header \"Content-Type: application/json\" --post-file=./info.json http://cov-glue.cvr.gla.ac.uk/gluetools-ws/project/cov/custom-table-row/cov_replacement/" $1}' allmuts > allmuts.get
The CoV-GLUE database seems like a great resource. I hope they add a feature to download sequences/results soon. I’ve seen database results presented in a few preprints. It would be nice it those papers could also include the raw data, otherwise they’re unfortunately going to end up being difficult to replicate…