An attempt to read Pfam database dumps…
Edit: See update 3 below for the final method I used.
I’ve been trying to extract data from an older release of Pfam. This can be important when trying to replicate the results of a Bioinformatics paper, which uses a dataset generated using an older release of Pfam.
Old releases of Pfam do not appear to be online. The Waybackmachine has mirrors of pages from Pfam going back a few years… but this isn’t always helpful.
Pfam have database dumps online e.g.
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam24.0/database_files/
These are mysql dumps… however much of the data appears to have been stored in longblobs. In particular I was looking at the alignments_and_trees table. The sql dump creates this data. There is then a separate file called alignments_and_trees.txt.gz the format of this file is unclear to me.
The file is not a compressed text file.
Here are the first few bytes of the compressed file:
00000000: 2731 3135 3730 2709 271f 8b08 0838 5dc3 '11570'.'....8].
00000010: 4a5c 3003 5345 4544 5c30 cd5b 5d73 e336 J\0.SEED\0.[]s.6
00000020: 967d f7af 50d5 bccc d68c d896 2c7f 556a .}..P.......,.Uj
00000030: 1f60 e183 0840 84f8 6868 9017 57a7 e3c4 .`[email protected]...
00000040: 5de3 fe98 b633 bbf9 f77b 2e48 4a74 87d9 ]....3...{.HJt..
00000050: dac7 654d a5bb 0508 3c3c 3cf7 de73 01cd ..eM....<<<..s..
00000060: 5f56 31fd b037 ed0f b65b 6d9a f3b3 bffc _V1..7...[m.....
00000070: a792 2bcd 57ab 157f 2b2f b6b7 37c3 5c27 ..+.W...+/..7.\'
00000080: 6c8f 4f7a b9d9 5c5c 5fdc 349b e133 2ee8 l.Oz..\\_.4..3..
00000090: b3af 9f5f 1e3e 7c5a 7dfe 65f5 dba7 7f7e ..._.>|Z}.e....~
A note of the Pfam site refers to various tables containing gzip compressed blobs. So I adapted some code I found online to search for gzip headers/streams. This found some potential streams but failed to decode them… Here’s the code I used to search for gzip streams:
import zlib
from glob import glob
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
while i < len(data):
try:
#zo = zlib.decompressobj(16+zlib.MAX_WBITS)
zo = zlib.decompressobj(32+zlib.MAX_WBITS)
yield i, zo.decompress(data[i:])
i += 1
except zlib.error:
i += 1
for filename in glob("inputfile"):
print(filename)
for i, data in zipstreams(filename):
print (i, data)
~
And here is the data trimmed to the start of the stream:
00000000: 1f8b 0808 385d c34a 5c30 0353 4545 445c ....8].J\0.SEED\
00000010: 30cd 5b5d 73e3 3696 7df7 af50 d5bc ccd6 0.[]s.6.}..P....
00000020: 8cd8 962c 7f55 6a1f 60e1 8308 4084 f868 ...,.Uj.`[email protected]
00000030: 6890 1757 a7e3 c45d e3fe 98b6 33bb f9f7 h..W...]....3...
00000040: 7b2e 484a 7487 d9da c765 4da5 bb05 083c {.HJt....eM....<
00000050: 3c3c f7de 7301 cd5f 5631 fdb0 37ed 0fb6 <<..s.._V1..7...
00000060: 5b6d 9af3 b3bf fca7 922b cd57 ab15 7f2b [m.......+.W...+
00000070: 2fb6 b737 c35c 276c 8f4f 7ab9 d95c 5c5f /..7.\'l.Oz..\\_
00000080: dc34 9be1 332e e8b3 af9f 5f1e 3e7c 5a7d .4..3....._.>|Z}
00000090: fe65 f5db a77f 7efa fc5f 9f56 bffc f6e9 .e....~.._.V....
I also hacked together some code to attempt to decompress every offset using deflate:
import zlib
import io
import sys
from glob import glob
import deflatedecompress
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
while i < len(data):
try:
file = io.BytesIO(data[i:])
bitin = deflatedecompress.BitInputStream(file)
decomp = deflatedecompress.Decompressor.decompress_to_bytes(bitin)
yield i, decomp
i += 1
except ValueError as e:
i += 1
for filename in glob("inputfile"):
print(filename)
for i, data in zipstreams(filename):
print (i, data)
sys.stdout.buffer.write(data)
This uses the deflate implementation from [email protected]:nayuki/Simple-DEFLATE-decompressor.git unfortunately this also fails to find anything useful.
Looking at the hexdump output the files appear to contain no aligned 0 bytes. Looking at the first two lines of the hexdump above:
00000000: 1f8b 0808 385d c34a 5c30 0353 4545 445c ....8].J\0.SEED\
00000010: 30cd 5b5d 73e3 3696 7df7 af50 d5bc ccd6 0.[]s.6.}..P....
There are two instances of the ascii sequence “\0”. They are both at positions where a single 0 byte would make more sense. Replacing these bytes with 0 creates a more plausible gzip header. The OS code is UNIX (03 is now are the correct offset for the OS code). And what appears to be the filename is correctly terminated. The Huffman code block begins with the two bits 11. Which is expected. This was confirmed across a couple of instances.
Using file or other tools to identify the modified file shows a plausible timestamp. With the modifications above, the filename is also shown and correctly terminated.
However the deflate stream still fails to decode. It fails before another instance of “5c 30” (\0). So perhaps there is some other mangling of the stream going on which is preventing the stream from parsing…
Source code for Pfam does not appear to be available, so I can’t see if there’s any de-mangling happening in their system, or if the data was somehow mangled as part of the database dump process.
Update:
I tried another block, and I could get this to decode after converting the first two \0s to 0s. Both the CRC and length checks fail however. I used a modified version of gzip-decompress.py removing the CRC and length checks:
# Check decompressed data's length and CRC
#if size != len(decomp):
#return "Size mismatch: expected={}, actual={}".format(size, len(decomp.length))
#actualcrc = zlib.crc32(decomp) & 0xFFFFFFFF
#if crc != actualcrc:
# return "CRC-32 mismatch: expected={:08X}, actual={:08X}".format(crc, actualcrc)
A tarball of the code used is here: http://41j.com/blog/wp-content/uploads/2020/02/code.tar.gz
Update 2: Googling around finds this codebase:
Which suggests that the file can be imported into MySQL using “LOAD DATA LOCAL INFILE”. So… the binary data is encoded as some kind of string literal, with various escape codes perhaps? It looks like the binary data string is delimited with ‘. If this is the case ‘ would also need to be escaped…
Update 3: While I’m not clear how alignments_and_trees.txt.gz was generated, it’s possible to import this into a current (5.7.27) version of MySQL. I first created a table to put the data in. This uses the same structure as alignments_and_trees.sql, but don’t have the constraints:
CREATE DATABASE test;
USE test;
DROP TABLE IF EXISTS `alignments_and_trees`;
CREATE TABLE `alignments_and_trees` (
`auto_pfamA` int(5) NOT NULL,
`alignment` longblob,
`tree` longblob,
`jtml` longblob,
`post` longblob,
`type` enum('full','seed','ncbi','meta') default NULL,
UNIQUE KEY `UQ_alignments_and_trees_1` (`auto_pfamA`,`type`),
KEY `auto_pfamA` (`auto_pfamA`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I then import the complete dump with:
LOAD DATA INFILE 'alignments_and_trees.txt' INTO TABLE alignments_and_trees CHARACTER SET latin1 FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY "'" LINES TERMINATED BY '\n';
It’s possible to copy a subset of this data into a new table e.g.:
CREATE TABLE Seltab AS SELECT * FROM alignments_and_trees WHERE auto_pfamA=XXXX;
Where XXXX is a pfamA ID from the pfamA sql dump . You can download this from the database dump directory, and read it in a text editor. This allows you to map between the IDs shown on the Pfam website and these used to link between tables (for example you can use zcat pfamA.txt.gz | grep PF00XXX).
It’s then possible to dump this using mysqldump in a format that’s slightly easier to parse the that in the current dump files (which as noted above appear to contain escape sequences in the binary data). This is done using the hex-blob option:
mysqldump test Seltab --opt -u root -p --hex-blob > dump
If you cut out a section of hex encoded binary data (everything after a 0x and before a “,” and put this in it’s own file. You can then convert it to a binary file using xxd:
xxd -p -r datain > datain.gz
Which can then be uncompressed and read as usual…