Accessing the Common Crawl Dataset from the command line

The common crawl dataset is a crawl of the web which has been made freely available on S3 as a public dataset. There are a couple of guides out there for accessing the common crawl dataset from Hadoop, but I wanted to take a peak at the data before analysing it.

Here’s how you do this from EC2, this doesn’t incur any charges, but doing this from an external host will.

First, fire up an instance on EC2 and login. Then:

sudo su
yum install git
git clone git://github.com/s3tools/s3cmd.git

The current version (April 2012) of s3cmd is borked for “requester pays” datasets. It needs patches as described here: http://arxiv.org/help/bulk_data_s3

The instructions basically say add the following lines to S3/S3.py:

        if self.s3.config.extra_headers:
          self.headers.update(self.s3.config.extra_headers)

after:

class S3Request(object):
    def __init__(self, s3, method_string, resource, headers, params = {}):
        self.s3 = s3
        self.headers = SortedDict(headers or {}, ignore_case = True)

Then install it:

python setup.py install
s3cmd --configure

In the AWS management console, go to the top right where your name is, select, select “Security Credentials” get your access key and secret key and enter them in to s3cmd. For the other options you can accept the defaults. You can then access the dataset.

List the bucket:

s3cmd ls --add-header=x-amz-request-payer:requester s3://aws-publicdatasets/common-crawl/crawl-002

Fetch a file:

s3cmd get --add-header=x-amz-request-payer:requester s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/1/1262851198963_1.arc.gz

Once you’ve fetched a file you can decompress it as follows:

gunzip -c 1262851198963_1.arc.gz > text