{"id":867,"date":"2012-04-15T09:28:50","date_gmt":"2012-04-15T09:28:50","guid":{"rendered":"http:\/\/41j.com\/blog\/?p=867"},"modified":"2012-04-15T09:28:50","modified_gmt":"2012-04-15T09:28:50","slug":"accessing-the-common-crawl-dataset-from-the-command-line","status":"publish","type":"post","link":"https:\/\/41j.com\/blog\/2012\/04\/accessing-the-common-crawl-dataset-from-the-command-line\/","title":{"rendered":"Accessing the Common Crawl Dataset from the command line"},"content":{"rendered":"<p>The common crawl dataset is a crawl of the web which has been made freely available on S3 as a public dataset. There are a couple of guides out there for accessing the common crawl dataset from Hadoop, but I wanted to take a peak at the data before analysing it.<\/p>\n<p>Here&#8217;s how you do this from EC2, this doesn&#8217;t incur any charges, but doing this from an external host will.<\/p>\n<p>First, fire up an instance on EC2 and login. Then:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nsudo su\r\nyum install git\r\ngit clone git:\/\/github.com\/s3tools\/s3cmd.git\r\n<\/pre>\n<p>The current version (April 2012) of s3cmd is borked for &#8220;requester pays&#8221; datasets. It needs patches as described here: http:\/\/arxiv.org\/help\/bulk_data_s3<\/p>\n<p>The instructions basically say add the following lines to S3\/S3.py:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n        if self.s3.config.extra_headers:\r\n          self.headers.update(self.s3.config.extra_headers)\r\n<\/pre>\n<p>after:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nclass S3Request(object):\r\n    def __init__(self, s3, method_string, resource, headers, params = {}):\r\n        self.s3 = s3\r\n        self.headers = SortedDict(headers or {}, ignore_case = True)\r\n<\/pre>\n<p>Then install it:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\npython setup.py install\r\ns3cmd --configure\r\n<\/pre>\n<p>In the AWS management console, go to the top right where your name is, select, select &#8220;Security Credentials&#8221; get your access key and secret key and enter them in to s3cmd. For the other options you can accept the defaults. You can then access the dataset.<\/p>\n<p>List the bucket:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ns3cmd ls --add-header=x-amz-request-payer:requester s3:\/\/aws-publicdatasets\/common-crawl\/crawl-002\r\n<\/pre>\n<p>Fetch a file:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ns3cmd get --add-header=x-amz-request-payer:requester s3:\/\/aws-publicdatasets\/common-crawl\/crawl-002\/2010\/01\/06\/1\/1262851198963_1.arc.gz\r\n<\/pre>\n<p>Once you&#8217;ve fetched a file you can decompress it as follows:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ngunzip -c 1262851198963_1.arc.gz &gt; text\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The common crawl dataset is a crawl of the web which has been made freely available on S3 as a public dataset. There are a couple of guides out there for accessing the common crawl dataset from Hadoop, but I wanted to take a peak at the data before analysing it. Here&#8217;s how you do [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-867","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1RRoU-dZ","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/867","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/comments?post=867"}],"version-history":[{"count":1,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/867\/revisions"}],"predecessor-version":[{"id":868,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/867\/revisions\/868"}],"wp:attachment":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/media?parent=867"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/categories?post=867"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/tags?post=867"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}