{"id":50,"date":"2011-09-12T20:56:52","date_gmt":"2011-09-12T20:56:52","guid":{"rendered":"http:\/\/41j.com\/blog\/?p=50"},"modified":"2011-09-28T16:03:17","modified_gmt":"2011-09-28T16:03:17","slug":"aol-search-data-leak","status":"publish","type":"post","link":"https:\/\/41j.com\/blog\/2011\/09\/aol-search-data-leak\/","title":{"rendered":"AOL Search Data Leak"},"content":{"rendered":"<p>In 2006 there was a &#8220;leak&#8221; of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I&#8217;m interested in playing with it a little.<\/p>\n<p>First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:<\/p>\n<table border=\"1\">\n<tr>\n<td>1<\/td>\n<td>711111<\/p>\n<td>&#8211;<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>233670<\/p>\n<td>google<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>98713<\/p>\n<td>ebay<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>91234<\/p>\n<td>yahoo<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>69314<\/p>\n<td>yahoo.com<\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td>61890<\/p>\n<td>mapquest<\/td>\n<\/tr>\n<tr>\n<td>7<\/td>\n<td>55832<\/p>\n<td>google.com<\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>54985<\/p>\n<td>myspace.com<\/td>\n<\/tr>\n<tr>\n<td>9<\/td>\n<td>51433<\/p>\n<td>myspace<\/td>\n<\/tr>\n<tr>\n<td>10<\/td>\n<td>30835<\/p>\n<td>www.yahoo.com<\/td>\n<\/tr>\n<tr>\n<td>11<\/td>\n<td>30045<\/p>\n<td>www.google.com<\/td>\n<\/tr>\n<tr>\n<td>12<\/td>\n<td>28185<\/p>\n<td>internet<\/td>\n<\/tr>\n<tr>\n<td>13<\/td>\n<td>21353<\/p>\n<td>http<\/td>\n<\/tr>\n<tr>\n<td>14<\/td>\n<td>20046<\/p>\n<td>www.myspace.com<\/td>\n<\/tr>\n<tr>\n<td>15<\/td>\n<td>19578<\/p>\n<td>map<\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>19381<\/p>\n<td>weather<\/td>\n<\/tr>\n<tr>\n<td>17<\/td>\n<td>18577<\/p>\n<td>my<\/td>\n<\/tr>\n<tr>\n<td>18<\/td>\n<td>18560<\/p>\n<td>ebay.com<\/td>\n<\/tr>\n<tr>\n<td>19<\/td>\n<td>16191<\/p>\n<td>bank<\/td>\n<\/tr>\n<tr>\n<td>20<\/td>\n<td>15940<\/p>\n<td>american<\/td>\n<\/tr>\n<tr>\n<td>21<\/td>\n<td>14512<\/p>\n<td>pogo<\/td>\n<\/tr>\n<tr>\n<td>22<\/td>\n<td>14063<\/p>\n<td>hotmail<\/td>\n<\/tr>\n<tr>\n<td>23<\/td>\n<td>13824<\/p>\n<td>msn<\/td>\n<\/tr>\n<tr>\n<td>24<\/td>\n<td>13575<\/p>\n<td>ask<\/td>\n<\/tr>\n<tr>\n<td>25<\/td>\n<td>13521<\/p>\n<td>craigslist<\/td>\n<\/tr>\n<tr>\n<td>26<\/td>\n<td>13294<\/p>\n<td>hotmail.com<\/td>\n<\/tr>\n<tr>\n<td>27<\/td>\n<td>12825<\/p>\n<td>dictionary<\/td>\n<\/tr>\n<tr>\n<td>28<\/td>\n<td>12512<\/p>\n<td>yahoo<\/td>\n<\/tr>\n<tr>\n<td>29<\/td>\n<td>11877<\/p>\n<td>msn.com<\/td>\n<\/tr>\n<tr>\n<td>30<\/td>\n<td>11757<\/p>\n<td>mycl.cravelyrics.com<\/td>\n<\/tr>\n<tr>\n<td>31<\/td>\n<td>11091<\/p>\n<td>bankofamerica<\/td>\n<\/tr>\n<tr>\n<td>32<\/td>\n<td>10360<\/p>\n<td>mapquest.com<\/td>\n<\/tr>\n<tr>\n<td>33<\/td>\n<td>10158<\/p>\n<td>walmart<\/td>\n<\/tr>\n<tr>\n<td>34<\/td>\n<td>10126<\/p>\n<td>my<\/td>\n<\/tr>\n<tr>\n<td>35<\/td>\n<td>10117<\/p>\n<td>ask.com<\/td>\n<\/tr>\n<tr>\n<td>36<\/td>\n<td>9943<\/p>\n<td>tattoo<\/td>\n<\/tr>\n<tr>\n<td>37<\/td>\n<td>9530<\/p>\n<td>.com<\/td>\n<\/tr>\n<tr>\n<td>38<\/td>\n<td>9059<\/p>\n<td>southwest<\/td>\n<\/tr>\n<tr>\n<td>39<\/td>\n<td>8926<\/p>\n<td>myspace<\/td>\n<\/tr>\n<tr>\n<td>40<\/td>\n<td>8906<\/p>\n<td>white<\/td>\n<\/tr>\n<tr>\n<td>41<\/td>\n<td>8510<\/p>\n<td>maps<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>8342<\/p>\n<td>sex<\/td>\n<\/tr>\n<tr>\n<td>43<\/td>\n<td>8315<\/p>\n<td>porn<\/td>\n<\/tr>\n<tr>\n<td>44<\/td>\n<td>7791<\/p>\n<td>mailbox<\/td>\n<\/tr>\n<tr>\n<td>45<\/td>\n<td>7714<\/p>\n<td>home<\/td>\n<\/tr>\n<tr>\n<td>46<\/td>\n<td>7709<\/p>\n<td>www.google<\/td>\n<\/tr>\n<tr>\n<td>47<\/td>\n<td>7615<\/p>\n<td>fidelity.com<\/td>\n<\/tr>\n<tr>\n<td>48<\/td>\n<td>7506<\/p>\n<td>pogo.com<\/td>\n<\/tr>\n<tr>\n<td>49<\/td>\n<td>7459<\/p>\n<td>target<\/td>\n<\/tr>\n<tr>\n<td>50<\/td>\n<td>7372<\/p>\n<td>match.com<\/td>\n<\/tr>\n<\/table>\n<h2>Notes<\/h2>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nwget http:\/\/www.atrus.org\/hosted\/AOL-data.tgz\r\ntar xvzf AOL-data.tgz\r\nfor i in .\/user*;do sed '1 d' $i &gt;&gt; aol_all.txt;done\r\nawk 'BEGIN{FS=&quot;\\t&quot;;}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 &gt; aol_all.txt.top\r\nhead aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print &quot;&lt;tr&gt;&lt;td&gt;&quot; n &quot;&lt;\/td&gt;&lt;td&gt;&quot; $1 &quot;&lt;td&gt;&quot; $2 &quot;&lt;\/td&gt;&lt;\/tr&gt;&quot;;n++;}'\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In 2006 there was a &#8220;leak&#8221; of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-50","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1RRoU-O","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/50","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/comments?post=50"}],"version-history":[{"count":8,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/50\/revisions"}],"predecessor-version":[{"id":209,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/posts\/50\/revisions\/209"}],"wp:attachment":[{"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/media?parent=50"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/categories?post=50"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/41j.com\/blog\/wp-json\/wp\/v2\/tags?post=50"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}