AOL Search Data Leak
In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.
First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:
1 | 711111 | – |
2 | 233670 | |
3 | 98713 | ebay |
4 | 91234 | yahoo |
5 | 69314 | yahoo.com |
6 | 61890 | mapquest |
7 | 55832 | google.com |
8 | 54985 | myspace.com |
9 | 51433 | myspace |
10 | 30835 | www.yahoo.com |
11 | 30045 | www.google.com |
12 | 28185 | internet |
13 | 21353 | http |
14 | 20046 | www.myspace.com |
15 | 19578 | map |
16 | 19381 | weather |
17 | 18577 | my |
18 | 18560 | ebay.com |
19 | 16191 | bank |
20 | 15940 | american |
21 | 14512 | pogo |
22 | 14063 | hotmail |
23 | 13824 | msn |
24 | 13575 | ask |
25 | 13521 | craigslist |
26 | 13294 | hotmail.com |
27 | 12825 | dictionary |
28 | 12512 | yahoo |
29 | 11877 | msn.com |
30 | 11757 | mycl.cravelyrics.com |
31 | 11091 | bankofamerica |
32 | 10360 | mapquest.com |
33 | 10158 | walmart |
34 | 10126 | my |
35 | 10117 | ask.com |
36 | 9943 | tattoo |
37 | 9530 | .com |
38 | 9059 | southwest |
39 | 8926 | myspace |
40 | 8906 | white |
41 | 8510 | maps |
42 | 8342 | sex |
43 | 8315 | porn |
44 | 7791 | mailbox |
45 | 7714 | home |
46 | 7709 | www.google |
47 | 7615 | fidelity.com |
48 | 7506 | pogo.com |
49 | 7459 | target |
50 | 7372 | match.com |
Notes
wget http://www.atrus.org/hosted/AOL-data.tgz tar xvzf AOL-data.tgz for i in ./user*;do sed '1 d' $i >> aol_all.txt;done awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt.top head aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}'