AOL Search Data Leak
In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.
First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:
1 | 711111 | – |
2 | 233670 | |
3 | 98713 | ebay |
4 | 91234 | yahoo |
5 | 69314 | yahoo.com |
6 | 61890 | mapquest |
7 | 55832 | google.com |
8 | 54985 | myspace.com |
9 | 51433 | myspace |
10 | 30835 | www.yahoo.com |
11 | 30045 | www.google.com |
12 | 28185 | internet |
13 | 21353 | http |
14 | 20046 | www.myspace.com |
15 | 19578 | map |
16 | 19381 | weather |
17 | 18577 | my |
18 | 18560 | ebay.com |
19 | 16191 | bank |
20 | 15940 | american |
21 | 14512 | pogo |
22 | 14063 | hotmail |
23 | 13824 | msn |
24 | 13575 | ask |
25 | 13521 | craigslist |
26 | 13294 | hotmail.com |
27 | 12825 | dictionary |
28 | 12512 | yahoo |
29 | 11877 | msn.com |
30 | 11757 | mycl.cravelyrics.com |
31 | 11091 | bankofamerica |
32 | 10360 | mapquest.com |
33 | 10158 | walmart |
34 | 10126 | my |
35 | 10117 | ask.com |
36 | 9943 | tattoo |
37 | 9530 | .com |
38 | 9059 | southwest |
39 | 8926 | myspace |
40 | 8906 | white |
41 | 8510 | maps |
42 | 8342 | sex |
43 | 8315 | porn |
44 | 7791 | mailbox |
45 | 7714 | home |
46 | 7709 | www.google |
47 | 7615 | fidelity.com |
48 | 7506 | pogo.com |
49 | 7459 | target |
50 | 7372 | match.com |
Notes
1 2 3 4 5 | wget http: //www .atrus.org /hosted/AOL-data .tgz tar xvzf AOL-data.tgz for i in . /user *; do sed '1 d' $i >> aol_all.txt; done awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt. top head aol_all.txt. top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}' |
I am booking looking around for a while for this data, but have not found it. Could you point me to it, or, if it has been taken down on all public sites (somethign AOL has been working at), do you still have it, and could you send it to me?
Thanks!