AOL Search Data Leak

In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.

First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:

1 711111

2 233670

google
3 98713

ebay
4 91234

yahoo
5 69314

yahoo.com
6 61890

mapquest
7 55832

google.com
8 54985

myspace.com
9 51433

myspace
10 30835

www.yahoo.com
11 30045

www.google.com
12 28185

internet
13 21353

http
14 20046

www.myspace.com
15 19578

map
16 19381

weather
17 18577

my
18 18560

ebay.com
19 16191

bank
20 15940

american
21 14512

pogo
22 14063

hotmail
23 13824

msn
24 13575

ask
25 13521

craigslist
26 13294

hotmail.com
27 12825

dictionary
28 12512

yahoo
29 11877

msn.com
30 11757

mycl.cravelyrics.com
31 11091

bankofamerica
32 10360

mapquest.com
33 10158

walmart
34 10126

my
35 10117

ask.com
36 9943

tattoo
37 9530

.com
38 9059

southwest
39 8926

myspace
40 8906

white
41 8510

maps
42 8342

sex
43 8315

porn
44 7791

mailbox
45 7714

home
46 7709

www.google
47 7615

fidelity.com
48 7506

pogo.com
49 7459

target
50 7372

match.com

Notes

wget http://www.atrus.org/hosted/AOL-data.tgz
tar xvzf AOL-data.tgz
for i in ./user*;do sed '1 d' $i >> aol_all.txt;done
awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt.top
head aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}'