AOL Search Data Leak

September 12, 2011, 8:56 pm

In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.

First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:

1	711111	–
2	233670	google
3	98713	ebay
4	91234	yahoo
5	69314	yahoo.com
6	61890	mapquest
7	55832	google.com
8	54985	myspace.com
9	51433	myspace
10	30835	www.yahoo.com
11	30045	www.google.com
12	28185	internet
13	21353	http
14	20046	www.myspace.com
15	19578	map
16	19381	weather
17	18577	my
18	18560	ebay.com
19	16191	bank
20	15940	american
21	14512	pogo
22	14063	hotmail
23	13824	msn
24	13575	ask
25	13521	craigslist
26	13294	hotmail.com
27	12825	dictionary
28	12512	yahoo
29	11877	msn.com
30	11757	mycl.cravelyrics.com
31	11091	bankofamerica
32	10360	mapquest.com
33	10158	walmart
34	10126	my
35	10117	ask.com
36	9943	tattoo
37	9530	.com
38	9059	southwest
39	8926	myspace
40	8906	white
41	8510	maps
42	8342	sex
43	8315	porn
44	7791	mailbox
45	7714	home
46	7709	www.google
47	7615	fidelity.com
48	7506	pogo.com
49	7459	target
50	7372	match.com

Notes

wget http://www.atrus.org/hosted/AOL-data.tgz
tar xvzf AOL-data.tgz
for i in ./user*;do sed '1 d' $i >> aol_all.txt;done
awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt.top
head aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}'

Category: Uncategorized | Comment (RSS) | Trackback

One Comment

Jeff says:

August 25, 2012 at 4:31 pm

I am booking looking around for a while for this data, but have not found it. Could you point me to it, or, if it has been taken down on all public sites (somethign AOL has been working at), do you still have it, and could you send it to me?

Thanks!

41J Blog