Paradoxia

I like paradoxes I think they tell us something meanful about the nature of logic. Here are the most succinct paradoxes I know of. If you have any more please leave a comment.

Berry’s paradox: The smallest number that can not be expressed in less than 20 words.

Russell’s paradox: Does the set of all sets that do not contain themselves, contain itself?

The liar paradox: This statement is untrue.

Quine’s paradox: “Yields falsehood when preceded by its quotation” yields falsehood when preceded by it’s quotation.

Unknowable truths

Godel’s incompleteness theorem is one of the most important and beautiful results in Computer Science. It tells us that even if we know exactly how a system is constructed there are statements about that system which are true, but we can never prove. They are in some sense unknowable truths.

However the usual presentation of Godel’s theorem is complicated and unwieldy. I could never really get my head around it, or find the time to dedicate to doing so. Even after gaining the jist of it, Godel’s description doesn’t leave me with the sense of awe that it perhaps should.

There is however another, less well known, description formulated in terms of Kolmogorov Complexity. It’s simple, beautiful, and certainly leaves me with a sense of awe. It somehow feels much more like a proof ‘should’. And runs as follows [1]:

1. Let x denote a finite binary string. We write “x is random” if the shortest description of x is at least as long as x. That is, we can’t write a computer program to generate x, that’s shorter than x. It’s incompressible.

2. We can see by a simple counting argument that there are random strings of every length. See note 2.

3. We have a language (formal system to use the precise terminology). In that language we can make statements like ‘x is random’. Now, suppose we can describe that language (F) in f bits of data. For example, this could be the number of bits required for an exhaustive description of the textbook ‘the language of F’. F can be a complicated language, but it must be consistent. No statement in F can be both true AND false, and only true statements can be proven to be true using F.

4. Now we’re going to show that there are statements in F which are true, but we can’t prove them ! Truths which exists that we can never know. We so this by contradiction. Given F we can start to exhaustively search for a proof that some string (which is a lot bigger than f) is random, and then print it when it’s found.

But wait! If you can prove it’s random it can’t be random! Why? Because the procedure to print it only required f + log n bits of data. Which is much smaller then n. Therefore we’ve proven something true, which is not true. Our language F is not consistent, and our axiom is violated.

So in short, you can have a system that’s consistent (things you can prove to be true, are really true) or a system which contains true statements but you can never know they are true…

I think it’s a stunning result! It tells us that whatever logical tool you use to look at the universal you will still fail to understand it completely. We are in some sense doomed to live in a state of unknowing, and must make peace with a certain degree of confusion.

——

Note 1: This form of the proof can be found on page 5 of “An Introduction to Kolmogorov Complexity and Its Applications”. Though I’m not sure that it originates here.

Note 2: That’s because there are 2^n strings of length n and sumof 2^(n-i) for i=1 to n-1 strings of length n-1 or smaller. As the number of strings of length n-1 is less than the number of strings of length n and our computer programs can be expressed as binary strings. It’s clear to see that there can’t be a 1 to 1 mapping between strings of length n and programs of length n-1 or less. There must be strings of length n with are random or ‘uncompressable’.

AOL Search Data Leak

In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.

First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:

1 711111

2 233670

google
3 98713

ebay
4 91234

yahoo
5 69314

yahoo.com
6 61890

mapquest
7 55832

google.com
8 54985

myspace.com
9 51433

myspace
10 30835

www.yahoo.com
11 30045

www.google.com
12 28185

internet
13 21353

http
14 20046

www.myspace.com
15 19578

map
16 19381

weather
17 18577

my
18 18560

ebay.com
19 16191

bank
20 15940

american
21 14512

pogo
22 14063

hotmail
23 13824

msn
24 13575

ask
25 13521

craigslist
26 13294

hotmail.com
27 12825

dictionary
28 12512

yahoo
29 11877

msn.com
30 11757

mycl.cravelyrics.com
31 11091

bankofamerica
32 10360

mapquest.com
33 10158

walmart
34 10126

my
35 10117

ask.com
36 9943

tattoo
37 9530

.com
38 9059

southwest
39 8926

myspace
40 8906

white
41 8510

maps
42 8342

sex
43 8315

porn
44 7791

mailbox
45 7714

home
46 7709

www.google
47 7615

fidelity.com
48 7506

pogo.com
49 7459

target
50 7372

match.com

Notes

wget http://www.atrus.org/hosted/AOL-data.tgz
tar xvzf AOL-data.tgz
for i in ./user*;do sed '1 d' $i >> aol_all.txt;done
awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt.top
head aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}'

September 11 Pager Message Analysis

I wanted to do some basic analysis on the September 11 Pager Message dump from wikileaks. The data is about 2 years old now, and I’m sure it’s already been analysed to death but I wanted to have a play anyway.

First I binned the messages in to 5min chunks and looks for occurrences of the word “plane”. You can see the results in the graph below:

There’s an unexpected blip at around 7am, this is due to messages about a US spy drone which was shot down over Iraq before the WTC planes hit.

A lot of people initially thought it was a bomb, here are the occurrences of “bomb” in text messages:

The peak at 3pm seems to represent reports of a “Car bomb explodes outside State Department, senior law enforceement officials say” mostly from yahoo news alerts.

Next I wanted to look for server issues so I search for any message containing: offline, timeout, error (but not terror) or “not responding”:

Not sure what the peak at 5am is, there are a lot of: ” adhocdb TEXT: adhocdb unix: WARNING: vxvm:vxio: Subdisk disk02-01 block 240384: Uncorrectable write error” and “[email protected]||elgweb8 login content match error on step 2”. It appears to be mostly the later. But you can certainly see the rate of errors pick up from about 8:00 onward. To know how much of this was due to the attacks I’d need more of a baseline. Some of the more interesting ones:

2001-09-11 09:05:20 Arch [0912377] C  ALPHA  8628**Customer reported outage* Internet is unavailable. Impact:User unable to connect to internet.  Multiple users state either extremely slow connection or connection timed out error. Occurred:08:55
2001-09-11 09:05:31 Skytel [003920778] C  ALPHA  SEV1 DesMoines as PICS  H0864464 SLI=Y ETA=Na 7:44  I:Pics is down, unable to access orders - error 202 no records found.  E:PICS Helpdesk  Janie ATCHelpdesk 512-248-4967-ATC Helpdesk
2001-09-11 09:15:46 Skytel [007607560] C  ALPHA  [email protected]||NSSW3/General Tire/T3149925/1-888-212-5447/Continental GT users in Charlotte are unable to connect to IBM IIN. NSCOMSOFT seeing errors. Please call ADVNETO.

Notes

I downloaded the wikileaks 911 pager torrent from: http://file.wikileaks.org/torrent/9-11_all_messages.7z.torrent. The pager messages are broken down by minute but for my analysis I wanted them all in a single file. So I started by concatenating them all.

cat 2001* > all_messages

The file contains lines that look like:

2001-09-11 03:00:00 Metrocall [1060278] B ALPHA 09/12@03:03:50 BETA13:Service (Oracle Web Lsnr 4.0(admin -DEFAULT_HOME,Intranet)) is not responding. Stopped. 1
2001-09-11 03:00:00 Metrocall [1421210] C ALPHA LAKEJob exceeded 4 hours on Lake
2001-09-11 03:00:00 Metrocall [1421210] C ALPHA LAKEJob 378304/VSIOWNER/OMSJRNMGR has MSGW status.
2001-09-11 03:00:00 Metrocall [0007690] C ALPHA THIS IS A TEST PERIODIC PAGE SEQUENTIAL NUMBER 4719
2001-09-11 03:00:01 Arch [0485957] B ALPHA (24)[email protected]|10.134.192.34 VARESAAPP03 is UP at 01:59:12
2001-09-11 03:00:01 Arch [0987275] C ALPHA s0191: 09/11 12:28:34 Reboot NT machine gblnetnt05 in cabinet 311R at 13/1CMP:CRITICAL:Sep 11 12:28:34
2001-09-11 03:00:01 Arch [1425048] C ALPHA 300~MPfetchData:openConnectionToManager:ERROR CONNECTING:192.168.35.97 : www36 connectToServerPort:socket/socket timed out at /home/crdtdrv/creditderivatives/script/MPfetchData.pl line 342, <SOCK_192.168.35.19> chunk 193021.

I then wrote some C++ code to analyse the data, here’s the “bomb” count in 5min bins across the time period:

stringify.h header:

#include <sstream>
#include <iostream>
#include <string>
#include <stdexcept>

#ifndef STRINGIFY_H
#define STRINGIFY_H

class BadConversion : public std::runtime_error {
public:
 BadConversion(const std::string&amp;amp;amp;amp;amp; s)
      : std::runtime_error(s) {}
};

template<class _type>
inline std::string stringify(_type x)
{
  std::ostringstream o;
  if (!(o << std::fixed << x))
    throw BadConversion("stringify()");
  return o.str();
}

template<class _type>
inline _type convertTo(const std::string&amp;amp;amp;amp;amp; s)
{
  std::istringstream i(s);
  _type x;
  if (!(i >> x))
    throw BadConversion("convertTo(\"" + s + "\")");
  return x;
}

#endif

Analysis code:

#include
#include
#include
#include
#include “stringify.h”
#include
#include

using namespace std;

// Example line: 2001-09-11 03:00:00 Metrocall [1060278] B ALPHA 09/12@03:03:50 BETA13:Service (Oracle Web Lsnr 4.0(admin -DEFAULT_HOME,Intranet)) is not responding. Stopped. 1

int time_to_int(string date,string time) {

int seconds_per_day = 86400;
int seconds_per_hour = 3600;
int seconds_per_min = 60;

int day = convertTo(date.substr(8,2));
int hour = convertTo(time.substr(0,2));
int min = convertTo(time.substr(3,2));
int sec = convertTo(time.substr(6,2));

int itime = 0;
if(day == 12) itime += seconds_per_day;
itime += hour * seconds_per_hour;
itime += min * seconds_per_min;
itime += sec;

return itime;
}

int main() {

vector > message_bins;

int bin_size = 5; // mins

ifstream data_file(“all_messages”);

int start_time;
int next_time;

bool first=true;
vector current_bin;
for(;!data_file.eof();) {
char in_line[10000];
data_file.getline(in_line,10000);

if(data_file.eof()) break;

string line(in_line);

istringstream ss(line);

string date;
string time;
ss >> date;
ss >> time;

int i_time = time_to_int(date,time);

if(first) { start_time = i_time; next_time = start_time + 60*bin_size; first=false;}

if(i_time >= next_time) {
message_bins.push_back(current_bin);
current_bin.clear();
next_time += 60*bin_size;
}

std::transform(line.begin(), line.end(), line.begin(), ::tolower);
current_bin.push_back(line);
}

for(size_t n=0;n