Unknowable truths

Godel’s incompleteness theorem is one of the most important and beautiful results in Computer Science. It tells us that even if we know exactly how a system is constructed there are statements about that system which are true, but we can never prove. They are in some sense unknowable truths.

However the usual presentation of Godel’s theorem is complicated and unwieldy. I could never really get my head around it, or find the time to dedicate to doing so. Even after gaining the jist of it, Godel’s description doesn’t leave me with the sense of awe that it perhaps should.

There is however another, less well known, description formulated in terms of Kolmogorov Complexity. It’s simple, beautiful, and certainly leaves me with a sense of awe. It somehow feels much more like a proof ‘should’. And runs as follows [1]:

1. Let x denote a finite binary string. We write “x is random” if the shortest description of x is at least as long as x. That is, we can’t write a computer program to generate x, that’s shorter than x. It’s incompressible.

2. We can see by a simple counting argument that there are random strings of every length. See note 2.

3. We have a language (formal system to use the precise terminology). In that language we can make statements like ‘x is random’. Now, suppose we can describe that language (F) in f bits of data. For example, this could be the number of bits required for an exhaustive description of the textbook ‘the language of F’. F can be a complicated language, but it must be consistent. No statement in F can be both true AND false, and only true statements can be proven to be true using F.

4. Now we’re going to show that there are statements in F which are true, but we can’t prove them ! Truths which exists that we can never know. We so this by contradiction. Given F we can start to exhaustively search for a proof that some string (which is a lot bigger than f) is random, and then print it when it’s found.

But wait! If you can prove it’s random it can’t be random! Why? Because the procedure to print it only required f + log n bits of data. Which is much smaller then n. Therefore we’ve proven something true, which is not true. Our language F is not consistent, and our axiom is violated.

So in short, you can have a system that’s consistent (things you can prove to be true, are really true) or a system which contains true statements but you can never know they are true…

I think it’s a stunning result! It tells us that whatever logical tool you use to look at the universal you will still fail to understand it completely. We are in some sense doomed to live in a state of unknowing, and must make peace with a certain degree of confusion.

——

Note 1: This form of the proof can be found on page 5 of “An Introduction to Kolmogorov Complexity and Its Applications”. Though I’m not sure that it originates here.

Note 2: That’s because there are 2^n strings of length n and sumof 2^(n-i) for i=1 to n-1 strings of length n-1 or smaller. As the number of strings of length n-1 is less than the number of strings of length n and our computer programs can be expressed as binary strings. It’s clear to see that there can’t be a 1 to 1 mapping between strings of length n and programs of length n-1 or less. There must be strings of length n with are random or ‘uncompressable’.

AOL Search Data Leak

In 2006 there was a “leak” of AOL search data. A large set of anonymized search queries was released publicly by AOL for research purposes. AOL however quickly realised that the data was a public relations nightmare, and that some users could be deanonymized though their search queries. The dataset was removed from their website, and the incident lead to the resignation of the CTO of AOL. Anyway, the dataset is still freely available on the internet and I’m interested in playing with it a little.

First of all I want to pull out the top search queries. UNIX sort and uniq handle this quite well as shown in the Notes section. Here are the top 50 queries:

1 711111

2 233670

google
3 98713

ebay
4 91234

yahoo
5 69314

yahoo.com
6 61890

mapquest
7 55832

google.com
8 54985

myspace.com
9 51433

myspace
10 30835

www.yahoo.com
11 30045

www.google.com
12 28185

internet
13 21353

http
14 20046

www.myspace.com
15 19578

map
16 19381

weather
17 18577

my
18 18560

ebay.com
19 16191

bank
20 15940

american
21 14512

pogo
22 14063

hotmail
23 13824

msn
24 13575

ask
25 13521

craigslist
26 13294

hotmail.com
27 12825

dictionary
28 12512

yahoo
29 11877

msn.com
30 11757

mycl.cravelyrics.com
31 11091

bankofamerica
32 10360

mapquest.com
33 10158

walmart
34 10126

my
35 10117

ask.com
36 9943

tattoo
37 9530

.com
38 9059

southwest
39 8926

myspace
40 8906

white
41 8510

maps
42 8342

sex
43 8315

porn
44 7791

mailbox
45 7714

home
46 7709

www.google
47 7615

fidelity.com
48 7506

pogo.com
49 7459

target
50 7372

match.com

Notes

wget http://www.atrus.org/hosted/AOL-data.tgz
tar xvzf AOL-data.tgz
for i in ./user*;do sed '1 d' $i >> aol_all.txt;done
awk 'BEGIN{FS="\t";}{print $2}' aol_all.txt | sort | uniq -c | sort -n -r -k 1 > aol_all.txt.top
head aol_all.txt.top -n 50 | awk 'BEGIN{n=1;}{print "<tr><td>" n "</td><td>" $1 "<td>" $2 "</td></tr>";n++;}'

September 11 Pager Message Analysis

I wanted to do some basic analysis on the September 11 Pager Message dump from wikileaks. The data is about 2 years old now, and I’m sure it’s already been analysed to death but I wanted to have a play anyway.

First I binned the messages in to 5min chunks and looks for occurrences of the word “plane”. You can see the results in the graph below:

There’s an unexpected blip at around 7am, this is due to messages about a US spy drone which was shot down over Iraq before the WTC planes hit.

A lot of people initially thought it was a bomb, here are the occurrences of “bomb” in text messages:

The peak at 3pm seems to represent reports of a “Car bomb explodes outside State Department, senior law enforceement officials say” mostly from yahoo news alerts.

Next I wanted to look for server issues so I search for any message containing: offline, timeout, error (but not terror) or “not responding”:

Not sure what the peak at 5am is, there are a lot of: ” adhocdb TEXT: adhocdb unix: WARNING: vxvm:vxio: Subdisk disk02-01 block 240384: Uncorrectable write error” and “[email protected]||elgweb8 login content match error on step 2”. It appears to be mostly the later. But you can certainly see the rate of errors pick up from about 8:00 onward. To know how much of this was due to the attacks I’d need more of a baseline. Some of the more interesting ones:

2001-09-11 09:05:20 Arch [0912377] C  ALPHA  8628**Customer reported outage* Internet is unavailable. Impact:User unable to connect to internet.  Multiple users state either extremely slow connection or connection timed out error. Occurred:08:55
2001-09-11 09:05:31 Skytel [003920778] C  ALPHA  SEV1 DesMoines as PICS  H0864464 SLI=Y ETA=Na 7:44  I:Pics is down, unable to access orders - error 202 no records found.  E:PICS Helpdesk  Janie ATCHelpdesk 512-248-4967-ATC Helpdesk
2001-09-11 09:15:46 Skytel [007607560] C  ALPHA  [email protected]||NSSW3/General Tire/T3149925/1-888-212-5447/Continental GT users in Charlotte are unable to connect to IBM IIN. NSCOMSOFT seeing errors. Please call ADVNETO.

Notes

I downloaded the wikileaks 911 pager torrent from: http://file.wikileaks.org/torrent/9-11_all_messages.7z.torrent. The pager messages are broken down by minute but for my analysis I wanted them all in a single file. So I started by concatenating them all.

cat 2001* > all_messages

The file contains lines that look like:

2001-09-11 03:00:00 Metrocall [1060278] B ALPHA 09/12@03:03:50 BETA13:Service (Oracle Web Lsnr 4.0(admin -DEFAULT_HOME,Intranet)) is not responding. Stopped. 1
2001-09-11 03:00:00 Metrocall [1421210] C ALPHA LAKEJob exceeded 4 hours on Lake
2001-09-11 03:00:00 Metrocall [1421210] C ALPHA LAKEJob 378304/VSIOWNER/OMSJRNMGR has MSGW status.
2001-09-11 03:00:00 Metrocall [0007690] C ALPHA THIS IS A TEST PERIODIC PAGE SEQUENTIAL NUMBER 4719
2001-09-11 03:00:01 Arch [0485957] B ALPHA (24)[email protected]|10.134.192.34 VARESAAPP03 is UP at 01:59:12
2001-09-11 03:00:01 Arch [0987275] C ALPHA s0191: 09/11 12:28:34 Reboot NT machine gblnetnt05 in cabinet 311R at 13/1CMP:CRITICAL:Sep 11 12:28:34
2001-09-11 03:00:01 Arch [1425048] C ALPHA 300~MPfetchData:openConnectionToManager:ERROR CONNECTING:192.168.35.97 : www36 connectToServerPort:socket/socket timed out at /home/crdtdrv/creditderivatives/script/MPfetchData.pl line 342, <SOCK_192.168.35.19> chunk 193021.

I then wrote some C++ code to analyse the data, here’s the “bomb” count in 5min bins across the time period:

stringify.h header:

#include <sstream>
#include <iostream>
#include <string>
#include <stdexcept>

#ifndef STRINGIFY_H
#define STRINGIFY_H

class BadConversion : public std::runtime_error {
public:
 BadConversion(const std::string&amp;amp;amp;amp;amp; s)
      : std::runtime_error(s) {}
};

template<class _type>
inline std::string stringify(_type x)
{
  std::ostringstream o;
  if (!(o << std::fixed << x))
    throw BadConversion("stringify()");
  return o.str();
}

template<class _type>
inline _type convertTo(const std::string&amp;amp;amp;amp;amp; s)
{
  std::istringstream i(s);
  _type x;
  if (!(i >> x))
    throw BadConversion("convertTo(\"" + s + "\")");
  return x;
}

#endif

Analysis code:

#include
#include
#include
#include
#include “stringify.h”
#include
#include

using namespace std;

// Example line: 2001-09-11 03:00:00 Metrocall [1060278] B ALPHA 09/12@03:03:50 BETA13:Service (Oracle Web Lsnr 4.0(admin -DEFAULT_HOME,Intranet)) is not responding. Stopped. 1

int time_to_int(string date,string time) {

int seconds_per_day = 86400;
int seconds_per_hour = 3600;
int seconds_per_min = 60;

int day = convertTo(date.substr(8,2));
int hour = convertTo(time.substr(0,2));
int min = convertTo(time.substr(3,2));
int sec = convertTo(time.substr(6,2));

int itime = 0;
if(day == 12) itime += seconds_per_day;
itime += hour * seconds_per_hour;
itime += min * seconds_per_min;
itime += sec;

return itime;
}

int main() {

vector > message_bins;

int bin_size = 5; // mins

ifstream data_file(“all_messages”);

int start_time;
int next_time;

bool first=true;
vector current_bin;
for(;!data_file.eof();) {
char in_line[10000];
data_file.getline(in_line,10000);

if(data_file.eof()) break;

string line(in_line);

istringstream ss(line);

string date;
string time;
ss >> date;
ss >> time;

int i_time = time_to_int(date,time);

if(first) { start_time = i_time; next_time = start_time + 60*bin_size; first=false;}

if(i_time >= next_time) {
message_bins.push_back(current_bin);
current_bin.clear();
next_time += 60*bin_size;
}

std::transform(line.begin(), line.end(), line.begin(), ::tolower);
current_bin.push_back(line);
}

for(size_t n=0;n

The Distribution of Bitcoin users

I wanted to get an idea of the distribution of bitcoin clients throughout the world. While bitcoin is a distributed network clients current connect to a central IRC channel to find initial peers. I connected an IRC client to the channel and monitored connections for a couple of days. I then ran the results through a geoip lookup tool.

The 10 countries with the most bitcoin clients are:

Position Count Country
1 2769 US, United States
2 868 DE, Germany
3 518 RU, Russian Federation
4 478 GB, United Kingdom
5 368 CA, Canada
6 332 PL, Poland
7 227 CN, China
8 227 AU, Australia
9 179 SE, Sweden
10 171 UA, Ukraine

I was slightly surprised to see that the US was so far ahead to everyone else. Also to find that Japan was so far down the list, considering the technology is said to have originated there, and that the largest bitcoin exchange is hosted there.

Notes

The irc log file contains lines which look like this:

10:25 -!- u4qBM3jsFJbVbpo [[email protected]] has quit [Ping timeout: 600 seconds]
10:25 -!- u5R1WUESwqDJAH4 [[email protected]] has quit [Ping timeout: 600 seconds]
10:25 -!- uCj5PhnxQtuXvRE [[email protected]] has quit [Ping timeout: 600 seconds]
10:25 -!- u5vMMajjqPPNxqT [[email protected]] has joined #bitcoin
10:25 -!- uDEdr5uhnbT2e4f [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- uAiz7fBo8XnZZ1i [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- u5jSfdNqpwjJ9eV [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- u7LamexRgFDoHdu [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- u69NdViycJWVKYm [[email protected]] has joined #bitcoin
10:26 -!- u4W2ih7yRg7TH72 [[email protected]] has joined #bitcoin
10:26 -!- uB2dfUQ2HC94KkT [[email protected]] has joined #bitcoin
10:26 -!- u6NXdjDMZoatVNi [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- uBrRc777VmroaKp [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- u7yETCjt7iCtDL1 [[email protected]] has joined #bitcoin
10:26 -!- u6M6WhB6r7ucQTM [[email protected]] has quit [Ping timeout: 600 seconds]
10:26 -!- u6DPECS8Ti8uSkD [[email protected]] has joined #bitcoin
10:26 -!- u5Dp18HMdzYBwjs [[email protected]] has quit [Ping timeout: 600 seconds]

I used the following procedure to extract the lines containing the joins and perform the geoip lookups and counts. I only used unique addresses. The GeoIP database was from MaxMind and the database was updated on the 9th of September.

grep join irc.log > bitcoin.joins
... I then lazily replaced all the ]s with @s in the file bitcoin.joins
awk 'BEGIN{FS="@";}{print $2}' bitcoin.join > bitcoin.addrs
awk '{print "geoiplookup " $0}' bitcoin.addrs | sort | uniq | sh > bitcoin.countries.1
cat bitcoin.countries.1 | awk 'BEGIN{FS=":"}{print $2}' | sort | uniq -c | sort -k 1 -n -r > bitcoin.cntcount

The full breakdown of countries is as follows:

   2769  US, United States
    868  DE, Germany
    518  RU, Russian Federation
    478  GB, United Kingdom
    368  CA, Canada
    332  PL, Poland
    227  CN, China
    227  AU, Australia
    179  SE, Sweden
    171  UA, Ukraine
    160  NL, Netherlands
    143  FR, France
    109  IT, Italy
     95  DK, Denmark
     95  AR, Argentina
     90  FI, Finland
     85  BR, Brazil
     82  RO, Romania
     72  ES, Spain
     72  AT, Austria
     71  CH, Switzerland
     64  ZA, South Africa
     59  MX, Mexico
     56  --, N/A
     48  IE, Ireland
     48  CZ, Czech Republic
     45  IL, Israel
     44  BE, Belgium
     43  NZ, New Zealand
     42  CO, Colombia
     41  NO, Norway
     35  IN, India
     27  SG, Singapore
     27  RS, Serbia
     27  BG, Bulgaria
     26  BY, Belarus
     24  TW, Taiwan
     20  TH, Thailand
     20  CL, Chile
     18  SI, Slovenia
     17  KZ, Kazakhstan
     17  JP, Japan
     17  HU, Hungary
     16  PT, Portugal
     16  GR, Greece
     15  PH, Philippines
     14  SK, Slovakia
     14  LT, Lithuania
     13  LV, Latvia
     12  MD, Moldova, Republic of
     12  HK, Hong Kong
     11  SA, Saudi Arabia
     11  MY, Malaysia
     10  MA, Morocco
     10  BA, Bosnia and Herzegovina
      9  HR, Croatia
      8  VN, Vietnam
      8  DO, Dominican Republic
      7  PE, Peru
      7  KR, Korea, Republic of
      7  A1, Anonymous Proxy
      6  VE, Venezuela
      6  IS, Iceland
      6  ID, Indonesia
      6  EC, Ecuador
      6  AE, United Arab Emirates
      5  PG, Papua New Guinea
      5  IR, Iran, Islamic Republic of
      4  TT, Trinidad and Tobago
      4  LK, Sri Lanka
      4  HN, Honduras
      4  EG, Egypt
      4  EE, Estonia
      4  BS, Bahamas
      3  UY, Uruguay
      3  TR, Turkey
      3  PA, Panama
      3  MK, Macedonia
      3  LU, Luxembourg
      3  JM, Jamaica
      3  JE, Jersey
      3  CR, Costa Rica
      3  BO, Bolivia
      2  ZM, Zambia
      2  TN, Tunisia
      2  MT, Malta
      2  LB, Lebanon
      2  ET, Ethiopia
      2  DZ, Algeria
      1  TZ, Tanzania, United Republic of
      1  SY, Syrian Arab Republic
      1  QA, Qatar
      1  PY, Paraguay
      1  PR, Puerto Rico
      1  PK, Pakistan
      1  OM, Oman
      1  MN, Mongolia
      1  KE, Kenya
      1  JO, Jordan
      1  GY, Guyana
      1  GU, Guam
      1  GT, Guatemala
      1  GI, Gibraltar
      1  CI, Cote D'Ivoire
      1  BN, Brunei Darussalam
      1  BB, Barbados
      1  AZ, Azerbaijan
      1  AL, Albania
      1  AG, Antigua and Barbuda