Tuesday, January 01, 2008

Inconsistencies in the Reality Mining dataset?

The MIT Reality Mining dataset has become a touchstone for DTN and social network type research, and I am exploring using that for some of my own simulations. My intended use is to use it as a source trace for when the experiment subjects were in bluetooth proximity with each other.

Imagine my surprise, therefore, when I ran the following SQL query, for average duration of bluetooth proximity, and found that this average was a negative number!

I chose the start and end dates to coincide with a period when Media Lab is frantically preparing for their annual showcase to Industry sponsors. Nathan Eagle, whose PhD consisted of gathering and mining this data set talks about this time in his thesis.

mysql> select avg(endtime-starttime) from (select ds.starttime,ds.endtime,ds.person_oid as src,d.person_oid as dst from devicespan ds left outer join device d on ds.device_oid=d.oid where d.person_oid != 0 and ds.starttime >'2004-10-01' and ds.endtime < '2004-11-01' order by ds.starttime) as xxx;
+------------------------+
| avg(endtime-starttime) |
+------------------------+
| -948.28116818633 |
+------------------------+
1 row in set (0.00 sec)


It turns out that this is because there are 4 records which is erroneous in this period, including one eggregious mistake, which looks like a typo (a date of 7/10 rather than 10/7):
mysql> select * from (select ds.starttime,ds.endtime,ds.person_oid as src,d.person_oid as dst from devicespan ds left outer join device d on ds.device_oid=d.oid where d.person_oid != 0 and ds.starttime >'2004-10-01' and ds.endtime < '2004-11-01' order by ds.starttime) as xxx where endtime < starttime;
+---------------------+---------------------+-----+------+
| starttime | endtime | src | dst |
+---------------------+---------------------+-----+------+
| 2004-10-07 15:51:34 | 2004-07-10 16:02:55 | 92 | 36 |
| 2004-10-31 02:02:02 | 2004-10-31 01:23:40 | 46 | 86 |
| 2004-10-31 02:12:46 | 2004-10-31 01:23:40 | 46 | 22 |
| 2004-10-31 02:18:12 | 2004-10-31 01:44:40 | 46 | 73 |
+---------------------+---------------------+-----+------+


Now, if we exclude these records, we get:
mysql> select avg(endtime-starttime) from (select ds.starttime,ds.endtime,ds.person_oid as src,d.person_oid as dst from devicespan ds left outer join device d on ds.device_oid=d.oid where d.person_oid != 0 and ds.starttime >'2004-10-01' and ds.endtime < '2004-11-01' order by ds.starttime) as xxx where endtime >=starttime;
+------------------------+
| avg(endtime-starttime) |
+------------------------+
| 12460.984108352 |
+------------------------+

There are a total of 22154 records in this period, so if you add in the effect of one mistake (bolded above), averaged over each record, you get roughly 12460+948=13405:
mysql> select (timestamp('2004-07-10 16:02:55')-timestamp('2004-10-07 15:51:34'))/22154;
+---------------------------------------------------------------------------+
| (timestamp('2004-07-10 16:02:55')-timestamp('2004-10-07 15:51:34'))/22154 |
+---------------------------------------------------------------------------+
| -13405.9257470434 |
+---------------------------------------------------------------------------+

(The other three records contribute 3 seconds to the wrong average).