This tutorial expects that you have TrailDB installed and working. If you haven’t installed TrailDB yet, see Getting Started for instructions.
Part I: Create a simple TrailDB¶
In this example, we will create a tiny TrailDB that includes events from three users. You can find the full Python source code in the traildb-python repo and the C source in the main traildb repo.
Note that opening a new TrailDB constructor fails if there is an
existing TrailDB with the same name. If you run this example multiple
times, you should delete the tiny
directory, which may contain partial
results, and tiny.tdb
before running the example.
First, let’s create a new constructor that we will use to populate the TrailDB.
The TrailDB will have two fields, username
and action
, which we will specify
when creating the constructor.
from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime
cons = TrailDBConstructor('tiny', ['username', 'action'])
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <traildb.h>
int main(int argc, char **argv)
{
const char *fields[] = {"username", "action"};
tdb_error err;
tdb_cons* cons = tdb_cons_init();
if ((err = tdb_cons_open(cons, "tiny", fields, 2))){
printf("Opening TrailDB constructor failed: %s\n", tdb_error_str(err));
exit(1);
}
Now we can populate the TrailDB with events. We are going
to create three dummy users, each of which will have three
events. Note that the primary key identifying the user is a
UUID. We can use the uuid
module
in Python to generate UUIDs, or you can create your own identifiers like
the C code does.
for i in range(3):
uuid = uuid4().hex
username = 'user%d' % i
for day, action in enumerate(['open', 'save', 'close']):
cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))
static char username[6];
static uint8_t uuid[16];
const char *EVENTS[] = {"open", "save", "close"};
uint32_t i, j;
/* create three users */
for (i = 0; i < 3; i++){
memcpy(uuid, &i, 4);
sprintf(username, "user%d", i);
/* every user has three events */
for (j = 0; j < 3; j++){
const char *values[] = {username, EVENTS[j]};
uint64_t lengths[] = {strlen(username), strlen(EVENTS[j])};
/* generate a dummy timestamp */
uint64_t timestamp = i * 10 + j;
if ((err = tdb_cons_add(cons, uuid, timestamp, values, lengths))){
printf("Adding an event failed: %s\n", tdb_error_str(err));
exit(1);
}
}
}
Once you are done adding events in the TrailDB, you have to finalize it. Finalization takes care of compacting the events and creating a valid TrailDB file.
cons.finalize()
if ((err = tdb_cons_finalize(cons))){
printf("Closing TrailDB constructor failed: %s\n", tdb_error_str(err));
exit(1);
}
tdb_cons_close(cons);
You can check the contents of the new TrailDB using the tdb
tool by
running tdb dump -i tiny
. We can easily print out its contents using
the API too:
for uuid, trail in TrailDB('tiny').trails():
print uuid, list(trail)
tdb* db = tdb_init();
if ((err = tdb_open(db, "tiny"))){
printf("Opening TrailDB failed: %s\n", tdb_error_str(err));
exit(1);
}
tdb_cursor *cursor = tdb_cursor_new(db);
/* loop over all trails */
for (i = 0; i < tdb_num_trails(db); i++){
const tdb_event *event;
uint8_t hexuuid[32];
tdb_uuid_hex(tdb_get_uuid(db, i), hexuuid);
printf("%.32s ", hexuuid);
tdb_get_trail(cursor, i);
/* loop over all events of this trail */
while ((event = tdb_cursor_next(cursor))){
printf("[ timestamp=%llu", event->timestamp);
for (j = 0; j < event->num_items; j++){
uint64_t len;
const char *val = tdb_get_item_value(db, event->items[j], &len);
printf(" %s=%.*s", fields[j], len, val);
}
printf(" ] ");
}
printf("\n");
}
That’s it! You can easily extend this example for creating TrailDBs based on event sources of your own.
Part II: Analyze a large TrailDB of Wikipedia edits¶
Wikipedia provides a database dump of the full edit history of Wikipedia pages. This is a treasure trove of data that can be used to analyze, for instance, behavior of individual contributors or edit history of individual pages.
We converted the 50GB compressed dump to a TrailDB. For this tutorial, you should download the pre-made TrailDB. Two versions are provided:
-
wikipedia-history.tdb contains the full edit history of Wikipedia between January 2001 and May 2016. This TrailDB contains trails for 44M contributors, covering 663M edit actions. The size of the file is 5.8GB.
-
wikipedia-history-small.tdb contains a random sample of 1% contributors (103MB). If you are curious, this script was used to produce a random extract of the full TrailDB.
First, you should download the smaller snapshot above,
wikipedia-history-small.tdb
, which allows you to verify quickly
that the code works. Python is convenient for small and medium-scale
analysis but it tends to be slow with larger amounts of data. For
analyzing the full wikipedia-history.tdb
, we recommend that you use C,
D, Go or Haskell bindings of TrailDB.
Number of sessions by contributor¶
Trails in the Wikipedia TrailDBs include all edit actions
of each Wikipedia contributor. Contributors include both anonymous
contributors who are identified by the IP address (field ip
) and
registered contributors who have a username (field user
). Each
event includes also a title
of the page that was edited and the
timestamp of the edit action.
To measure contributor activity, it is useful to count the number of edit sessions, in addition to the raw number of edits. We define a session as a sequence of actions where actions are at most 30 minutes apart, similar to how sessions are defined in web analytics. Counting the number of sessions by contributor is easy with TrailDB.
You can find the full Python source code in the traildb-python repo and the C source in the main traildb repo.
def sessions(tdb):
for i, (uuid, trail) in enumerate(tdb.trails(only_timestamp=True)):
prev_time = trail.next()
num_events = 1
num_sessions = 1
for timestamp in trail:
if timestamp - prev_time > SESSION_LIMIT:
num_sessions += 1
prev_time = timestamp
num_events += 1
print 'Trail[%d] Number of Sessions: %d Number of Events: %d' %\
(i, num_sessions, num_events)
tdb_cursor *cursor = tdb_cursor_new(db);
uint64_t i;
for (i = 0; i < tdb_num_trails(db); i++){
const tdb_event *event;
tdb_get_trail(cursor, i);
event = tdb_cursor_next(cursor);
uint64_t prev_time = event->timestamp;
uint64_t num_sessions = 1;
uint64_t num_events = 1;
while ((event = tdb_cursor_next(cursor))){
if (event->timestamp - prev_time > SESSION_LIMIT)
++num_sessions;
prev_time = event->timestamp;
++num_events;
}
printf("Trail[%llu] Number of Sessions: %llu Number of Events: %llu\n",
i,
num_sessions,
num_events);
}
The code loops over all trails and measures the time between actions.
If the time exceeds 30 minutes, we increment the session counter. Note
that the Python code sets only_timestamp=True
which makes the cursor
return only timestamps instead of the full events. This is a performance
optimization that removes unnecessary allocations in the inner loop
which are particularly expensive in Python.
The code outputs the number of sessions and the number of events for each contributor. We can plot a histogram of the results:
Unsurprisingly, the vast majority of contributors have only one session. However, there is a very long tail of contributors who have over 200 sessions.
Not all contributors are human beings. There are a number of benevolent
bots making routine edits in Wikipedia, such as maintaining basic
statistics. In fact, in wikipedia-history.tdb
you can find over 4500
users whose name ends with bot
. As a fun follow up exercise, you
can write a script that tries to detect bots based on their behavior
that is often very characteristic and easy to distinguish from human
contributors.