Walking cloudless through the Github archive

There are addictions to all kinds of things. Currently, i’m addicted to generating timeseries of all kinds of public data. I just finished setting up a data archive for a couple of API responses from the Deutsche Bahn which was basically just a procrastination tactic to pull myself away from the refactoring of the office schedule data archive which is quite elaborate. More procrastination followed and i started using the github search to look for other repositories that collect historic data. There are quite a few interesting ones. Not long it took to stumble across the github archive of which i hadn’t heared before.

The github archive is a bunch of compressed, new-line delimited json files, containing all the events of the github events API. Well, maybe not entirely all (there are statements about missing events in the discussions) but enough to make people stumble.

To get an overview of the pure amount of data, download a single hour of each year since availability (Feb. 2011):

$ wget https://data.gharchive.org/20{11..21}-03-01-15.json.gz

3 PM might actually be a busy hour but the numbers are from a horror movie!

$ ls -l
    450433  2011-03-01-15.json.gz
   2329840  2012-03-01-15.json.gz
   2648943  2013-03-01-15.json.gz
   3369069  2014-03-01-15.json.gz
   6365599  2015-03-01-15.json.gz
  21268062  2016-03-01-15.json.gz
  25974997  2017-03-01-15.json.gz
  37577449  2018-03-01-15.json.gz
  42593332  2019-03-01-15.json.gz
  35538137  2020-03-01-15.json.gz
 104531294  2021-03-01-15.json.gz

While the files for year 2011 might fit into 3 Gigabytes, 2016 probably needs more than a 100 Gb and 2021 will certainly amount to over 500 Gb!

So well, i listened to the The Changelog episode #144 as suggested by gharchive.org and the author Ilya Grigorik was obviously struggling with the same problems. The solution at the time was to put everything on Google BigQuery. As Ilya states in the podcast: a BigQuery through the whole dataset takes between 1 and 10 seconds, while just reading all the files from disk takes an hour.

Now, i’m getting older and my not-to-do list is growing. I’m not willing to simply repeat Oh you work at Google? That’s amazing! and whenever i see Google BigQuery or Google Colab, i simply skip the link and look for other ways to access what i want. It may just be stubbornness but that’s how it is. Google APIs are blocked in my browser and i don’t feel compelled enough to allow them just for recreational purposes.

Anyways, i can parse this data myself! How difficult can it be?

It’s November 2021 right now, i have this 2nd hand hetzner server with about 1.7Tb of free harddisk space, so i’m deciding to download years 2018 to 2021 as far as i get.

du -h
175G	./2018
253G	./2019
419G	./2020
443G	./2021
1,3T	.

Monstrous! The download took 4 hours or so. In the meantime i created some utility code that i can run on the server to bucket the data for further processing.

As an example let’s look at the WatchEvents which represent the starring of a repository. Here’s the first recorded WatchEvent of 2018:

{
  "id": "7044401123",
  "type": "WatchEvent",
  "actor": {
    "id": 1710912,
    "login": "yangwenmai",
    "display_login": "yangwenmai",
    "gravatar_id": "",
    "url": "https://api.github.com/users/yangwenmai",
    "avatar_url": "https://avatars.githubusercontent.com/u/1710912?"
  },
  "repo": {
    "id": 75951828,
    "name": "wainshine/Chinese-Names-Corpus",
    "url": "https://api.github.com/repos/wainshine/Chinese-Names-Corpus"
  },
  "payload": {
    "action": "started"
  },
  "public": true,
  "created_at": "2018-01-01T00:00:02Z"
}

The bucketing script produces a CSV like this:

date	user	repo	action	events
2018-01-01T00:00:00Z	yangwenmai	wainshine/Chinese-Names-Corpus	started	1

The date is floored to the nearest hour and every occurence of an event with the same date, user, repo and action is bucketed into a single line, increasing the events counter. Apart from loosing the event ID and the particular minute of the hour, this keeps all relevant information. The counting of the number of similar events is more relevant for other event types like PushEvent.

The conversion of all the WatchEvents in 2018

took eleven hours (while other events where bucketed in parallel)
produced a CSV with
- 36,585,735 rows
- and a compressed size of 628,047,892 bytes (2,381,261,647 bytes uncompressed)

Now, that’s not a CSV you can just open but at least i was able to copy it from the server to my laptop.

Python’s generators are quite useful to walk through a CSV or ndjson file which might not fit into memory in it’s entirety. Let’s measure the time it takes to find all of yangwenmai’s WatchEvents with vanilla python:

import csv
import gzip
import time
import io
import pandas as pd

# per-line iterator through the gzipped csv
def iter_lines(filename: str = "../../../gharchive-stats/server-all/2018/watch_h.csv.gz"):
    with io.TextIOWrapper(io.BufferedReader(gzip.open(filename))) as fp:
        yield from fp

start_time = time.time()

# parse csv and filter for user
rows = []
for row in csv.DictReader(iter_lines()):
    if row["user"] == "yangwenmai":
        rows.append(row)

duration = time.time() - start_time
print(f"took {duration:.2f} seconds")

# convert to pandas
df = pd.DataFrame(rows)
df

took 112.14 seconds

	date	user	repo	action	events
0	2018-01-01T00:00:00Z	yangwenmai	wainshine/Chinese-Names-Corpus	started	1
1	2018-01-01T07:00:00Z	yangwenmai	yangwenmai/jaeger-opentracing-examples	started	1
2	2018-01-01T15:00:00Z	yangwenmai	faceair/youjumpijump	started	1
3	2018-01-02T23:00:00Z	yangwenmai	RangelReale/osin	started	1
4	2018-01-02T23:00:00Z	yangwenmai	chai2010/advanced-go-programming-book	started	1
...	...	...	...	...	...
496	2018-11-28T14:00:00Z	yangwenmai	mholt/archiver	started	1
497	2018-11-30T02:00:00Z	yangwenmai	henrylee2cn/aster	started	1
498	2018-12-11T14:00:00Z	yangwenmai	developer-learning/telegram-bot-go	started	1
499	2018-12-13T00:00:00Z	yangwenmai	bragfoo/TiPrometheus	started	1
500	2018-12-17T01:00:00Z	yangwenmai	cmu-db/ottertune	started	1

501 rows × 5 columns

Over a hundred seconds simply to parse through the CSV, no matter what we are actually looking for. This would probably lead to a query time of 20 minutes for the whole dataset. Far from optimal but it requires no complicated stuff, google accounts or lots of memory.

For reasons, i quite like Elastisearch so i moved the CSV file into an elasticsearch index which took 90 minutes and resulted in an index size of 4.2 Gb. Repeating the above query using elastipy:

from elastipy import Search

response = (Search("gharchive-watch-2018")
 .term("user", "yangwenmai")
 .size(1000)
 .execute()
)
print("took", response["took"], "ms")
pd.DataFrame(response.documents)

took 20 ms

	timestamp	timestamp_hour	timestamp_weekday	user	repo	action	events
0	2018-05-25T23:00:00	23	5 Friday	yangwenmai	developer-learning/learning-kubernetes	started	1
1	2018-05-26T09:00:00	9	6 Saturday	yangwenmai	danistefanovic/build-your-own-x	started	1
2	2018-05-27T03:00:00	3	0 Sunday	yangwenmai	yongman/tidis	started	1
3	2018-05-28T02:00:00	2	1 Monday	yangwenmai	muesli/kmeans	started	1
4	2018-05-28T06:00:00	6	1 Monday	yangwenmai	appoptics/appoptics-apm-go	started	1
...	...	...	...	...	...	...	...
496	2018-11-28T14:00:00	14	3 Wednesday	yangwenmai	mholt/archiver	started	1
497	2018-11-30T02:00:00	2	5 Friday	yangwenmai	henrylee2cn/aster	started	1
498	2018-12-11T14:00:00	14	2 Tuesday	yangwenmai	developer-learning/telegram-bot-go	started	1
499	2018-12-13T00:00:00	0	4 Thursday	yangwenmai	bragfoo/TiPrometheus	started	1
500	2018-12-17T01:00:00	1	1 Monday	yangwenmai	cmu-db/ottertune	started	1

501 rows × 7 columns

Yes! Praise “Bob!”. Who needs the google cloud?

Well, apart from hosting gharchive.org i mean…

And let’s be fair and do a query that requires all documents to be inspected:

agg = (Search("gharchive-watch-2018")
 .agg_terms("action", field="action")
 .metric_sum("events", field="events")
 .metric_cardinality("users", field="user")
 .metric_cardinality("repos", field="repo")
 .execute()
)
print(f'{agg.search.response["took"]:,} ms')
agg.df().style.format("{:,}", ["action.doc_count", "events", "users", "repos"])

35,968 ms

	action	action.doc_count	events	users	repos
0	started	36,585,734	37,066,146.0	2,956,456	3,392,387

30+ seconds on a single sharded elasticsearch index on my laptop. Although elasticsearch can aggregate over the indices for the separate years in parallel this might still lead to range of minutes for the whole dataset. We’ll see. It’s kind of manageable, though.

Have you ever un-starred somebody on github? I think this might happen once or twice. It does not seem to be reported by the API, though.

There’s certainly procrastination potential for more posts. I’ll just wrap up some basic findings about the dataset right now.

Some events are listed more than once. Their ID simply repeats. I skip those by maintaining a set of processed IDs. Once the size of the set is over a million i remove the oldest 500,000.
Some timestamps can be a bit outside the sequential order. I keep a stash of buckets such that only buckets that are two hours behind the currently processed events are written out to the export.
Some events are listed more than once with distinct event IDs. That’s why i count the number of each specific event. In the 2018 WatchEvents there seem to be almost half a million duplicates as can be seen by comparing events and action.doc_count above.
… Or, these are no duplicates but instead star - unstar - star sequences but the unstar is not reported.
… But, from what i’ve seen in the PushEvent data, there are particular and manifold reasons why events repeat. Not all of them can be determined, i think. Sometimes there is a correlation with the actual commits, sometimes not. Some people push the same things to 7,000 cryptically named branches… One guy has somehow issued 30,000 push events over a month on a repository which only has 14 commits or so.
In other words: It’s quite messy. And to get to that realisation requires quite intense data shoveling already. Just look at the PushEvent file:
- 145,893,474 rows
- 3,018,256,231 bytes compressed
- 12,934,378,008 bytes uncompressed
I mean, 12Gb is better than 200 but still…

Anyways, one nice plot at the end:

print("determine the top 10 starred repos per week")
start_time = time.time()

df_10_per_week = (Search("gharchive-watch-2018")
 .agg_date_histogram("date", calendar_interval="1w")
 .agg_terms("repo", field="repo")
 .execute().df().set_index("date")
)

duration = time.time() - start_time
print(f"took {duration:.2f} seconds")

# build the set of all repo names
top_repos = sorted(df_10_per_week["repo"].unique())

print("\nget the timelines of", len(top_repos), "repos, resampled to a day")
start_time = time.time()

df_timelines = (Search("gharchive-watch-2018")
 .terms("repo", top_repos)
 .agg_terms("repo", field="repo", size=len(top_repos))
 .agg_date_histogram("date", calendar_interval="1d")
 .execute().df()
 .set_index(["date", "repo"])
 ["date.doc_count"].unstack("repo")
)

duration = time.time() - start_time
print(f"took {duration:.2f} seconds")

# truncate the name of the repos (because of this particular eeeeeeeee one!)
df_timelines.columns = df_timelines.columns.map(lambda c: c[:40])

df_timelines.plot(
    title="2018 timelines of the top-10 starred repos per week",
    labels={"value": "number of star events", "date": "day"}
)

determine the top 10 starred repos per week
took 12.54 seconds

get the timelines of 386 repos, resampled to a day
took 1.79 seconds

So, that’s the usual trendy stuff that they want us to be informed about. The query times are not too bad though ;) You can drag&zoom and enable/disable the repos on the right side.

By the way. The javascript for this plot you just downloaded weighs about 4 megabytes.