Posted URLs on hackernews

So sometimes i visit yc hackernews. Maybe just because they still have this anachronistic css which nowadays looks like the script blocker has interfered. There is this API link at the bottom and the first thing i tried was the Max Item ID.

requests.get("https://hacker-news.firebaseio.com/v0/maxitem.json").json()

26319868

This is actually large enough to be a little challenging. So i downloaded them all using asyncio and aiohttp with a rate of about 100 messages per second and exported them into elasticsearch. The whole process took about 3 days.

They started in late 2006 and most of the stuff that we have become completely used to was not even existing then.

protocol

96.6% of all posted urls are HTTPS nowadays. And currently about 30,000 URLs are posted each month as can be seen in the next figure.

It’s just interesting how smooth the above graph is. HTTPS was specified in 2000, Let’s encrypt started in 2016, Google deprecated HTTP in 2018 and a lot more milestones will have happened in between but the figure shows none of it. Only a continuous steady climb.

One can still find a thousand HTTP links per month ;-)

Here is a top ten of Jan/Feb 2021:

 37x diego-pacheco.blogspot.com
 30x m.nautil.us
 17x bit.ly
 12x www.bbc.com
 11x nautil.us
 11x web.archive.org
 11x www.paulgraham.com
 10x eepurl.com
 10x www.tr0lly.com
  9x blogs.quovantis.com

There are actually people posting stuff like http://www.bbc.com/travel/story/20210103-englands-sleepy-scientology-town. Is this by purpose or have they disabled the automatic redirect to https and don’t know it? At least in the case of the bbc page, there actually is no redirect when visiting the link (March 2021).. Maybe if one posts a lot of links like this and then hangs around at public wifi spots, hoping someone clicks the links and visits a logged-in payed site via HTTP..

What are actually the first HTTPS pages posted on hackernews?

2007-06-11T18:20:29 https://loopt.com/loopt/jobs.aspx
2008-03-31T15:43:35 https://blog.codinghorror.com/revisiting-keyboard-vs-the-mouse-pt-1/
2008-07-06T03:27:50 https://privnote.com/
2008-07-08T01:41:47 https://16systems.com/digit/
2008-07-09T21:23:34 https://adwords.google.com/select/KeywordToolExternal
2008-07-10T08:32:05 https://visualvm.dev.java.net/
2008-07-12T04:48:01 https://addons.mozilla.org/en-US/firefox/search?q=hack&cat=all&show=100
2008-07-17T08:49:40 https://www.godaddy.com/gdshop/tlds/me.asp
2008-07-17T09:45:15 https://www.godaddy.com/gdshop/tlds/me.asp?ci=12389&domainToCheck=amaze%2Eme&me%5Fsearch=1
2008-07-18T18:17:52 https://privnote.com/

domains

Here’s another historic indicator, and not quite as smooth as the HTTP/S graph above. The percentage of URLs starting with www. or ending with .io or .org. (You can click or double-click on the variables on the right to show/hide them individually.)

query parameters

Hackernews is a marketing target (Well, just my guess, if you disagree please create an issue so we can try to proof/disproof numerically) and the following figure shows the percentage of URLs containing UTM tracking parameters. It basically means someone is using google analytics to count the number of visitors that come on her page by following a specific link, like a post on hackernews.

The popular query parameters p and id are included for comparison. They are generally not used for campaign tracking. (Again, you can click the variables on the right to view/hide them to get a better overview.)

Regarding the marketing allegation above, i must admit that it’s only 12 tracking code URLs per day on average. But why is it limited in time like this?

We can split by the query values of the UTM parameters. Below are roughly all major UTM trackers. Note that, e.g. hackernews most of the time means: Someone not affiliated with hackernews posted a link there, putting something like ?utm_source=hackernews in the URL to count how many people click the posted link.

(What this also implies is that google attaches hackernews to their profile of you, but hey, it’s free!)

So increasingly different actors are posting their tracking stuff but in March 2018 they all stop? Is it that hackernews does remove them automatically in most cases?

Let’s compare this with the URLs posted in comments. The percentage of utm tracking links versus utm-free links in the post’s text are plotted below. It’s not much really, a 10th of the url-field numbers. And it’s completely undisturbed by the March 2018 cut-off.

(These numbers are not as accurate as the url-field numbers above. The utm-urls are filtered with this lucene-style query text: href AND utm AND (source OR campaign) on a text field tokenized by the default stop analyzer. So there might be a few false positives in there.)

A stupid post, but genuinely out of a hacker’s curiosity, confirms the suspicion that Hackernews is actually removing those parameters now.

And how come these

utm-tagged URLs since April 2018? Are they privileged people? Or have they just another parameter in front as tested in this post?

Well, most of them have, but not these ones:

16761221 16839650 16878635 17655816 17902736 18289606 18441840 18874173 19419903 19752699 19884561 19946334 19958159 19958172 20421131 21212971 21405060 21522311 21629313 21841941 21948142 22137159 22294100 22351502 22474890 22593882 22671486 22717222 22830353 22861403 22958452 23033802 23116973 23235813 23315007 23397824 23475179 23557404 23922720 23994334 24381670 24392756 24425843 24771056 24793331 24821398 25139187 26171224

For historical charme here are the first URLs on hackernews containg utm parameters:

2007-03-07T21:31:06 http://www.theonion.com/content/node/59345?&utm_source=digg_1
2007-04-24T17:54:27 http://www.theonion.com/content/node/60924?utm_source=onion_rss_daily
2008-01-18T21:47:45 http://www.theonion.com/content/news/failure_now_an_option?utm_source=EMTF_Onion
2008-03-26T01:29:28 http://www.fiercemobilecontent.com/story/the-new-wave-of-mobile-web-surfing/2008-03-18?utm_medium=nl&utm_source=link
2008-04-12T15:32:23 http://www.theonion.com/content/news/area_man_makes_it_through_day?utm_source=onion_rss_daily
2008-09-01T04:36:26 http://hubpages.com/_3svwsez9at5zx/hub/space_tourism_robertbigelow?utm_source=fanclub&utm_campaign=evite&utm_medium=email
2008-10-01T04:13:16 http://scienceblogs.com/builtonfacts/2008/09/experimental_mathematics.php?utm_source=sbhomepage&utm_medium=link&utm_content=channellink
2008-10-01T04:57:24 http://dotank.nyls.edu/communitypatent/applications.html?utm_source=Peer-to-Patent+Email+Announcements&utm_campaign=915d3c11ef-New_P2P_Apps_6_30_2008&utm_medium=email
2008-11-12T00:19:59 http://scienceblogs.com/cortex/2008/11/the_cognitive_benefits_of_natu.php?utm_source=Seed+Subscribers&utm_campaign=67d7c2d18e-Recap_7_22_to_7_287_28_2008&utm_medium=email
2008-12-15T18:56:25 http://discovermagazine.com/2009/jan/051?utm_campaign=DISCOVER%20Magazine%20Technology%20Newsletter%2012.15.2008&utm_content=darmaniiii@yahoo.com&utm_medium=Email&utm_source=VerticalResponse&utm_term=%2351%3A%20Physicists%20Build%20the%20World%27s%20Smallest%20Transistor

Next would be the part about hostnames and their usage over time but it’s actually a bit boring.

Things like, medium.com started in 2012 and is now the most popular link target do not catch my attention too much. In case of objection, grab the download script and explore yourselves.