Alice visits a website (DRAFT)
Alice wants to visit a website. She’s clicking something in a browser and up comes the page.
For better oversight, the whole Internet Service Providers that actually deliver the content are not displayed.
The website’s server delivers some html and other files to show the page. Huge files or stuff that is quickly needed everywhere may come from a Content Delivery Network that the website runs, pays or simply relies on for free.
For example jQuery, a framework in javascript that helps making web pages more interactive and fancy and which may have saved javascript as language in web browsers from being overrun by something more convenient. Now this framework can be downloaded any time at
https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js
https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js
- or a dozen other places.
So when someone creates a website using jquery it could be delivered along with the rest of the page or the browser might get the file from one of the CDNs. They say that it’s
Good for web performance because the browser might have it in cache already and must not necessarily load it.
If it loads, though, it will send some information to the CDN like
- Alice’s current IP address,
- any associated Cookies
- and potentially the name of the website Alice is visiting, for example https://anonyme-alkoholiker.de.
Some websites simply don’t do this. They deliver everything by themselves. Other websites do this extensively and require 10 other services to deliver CSS styles, fonts, images, scripts and whatever. The Alcoholics Anonymous website mentioned above actually delivers it’s own jQuery, but a sub-library of jQuery requires the CSS from https://code.jquery.com/ui/1.11.4/themes/smoothness/jquery-ui.css?ver=5.6.2
.
But that’s how websites get delivered. And, by the way, web development is complicated and expensive, the websites need some revenue, so they place advertising on their page.
The Urchin might actually be a multinational corporation but in any case, the website owner does allow the other network to place banner ads, web analytics and possibly unknown things into it’s own page.
The owner of the website, lets call him Bob, gets some money for each of Alice’s clicks on a commercial. Also he can study Alice’s reasons for visiting his page and what stuff she looked at there. If he wants to see the statistics of all the visits on his page he goes to Urgin’s website.
Bob can see on Urchin’s website some of the data that has been gathered for him. But not more. While Urchin may have data about a million websites, it only shows Bob the stuff related to his own website (and only the stuff that is legal). Bob’s browser has no direct connection to Urchin’s secret data. But, of course, Bob’s clicks on Urchin’s website are collected like for anybody else.
Bob thinks:
This is great, i get some money for running my website and i can see what people typed into google before they came onto my website, and how long they stayed on each page, etc…
Urchin thinks:
This is great, another website included our advertising/analytics framework, so we can build better statistics, track more web users and deliver ads that get clicked more often by individual people so our customers that place ads through our system will pay more.
Alice probably just thinks:
I like the website, only the ads are a bit annoying.
Personally, i am thinking:
This is a downward spiral to corporate hell
And in fact, the internet looks like this right now:
Alice just opened reddit.com without any script/ad-blockers, clicked the Yes-allow-all-just-leave-me-alone button and browsed for 5 minutes. You can get an HTTP Archive (HAR) file of Alice’s session here.
The red connection lines mean that the content loaded from server A requested more content from server B. For example, the contents loaded from googlesyndication.com (whatever that actually is), requested more content from retailads.net, quantserver.com, webgains.com, etc…
The more intense the color of an Urchin, the more information got transmitted to that server. This includes Tracking Pixels which means the browser requests an image but the image is actually meaningless. Instead the information in the request is the true content of that transmission. That might contain canvas fingerprinting and other techniques to truly identify Alice’s browsing device, regardless of deleted cookies or changing the browser’s user-profile.
Some top-notch methods for profiling are presented on browserleaks.com (You may either be okay with google ads or have your adblockers up-to-date before visiting the page).
So what is this information that is passed? Here is what google’s doubleclick.net receives regularily from Alice’s browser:
xai = AKAOjsvbFEnw9d44qIKiGbLQnehTRfDxz5CoAAn6OnbsoelS7slvu7Rpox3J1a1oKw2jYP-vykNA9NOEljMZEuncJEj920OzdGNfaYCxJNeBUzvWWf0WMnlIAfow3GbMd7k5CR8ojdUWsCqvq-2jeAoMA1ZOZl6R7QMlFpxRs1oaz70eb2Hk4QsPu29f_Ingx1_hGnsgU3PePlBDbixnV7Lb8FCQ25iuUEwam1uQTy83kmnsbcyLdylrks9_GHJ_OfbvmtQy9N42eC6K2Ye-PF4coLQg15C9VIq3ZZxrf9IGlOziUNAJTb8cqQINLv04gj-BgCxlk_sU
sai = AMfl-YSqvI-2l3RwDTbIbkzSIzQwB1byK-wALfER95GT6QFW4OGumcGp-lY1SWzKNkedtVw66oU_JYDzDd6x1Pa6JJmGyQiYmapdvTLh5Ry-5tS8mTyEPfsl4YQ8s1hqfeo
sig = Cg0ArKJSzKjrJDYGyymXEAE
And webgains.com:
callback = hitCallback
wgpayload = FOa44iFBBNlY5Du4UXuKrnZ2CI9XkPrwXjm_YrJFW73AuyPB884akiEocEcEJ1w.Cs5uQ1szHVyVxFAk.rpwoNJ9z4oYYLzZKyJcbZpMIrkJXTiEocEcEJ1w.7bhpfze1r6zdstlDJFW73E4QCwby91Sp0alnjk3nKxUC54725H5UWBL6hqeFV.Ld_lHVxX_AD_AKtgtIzZzQmpRnoyDDbbaMrjbQKBcCdDSI6KUMnGWpwoNSUC56MnGWVQdg3ZLQ0F42p9..DgcOQ_i.uJtHoqvynx9MsFyxYMAqJkL6f1BSypw.5B0KB8D1Re4GSr_U_9zWuz3YMJ5tTma1kW0SX3NlY5DtTpuy.DQJ
wgcookie = {"wgifp12595":["99582","12595","723181","","1614816668","https%3A%2F%2F365534f451099c0661cae2249111d71b.safeframe.googlesyndication.com%2Fsafeframe%2F1-0-37%2Fhtml%2Fcontainer.html","","","1770336668","62681100006646100710680011523028"]}
wgchecksum = da59b14c869d4fb06f7ea8f903687b18
userIP = 78.54.127.247
wgtime = 1614816668
(wgchecksum could actually be a canvas fingerprint, judging by the name, because their fingerprinting code comes from
https://analytics-wg.webgains.io/tech-essence-clk.min.js
)
aaxadds.com:
___stu13p = aveoaamactga5dnnuee25ti2rm86bcrodqacb
lwbsh = AAX
dewh = SSP_CLIENT_gcp_eweu
dgw = desktop
flg = AAX763KC6
fw = BERLIN
skw = 617
slg = 8PR6YK195
gq = reddit.com
vhuyqdph = rtb-nv-dcos-ssp-10-6-34-207-6203
vyu = 030308_203_030312_71_ssp
yk = 617
yz = 1280
ylg = 00001614816541085013121943041473
vvsDeExfnhw = CONTROL
gdss = green
jgivwu = Y-N
xvs_ogi = false
xvs_vwulqj = 1YN-
jixqgo = 1200
jwg = 100
qjixqgo = 1200
ugo = 800
ghqg = 535
uhtxuo = https://www.reddit.com/
nzui = https://www.google.com/?&
googlesyndication.com:
id = sodar2
v = 221
li = gpt_2021022501
jk = 1725576623835237
bg = !1tWl1ZbNAAWsVXnBrDsAKQB2-DxaIGAokhMs2ErfKFwljupXH0xkdsydCe9lAnMCojNvN6PKv4uhAgAAANNSAAAACGgBBwoA7V6tQWriezhBsWP1wv5WVoejfVT3YzYgFzhTVpjH3BoKgKTnDckqhfRSjjcrOgQnhKnDVRop-dQfRmYWRJdFlPwrIXCtL_RohVpoWCLtpp3o42m4yGGp6qkRAxEsoCh7ZAUMaEL3O6m27BJuvpgeUWZdJkJoGWszTvsaE0ULl4ApaDUSZzw_xaPc1iP9YXAJ_oRDB1PuTquxS0pZ4hz1Dgwfdcyk0PHLVMMTJsbRzE2eHHpXafMy-mcB0CCuNy87z2Svux-aNIp8lLhSntyFwf2UJQB1M0-o_STNlc6XTHiwCCZxdQugmLrBfkqzeJkB5e8ilj1hzcMZE0_4CQHo_OylhSyBApPFyFKzpFEIUHDmAvZWI22eotbX0fjMn7_3DHt-LJFk2mjjAmFpDlRnKNhhNaFU74JpZP1-dXfPAUZuf63cFZtSt0u7UZwi1VKepWhqTwz4a8fHVWdAxICs0EiRwFL1u8MFQiNBZV8nXVsGZQ-6q9-vljfQKJ-bveeeBalPRZs7uMEVhswHMhTsJJWElOpnROf4E87UtvGs9QaPBfcJZxtiBGDwCZ-OUT-LRos_q-DN4Ek7Nvdt--taDkmOO1TPlIpKl9pwIwQYKPBcjDE3GS1S4m9YlKfBg14pZCQr0e3wjBqKavfJhsndJt_ySAk4b7Qm1lB4nZbRPQb5CJxzwlN83xEXxTHAm64iv24GmEGYx-MOwO4wQx6p6kalywvxy0UDAssrJffDymYic5u4zYx3fCcgugEFA24SCgwvUx8eH59ndCR1cdHlrE8JEoMT2lY7KDWoNBpnAYus85In2S7mukCgLvp4BgemOKF-cNY_aUFz8LKqw_IgQeBxgryz_zT2EW8anRZW8_erTRefiZC66R1GJ6jwwRXjxeNRnZrtxf6SDT6EPuWMj7rc8wKGJP9h1MXWoYevGOHSBq6v3nMW5D3LXMaJBly7qcGl6yJI
And so on…
There is no way to tell what the encrypted messages actually contain without reverse engineering the service, which is not entirely legal and very cumbersome as the javascript code is almost always minified and obfuscated, as if they have something to hide:
var ba,aa,da,ea,fa,ja,la,na,ka,oa,pa,qa,ta,va,xa,za,Ba,Da,Ea,Ga,Ia,Ja,Ka,La,Ma,Oa,Pa,Ya,ab,ib,lb,ub,vb,xb,Eb,Hb,Ib,Kb,Nb,Ob,Lb,Sb,Ub,cc,ec,gc,ic,jc,mc,nc,oc,pc,rc,sc,tc,vc,wc,yc,Ac,Ec,Hc,Ic,Nc,Qc,Rc,Sc,Vc,Xc,Zc,$c,ad,dd,id,jd,kd,ld,md,nd,od,qd,td,yd,I,zd,Ad,Bd,Cd,Dd,y,Ed,Fd,Gd,Kc,Hd,Id,Md,Nd,Od,ce,de,be,ae,ee,fe,ge,ia,he,ie,je,ke,Dc,L;ba=function(a,b){b=aa(a,b);return 0>b?null:"string"===typeof a?a.charAt(b):a[b]};aa=function(a,b){for(var c=a.length,d="string"===typeof a?a.split(""):a,e=0;e<c;e++)if(e in d&&b.call(void 0,d[e],e,a))return e;return-1};_.ca=function(a,b){return 0<=Array.prototype.indexOf.call(a,b,void 0)};da=function(a,b){b=Array.prototype.indexOf.call(a,b,void 0);var c;(c=0<=b)&&Array.prototype.splice.call(a,b,1);return c};ea=function(a){var b=a.length;if(0<b){for(var c=Array(b),d=0;d<b;d++)c[d]=a[d];return c}return[]};fa=function(a,b,c){return 2>=arguments.length?Array.proto...
Am i alone with my concern?
Seems like the so-called Inventor of the Internet, Tim Berners-Lee is concerned as well. His idea is that Alice’s personal data like browsing and shopping behaviour or even medical data is voluntarily stored in a Pod, which is a secured storage on some server.
Alice registers a Pod and can then give read or write access for particular details to other entities via Access Control Policies (ACP). Next time Alice visits the website, it might say:
Urchin wants to have access to the following personal data:
- Visits to this website and all affiliated websites
- Clicks on this website and all affiliated websites
- List of items you shopped in the last 30 days
- Your calendar
- Your Steam records
- Your medical history
Allow?
Alice might not like that at all and decline. She’s still annoyed by the commericals on the website, maybe even more because they are not personalized any more but the list looked too frigthening.
Let’s assume that the technology of Pods is completely secure and only Alice herself decides what goes in and what comes out and to whom. Still we have Urchin lingering on the page, freely executing javascript and collecting data about Alice as usual.
So this is not merely a technological issue. A community like Tim Berners-Lee’s Inrupt must convince the major website owners to restrict the Urchin’s actions to those allowed by Alice’s profile. This has started with the General Data Protection Regulation in Europe but it’s not near anything useful at the moment.
How can we trust Urchin or Bob? Well, if Alice’s Pod eventually contains more data than Urchin is able to collect, it will want to have access, so it will need to play nice. We will see.
What other options are there?