web image clusters

Hi there, this is probably more a proof-of-concept than an interesting post. Again..

Below are tSNE plots of web image properties. Thought i’d do something beautiful with all the emberrassing data from the automated website browsing experiments, so i grabbed all the images that where received and built some histograms of image properties like resolution, number of channels and mean red/green/blue/alpha values.

The distance between all histograms was calculated using the bergi value histogram distance algorithm and the resulting distance matrix was visualized with the TSNE solver from scikit-learn and a plotly scattergl plot.

Each dot in the plots shows a web domain from where at least 5 images where received. The color represents the top mean value for each color histogram. You can click a dot and the first 10 urls of images are displayed below the plot. If you click an url, the image will be loaded. (Which potentially also loads tracking pixels and stuff. Oh, and there is porn in there. Best to find via the entropy plot)

And this is actually the proof-of-concept: to calculate/load all data in the python notebook, pass it to javascript (together with the associated image urls) and call plotly inside javascript to render the plot but merged with the respective x/y data from each python call. Then export everything to html. My ranting is below.

resolution

Hovering over a dot with the mouse shows the domain name and the top 3 encountered image widths and the respective counts.

mean color

Here the label contains the top entries from the mean red histogram.

number of channels

entropy

No histogram was used here but the entropy of the image pixels. In other words, the information (or even randomness) in an image. A little-bit of the color histograms distances are mixed in, though.

the rant

Well… it’s all exciting.. Jupyter notebook and all these possibilities. Just, the amount of work and testing until those possibilities actually work… It’s posted now and i do not want to touch that notebook again soon. The %%javscript cell magic actually only works in the notebook and not when exporting. So fall back to write a javascript-containing string in python and then display(f"<script>{js_code}</script>"). When loading the notebook it crashes with a javascript error and half of it’s content is missing… It works after reloading the page, fortunately. And by the way, the tab-auto-completion does not function properly so i’m constantly doing things like [name for name in dir(some_object) if "something i look for" in name]. Eventually i search the internet and might be challenged again with fine-tuning the script blocker. But that’s my own fault, naturally. Also this hassle with delivering everything from one server, even if it’s a microsoft one, nowadays…

Then is this actually beautiful? Well, i like those tSNE plots. And i can actually see the images if i want to. Do i want to? Not really. And those tracking pixels are the most obvious thing in each plot. All this effort just to find them again! I am tired.

Well, thanks for listening.