Follow

A tiny test with CLIP network: what if I feed it three inputs: "sphere", "a sphere" and "the sphere"?

At the first glance, results differ. Sometimes smooth, low frequency features are very visible. Then the amount of noisy, higher frequency features and textures grows.

Another way to think about it is to tinker with Gimp's "wavelet decompose" plugin or look at an image here: docs.gimp.org/2.10/en/plug-in-

Often, only one side of the object or a section of it is being filled, which is interesting by itself.

CLIP fills edges with all kinds of textures very well, but generally, for a given token, it picks some zones and continually raises the image frequency detail.

In no way these three images are a representative sample, but it's fun to play with. Adding cubes and cylinders to the same semantic may help to test more stuff.

For the context: I am using BoneAmputee's CLIP+VQGAN.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.