Currently trying to fine-tune SD1.5 to use patches from CLIP ViT/14 as the prompt rather than text tokens, using about 2k SD images I generated in October as the dataset. Should allow things like prompting with an image pair to get something in between. Eventual multi-modal (text + images as prompts) would be nice. #AIArt #StableDiffusion #machinelearning #AI
@nichg like this model here? https://huggingface.co/lambdalabs/sd-image-variations-diffusers
@GaggiX exactly like that I guess 😅
@nichg I don't think that 2k images are nearly enough for this job
@GaggiX interesting if that's the case, since Dreambooth seems to work so well with only 10-100. What if one just trained the projection layer and not the UNet?
@nichg with Dreambooth you're not trying to shift the entire conditioning distribution
@GaggiX Well anyhow, this probably saved me three months of training time, so thanks!
@GaggiX Oh, hm, its actually not quite the same is the Image Variations thing. What I was trying to do was to use the whole set of ViT patches as tokens, whereas it looks like Variations uses the pooled and projected CLIP vector. So you can't just use two images as input and get a third with Variations unfortunately...
@nichg I thought it would be possible to interpolate between two CLIP embeddings and get in-between variations, Isn't this the same way Dalle 2 was conditioned on?
@GaggiX Yeah but that means something different. Interpolating between two vectors gives you the concept halfway between the two. But if you use cross-attention over a token set and add additional tokens, its more like an 'and'.
@nichg I don't understand what are you trying to do, I thought it was something like Midjourney when it's conditioned on two or more images