**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 21, 2022, 21:59

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 21, 2022, 21:59

Nicholas Guttenberg @nichg@sigmoid.social

Dec 21, 2022, 21:59

Nicholas Guttenberg @nichg@sigmoid.social

Currently trying to fine-tune SD1.5 to use patches from CLIP ViT/14 as the prompt rather than text tokens, using about 2k SD images I generated in October as the dataset. Should allow things like prompting with an image pair to get something in between. Eventual multi-modal (text + images as prompts) would be nice. #AIArt #StableDiffusion #machinelearning #AI

**GaggiX** @GaggiX@qoto.org · Dec 21, 2022, 23:26

**GaggiX** @GaggiX@qoto.org · Dec 21, 2022, 23:26

Dec 21, 2022, 23:26

GaggiX @GaggiX@qoto.org

@nichg like this model here? https://huggingface.co/lambdalabs/sd-image-variations-diffusers

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 21, 2022, 23:49

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 21, 2022, 23:49

Dec 21, 2022, 23:49

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX exactly like that I guess 😅

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 00:20

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 00:20

Dec 22, 2022, 00:20

GaggiX @GaggiX@qoto.org

@nichg I don't think that 2k images are nearly enough for this job

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 00:36

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 00:36

Dec 22, 2022, 00:36

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX interesting if that's the case, since Dreambooth seems to work so well with only 10-100. What if one just trained the projection layer and not the UNet?

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 01:10

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 01:10

Dec 22, 2022, 01:10

GaggiX @GaggiX@qoto.org

@nichg with Dreambooth you're not trying to shift the entire conditioning distribution

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 01:37

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 01:37

Dec 22, 2022, 01:37

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX Well anyhow, this probably saved me three months of training time, so thanks!

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 01:53

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 01:53

Dec 22, 2022, 01:53

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX Oh, hm, its actually not quite the same is the Image Variations thing. What I was trying to do was to use the whole set of ViT patches as tokens, whereas it looks like Variations uses the pooled and projected CLIP vector. So you can't just use two images as input and get a third with Variations unfortunately...

**GaggiX** @GaggiX@qoto.org · 2022-12-22T02:47:56Z

GaggiX @GaggiX@qoto.org

@nichg I thought it would be possible to interpolate between two CLIP embeddings and get in-between variations, Isn't this the same way Dalle 2 was conditioned on?

Dec 22, 2022, 02:47 · · Tusky · · ·

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 03:40

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 03:40

Dec 22, 2022, 03:40

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX Yeah but that means something different. Interpolating between two vectors gives you the concept halfway between the two. But if you use cross-attention over a token set and add additional tokens, its more like an 'and'.

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 03:51

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 03:51

Dec 22, 2022, 03:51

GaggiX @GaggiX@qoto.org

@nichg I don't understand what are you trying to do, I thought it was something like Midjourney when it's conditioned on two or more images

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 03:58

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 22, 2022, 03:58

Dec 22, 2022, 03:58

Nicholas Guttenberg @nichg@sigmoid.social

@GaggiX Broadly, if fine-tuning to new modalities is easy, you could have a multi-modality model where you choose what to drop in and what to remove. Use the ViT tokens except in an area to inpaint. Combine with audio because why not. Have some text be descriptive, other be seen as tags, one image for a mask, one for a depth map, segmentation map, and a texture reference, etc.

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 04:17

**GaggiX** @GaggiX@qoto.org · Dec 22, 2022, 04:17

Dec 22, 2022, 04:17

GaggiX @GaggiX@qoto.org

@nichg I still have no idea what you are trying to do but I guess good luck ahah

**hans** @wav@sigmoid.social · Dec 23, 2022, 07:35

**hans** @wav@sigmoid.social · Dec 23, 2022, 07:35

Dec 23, 2022, 07:35

hans @wav@sigmoid.social

@nichg @GaggiX does the cross attention operation accept arbitrary length sequences? Sounds like a really cool idea.

Would be especially cool if you could do something like LoRA on the cross attention weights to separately fine-tune different conditioning modalities and then merge the ones you need at inference time.

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 23, 2022, 07:50

**Nicholas Guttenberg** @nichg@sigmoid.social · Dec 23, 2022, 07:50

Dec 23, 2022, 07:50

Nicholas Guttenberg @nichg@sigmoid.social

@wav @GaggiX yep, that's one advantage! Though practically if you get too far from the token count used in training it can wash out and just sort of average.

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…