Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whats the difference between the Consistent self-attention and Cross Frame attention in Text2Video-Zero? #99

Open
Yushuyang1994 opened this issue May 17, 2024 · 2 comments

Comments

@Yushuyang1994
Copy link

It seems they are somehow similar and could you please describe the difference between them? Thank you!

@brentjohnston
Copy link

brentjohnston commented May 17, 2024

I searched entire repo and code for "Text2Video-Zero" the video weights have still not been released and I don't see any code related to the text to video yet. The dev said it's just for comic generation for now in another comment. Not sure where you are seeing this?

@Z-YuPeng
Copy link
Collaborator

Thank you for your attention. Both Consistent Self-Attention and Cross-Frame Attention make use of the key and value from self-attention, which was also introduced in Imagen. However, the subjects and purposes of their self-attention operations differ. Cross-frame attention is applied to video generation models, utilizing the first frame as a reference image, while Consistent Self-Attention is based on image generation models, leveraging sampled tokens from various character images to facilitate interaction among character features, thus ensuring character consistency. We will update our paper to make readers more aware of this distinction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants