Text-to-video model

A video generated using OpenAI's unreleased, open source Sora text-to-video model, using the prompt:

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.^[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]

Models

There are different models, including open source models. Chinese-language input^[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022.^[4] That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",^[5]^[6]^[7] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.^[8]^[9]^[10]^[11]^[12]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. ^[13] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. ^[14]

Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.^[15] In June 2024, Luma Labs launched its Dream Machine video tool.^[16]^[17] That same month,^[18] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.^[19]

Alternative approaches to text-to-video models include^[20] Google's Phenaki, Hour One, Colossyan,^[21] Runway's Gen-3 Alpha,^[22]^[23] and OpenAI's unreleased (as at August 2024) Sora,^[24] available only to alpha testers.^[25]

Text-to-Video AI Models Comparison


Model/Product	Company	Year Released	'Status	Key Features	Capabilities	Pricing	Video Length	Supported Languages
Synthesia	Synthesia	2019	Released	AI avatars, multilingual support for 60+ languages, customization options^[26]	Specialized in realistic AI avatars for corporate training and marketing^[26]	Subscription-based, starting around $30/month	Varies based on subscription	60+
InVideo AI	InVideo	2021	Released	AI-powered video creation, large stock library, AI talking avatars^[26]	Tailored for social media content with platform-specific templates^[26]	Free plan available, Paid plans starting at $16/month	Varies depending on content type	Multiple (not specified)
Fliki	Fliki AI	2022	Released	Text-to-video with AI avatars and voices, extensive language and voice support^[26]	Supports 65+ AI avatars and 2,000+ voices in 70 languages^[26]	Free plan available, Paid plans starting at $30/month	Varies based on subscription	70+
Runway Gen-2	Runway AI	2023	Released	Multimodal video generation from text, images, or videos^[27]	High-quality visuals, various modes like stylization and storyboard^[27]	Free trial, Paid plans (details not specified)	Up to 16 seconds	Multiple (not specified)
Pika Labs	Pika Labs	2024	Beta	Dynamic video generation, camera and motion customization^[28]	User-friendly, focused on natural dynamic generation^[28]	Currently free during beta	Flexible, supports longer videos with frame continuation	Multiple (not specified)
Runway Gen-3 Alpha	Runway AI	2024	Alpha	Enhanced visual fidelity, photorealistic humans, fine-grained temporal control^[29]	Ultra-realistic video generation with precise key-framing and industry-level customization^[29]	Free trial available, custom pricing for enterprises	Up to 10 seconds per clip, extendable	Multiple (not specified)
OpenAI Sora	OpenAI	2024 (expected)	Alpha	Deep language understanding, high-quality cinematic visuals, multi-shot videos^[30]	Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures^[30]	Pricing not yet disclosed	Expected to generate longer videos; duration specifics TBD	Multiple (not specified)

References

^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
^ Text-to-Video Generative AI Models: The Definitive List AI Business accessed 19 August 2024.
^ CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12
^ Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.
^ "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.
^ "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
^ "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". ar5iv. Retrieved 2024-08-30.
^ "Text to Speech for Videos". Retrieved 2023-10-17.
^ Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race VentureBeat accessed August 16, 2024.
^ Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video Forbes accessed August 16, 2024.
^ What you need to know about Kling, the AI video generator rival to Sora that’s wowing creators VentureBeat accessed August 16, 2024.
^ ByteDance joins OpenAI's Sora rivals with AI video app launch Reuters accessed August 16, 2024.
^ Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12
^ Text-to-Video Generative AI Models: The Definitive List AI Business accessed August 16, 2024.
^ Runway's Sora competitor Gen-3 Alpha now available The Decoder accessed August 16, 2024.
^ Generative AI's Next Frontier Is Video Bloomberg accessed August 16, 2024.
^ OpenAI teases 'Sora,' its new text-to-video AI model NBC News accessed August 16, 2024.
^ Toys R Us creates first brand film to use OpenAI’s text-to-video tool Marketing Dive accessed August 16, 2024.
^ ^a ^b ^c ^d ^e ^f "Top AI Video Generation Models of 2024". Deepgram. Retrieved 2024-08-30.
^ ^a ^b "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 2024-08-30.
^ ^a ^b Sharma, Shubham (2023-12-26). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 2024-08-30.
^ ^a ^b "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 2024-08-30.
^ ^a ^b "Sora | OpenAI". openai.com. Retrieved 2024-08-30.

[AIIR-1] Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.

[2] Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].

[3] Text-to-Video Generative AI Models: The Definitive List AI Business accessed 19 August 2024.

[4] CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12

[5] Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.

[6] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[7] "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.

[8] "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.

[9] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[10] "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.

[11] "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[12] "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[13] Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].

[14] "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". ar5iv. Retrieved 2024-08-30.

[15] "Text to Speech for Videos". Retrieved 2023-10-17.

[16] Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race VentureBeat accessed August 16, 2024.

[17] Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video Forbes accessed August 16, 2024.

[18] What you need to know about Kling, the AI video generator rival to Sora that’s wowing creators VentureBeat accessed August 16, 2024.

[19] ByteDance joins OpenAI's Sora rivals with AI video app launch Reuters accessed August 16, 2024.

[20] Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12

[21] Text-to-Video Generative AI Models: The Definitive List AI Business accessed August 16, 2024.

[22] Runway's Sora competitor Gen-3 Alpha now available The Decoder accessed August 16, 2024.

[23] Generative AI's Next Frontier Is Video Bloomberg accessed August 16, 2024.

[24] OpenAI teases 'Sora,' its new text-to-video AI model NBC News accessed August 16, 2024.

[25] Toys R Us creates first brand film to use OpenAI’s text-to-video tool Marketing Dive accessed August 16, 2024.

[:3-26] ^ ^a ^b ^c ^d ^e ^f "Top AI Video Generation Models of 2024". Deepgram. Retrieved 2024-08-30.

[:0-27] "Runway Research | Gen-2: Generate novel videos with text, images or video clips". runwayml.com. Retrieved 2024-08-30.

[:1-28] Sharma, Shubham (2023-12-26). "Pika Labs' text-to-video AI platform opens to all: Here's how to use it". VentureBeat. Retrieved 2024-08-30.

[:2-29] "Runway Research | Introducing Gen-3 Alpha: A New Frontier for Video Generation". runwayml.com. Retrieved 2024-08-30.

[:4-30] "Sora | OpenAI". openai.com. Retrieved 2024-08-30.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

Models

Text-to-Video AI Models Comparison

See also

References