Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024

As we experience the world, our senses (vision, sounds, smells) provide a diverse array of information, and we express ourselves using different communication methods, such as facial expressions and gestures. These senses and communication methods are collectively called modalities, representing the different ways we perceive and communicate. Drawing inspiration from this human capability, large multimodal model (LMM), a combination of generative and multimodal AI, are being developed to understand and create content using different types like text, images, and audio. In this article, we delve into this newly emerging field, exploring what LMMs (Large Multimodal Models) are, how they're constructed, existing examples, the challenges they face, and potential applications.

Evolution of Generative AI in 2024: From Large Language Models to Large Multimodal Models

In its latest report, McKinsey designated 2023 as a breakout year for generative AI, leading to many advancements in the field. We have witnessed a notable rise in the prevalence of large language models (LLMs) adept at understanding and generating human-like language. Furthermore, image generation models are significantly evolved, demonstrating their ability to create visuals from textual prompts. However, despite significant progress in individual modalities like text, images, or audio, generative AI has encountered challenges in seamlessly combining these modalities in the generation process. As the world is inherently multimodal in nature, it is crucial for AI to grapple with multimodal information. This is essential for meaningful engagement with humans and successful operation in real-world scenarios.

Consequently, many AI researchers anticipate the rise of LMMs as the next frontier in AI research and development in 2024. This evolving frontier focuses on enhancing the capacity of generative AI to process and produce diverse outputs, spanning text, images, audio, video, and other modalities. It is essential to emphasize that not all multimodal systems qualify as LMMs. Models like Midjourney and Stable Diffusion, despite being multimodal, do not fit into the LMM category mainly because they lack the presence of LLMs, which are a fundamental component of LMMs. In other words, we can describe LMMs as an extension of LLMs, providing them with the capability to proficiently handle various modalities.

How do LMMs Work?

While researchers have explored various approaches to constructing LMMs, they typically involve three essential components and operations. First, encoders are employed for each data modality to generate data representations (referred to as embeddings) specific to that modality. Second, different mechanisms are used for aligning embeddings from different modalities into a unified multimodal embedding space. Third, for generative models, an LLM is employed to generate text responses. As inputs may consist of text, images, videos and audios, researchers are working on new ways to make language models consider different modalities when giving responses.

Development of LMMs in 2023

Below, I have briefly outlined some of the notable LMMs developed in 2023.

LLaVA is an open-source LMM, jointly developed by the University of Wisconsin-Madison, Microsoft Research, and Columbia University. The model aims to offer an open-source version of multimodal GPT4. Leveraging Meta's Llama LLM, it incorporates the CLIP visual encoder for robust visual comprehension. The healthcare-focused variant of LLaVa, termed as LLaVA-Med, can answer inquiries related to biomedical images.
ImageBind is an open-source model crafted by Meta, emulating the ability of human perception to relate multimodal data. The model integrates six modalities—text, images/videos, audio, 3D measurements, temperature data, and motion data—learning a unified representation across these diverse data types. ImageBind can connect objects in photos with attributes like sound, 3D shapes, temperature, and motion. The model can be used, for instance, to generate scene from text or sounds.
SeamlessM4T is a multimodal model designed by Meta to foster communication among multilingual communities. SeamlessM4T excels in translation and transcription tasks, supporting speech-to-speech, speech-to-text, text-to-speech, and text-to-text translations. The model employs non-autoregressive text-to-unit decoder to perform these translations. The enhanced version, SeamlessM4T v2, forms the basis for models like SeamlessExpressive and SeamlessStreaming, emphasizing the preservation of expression across languages and delivering translations with minimal latency.
GPT4, launched by OpenAI, is an advancement of its predecessor, GPT3.5. Although detailed architectural specifics are not fully disclosed, GPT4 is well-regarded for its smooth integration of text-only, vision-only, and audio-only models. The model can generate text from both written and graphical inputs. It excels in various tasks, including humor description in images, summarization of text from screenshots, and responding adeptly to exam questions featuring diagrams. GPT4 is also recognized for its adaptability in effectively processing a wide range of input data formats.
Gemini, created by Google DeepMind, distinguishes itself by being inherently multimodal, allowing seamless interaction across various tasks without relying on stitching together single-modality components. This model effortlessly manages both text and diverse audio-visual inputs, showcasing its capability to generate outputs in both text and image formats.

Challenges of Large Multimodal Models

Incorporating More Data Modalities: Most of existing LMMs operate with text and images. However, LMMs need to evolve beyond text and images, accommodating modalities like videos, music, and 3D.
Diverse Dataset Availability: One of the key challenges in developing and training multimodal generative AI models is the need for large and diverse datasets that include multiple modalities. For example, to train a model to generate text and images together, the dataset needs to include both text and image inputs that are related to each other.
Generating Multimodal Outputs: While LMMs can handle multimodal inputs, generating diverse outputs, such as combining text with graphics or animations, remains a challenge.
Following Instructions: LMMs face the challenge of mastering dialogue and instruction-following tasks, moving beyond mere completion.
Multimodal Reasoning: While current LMMs excel at transforming one modality into another, the seamless integration of multimodal data for complex reasoning tasks, like solving written word problems based on auditory instructions, remains a challenging endeavor.
Compressing LMMs: The resource-intensive nature of LMMs poses a significant obstacle, rendering them impractical for edge devices with limited computational resources. Compressing LMMs to enhance efficiency and make them suitable for deployment on resource-constrained devices is a crucial area of ongoing research.

Potential Use Cases

Education: LMMs have the potential to transform education by generating diverse and engaging learning materials that combine text, images, and audio. LMMs provide comprehensive feedback on assignments, promote collaborative learning platforms, and enhance skill development through interactive simulations and real-world examples.
Healthcare: In contrast to traditional AI diagnostic systems that target a single modality, LMMs improve medical diagnostics by integrating multiple modalities. They also support communication across language barriers among healthcare providers and patients, acting as a centralized repository for various AI applications within hospitals.
Art and Music Generation: LMMs could excel in art and music creation by combining different modalities for unique and expressive outputs. For example, an art LMM can blend visual and auditory elements, providing an immersive experience. Likewise, a music LMM can integrate instrumental and vocal elements, resulting in dynamic and expressive compositions.
Personalized Recommendations: LMMs can analyze user preferences across various modalities to provide personalized recommendations for content consumption, such as movies, music, articles, or products.

Weather Prediction and Environmental Monitoring: LMMs can analyze various modalities of data, such as satellite images, atmospheric conditions, and historical patterns, to improve accuracy in weather prediction and environmental monitoring.

The Bottom Line

The landscape of Large Multimodal Models (LMMs) marks a significant breakthrough in generative AI, promising advancements in various fields. As these models seamlessly integrate different modalities, such as text, images, and audio, their development opens doors to transformative applications in healthcare, education, art, and personalized recommendations. However, challenges, including accommodating more data modalities and compressing resource-intensive models, underscore the ongoing research efforts needed for the full realization of LMMs' potential.

The post Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 appeared first on Unite.AI.