Visual Instruction Tuning for Pixel-Level Understanding with Osprey
With the recent enhancement of visual instruction tuning methods, Multimodal Large Language Models (MLLMs) have demonstrated remarkable general-purpose vision-language capabilities. These capabilities make them key building blocks for modern general-purpose visual assistants. Recent models, including MiniGPT-4, LLaVA, InstructBLIP, and others, …