Ferret: Refer and Ground at Any Granularity
Enabling spatial understanding in vision-language learning models remains a core research challenge. This understanding underpins two crucial capabilities: grounding and referring. Referring enables the model to accurately interpret the semantics of specific regions, while grounding involves using semantic descriptions to …