![Multimodal technique for analyzing audio and visual data improves performance of machine-learning models Multimodal technique for analyzing audio and visual data improves performance of machine-learning models](https://search.ai.wiki/wp-content/uploads/2023/06/multimodal-technique-for-analyzing-audio-and-visual-data-improves-performance-of-machine-learning-models.jpg)
Multimodal technique for analyzing audio and visual data improves performance of machine-learning models
Researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research, and elsewhere have developed a new technique for analyzing unlabeled audio and visual data that could improve the performance of machine-learning models used in applications like speech recognition and object detection. The work, for the first time, combines two architectures of self-supervised learning, contrastive learning and masked data modeling, in an effort to scale machine-learning tasks like event classification in single- and multimodal data without the need for annotation, thereby replicating how humans understand and perceive our world.