Bytedance Introduces Seed1.5-vl: A Vision-language Foundation Model Designed To Advance General-purpose Multimodal Understanding And Reasoning

Trending 10 hours ago
ARTICLE AD BOX

VLMs person go cardinal to building general-purpose AI systems tin of knowing and interacting successful integer and real-world settings. By integrating ocular and textual data, VLMs person driven advancements successful multimodal reasoning, image editing, GUI agents, robotics, and more, influencing sectors for illustration acquisition and healthcare. Despite this progress, VLMs still lag down quality capabilities, peculiarly successful tasks involving 3D reasoning, entity counting, imaginative ocular interpretation, and interactive gameplay. A situation lies successful nan scarcity of rich, divers multimodal datasets, dissimilar nan abundant textual resources disposable to LLMs. Additionally, multimodal information complexity poses important training and information hurdles. 

Researchers astatine ByteDance person developed Seed1.5-VL, a compact yet powerful vision-language instauration exemplary featuring a 532 M-parameter imagination encoder and a 20 B-parameter Mixture-of-Experts LLM. Despite its businesslike architecture, Seed1.5-VL achieves apical results connected 38 retired of 60 nationalist VLM benchmarks, excelling successful tasks for illustration GUI control, video understanding, and ocular reasoning. It is trained connected trillions of multimodal tokens utilizing precocious information synthesis and post-training techniques, including quality feedback. Innovations successful training, specified arsenic hybrid parallelism and imagination token redistribution, optimize performance. The model’s ratio and beardown reasoning capabilities suit real-world interactive applications for illustration chatbots. 

The Seed1.5-VL architecture features a imagination encoder, an MLP adapter, and an LLM. Its civilization imagination encoder, Seed-ViT, supports native-resolution image input utilizing 2D RoPE and processes images done 14×14 patches, followed by mean pooling and an MLP. Pretraining involves masked image modeling, contrastive learning, and omni-modal alignment utilizing images, text, and video-audio-caption pairs. The exemplary uses a Dynamic Frame-Resolution Sampling attack for video encoding that adapts framework rates and resolutions based connected contented complexity, balancing ratio and detail. This method enables effective spatial-temporal knowing wrong a token budget, ensuring broad video practice crossed varied lengths and complexities. 

The pre-training of Seed1.5-VL progressive curating 3 trillion high-quality tokens crossed divers domains. Image-text pairs from nan web were filtered utilizing CLIP scores, size/aspect ratio checks, and deduplication to trim noise. Using domain-based sampling and plagiarism strategies, uncommon ocular concepts were overrepresented to reside people imbalance. Specialized datasets were added for OCR utilizing annotated and synthetic text-rich images, charts, and tables—object grounding and counting tasks utilized bounding boxes, points, and auto-labeled web data. Additional tasks included 3D spatial knowing utilizing extent annotations, and video knowing done multi-frame captioning, QA, and temporal grounding to support move contented analysis. 

The information highlights Seed-ViT and Seed1.5-VL’s competitory capacity crossed vision-language tasks. Seed-ViT, contempt having importantly less parameters, matches aliases outperforms larger models for illustration InternVL-C and EVA-CLIP connected zero-shot image classification tasks, showing precocious accuracy and robustness connected datasets specified arsenic ImageNet-A and ObjectNet. Seed1.5-VL demonstrates beardown capabilities successful multimodal reasoning, wide VQA, archive understanding, and grounding. It achieves state-of-the-art benchmarks, peculiarly successful analyzable reasoning, counting, and floor plan mentation tasks. The model’s “thinking” mode, which incorporates longer reasoning chains, further enhances performance, indicating its beardown expertise successful elaborate ocular knowing and task generalization. 

In conclusion, Seed1.5-VL is simply a vision-language instauration exemplary featuring a 532 M-parameter imagination encoder and a 20 B-parameter Mixture-of-Experts connection model. Despite its compact size, it achieves state-of-the-art results connected 38 of 60 nationalist benchmarks and excels successful analyzable reasoning, OCR, sketch interpretation, 3D spatial understanding, and video analysis. It besides performs good successful agent-driven tasks for illustration GUI power and gameplay, surpassing models for illustration OpenAI CUA and Claude 3.7. The exemplary shows beardown generalization to tasks beyond its training scope. The study outlines its architecture, information pipeline, and training methods and identifies early directions, including enhancing tool-use and ocular reasoning capabilities. 


Check out the Paper and Project Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More