Discussion about this post

User's avatar
Lisa's avatar

Hi Charlie, could you please check your DMs? We'd really love to work with you!

Expand full comment
suman suhag's avatar

Just as we have large language models (LLMs) serving as general-purpose foundation models for natural language processing (NLP), computer vision is following a similar path with foundation models that can be reused across many tasks. One of the most widely adopted in the past two years has been Meta AI’s DINOv2, which has become one of the go-to backbones in the community.Now, Meta AI has released the next step in this family: DINOv3. Much like what we’ve seen with LLMs, DINOv3 is both larger and trained on more data, expanding from 1 billion to 7 billion parameters, and from 142 million images to 1.7 billion images.More parameters, more data. But scaling to this level also introduces new challenges, and some fascinating solutions, which we’ll explore in this post. Specifically, at the end we’ll dive into one of the key innovations that makes DINOv3 work – a method called Gram Anchoring.Before the rise of foundation models, building a computer vision system often meant starting from scratch, which involves:- Collecting or curating a dataset.- nChoosing an architecture.- Training a model specifically for that task.This process could be both time-consuming and computationally expensive, especially for complex applications.DINOv2 changed that. It is a pretrained, large-scale Vision Transformer (ViT) that showed you don’t necessarily need to design or train a specialized, complex model for every new problem. Instead, you could rely on a single general-purpose backbone, and DINOv3 takes this idea to a new level."

Expand full comment
1 more comment...

No posts