1. Semantically Coherent Image Regions
The first step towards cross-image object-level bootstrapping is to identify semantically coherent image regions within the images. To that end, we rely on an online joint-space clustering algorithm from [Stegmuller_2023_CVPR].
Joint-space clustering.
The clustering algorithm takes as input the dense representations and associated with the two augmented views and concatenates them along the token axis to obtain the dense representation of the joint-view, . The tokens are subsequently partitioned into clusters to produce semantically coherent and spatially compact clusters. A hyperparameter modulates the importance of the spatiality against that of the semantic. The resulting assignment is encoded in a matrix , which can be split in two to obtain the view-wise assignments . Here denotes a soft-assignment of token to object-cluster in view . These can in turn be used to compute the object representations in each view as follows:
Note that the columns of undergo normalization, rendering the object representations as affine combinations of local representations. observed that the cross-view correspondence is provided for free as the object representation from the first view is matched to in the second one as a consequence of the joint-space clustering. In practice, it can be the case that some objects are exclusive to one of the two views, i.e., . In that case, the object is simply discarded from the view where it is present.
3. Self-Supervised Training Objectives
CrIBoβs final training objective involves three distinct types of self-supervision: cross-view object-level, cross-image object-level, and global cross-image. All levels of self-supervision are based on the self-distillation mechanism, which employs a student-teacher pair of siamese networks and , respectively. Given two augmented views and , global representations and object representations can be obtained for each view and each network within the siamese pair.
Cross-View Object-Level Self-Supervision
The object representations from the teacher and student are fed to their respective head and to output probability mass functions whose sharpness can be modulated via temperature parameters ( and ). For the teacher, this translates to:
The student's projections are obtained analogously. The cross-view object-level loss is expressed as the averaged cross-entropy over all pairs of object representations and making the loss symmetric with respect to the student/teacher:
where denotes the cross-entropy.
Cross-Image Object-Level Self-Supervision
Once all object representations in the batch have been retrieved, the cross-image object-level loss is enforced in a manner analogous to its cross-view counterpart. The nearest neighbors are first fed to the projection head as follows:
Note that these projections are only defined for the teacher since the operator is not differentiable. The formulation of the cross-image object-level loss aligns with that in the cross-view loss up to the filtering of invalid positive pairs:
where denotes the indicator function and where and are normalization constants, i.e., and , respectively.
Global Cross-View Self-Supervision
Importantly, both the clustering and the bootstrapping steps are contingent on the quality of the representations. To avoid potential instabilities, we incorporate a global cross-view loss into our framework, offering meaningful self-supervision at any stage of the pretraining. This loss is analogous to the one from the cross-view object-level loss except that it is applied to the global representation (, , , and ), which are fed to a dedicated head :
The global cross-view projections are defined as follows:
where and are the corresponding temperature parameters for the -dimensional output distribution. The global cross-view loss is defined as:
where .
Finally, we obtain the overall training objective of CrIBo as a sum of its individual components:
The student's parameters are optimized to minimize the above loss whereas the parameters of the teacher are updated via an exponential moving average of the student's weights.