CAT is a specialized dataset for co-saliency detection - one of the core tasks in the field of computer vision. Our dataset is intended for both helping to assess the performance of vision algorithms and supporting research that aims to exploit large volumes of annotated data, e.g., for training deep neural networks. Click here or read our paper for more details.

Basic Statistics

Scale & Features

  • A total number of 33500 image samples.
  • 280 semantic groups affiliated to 15 superclasses.
  • High-quality mask annotations.
  • Diverse visual context with multiple foreground objects.


CAT consists of 15 superclasses, i.e., aves, electronic product (elec.), food, fruit, insect, instrument (instru.), kitchenware (kitch.), mammal (mamm.), marine, other, reptile (rept.), sports, tool, transportation (trans.), and vegetable (vege.). Specifically, mammal occupies 73 of the 280 semantic groups (23%), which is the largest among the 15 superclasses. The second-largest is aves, which consists of 56 semantic groups (20%). The remaining superclasses occupy a total of 54% subclasses, of which the smallest two are marine and vegetable, each containing four semantic groups. In general, the 15 superclasses in our dataset cover a large number of categories in different fields. More details are available here.


CAT contains 280 subclasses (semantic groups), of which range from biological species and daily necessities. Specifically, each semantic group in our dataset is a collection of one specific item/species. These items/species usually have a uniform appearance, such as a specific shape, color, texture, etc. Thanks to the powerful pre-training models, these species are automatically classified to form diverse semantic groups. The average number of images per group is 119.6. Among all the semantic groups, apple is the largest one with 824 images, while boston terrier and cell phone are the smallest, which contain 28 images. Such a distribution is in line with that of the recent evaluation datasets, i.e., CoSOD3k (Fan et al. 2020) and CoCA (Zhang et al. 2020). The maximum/minimum numbers of images per group for these two datasets are 30/4 and 40/8, respectively. More details on these semantic groups and their representative examples are available on our webpage.

Statistical Comparisons

Dataset Year # Img. # Cat. # Avg. # Max. # Min. Mul. Sal. Larg. H.Q. Type
CAT202133500 280119.682428 yesyesyesyesCO
Notations: #Img.: Number of images; #Cat.: Number of categories (semantic groups); #Avg.: Average number of images per category; #Max.: Maximum number of images per group; #Min.: Minimum number of images per group. Mul.: Whether contains multiple foreground objects or not; Sal.: Whether maintains saliency or not; Larg.: Whether large-scale (more than 10k images) or not; H.Q.: Whether has high-quality annotations or not. CO: Co-saliency detection dataset; SD: Saliency detection dataset; SS: Semantic segmentation dataset. “-” denotes “not available”.

Visual Comparisons

Semantic Group: 'White Stork'

Semantic Group: 'Zebra'


  title={Free Lunch for Co-Saliency Detection: Context Adjustment},
  author={Kong, Lingdong and Ganesh, Prakhar and Wang, Tan and Liu, Junhao and Chen, Yao and Zhang, Le},
  journal={arXiv preprint arXiv:2108.02093},


We welcome and cherish every feedback and are open to discussion! Please send via cat.cosaliency@gmail.com.


Lingdong Kong

Nanyang Technological University

Prakhar Ganesh

National University of Singapore

Tan Wang

Nanyang Technological University

Junhao Liu

SIAT, Chinese Academy of Sciences

Le Zhang

University of Electronic Science and Technology of China

Yao Chen

Advanced Digital Sciences Center