The Auto Arborist dataset is a multiview fine-grained visual categorization dataset that contains over 2 million trees belonging to over 300 genus-level categories in 23 cities across the US and Canada built to foster the development of robust methods for large-scale urban forest monitoring. The dataset was initially released as part of a CVPR 2022 publication. Data use and access information here.

Dataset Overview

Overview of Auto Arborist Dataset

The 23 cities in our dataset are spread across the US and Canada, and are categorized into West, Central, and East Regions to enable us to study how well models generalize in a spatial and hierarchical manner.

Trees in Cities in Auto Arborist

Tree records per city. Our dataset contains >300 tree genus classes. Note the shifts in distribution of common genera across cities (visualized by color).

We propose urban forest monitoring as an ideal testbed for working on several computer vision challenges (domain generalization, fine-grained categorization, long-tail learning, multiview vision), while working towards filling a crucial environmental and societal need. Urban forests provide significant benefits to urban societies (e.g., cleaner air and water, carbon sequestration, and energy savings among others). However, planning and maintaining these forests is expensive. One particularly costly aspect of urban forest management is monitoring the existing trees in a city: e.g., tracking tree locations, species, and health. Monitoring efforts are currently based on tree censuses built by human experts, costing cities millions of dollars per census and thus collected infrequently.

Tree records

The number of tree records in each city, with the heldout cities in bold.

Previous investigations into automating urban forest monitoring focused on small datasets from single cities, covering only common categories. To address these shortcomings, we introduce a new large-scale dataset that joins public tree censuses from 23 cities with a large collection of street level and aerial imagery. Our Auto Arborist dataset contains over 2.5M trees and over 300 genera and is more than 2 orders of magnitude larger than the closest dataset in the literature. In our paper we introduce baseline results on our dataset across modalities as well as metrics for the detailed analysis of generalization with respect to geographic distribution shifts, vital for such a system to be deployed at-scale.

Dataset Challenges

Generalization to novel domains is a fundamental challenge for computer vision. Near-perfect accuracy on benchmarks is common, but these models do not work as expected when deployed outside of the training distribution. To build computer vision systems that truly solve real-world problems at global scale, we need benchmarks that fully capture real-world complexity, including geographic domain shift, long-tailed distributions, and data noise.

fine-grained

The data has a long-tailed distribution across categories, meaning that the majority of the examples in the dataset come from just a few frequent categories, and many of the examples have far fewer examples. We characterize each genera as frequent, common, or rare based on the number of training examples we have for that genera. Note that our test data is split spatially from our training data within each city, so not all rare species are seen in test.

taxonomy

taxonomy zoomed in

This dendrogram shows the taxonomic structure of the genera in Auto Arborist. The dataset is taxonomically diverse, with >300 different genera represented.

blurred imagery

Examples of street level data after blurring for privacy.

Data Use and Access

We may post updates about the project and dataset on our Google Group: https://groups.google.com/g/auto-arborist.

We would love to hear back from you if you have questions or suggestions or success stories relating to this dataset. You can reach out to us at: auto-arborist+managers@googlegroups.com.

If you are interested in accessing the dataset please fill out the following form. We are releasing the dataset in phases, and we are manually verifying that PII is obscured for all images before release. A data card for our model can be downloaded here.