Image search on a very large scale, where three constraints have to be considered jointly:
the accuracy of the search, its efficiency, and the memory usage of the representation
Author propose aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation.
Obtains a significantly higher accuracy with, typically, a 20-byte representation. This is obtained by optimizing:
- the representation, i.e., how to aggregate local image descriptors into a vector representation;
- the dimensionality reduction of these vectors;
- the indexing algorithm.
Two contribution of this work:
- proposing a representation that provides excellent search accuracy with a reasonable vector dimensionality
- show the advantage of jointly optimizing the trade-off between the dimensionality reduction and the indexation algorithm
VLAD: vector of locally aggregated descriptors
- learn a codebook C = {c1, ...ck} of k visual words with k-means. Each local descriptor x is associated to its nearest visual word
- The idea of the VLAD descriptor is to accumulate, for each visual word c , the differences x−ci of the vectors x assigned to ci, This characterizes the distribution of the vectors with respect to the center.
- A Component of v is obtained as a sum over all the image descriptors:
Images and corresponding VLAD descriptors, fork=16 centroids (D=16×128). The components of the descriptor are represented like SIFT, with negative components (see Equation 1) in red
VLAD performance and dimensionality reduction
Compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)
Dimension is reduced to from D to D’ dimensions with PCA
Observations:
- VLAD better than BoF for a given descriptor size
- comparable to Fisher kernels for these operating points
- Choose a small D if output dimension D’ is small
Indexing algorithm: searching with quantization
- Search/Indexing = distance approximation problem
- The distance between a query vector x and a database vector y is estimated by
where q(.) is a quantizer
- vector-to-code distance
- The choice of the quantizer is critical
- needs many centroids
- regular k-means and approximate k-means can not be used
- we typically want k=2^
64for 64-bit codes
Product quantization for nearest neighbor search
- Vector split into m subvectors:
- Subvectors are quantized separately by quantizers
where each qi is learned by k-means with a limited number of centroids
- Example: y = 128-dim vector split in 8 subvectors of dimension 16