AMMAI SUMMARY: 3月 2013

Image search on a very large scale, where three constraints have to be considered jointly:
the accuracy of the search, its efﬁciency, and the memory usage of the representation

Author propose aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simpliﬁcation of the Fisher kernel representation.

Obtains a significantly higher accuracy with, typically, a 20-byte representation. This is obtained by optimizing:

the representation, i.e., how to aggregate local image descriptors into a vector representation;
the dimensionality reduction of these vectors;
the indexing algorithm.

Two contribution of this work:

proposing a representation that provides excellent search accuracy with a reasonable vector dimensionality
show the advantage of jointly optimizing the trade-off between the dimensionality reduction and the indexation algorithm

VLAD: vector of locally aggregated descriptors

learn a codebook C = {c1, ...ck} of k visual words with k-means. Each local descriptor x is associated to its nearest visual word
The idea of the VLAD descriptor is to accumulate, for each visual word c , the differences x−ci of the vectors x assigned to ci, This characterizes the distribution of the vectors with respect to the center.
A Component of v is obtained as a sum over all the image descriptors:

Images and corresponding VLAD descriptors, fork=16 centroids (D=16×128). The components of the descriptor are represented like SIFT, with negative components (see Equation 1) in red

VLAD performance and dimensionality reduction

Compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)

Dimension is reduced to from D to D’ dimensions with PCA

Observations:

VLAD better than BoF for a given descriptor size

comparable to Fisher kernels for these operating points

Choose a small D if output dimension D’ is small

Indexing algorithm: searching with quantization

Search/Indexing = distance approximation problem

The distance between a query vector x and a database vector y is estimated by

where q(.) is a quantizer

vector-to-code distance

The choice of the quantizer is critical

needs many centroids

regular k-means and approximate k-means can not be used

we typically want k=2^

64
for 64-bit codes