2013年6月27日 星期四

Efficient visual search of videos cast as text retrieval


This paper describe object retrieval which searches for and localizes all the occurrences of an object in
a video, given a query image of the object. The object is represented by a set of viewpoint invariant region descriptors.
Efficient retrieval is achieved by employing methods from statistical text retrieval.




Steps:
VIEWPOINT INVARIANT DESCRIPTION

1. Employ the technology of viewpoint invariant segmentation developed for
wide baseline matching

Regions are detected in a viewpoint invariant manner and regions are detected in a viewpoint invariant manner.

2. The second type of region is constructed by selecting areas from an intensity watershed image segmentation.

The SA regions tend to be centred on corner like features, and the MS regions correspond to blobs of high contrast with respect to their surroundings such as a dark window on a grey wall.


3.Each elliptical affine covariant region is represented by a 128- dimensional vector using the SIFT descriptor

 Combining the SIFT descriptor with affine covariant regions gives region description

vectors which are invariant to affine transformations of the image.

To reduce noise and reject unstable regions, information is aggregated over a sequence of frames. The regions detected in each frame of the video are tracked using a simple constant velocity dynamical model and correlation.

BUILDING A VISUAL VOCABULARY

The objective here is to vector quantize the descriptors into clusters which will be the visual ‘words’ for text retrieval.
The vector quantization is carried out by K-means clustering.

Samples of normalized affine covariant regions from clusters
corresponding to a single visual word: (a–c) Shape Adapted regions; (d–f)
Maximally Stable regions.


A. Term frequency–inverse document frequency weighting

where nid is the number of occurrences of word i in document
d, nd is the total number of words in the document d, Ni is the
number of documents containing term i, and N is the number of
documents in the whole database.

The weighting is a product of two terms: the word frequency, nid/nd, and the inverse document
frequency, log N/Ni.

At the retrieval stage documents are ranked by the normalized scalar product (cosine of angle)

B. Stop list
Using a stop list analogy the most frequent visual words that occur in almost all images are suppressed.

C. Spatial consistency
Spatial consistency can be measured quite loosely by requiring that neighbouring matches in the query region lie in a surrounding area in the retrieved frame.

Illustration of spatial consistency voting. To verify a pair of matching
regions (A,B) 

The final score of the frame is determined by summing the spatial consistency votes, and adding
the frequency score sim(vq, vd)
Including the frequency score (which ranges between 0 and 1) disambiguates ranking amongst frames which receive the same number of spatial consistency votes.



The similarity between document retrieval using a bag-of-words, and frame retrieval using a bag-of-visual words:

1. Visual features overlap in the image, some spatial information is implicitly preserved
2. An image query typically contains many more visual words than a text query
3. Internet search engines exploit cues. This query independent rank provides a general indicator of a quality of a web-page and enables more efficient and in some cases more accurate retrieval.


沒有留言:

張貼留言