Example: quiz answers

Product quantization for nearest neighbor search - Inria

1 Product quantization for nearest neighbor searchHerv e J egou, Matthijs Douze, Cordelia SchmidAbstract This paper introduces a Product quantizationbased approach for approximate nearest neighbor idea is to decomposes the space into a Cartesianproduct of low dimensional subspaces and to quantize eachsubspace separately. A vector is represented by a shortcode composed of its subspace quantization indices. TheEuclidean distance between two vectors can be efficientlyestimated from their codes. An asymmetric version in-creases precision, as it computes the approximate distancebetween a vector and a results show that our approach searchesfor nearest neighbors efficiently, in particular in combi-nation with an inverted file system. Results for SIFT andGIST image descriptors show excellent search accuracyoutperforming three state-of-the-art approaches.

means clustering algorithm, finds a near-optimal code-book by iteratively assigning the vectors of a training set to centroids and re-estimating these centroids from the assigned vectors. In the following, we assume that the two Lloyd conditions hold, as we learn the quantizer using k-means. Note, however, that k-means does only

Tags:

  Product, Search, Easterns, Clustering, Neighbor, Quantization, Product quantization for nearest neighbor search

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Product quantization for nearest neighbor search - Inria

1 1 Product quantization for nearest neighbor searchHerv e J egou, Matthijs Douze, Cordelia SchmidAbstract This paper introduces a Product quantizationbased approach for approximate nearest neighbor idea is to decomposes the space into a Cartesianproduct of low dimensional subspaces and to quantize eachsubspace separately. A vector is represented by a shortcode composed of its subspace quantization indices. TheEuclidean distance between two vectors can be efficientlyestimated from their codes. An asymmetric version in-creases precision, as it computes the approximate distancebetween a vector and a results show that our approach searchesfor nearest neighbors efficiently, in particular in combi-nation with an inverted file system. Results for SIFT andGIST image descriptors show excellent search accuracyoutperforming three state-of-the-art approaches.

2 The scal-ability of our approach is validated on a dataset of twobillion Terms High-dimensional indexing, image index-ing, very large databases, approximate INTRODUCTIONC omputing Euclidean distances between high dimen-sional vectors is a fundamental requirement in manyapplications. It is used, in particular, for nearest neigh-bor (NN) search . nearest neighbor search is inherentlyexpensive due to thecurse of dimensionality[3], [4].Focusing on theD-dimensional Euclidean spaceRD,the problem is to find the element NN(x), in a finitesetY RDofnvectors, minimizing the distance to thequery vectorx RD:NN(x) = arg miny Yd(x,y).(1)Several multi-dimensional indexing methods, such asthe popular KD-tree [5] or other branch and boundtechniques, have been proposed to reduce the searchtime. However, for high dimensions it turns out [6] thatsuch approaches are not more efficient than the brute-force exhaustive distance calculation, whose complexityisO(nD).

3 There is a large body of literature [7], [8], [9] onalgorithms that overcome this issue by performing ap-proximate nearest neighbor (ANN) search . The key ideaThis work was partly realized as part of the Quaero Programme,funded by OSEO, French State agency for innovation. It was orig-inally published as a technical report [1] in August 2009. It is alsorelated to the work [2] on source coding for nearest neighbor by these algorithms is to find the NN withhigh probability only , instead of probability 1. Mostof the effort has been devoted to the Euclidean dis-tance, though recent generalizations have been proposedfor other metrics [10]. In this paper, we consider theEuclidean distance, which is relevant for many appli-cations. In this case, one of the most popular ANNalgorithms is the Euclidean Locality-Sensitive Hashing(E2 LSH) [7], [11], which provides theoretical guaranteeson the search quality with limited assumptions.

4 It hasbeen successfully used for local descriptors [12] and3D object indexing [13], [11]. However, for real data,LSH is outperformed by heuristic methods, which exploitthe distribution of the vectors. These methods includerandomized KD-trees [14] and hierarchical k-means [15],both of which are implemented in the FLANN selectionalgorithm [9].ANN algorithms are typically compared based on thetrade-off between search quality and efficiency. However,this trade-off does not take into account the memoryrequirements of the indexing structure. In the case ofE2 LSH, the memory usage may even be higher thanthat of the original vectors. Moreover, both E2 LSH andFLANN need to perform a final re-ranking step based onexact L2 distances, which requires the indexed vectors tobe stored in main memory if access speed is constraint seriously limits the number of vectorsthat can be handled by these algorithms.

5 Only recently,researchers came up with methods limiting the memoryusage. This is a key criterion for problems involvinglarge amounts of data [16], , in large-scale scenerecognition [17], where millions to billions of imageshave to be indexed. In [17], Torralbaet animage by a single global GIST descriptor [18] which ismapped to a short binary code. When no supervision isused, this mapping is learned such that the neighborhoodin the embedded space defined by the Hamming distancereflects the neighborhood in the Euclidean space of theoriginal features. The search of the Euclidean nearestneighbors is then approximated by the search of thenearest neighbors in terms of Hamming distances be-tween codes. In [19], spectral hashing (SH) is shown tooutperform the binary codes generated by the restrictedBoltzmann machine [17], boosting and LSH. Similarly,the Hamming embedding method of Jegouet al.

6 [20],2[21] uses a binary signature to refine quantized SIFTor GIST descriptors in a bag-of-features image this paper, we construct short codes using quanti-zation. The goal is to estimate distances using vector-to-centroid distances, , the query vector is not quan-tized, codes are assigned to the database vectors reduces the quantization noise and subsequentlyimproves the search quality. To obtain precise distances,the quantization error must be limited. Therefore, thetotal numberkof centroids should be sufficiently large, ,k= 264for 64-bit codes. This raises several issueson how to learn the codebook and assign a vector. First,the number of samples required to learn the quantizeris huge, , several timesk. Second, the complexity ofthe algorithm itself is prohibitive. Finally, the amount ofcomputer memory available on Earth is not sufficient tostore the floating point values representing the hierarchical k-means see (HKM) improves theefficiency of the learning stage and of the correspondingassignment procedure [15].

7 However, the aforementionedlimitations still apply, in particular with respect to mem-ory usage and size of the learning set. Another possibilityare scalar quantizers, but they offer poor quantization er-ror properties in terms of the trade-off between memoryand reconstruction error. Lattice quantizers offer betterquantization properties for uniform vector distributions,but this condition is rarely satisfied by real world practice, these quantizers perform significantly worsethan k-means in indexing tasks [22]. In this paper, wefocus on Product quantizers. To our knowledge, such asemi-structured quantizer has never been considered inany nearest neighbor search advantages of our method are twofold. First, thenumber of possible distances is significantly higher thanfor competing Hamming embedding methods [20], [17],[19], as the Hamming space used in these techniquesallows for a few distinct distances only.

8 Second, as abyproduct of the method, we get an estimation of theexpected squared distance, which is required for -radiussearch or for using Lowe s distance ratio criterion [23].The motivation of using the Hamming space in [20],[17], [19] is to compute distances efficiently. Note, how-ever, that one of the fastest ways to compute Hammingdistances consists in using table lookups. Our methoduses a similar number of table lookups, resulting incomparable exhaustive comparison of the query vector with allcodes is prohibitive for very large datasets. We, therefore,introduce a modified inverted file structure to rapidlyaccess the most relevant vectors. A coarse quantizeris used to implement this inverted file structure, wherevectors corresponding to a cluster (index) are stored inthe associated list. The vectors in the list are representedby short codes, computed by our Product quantizer,which is used here to encode the residual vector withrespect to the cluster interest of our method is validated on twokinds of vectors, namely local SIFT [23] and globalGIST [18] descriptors.

9 A comparison with the state ofthe art shows that our approach outperforms existingtechniques, in particular spectral hashing [19], Hammingembedding [20] and FLANN [9].Our paper is organized as follows. Section II intro-duces the notations for quantization as well as the prod-uct quantizer used by our method. Section III presentsour approach for NN search and Section IV introducesthe structure used to avoid exhaustive search . An evalua-tion of the parameters of our approach and a comparisonwith the state of the art is given in Section BACKGROUND: quantization ,PRODUCTQUANTIZER A large body of literature is available on vectorquantization, see [24] for a survey. In this section, werestrict our presentation to the notations and conceptsused in the rest of the Vector quantizationQuantization is a destructive process which has beenextensively studied in information theory [24].

10 Its pur-pose is to reduce the cardinality of the representationspace, in particular when the input data is , a quantizer is a functionqmapping aD-dimensional vectorx RDto a vectorq(x) C={ci;i I}, where the index setIis from now onassumed to be finite:I= 1. The reproductionvaluesciare calledcentroids. The set of reproductionvaluesCis thecodebookof setViof vectors mapped to a given indexiisreferred to as a (Voronoi)cell, and defined asVi,{x RD:q(x) =ci}.(2)Thekcells of a quantizer form a partition ofRD. Bydefinition, all the vectors lying in the same cellViarereconstructed by the same centroidci. The quality of aquantizer is usually measured by the mean squared errorbetween the input vectorxand its reproduction valueq(x):MSE(q) =EX[d(q(x),x)2]= p(x)d(q(x),x)2dx,(3)3whered(x,y) =||x y||is the Euclidean distancebetweenxandy, and wherep(x)is the probabilitydistribution function corresponding the random variableX.


Related search queries