Querying Spatial Patterns by Keyword

Keyword based search in text rich spatial datasets leads to many novel applications and services. It also enhances the functionality of existing map search services, image search applications, and GIS systems. Currently, we are studying the problem of searching top-k sets of spatially close spatial objects for a given set of keywords such that each set contains objects for each keyword and no proper subset of the set does so. We are also developing algorithms to mine interesting connected object sets in spatial datasets.

  • "ProMiSH: Nearest Keyword Set Search in Very High Dimensional Spaces", Vishwakarma Singh and Ambuj K. Singh.
  • "Geo-Clustering of Images With Missing GeoTags", Vishwakarma Singh, Sharath Venkatesha, and Ambuj K. Singh, in IEEE Granular Computing, 2010.    Geo-Cluster Examples

Nearest Neighbor Search in Very High Dimensional Spaces

Nearest neighbor search in very high dimensional spaces is useful in many applications. Existing techniques solve this problem efficiently only for the approximate case. These solutions are designed to solve r-near neighbor queries only for a fixed query range or a set of query ranges with probabilistic guarantees and then, extended for nearest neighbor queries. Solutions supporting a set of query ranges suffer from prohibitive space cost. There are many applications which are quality sensitive and need to efficiently and accurately support near neighbor queries for all query ranges. We design a novel indexing and querying scheme called Spatial Intersection and Metric Pruning (SIMP) that efficiently supports r-near neighbor queries in very high dimensional spaces for all query ranges with 100% quality guarantee and with practical storage costs. We also provide a statistical cost model for SIMP. SIMP outperforms LSH, Multi-Probe LSH, LSB tree, and iDistance on two real datasets of dimensions 128 and 256 and sizes 1 million and 1.08 million respectively. We show on a 128 dimensional real dataset of 10 million points that SIMP scales linearly with increasing query range.

We also develop a vote count based algorithm using p-stable distribution for approximate similarity search. Approximate similarity search effectively serves purpose for many real applications. Our algorithm is efficient and scalable with both dimension and database size. We also propose a novel hardware implementation of the algorithm using simple modification to Random Access Memory(RAM). The hardware design gives real time search for millions of points at practical cost. We empirically achieve high accuracy for query results using our algorithm on both synthetic and real datasets.

  • "SIMP: Accurate and Efficient Near Neighbor Search in Very High Dimensional Spaces", Vishwakarma Singh and Ambuj K. Singh, UCSB CS-Technical Report. 
  • "An Algorithm and Hardware Design for Very Fast Similarity Search in High Dimensional Space", Vishwakarma Singh and Wenyu Jiang, in IEEE International Conference on Granular Computing, 2010. 

Similarity Search in Images and Videos

Duplicate Video Detection

Copyright infringements and data piracy have recently become serious concerns for the ever growing online video repositories. Approaches based on content-based copy detection and watermarking are used to detect such infringements. In this work we develop an efficient and accurate method for duplicate video detection in a large database using video fingerprints. We use vector quantized color layout descriptor to create fingerprints. We propose a new nonmetric distance measure to find the similarity between the query and a database video fingerprint. Efficient search cannot be performed for high-dimensional data using a nonmetric distance measure with existing indexing techniques. Therefore, we develop novel search algorithms based on precomputed distances and new dataset pruning techniques yielding practical retrieval times.

  • "Efficient and Robust Detection of Duplicate Videos in a Large Database", Anindya Sarkar, Vishwakarma Singh, Pratim Ghosh, B. S. Manjunath, and Ambuj K. Singh, in IEEE CSVT, 2010. 
Sub-Image Similarity Search

Sub-image search with high accuracy in natural images still remains a challenging problem. We develop a new feature vector called profile for a keypoint in a bag of visual words model of an image. The profile of a keypoint captures the spatial geometry of all the other keypoints in an image with respect to itself, and is very effective in discriminating true matches from false matches. Sub-image search using profiles is a single-phase process requiring no geometric validation, yields high precision on natural images, and works well on small visual codebook. The proposed search technique differs from traditional methods that first generate a set of candidates disregarding spatial information and then verify them geometrically. Conventional methods also use large codebooks.

  • "Profile Based Sub-Image Search in Image Databases", Vishwakarma Singh and Ambuj K. Singh, UCSB CS-Technical Report. 

Querying Spatial Patterns by Feature Similarity

Spatial data are common in many scientific and commercial domains such as geographical information systems and gene/protein expression profiles. Querying for distribution patterns on such data can discover underlying spatial relationships and suggest avenues for further scientific exploration. Supporting such pattern retrieval requires not only the formulation of an appropriate scoring function for defining relevant connected subregions but also the design of new access methods that can scale to large databases. In this project, we propose a solution to this problem of querying significant subregions on spatial data provided as raster images. We design a scoring scheme to measure the similarity of subregions. All the raster images are tiled and each alignment of the query and a database image produces a tile score matrix. We show that the problem of finding the best connected subregion from this matrix is NP-hard and develop a dynamic programming heuristic. With this heuristic, we develop two index-based scalable search strategies, TARS and SPARS, to query patterns in large data repositories.

Patterns queried from spatial/image datasets just based on distance/score are not always interesting and sufficient for domain scientists. Ranking of results based on statistical significance or p-value is more useful. This can be computed using an appropriate model of random objects. The problem of computing p-values becomes more acute when queries have multiple components. In this case, the returned score is an aggregate of individual scores. The simple way of calculating the p-value by enumerating all random possibilities fails for large database and query sizes. We design an efficient method to calculate the approximate p-value of a multi-attribute result when the distribution of scores for the database objects is non-parametric.

  • "Querying Spatial Patterns", Vishwakarma Singh, Arnab Bhattacharya, and Ambuj K. Singh, in EDBT, 2010. 
  • "Efficient Computation of Statistical Significance of Query Results in Databases", Vishwakarma Singh, Arnab Bhattacharya, and Ambuj K. Singh, in SSDBM, 2008. 
  • "QUIP: Querying Significant Patterns from Image Databases", Vishwakarma Singh, Arnab Bhattacharya, Ambuj K. Singh, Chris Banna, Geoffrey P. Lewis, Steven K. Fisher, UCSB CS-Technical Report.