Scalable Similarity Computing

[Project Overview] [Publication] [People]

Project Overview

Similarity comparison is one of the key operations in many data-intensive mining/search applications and cloud systems. Conducting similarity search on large datasets is time consuming and becomes more challenging when data is being updated continuously. This project studies scalable algorithms and system support for high performance similarity computing in modern computer architectures.

Techniques for partitioning data, data layout design, computation balancing are developed to optimize communication, memory hierarchy performance, and computing resource usage. The project starts with incremental duplicate detection for web data analysis and search, and continues to work on similarity computing in other applications and cloud storage systems.

We expect to make the developed all-pairs similarity search software available in this web page by 2014.


  1. X. Tang, M. Alabduljalil, X. Jin, T. Yang, Load Balancing for Partition-based Similarity Search . To appear in Proc. of 2014 ACM SIGIR conference on Research and Development in Information Retrieval.

  2. X. Tang, X. Jin, T. Yang. Cache-Conscious Runtime Optimization for Ranking Ensembles. To appear in Proc. of 2014 ACM SIGIR conference on Research and Development in Information Retrieval.

  3. M. Alabduljalil, X. Tang, T. Yang. Cache-Conscious Performance Optimization for Similarity Search. SIGIR'2013 (Proc. of 36th ACM SIGIR conference on Research and Development in Information Retrieval) . Slides.

  4. Maha Alabduljalil, Xun Tang, Tao Yang, Optimizing Parallel Algorithms for All Pairs Similarity Search. WSDM'2013 (6th ACM International Conference on Web Search and Data Mining . Slides.

  5. W. Zhang, T. Yang, G. Narayanasamy, H. Tang. Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage . USENIX HotStorage'2013. Slides.

  6. T. Yang, A. Gerasoulis, Web Search Engines: Practice and Experience. To appear in Computer Science Handbook (T. Gonzalez. Eds), Chapman & Hall/CRC Press.

  7. S. Qiu, J. Zhou, and T. Yang, Versioned File Backup and Synchronization in Serverless Clouds, To appear in 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013.

  8. Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu and Tao Yang. Clustering and Load Balancing Optimization for Redundant Content Removal, in Proc. of 22nd International World Wide Web Conference (WWW 2012). Lyon, April 2012.

  9. W. Zhang, H. Tang, H. Jiang, T. Yang, X. Li, Y. Zeng, Multi-level Selective Deduplication for VM Snapshots in Cloud Storage, in Proc. of IEEE Cloud 2012.

  10. H. Guan, J. Zhou, B. Xiao, M. Guo, and T. Yang. Fast Dimension Reduction for Document Classification Based on Imprecise Spectrum Analysis. Information Sciences, 2012.


Faculty Students

This material is based upon the work supported by National Science Foundation under Grant No. IIS-1118106 (2011-2014). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.