Scalable Similarity Computing

[Project Overview] [Publication] [People]

Project Overview

Similarity comparison is one of the key operations in many data-intensive mining/search applications and cloud systems. Conducting similarity search on large datasets is time consuming and becomes more challenging when data is being updated continuously. This project studies scalable algorithms and system support for high performance similarity computing in modern computer architectures.

Techniques for partitioning data, data layout design, computation balancing are developed to optimize communication, memory hierarchy performance, and computing resource usage. The project starts with incremental duplicate detection for web data analysis and search, and continues to work on similarity computing in other applications and cloud storage systems.

Software prototype:

Partitioned Similarity Search for Hadoop/Mapreduce (Release 1.0)

Partitioned Similarity Search for Spark platform


  1. W. Zhang, D. Agun, T. Yang, R. Wolski and H. Tang, VM-Centric Snapshot Deduplication for Cloud Data Backup . Proceedings of the 31st International Conference on Massive Storage Systems and Technologies. 2015.

  2. Xun Tang, Efficient Similarity Search with Cache-Conscious Data Traversal. PhD Thesis, University of California at Santa Barbara, 2015.

  3. X. Tang, M. Alabduljalil, X. Jin, T. Yang, Load Balancing for Partition-based Similarity Search. Proceedings of 2014 ACM SIGIR conference on Research and Development in Information Retrieval. Slides.

  4. X. Tang, X. Jin, T. Yang. Cache-Conscious Runtime Optimization for Ranking Ensembles. Proc. of 2014 ACM SIGIR conference on Research and Development in Information Retrieval. Slides.

  5. T. Yang, A. Gerasoulis. Author Retrospective for PYRROS: Static Task Scheduling and Code Generation for Message Passing Multiprocessors. in "25 years of International Conference on Supercomputing" ACM ICS 25th Anniversary Volume. , 2014.

  6. M. Alabduljalil, X. Tang, T. Yang. Cache-Conscious Performance Optimization for Similarity Search. SIGIR'2013 (Proc. of 36th ACM SIGIR conference on Research and Development in Information Retrieval) . Slides.

  7. Wei Zhang, Collocated Deduplication for Virtual Machine Backup in Cloud Storage. PhD Thesis, University of California at Santa Barbara, 2014.

  8. Maha Ahmed Alabduljalil, Efficient Parallel Optimizations for All Pairs Similarity Search. PhD Thesis, University of California at Santa Barbara, 2014.

  9. Maha Alabduljalil, Xun Tang, Tao Yang, Optimizing Parallel Algorithms for All Pairs Similarity Search. WSDM'2013 (6th ACM International Conference on Web Search and Data Mining. Finalist for the best student paper award. Slides.

  10. W. Zhang, T. Yang, G. Narayanasamy, H. Tang. Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage . USENIX HotStorage'2013. Slides.

  11. T. Yang, A. Gerasoulis, Web Search Engines: Practice and Experience. To appear in Computer Science Handbook (T. Gonzalez. Eds), Chapman & Hall/CRC Press.

  12. S. Qiu, J. Zhou, and T. Yang, Versioned File Backup and Synchronization in Serverless Clouds, To appear in 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2013.

  13. Shanzhong Zhu, Alexandra Potapova, Maha Alabduljalil, Xin Liu and Tao Yang. Clustering and Load Balancing Optimization for Redundant Content Removal, in Proc. of 22nd International World Wide Web Conference (WWW 2012). Lyon, April 2012.

  14. W. Zhang, H. Tang, H. Jiang, T. Yang, X. Li, Y. Zeng, Multi-level Selective Deduplication for VM Snapshots in Cloud Storage, in Proc. of IEEE Cloud 2012.

  15. H. Guan, J. Zhou, B. Xiao, M. Guo, and T. Yang. Fast Dimension Reduction for Document Classification Based on Imprecise Spectrum Analysis. Information Sciences, 2012.


Faculty Students

This material is based upon the work supported by National Science Foundation under Grant No. IIS-1118106 (2011-2014). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.