Scalable algorithm design has become central in the era of large-scale data analysis. The vast amounts of data pouring in from a diverse set of application domains, such as bioinformatics, recommender systems, sensor systems, and social networks, cannot be analyzed efficiently using many data mining and statistical tools that were designed for a small scale setting. It is an ongoing challenge to the data mining, machine learning, and statistics communities to design new methods for efficient data analysis. Confounding this challenge is the noisy and incomplete nature of real-world data sets. Research scientists as well as practitioners in industry need to find meaningful patterns in data with missing value rates often as high as 99%, in addition to errors in the data that can obstruct accurate analyses.
My contribution to this line of research is the design of new algorithms for scalable clustering, data reduction, and similarity evaluation by exploiting inherent clustering structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, by focusing on underlying clustering properties of the data, we can improve the efficiency of several data analysis methods on sparse, discrete-valued data sets. I will highlight new methods that I have developed with my collaborators for three diverse knowledge discovery tasks: (1) clustering genetic markers into linkage groups, (2) reducing large-scale genetic data to a much smaller, more accurate representative data set, and (3) computing similarity between users in recommender systems In each case, I will point out how the underlying clustering structure can be used to design more efficient algorithms, even when high missing value rates are present.
Veronika Strnadová-Neeley is an incoming assistant professor of computer science at Montana State University, and she is completing her Ph.D. at the University of California, Santa Barbara. She holds a master’s degree in Computer Science from UCSB and a bachelor’s degree in Applied Mathematics from the University of New Mexico. Her work has been published in the IEEE International Conference on Bioinformatics and Biomedicine, the ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Genome Biology, the AAAI Spring Symposium Series, and the Workshop on Large Scale Recommendation Systems. She has served on as a reviewer and program committee member for several journals and workshops. She has previously been selected as an M.I.T. Rising Star in EECS and a Google Anita Borg Memorial Scholarship Finalist.