Knowledge is being produced at an unprecedented level. According to the prominent STM report, there are about 3 million scholarly journal articles each year, with an annual growth rate of 5%. Advanced techniques for digesting such knowledge is in great demand. Motivated by recent advancement in text mining, neural natural language processing, and large scale data analytics, I will focus my proposal talk on some of the works I did for effectively mining and analyzing technical knowledge based on concepts, i.e. meaning-bearing units, such as “support vector machine” and “generative adversarial networks”.
Specifically, I will discuss my work on mining concepts from relational table corpus, i.e. grouping values into tables such as “Microsoft SQL 2016”, “Microsoft Excel 2016” to form consistent concept hierarchy, e.g. “Microsoft Office Software”, “Software”, “Things”. We first exhaustively find possible coherent clusters by performing a batch-versioned distributed agglomerative clustering that requires a round of communication compared to the number of batchs up to a logarithm factor, and leverage the principle of minimal description length to select the those that corresponds to real concepts. We prove the problem is APX-hard in general, and provide a distributed tree induction procedure that efficiently solve the task optimally and efficiently in practice.
I will also talk about my work on concept mining from raw texts that incorporates existing approaches such as knowledge-base methods, grammatical pattern based methods, phrase chunking methods, statistical phrase mining into a unified framework. Specifically, we first utilize these approaches to produce a set of noisy, and possibly overlapping textual occurrences, and then select appropriate ones based on the quality of their distributed semantics as well as their fitness to the local context. We propose a more generalized form of word embedding model as our optimizing objective, which is then used to infer both the global concept vocabulary and concept occurrence recognition result in local context.
Next, I will shed light some of my ongoing work that leverages the mined concepts for downstream analytical tasks, specifically, inferring the scientific topic of a document. We will start by introducing a novel cascade embedding model, that first embed concepts into a continuous vector space that capture concept semantics, and then further embed them into a hidden category space, where the category information becomes explicit. We then leverage the concepts semantics to further categorize documents into fine-grained multi-level taxonomy, by effectively representing both documents and taxonomy nodes in a common concept space, and efficiently aggregate concept level similarity to derive document – taxonomy similarities.
Finally, I will finish the talk by briefly discussing some of my collaborated work on other analytical tasks such as concept relation mining, web content mining, and also discuss how the mined concepts can be useful in broader downstream natural language processing tasks.