Software, Tutorials, and Data

KGPT: Knowledge-Grounded Data-to-Text Pretraining

CODE DATA

AACL Tutorial on Self-Supervised Learning for NLP with Xin Wang

SLIDES VIDEO

ProQA: Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

CODE

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning

CODE

Logic2Text: High-Fidelity Natural Language Generation from Logical Forms

CODE DATA

HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data

WEBSITE

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

MODEL

Logical Natural Language Generation from Open-Domain Tables

CODE DATA

Few-Shot NLG with Pre-Trained Language Model

CODE

TabFact: A Large-scale Dataset for Table-based Fact Verification

WEBSITE

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

DATA

Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

CODE DATA

Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection

DATA

DOLORES: Deep Contextualized Knowledge Graph Embeddings

CODE

A Benchmark Dataset for Learning to Intervene in Online Hate Speech

DATA

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

WEBSITE

Knowledge-Aware Reader

PyTorch implementation of the ACL 2019 paper "Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader". CODE

Self-Supervised Extractive Summarization (ACL 2019)

Code and Data for ACL 2019 "Self-Supervised Learning for Contextualized Extractive Summarization". CODE

Hierarchically Disentangled Self Attention

Code and Data for ACL 2019 "Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention". CODE

Lifelong Relation Extraction

Code for our NAACL 2019 paper: Sentence Embedding Alignment for Lifelong Relation Extraction. CODE

Riemannian Normalizing Flow for Variational Wasserstein Autoencoder

Pytorch Implemetation for our NAACL2019 Paper "Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling". CODE

Variational Vocabulary Reduction

Code for NAACL19 Paper "How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection". CODE

Extremely Fine-Grained Entity Typing

PyTorch implementation of our paper "Imposing Label-Relational Inductive Bias for Extremely Fine-Grained Entity Typing" (NAACL19). CODE

Deep Adversarial Learning for NLP

I gave a tutorial on Deep Adversarial Learning for NLP at NAACL 2019 conference with Sameer Singh (UCI). Slides are available here.

XL-NBT: A Cross-lingual Neural Belief Tracking Framework

Arxiv preprint: PDF CODE

One-Shot Relational Learning for Knowledge Graphs

Arxiv preprint: PDF CODE

WikiHow: A Large Scale Text Summarization Dataset

Arxiv preprint: PDF DATA

CIPS Summer School Slides

PART 1: Recent Advances in Distant Supervision IE PDF
PART 2: Recent Advances in Knowledge Graph Embeddings PDF
PART 3: Recent Advances in Knowledge Graph Reasoning PDF

MOJITALK: Generating Emotional Responses at Scale

Xianda Zhou and William Yang Wang, "MOJITALK: Generating Emotional Responses at Scale", to appear in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), full paper, Melbourne, Australia, July 15-20, 2018, ACL. Preprint arxiv PDF BIB CODE and DATA

ACL 2018 Tutorial on Deep Reinforcement Learning for NLP

William Wang, Jiwei Li, and Xiaodong He. PDF

Scheduled Policy Optimization

Wenhan Xiong, Xiaoxiao Guo, Mo Yu, Shiyu Chang, Bowen Zhou, and William Yang Wang, "Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents", to appear in Proceedings of the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence (IJCAI-ECAI 2018), full oral paper, Stockholm, Sweden, July 13-19, 2018, IJCAI. Preprint arxiv PDF BIB CODE

Deep Reinforcement Learning for Chinese Zero Pronoun Resolution

Qingyu Yin, Yu Zhang, Wei-Nan Zhang, Ting Liu, and William Yang Wang, "Deep Reinforcement Learning for Chinese Zero Pronoun Resolution", to appear in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), full paper, Melbourne, Australia, July 15-20, 2018, ACL. Preprint arxiv PDF BIB CODE

NAACL 2018 Tutorial on Knowledge Construction and Reasoning

Part 1: Xiang Ren (USC)
Part 2: Nanyun Peng (USC)
Part 3: William Wang (UCSB) PDF

Simple Models for Word Formation in Slang

Vivek Kulkarni and William Yang Wang, "Simple Models for Word Formation in Slang", to appear in Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), long paper, New Orleans, LA, USA, June 1 - June 6, 2018, ACL. PDF BIB CODE

KBGAN: Adversarial Learning for Knowledge Graph Embeddings

Liwei Cai and William Yang Wang, "KBGAN: Adversarial Learning for Knowledge Graph Embeddings", to appear in Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), long oral paper, New Orleans, LA, USA, June 1 - June 6, 2018, ACL. Preprint arxiv PDF. CODE

CHARADES-Caption Dataset for Video Captioning

Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang, "Video Captioning via Hierarchical Reinforcement Learning", preprint arxiv PDF DATA

DeepPath: Reinforcement Learning for Knowledge Graph Reasoning

See Wenhan Xiong's code and his prepared NELL-995 dataset from the paper "DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning". PDF | CODE | NELL-995 DATASET

Learning to Generate Explanations

Ke Ni, and William Yang Wang, "Learning to Explain Non-Standard English Words and Phrases", to appear in Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), short paper, Taipei, Taiwan, Nov.27-Dec.1, AFNLP. PDF BIB DATA

Deep Residual Learning for Weakly-Supervised Relation Extraction

See Darren Huang's code and his EMNLP 2017 paper. PDF BIB CODE

Liar: a benchmark dataset for fake news detection

Wlliam Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL. DATA PDF

How to Do Research?

I gave a short talk on how to do research with my undergraduate students.

NAACL 2016 Tutorial on Statistical Relational Learning for NLP

Part 1: overview on logic, probability, MLNs, and probabilistic DDBs
Part 2 - ProPPR and applications
Part 3 - TensorLog, and other recent and current work

Annotated Annoying Behaviors from Twitter

William Yang Wang and Diyi Yang, "That's So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets", in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), short paper, Lisbon, Portugal, Sept. 17-21, ACL. PDF BIB DATA

Information Extraction Tutorial at Peking University

CIPS Summer School IE Course Homepage Slides: PPTX PDF July 25, 2015

Three Wikipedia Datasets for Joint IE and Reasoning

William Yang Wang and William W. Cohen, "Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach", to appear in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), long paper for oral presentation, Beijing, China, July 26-31, ACL. PDF BIB DATA

ProPPR: a scalable probabilistic first-order logic

William Yang Wang, Kathryn Mazaitis, Ni Lao, and William W. Cohen, "Efficient Inference and Learning in a Large Knowledge Base: Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic", to appear in Machine Learning Journal (MLJ 2015), Springer. Preprint version: PDF BIB CODE

A large European family dataset for relational learning

William Yang Wang, Kathryn Mazaitis, and William W. Cohen, "A Soft Version of Pre dicate Invention Based on Structured Sparsity", in Proceedings of the 24th Inte rnational Joint Conference on Artificial Intelligence (IJCAI 2015), full paper for oral presentation, Buenos Aires, Argentina, July 25-31, IJCAI. Preprint version: PDF BIB DATA

The meme descriptions datase

William Yang Wang and Miaomiao Wen, "I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions", to appear in the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), long paper, Denver, CO., USA, May 31-June 5, ACL. Preprint version: PDF BIB DATA

The earnings calls dataset

William Yang Wang, and Zhenhao Hua, "A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls", in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), long paper, Baltimore, MD, June 22-27, ACL. Preprint version: PDF BIB DATA

The Yelp computational branding analytics (CBA) data

William Yang Wang, Ed Lin, John Kominek, "This Text has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics", in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), full paper, Seattle, WA, USA, Oct. 18-21, ACL. PDF BIB DATA

The Columbia Summarization Corpus (CSC)

William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying Event Descriptions using Co-training with Online News Summaries", in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP. PDF BIB


The Columbia Summarization Corpus (CSC) was retrieved from the output of the Newsblaster online news summarization system that crawls the Web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster. We collected a total of 166,435 summaries containing 2.5 million sentences and covering 2,129 days in the 2003-2011 period. Additional references of the Columbia Newsblaster summarizer can be found on the website of Columbia NLP group publication page. The CSC corpus can be used, but not limited to the following areas:

* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)

Click here to download the CSC corpus.