| About Me |
I am currently a second year M.S. student in Computer Science at University of California, Santa Barbara. Before I came to UCSB, I received my B.S. degree in Computer Science and Technology in Beijing University of Posts and Telecommunications. I am interested in data mining and social networks.
* Programming Language Skills:
Python, Ruby, C, C++, Java, Javascript
|
|
|
Selected Projects |
* Web Spam Detection Using Support Vectore Machines
Build classifiers to predicate whether a webpage is a spam page or not. Features include: link-based feature such as in-degree, out-degree; content-based features such as fraction of visible text; text-based features such as tf-idf score of each word; and neighbor-based features which are calculated based on neighbors' features. Use F-Score to select features. Compare performance of different kernel functions combined with different feature selections.
* Managing Data With Errors
Detect two types of erros in a utility fee collecting database: (1) duplicate addresses that are written in different ways; (2) violation of contacts, companies, addresses integrity constraints. Remove detected data errors without introducing new errors, generate clean SQL query answer.
* Semi-automatic Extraction of Navigation State Machines of Web Applications
Based on user provided information for input fields, automatically extract navigation paths and navigation constraints of web applications; build and visualize navigation state machines of the web applications.
* Wikipedia Web-page Classification
Build single-label, multi-class content-based classifiers for Wikipedia web-pages; evaluate three factors that influence SVM classifier performance: feature term selection method; feature vector length and feature vector score range.
The two feature term selection method compared are: (1) A term is selected based on a TF-IDF score combined with different term weight, the term weight is assigned based on whether it appears in html title, headers or content. (2) A term is selected based on the combination of an inner-class term index and an inter-class term index.
Selected feature vector lengths in this project are 40(5 feature terms from each category), 80, 120, 160, 200.
* Wildcard Subgraph Feature Mining in Annotation Graphs
In a subgraph pattern, a vertext with wildcard label matches any other vertex labels. The task is to extend graph pattern mining algorithm gSpan to include wildcard patterns mining; the number of wildcard is limited either by setting minimum wildcard number and maximum wildcard number, or by setting wildcard ratio over subgraph patern size. The extended gSpan is used to find frequent subgraph patterns in annotation graphs that are extracted from online movie reviews.
* Web-Based Data Mining Based Service System
(Bachelor Thesis at BUPT)
Design a web-based Data Mining Service System Framework; design and implement Classification Analysis Module for the system; implement C4.5 algorithm for Classification Analysis Module; implement user interface and classification result visualization with Flex.
|
|
|
| Courses |
CMPSC 231: Topics in Combinatorial Algorithms
CMPSC 225: Information Theory
CMPSC 273: Data and Knowledge Base
CMPSC 290C: Program Analysis
CMPSC 290C: Formal Models for Web Software
CMPSC 290D: Advanced Data Mining
CMPSC 290I: Introduction to Pattern Recognition, Artificial Neural Networks and Machine Learning
CMPSC 290N: Information Retrieval, Web Search, and Mining
CMPSC 595F: Online Social Networks
CMPSC 595D: Information Network
CMPSC 595N: MAT Seminar Series
|
|
|
| TA Courses |
CMPSC138: Automata and Formal Languages Course Website
CMPSC170: Operating System Course Website
PHYS1: Basic Physics
|
|