290N Spring 2013 HW1

290N Winter 2015 HW1

Due during the first project meeting.

You are asked to plan a web service using Lucene/Solr to answer user questions by searching an NSF abstract database. Data files are in /cs/sandbox/faculty/tyang/290N/NSFabstract/NSFabs (Part1.zip, Part2.zip, and Part3.zip). Details about dataset and Lucene/Solr could be found at this readme file. If you plan to use Lucene Java API and Eclipse IDE, this readme file can be useful.

What to do

Implement a simple web service to search this dataset and display results which highlight query words in the record descriptions.
Also demonstrate you can search for specific keywords in a selected field (e.g. title). To do that, you will need to input data by fields and you can use any simplification that can save your efforts.
Perform the indexing of 3 parts of this dataset, and estimate the indexing time and disk storage requirement for this database with 1 million records, 10 million records, and 100 million records.
Investigate if you can place index partitions in different machines for these 3 NSF data partitions and serve a query using these machines in parallel. Distribution of search index is discussed here
Record the average response time for answering a search query from one machine and from multiple machines with distributed index.
Investigate the ranking results of 3 queries and study Lucene's ranking formula. Explain why Lucene's ranking matches your expectation. Propose some features that can help your ranking if you can modify Lucene search engine.

What to submit during demo

A report containing your group name(s) and the performance numbers for indexing and response time. Explain how you configure the machine resource and data partitions, if the ranked results make sense for 3 queries, and if performance numbers obtained are reasonable. Describe the proposed features and how to integrate them in the Lucene's ranking formula.

What to show during demo

Show the process for indexing of a sample data file and a query processing for the NSF dataset you have set.
Show the ranked results for a few queries, explaining why they make sense.
Explain your performance numbers.
Explain your finding on distributed processing of a query using multiple machines.