290N Spring 2013 HW1

290N Winter 2015 HW1

Due during the first project meeting.

You are asked to plan a web service using Lucene/Solr to answer user questions by searching an NSF abstract database. Data files are in /cs/sandbox/faculty/tyang/290N/NSFabstract/NSFabs (Part1.zip, Part2.zip, and Part3.zip). Details about dataset and Lucene/Solr could be found at this readme file. If you plan to use Lucene Java API and Eclipse IDE, this readme file can be useful.

What to do

  1. Implement a simple web service to search this dataset and display results which highlight query words in the record descriptions.

    Also demonstrate you can search for specific keywords in a selected field (e.g. title). To do that, you will need to input data by fields and you can use any simplification that can save your efforts.

  2. Perform the indexing of 3 parts of this dataset, and estimate the indexing time and disk storage requirement for this database with 1 million records, 10 million records, and 100 million records.
  3. Investigate if you can place index partitions in different machines for these 3 NSF data partitions and serve a query using these machines in parallel. Distribution of search index is discussed here
  4. Record the average response time for answering a search query from one machine and from multiple machines with distributed index.
  5. Investigate the ranking results of 3 queries and study Lucene's ranking formula. Explain why Lucene's ranking matches your expectation. Propose some features that can help your ranking if you can modify Lucene search engine.

What to submit during demo

What to show during demo