240A Winter 2014 HW2

290N Winter 2014 HW2 (Web Traffic Analysis with Hadoop/MapReduce)

TA: Xin Jin (xin_jin@cs)

Due on Feb 27 , 2014.

This exercise is to analyze an Apache log using Hadoop MapReduce.

Sample code and Hadoop use at TSCC cluster

What to do

You may use Java or C++. For Java, modify the sample code or use your own code to analyze the sample Apache log dataset under /home/tyang/log/apache?.splunk.com. This traffic dataset contains 3 apache log files for page views in a week. Replicate this dataset 10 times to increase the dataset and observe the benefits of parallel processing.

Record the overall execution time of above tasks when the number of machine nodes allocated varies from 1 to 3. Only include the time to execute the MapReduce tasks (e.g. use unix utility "time" as illustrated in the sample script).

You may choose to investigate and implement using C/C++. See the references below on how to use Hadoop C++ pipe. Some sample C++ code to extract the IP address from Apache log is here.

You can only allocate 1 core per node due to TSCC constraint.

What to submit and demo

Additional References