240A Winter 2014 HW2

290N Winter 2014 HW2 (Web Traffic Analysis with Hadoop/MapReduce)

TA: Xin Jin (xin_jin@cs)

Due on Feb 27 , 2014.

This exercise is to analyze an Apache log using Hadoop MapReduce.

Sample code and Hadoop use at TSCC cluster

Information on job execution in TSCC cluster is here.
An old document on how to use Triton's Hadoop Mapreduce is here . Note that the node allocation consistency is required in executing a MapReduce job.
- The processors per node (ppn) are set to 1. For example, qsub -I -l nodes=2:ppn=1 -l walltime=00:10:00
- "-n" option is set consistently in commands $MY_HADOOP_HOME/bin/pbs-configure.sh and $MY_HADOOP_HOME/bin/pbs-cleanup.sh used in the script execution.
The Hadoop package including binary and Java/C++ libraries is accessible under /opt/hadoop when you use the computing nodes at TSCC.
The Java word counting example is available at Triton under /home/tyang/wc1.
To compile, login a computing node. Type make under wc1 directory.
To run, use script "wordcount5.sh" under wc1 directory (modify the email address in the script).
It seems to me that I can only access jar and Hadoop in one of the computing cluster for compilation. Thus I use "make" after using qsub. If you forget to exit from an interactive execution "qsub -I", you may use "showq -u " to find your job and use "qdel " to terminate your "qsub -I" job.
A Hadoop MapReduce Java sample code for log analysis is available at Triton under /home/tyang/log. Copy this directory to your own directory. The sample code is based on this article.
To compile, use "make" under log directory. To run, use script "log.sh" under log directory (modify the email address in the script).
Sample code on how to parse the Apache log . You can search "apache log java parser" from the web to get more samples.

What to do

You may use Java or C++. For Java, modify the sample code or use your own code to analyze the sample Apache log dataset under /home/tyang/log/apache?.splunk.com. This traffic dataset contains 3 apache log files for page views in a week. Replicate this dataset 10 times to increase the dataset and observe the benefits of parallel processing.

Report the daily traffic for this website (# of unique users and # of page views).
Report the top 5 most frequent URLs every day. Show their access frequency.

Record the overall execution time of above tasks when the number of machine nodes allocated varies from 1 to 3. Only include the time to execute the MapReduce tasks (e.g. use unix utility "time" as illustrated in the sample script).

You may choose to investigate and implement using C/C++. See the references below on how to use Hadoop C++ pipe. Some sample C++ code to extract the IP address from Apache log is here.

You can only allocate 1 core per node due to TSCC constraint.

What to submit and demo

Turn in the source code WITHOUT binary files and log data, using a turnin program (turnin HW2@cs290n directory-name). The execution script specifies a path that contains data and the TA may change the path for a test. The code directory contains the instruction on how to compile if needed, how to test, how to collect performance number, and a simple text report containing your group name(s), program components, compilation steps, test process, and the performance numbers.
Demonstrate that you can run and produce the results of log analysis using MapReduce.

Additional References