19 May 2003 Progress Report: Fault Tolerant Matlab*P - Nisheeth Shrivastava - Rachit Chawla ----------------------------------------------- The project consists of the following features: ----------------------------------------------- * A shadow Matlab*P server running on a fault free machine. - It will work as a mediator between client and MatlabXP server. ----------------------- The main tasks will be: ----------------------- - Open client (Matlab) connection. - Start MatlabXP servers. - Take periodic checkpoint. - Recover the servers in case of any fault. * Checkpointing library integrated into MatlabXP code. - Take checkpoint on "CKPT" message from shadow server. - Recover if Shadow server detects any process failure. ----------- Tasks Done: ----------- -- Shadow Server Ready ---------------------- ------- Part 1: ------- -- Matlab Client connects to Shadow Server -- Shadow start MatlabXP as one of its child process -- Shadow Server connects to MatlabXP -- Matlab Client sends command to Shadow Server -- Shadow Server send command to MatlabXP Server -- Shadow gets the results and sends it to MatlabXP ------- Part 2: ------- -- Shadow checkpoints MatlabXP server, (currently) when an external signal is sent. -- If it detects any failure, restarts the MatlabXP server from the last checkpoint. -- Opens all connections again with new MatlabXP server, to keep the recovery transparent to the matlab client. -- Send the last command after failure again, get the results and send it to the matlab client. -- Checkpoint library --------------------- We have implemented a checkpointing library for Linux system. It has been tested to successfully checkpoint and recover a normal process. We are currently on the stage to make it work for any MPI process. -- Checkpointing can be done for a MPI Process. -- Recovery part in progress. ---------------- Tasks to be done: ----------------- -- Recovering a MPI process from the saved checkpoint. -- Checkpointing MatlabXP server periodically w/o any external signal, on a "CKPT" message from Shadow Server. -- Sending the last command after failure again, get the results and send it to the matlab client. -- Further Testing.