CS263 AppScale Project Options
The goals of this project are to
-
Give you hands-on experience with a large-scale distributed systems application
-
Expose you to development based on an existing, large, open-source code base
-
Open your mind to systems development in which multiple components, written in multiple languages, interoperate (efficiently and scalably, if possible)
-
Give you a better understanding of virtualization and how to use them (build images, deploy full system instances from images)
-
Get you thinking about the challenges and opportunities related to cloud comput
The first step of the project is to write a Python and Java Google App Engine (GAE) App, debug and test it with the SDK, and deploy it on Google's resources (free of charge). This will give you insight into the GAE process and what it supports and what it doesn't. You can find instructions on how to do this on Google's GAE pages.Write a GAE app in both Java and Python so that you can get the feel of how different each is.
Concurrently, you should begin investigating the background you need for your main project. Project options are listed below with descriptions and details. Also below are the specs and the list of important dates (when things are due, etc). There are due dates for progress reports throughout the quarter. The dates are also on Chandra's shared Google Class Calendar. See the cs263 page
for the link to add this calendar to your calendar set.
Project Specs and Due Dates
- Work on a project with one partner, i.e. in groups of two. No more, no fewer.
- If you need a machine (virtual or physical) or cluster to work on, and the GSL machines will not work for you, contact Chandra for access.
- Project topic and partner due to TA via email by Noon, Oct 2nd. Pick a couple
of different project options, if you have not spoken to Chandra in advance of this, in case your project is already taken by another group.
- Project progress report # 1 is due Oct 14th by noon to the TA via email. This consists of a 2 page write up of what you have done, what issues, if any, you are having, and what your plans are to pursue and complete the project.
- Project progress report # 2 is due Nov 9th by noon to the TA via email. This consists of a 2 page write up of what you have done, what issues, if any, you are having, and what your plans are to complete the project.
- Book a time (15-20mins) and meet with the TA to demo your project during the week of Nov 30-Dec 4. You must complete your demo with the TA by Noon on Dec. 4th. Email the TA early as the best slots will fill up.
- The project is due Dec 11th by noon. You should turn in a 6 page (no more/no fewer) report on your accomplishments, findings, and contributions. This should take the form of a research paper (intro, background, details of the contributions, evaluation, conclusions). In addition, turn in your code. Your code must contain documentation and a README that describes how to install and run your code in detail. If your code is part of an AppScale extension, move the image to one of Chandra's systems if its not there already, and let the TA know where it is. Your documentation should describe how to access and use the extensions.
Background and Reading Material
Choose accordingly:
Project Options
Below is the list of project options with some details/suggestions. We have left
many intentionally blank so that you have some freedom to pursue your interests and
to create a project that is unique.
-
Add an additional database to AppScale. The steps to take with this project are:
- Install the database of your choice (see choices below) on the GSL machines (grad student lab) or other Linux machine you have available to you. Figure out and understand how to install it, initialize, and use it. Record your steps b/c as part of this project, you will automate the install, initialization, and access to the database. If it is distributed, identify how you set various parameters (number of replicas, number of slaves/peers, ports that the components listen on, whether they support ssl, etc.).
- Next, investigate the Datastore API on the GAE web site. You should understand what calls a client can make to the datastore.
- Next, write a client server program that communicate over HTTP/HTTPs and a socket and implement the datastore API. The client should send requests to the server; the server should post to the database. You should next extend this to have the client send the server Google protocol buffers. The server should parse the buffer to determine the type of request then act accordingly.
- Next, get an AppScale instance and run it in single node mode. Go thru the code (/root/appscale). You should investigate how the system works and how the databases interface with the system. Chandra will send you additional details on this to those who choose the project.
Finally, you should extend the AppScale code (AppDB and AppController front end to start/initialize the DB) to support your database.
You may want to start with a simple one like PostGreSQL. You can use the AppScale thrift interface for MySQL almost as it is. If this database can be made to be distributed/scalable and/or if it has options for replication for availability (fault tolerance), you an use the database as the project.
Ideas for databases to consider include Oracle's free database and in-memory datastore;
CouchDB;
Tokyo Cabinet;
Dynomite;
MemcacheD;
Kai. You can use others that you find if AppScale doesn't already implement it (we currently implement HBase, Hypertable, MySQL, Cassandra, Voldemort, MongoDB) and it is distributed (or highly-available).
-
Java and Python GAE apps that exercise the AppScale system. Identify when the Google restrictions get in your way, and when/if AppScale also does so. Chandra will set up an AppScale cloud for you to test your Python apps on. Java/AppScale is not yet available. You can still implement this project for Java -- you will only use Google resources for this.
-
Tool to move data in and out of an AppScale cloud and in and out of a GAE app running on Google resources. The tool would connect to an AppScale database, grab all the protocol buffer entities for a given app, and write them in a file, potentially in some other format of your choosing (XML, YAML, other). The same tool could then parse that file and convert it to protocol buffers, reinserting it into the table (or even another backend, i.e. going from Hbase to Hypertable). When inserting data into a table, you will update the Apps table as well. For the Google side, you will investigate the Google pages for instructions and ideas on how to go about this. Contact Chandra for details.
Information about the Google protocol buffers can be found here.
-
AppScale over VMWare. First, download and install VMWare on a system to which you have
access (including the GSL). If this turns out to be a problem, email Chandra to request
access to a machine remotely. Learn what VMWare expects the format to be for the
images it runs. Investigate how to convert a Xen image (single partition) or a KVM image (disk image with partitions) for use under VMWare. Contact Chandra for additional details.
-
Python profiling/optimization. Download an opensource runtime for Python (cPython or other that you find to be intersting and open-source). Investigate the internals -- whether it does interpretation (if so, which optimizations does it perform?) or code generation (compilation/jitting) or other. Empirically evaluate the performance of the system using different configurations (with optimizations turned off/on). Profile the python code -- use extant profiling tools (contact Chandra for some) and write your own by extending the runtime you downloaded. Use the benchmarks from the Unladen Swallow project to do your empirical evaluations. You can also find benchmarks for different languages via the Shootout project. The project can be purely profiling -- so that you can get to know how the runtime works, or profiling+optimization.
-
Ruby (or some other dynamic language) profiling/optimization. Use the instructions for Python above. You will need to find the runtime and benchmarks for evaluating. The project can be purely profiling -- so that you can get to know how the runtime works, or profiling+optimization.
-
Distributed debugging across different languages. Investigate what debugging tools are out there -- look into ways of combining them in a distributed environment. Alternatively you can build your own by extending existing systems.
-
Interface (component interoperation) analysis. Profile the AppScale system at the interfaces boundary (Contact Chandra for help with this). Collect data to determine whether you can estimate the performance of the components simply by gathering data at the interface boundaries. Start with the the datastore/thrift interface.
-
Ruby, C#, PHP, Javascript AppScale Frontends -- or other frontend support (authentication, web page support). The key here is for your front-end to interface with the datastore of AppScale. Request that Chandra shares the Google Doc discussion by her team on this topic. Read about the Google Datastore API. Think about how a GAE app makes calls to this API (do so in your GAE app that you write to get started with this project). These are the calls your front end must make to the AppScale datastore (over https via Google protocol buffers).
-
Initial investigation of new computational model: Streaming. See the papers under Streaming on this page, to get a feel for what others are doing in this space. For this project, you will extend an existing system written in Python (its ok if you also use this project to learn Python). The project is called XSPY. Chandra will give you access to the code/system/documentation on XSPY if you choose this project. XSPY is simple streaming support for python programs. The stream elements ultimately will be Google protocol buffers; its currently simple python object communication. The projects are to extend this system to
- Implement streaming connectivity.
This involves implementing PB and writing a good API to read other streams. This should almost exclusively involve working in Stream, and potentially StreamReader. Additionally, the interaction between Stream and StreamReader should be solidified, especially where the data goes and who buffers it. Optionally the actual network communication protocol between the Stream and the data source could be formalized.
- Implement an actual datastore behind the DataServer.
This only requires modifying read, write, push, pop, and shift of DataServer. Supporting multiple backends is likely a plus. Writing a standard database class to be used by DataServer and subclassing it for each backend would be a good plan.
-
Making the network connectivity between the DataServer and DataClient actually robust.
Currently it is just enough to get by, but it needs to be multiplexed, threaded, and somewhat fault-tolerant. (Retransmitting lost packets would probably be a good first step.) Starting with some existing library is a good idea.
- A language-implementation project: Implement direct-threaded-interpretation for Java. There is an easy way and a hardway. One is via bytecode rewriting (easyway) and the other is by directly modifying the interpreter's intermediate representation of the bytecode to update/fix jump targets/exceptions. Either path is fine to pursue. Chandra can give you references to DTI and other forms of interpreter optimization. You can use the Java IcedTea/Zero interpreter for HotSpot JVM or some other runtime/interpreter of your choosing. Once implemented measure the performance gains you achieve for a set of benchmarks (Chandra will provide you with access).
Some links to papers on background (threading) for this topic can be found here.
- A language-implementation project: The same as the above only for any interpreted language, and runtime potentially with some form of threading. Chandra can give you suggestions. Extend this system to implement one or more of the following optimizations: context threading, superinstruction formation, direct-threaded interpretation. Compare the results with and without your optimization.
Some links to papers on background (threading) for this topic can be found here.
- A language-implementation project: Find an open source language runtime that implements interpretation for some high level language (Java, Python, Ruby). Implement a set of profile collection strategies to collect and characterize the bytecodes executed. Examples include histogram of most popular, common dynamic subsequences, number of instance/static field accesses, number of instance/static method accesses, number of call targets per virtual call site, others. Chandra can help with ideas and suggestions on runtimes (Java, Python, you'll need to find ones for other languages).
- Pick a topic from the related work listed
here. Re-implement the approach and measure it for a set of modern benchmarks. Implement the idea for a modern language and evaluate whether it works. Extend the idea in a novel way.