The problem
The ocean contains many tons of gold.
But, the gold atoms are too diffuse to extract usefully. Idle cycles on
the Internet, like gold atoms in the ocean, seem too diffuse to extract
usefully. If we could harness effectively the vast quantities of idle cycles,
we could greatly accelerate our acquisition of scientific knowledge, successfully
undertake grand challenge computations, and reap the rewards in physics,
chemistry, bioinformatics, and medicine, among other fields of knowledge.
The opportunity
Several trends, when combined, point
to an opportunity:
-
The number of networked computing devices
is increasing: Computation is getting faster and cheaper: The number of
unused cycles per second is growing rapidly
-
Bandwidth is increasing and getting cheaper
-
Communication latency is not decreasing
-
Humans are getting neither faster nor
cheaper
These trends and other technological advances
lead to opportunities whose surface we have barely scratched. It now is
technically feasible to undertake "Internet computations" that are technically
infeasible
for a network of supercomputers in the same time frame. The maximum feasible
problem size for ``Internet computations" is growing more rapidly
than that for supercomputer networks. The SETI@home project discloses an
emerging global computational organism, bringing "life" to Sun Microsystem's
phrase "The network is the computer". The underlying concept holds the
promise of a huge computational capacity, in which users pay only for the
computational capacity actually used, increasing the utilization of existing
computers.
Project Goals
In the CX project, we are designing
an open, extensible Computation eXchange that
can be instantiated privately, within a single organization (e.g., a university,
distributed set of researchers, or corporation), or publicly as part of
a market in computation, including charitable computations (e.g., AIDS
or cancer research, SETI). Application-specific computation services constitute
one kind of extension, in which computational consumers directly
contact specialized computational
producers, which provide computational support for particular applications.
The system must enable application
programmers to design, implement, and deploy large computations, using
computers on the Internet. It must reduce human administrative costs, such
as costs associated with:
-
downloading and executing a program
on heterogeneous sets of machines and operating systems
-
distributing software component upgrades.
It should reduce application design costs
by:
-
giving the application programmer a simple
but general programming abstraction
-
freeing the application programmer from
the non-application concerns of interprocessor communication and fault
tolerance.
System performance must scale both up
and down, despite communication latency, to a set of computation producers
whose size varies widely even within the execution of a single computation.
It must serve several consumers concurrently, associating different consumers
with different priorities. It should support computations of widely varying
lifetimes, from a few minutes to several months. Producers must be secure
from the code they execute. Discriminating among consumers is supported,
both for security and privacy, and for prioritizing the allocation of resources,
such as compute producers.
After initial installation of system
software, no human intervention is required to upgrade those components.
The computational model must enable general task [de]composition
with a restrictive shared state that is appropriate to the medium. The
API must be simple but general. Communication and fault tolerance
must be transparent to the user. Producers' interests must be aligned with
their consumer's interests: computations are completed according to how
highly they are valued.
Some Fundamental Issues
It is a challenge to achieve the goals
of this system with respect to performance, correctness, ease of use, incentive
to participate, security, and privacy. Although in the current phase we
are not focusing on security and privacy, the Java security model {Gong}
and, in particular, the Java 2 RMI API for network security {Scheifler}
(covering authentication, confidentiality, and integrity) clearly are intended
to support such concerns. Our choice of the Java programming system reflects
these benefits implicitly.
Application programming complexity
is managed by presenting the programmer with a simple, compact, general
API, briefly presented in the next section. Administrative complexity is
managed by using the Java programming system: its virtual machine provides
a homogeneous platform on top of otherwise heterogeneous sets of machines
and operating systems. We use a small set of interrelated Jini clients
and services to further
simplify the administration of system
components, such as the distribution of software component upgrades. The
Production Network is a Jini service that interfaces with every other CX
Jini client and service.
Performance issues can be decomposed
into several sub-issues.
Heterogeneity of machines & OS
The goal is to overcome the administrative
complexity associated with multiple hardware platforms and operating systems,
incurring an acceptable loss of execution performance. The tradeoff is
between the efficiency of native machine code vs. the universality of virtual
machine code. For the applications targetted (not, e.g., real-time applications)
the benefits of Java JITs reduce the benefits of native machine code: Java
wins by reducing application programming complexity and administrative
complexity, whose costs are not declining as fast as execution times.
Communication latency
There is little reason to believe
that technological advances will significantly decrease communication latency.
Hiding latency, to the extent that it is possible, thus is central to our
design.
Scalability
The architecture must scale to a higher
degree than existing multiprocessor architectures, such as workstation
clusters.
Robustness
An architecture that scales to thousands
of computational producers must tolerate faults, particularly when participating
machines, in addition
to failing, can dynamically disassociate
from an ongoing computation.
Ease of use
The computation consumer distributes
code/data to a heterogeneous set of machines/OSs. This motivates using
a {\em virtual} machine, in particular,
the JVM. Computational producers must
download/install/upgrade system software (not just application code). Use
of a screensaver/daemon obviates the need for human administration beyond
the one-time installation of producer software. The screensaver/daemon
is a wrapper or a Jini client, which downloads a ``task server" service
proxy every time it starts, automatically distributing system software
upgrades.