Tapestry Behavioral Specification
Ben Y. Zhao
Version 1.0
August 28, 2002
Outline
- Introduction
- Purpose of this document
- Tapestry summary
- Naming
- Node Guid Assignment
- Object Guid Assignment
- Application Level API: Core Routing
- Route to defined ID: TapestryRouteMsg
- Proximity route to ID: TapestryPrefixRouteMsg
- Application Level API: Object Location
- Publish / advertising objects: TapestryPublishMsg
- Locate / route to objects: TapestryLocateMsg
- Object multicast behavior
- Internal Message Handling
- Object Location: Publish / Unpublish
- Routing: RouteMsg
- Dynamic Node Membership
- Node insertion
- Directed multicast
- Nearest neighbor
- Voluntary node deletion
- Involuntary node deletion
- Upcall Module: Multicast - Bayeux
- Multicast Session Announcement
- Node join protocol
- Node leave protocol
- Session Teardown Announcement
- Upcall Module: Anycast
- Anycast Group Announcement
- Anycast Join
- Anycast Leave
- Anycast Routing
- Upcall Module: Mobile Tapestry
1. Introduction
A. Purpose of this document
This document is written in order to clearly define the expected
behavior of in implementation of the Tapestry overlay network.
We try to clearly outline the expected behavior from the perspective
of the application interface, such that a given application written
for one given Tapestry implementation should require minimal modifications
in order to work with a separate implementation.
B. Tapestry summary
Tapestry is a structured Peer-to-Peer overlay network infrastructure
that offers location independent routing and object location with
deterministic performance in an efficient manner.
As a network infrastructure, applications can be written on top
of Tapestry, and interact with it through a well-defined interface
(most of which is defined in this document).
In a Tapestry overlay network, participating nodes are assigned
a pseudo-random node identifier. These are Globally Unique Identifiers,
or GUIDs for short, and are typically randomized bit strings in a
linear namespace from 0 to 2^160-1. These nodes then self-organize
into a Tapestry mesh, defined by distributed routing tables located
at each node that route
to a logarithmic number of other nodes in the mesh. Given these
routing tables, a Tapestry message can be routed from any node to
and destination node given the GUID at the destination node.
In the context of object location, objects can be any piece
of data or metadata that has some notion of location inside the network.
Each object also has a Globally Unique IDentifier in the same linear
namespace as the node GUIDs. For our purposes, object location is
the process in which a node that has access to some data or resource advertises
its availability through a publication process, and clients searching for
that resource or data routes a message through the Tapestry network layer
to the location of the data or resource.
2. Naming
A. Node GUID assignment
There are a number of ways to generate the desired pseudo-random
GUID for each node (machine) participating in the Tapestry network.
Any secure one-way hash algorithm (such as SHA-1, MD5) will provide
the pseudo-randomness necessary to avoid collisions. The data
to be hashed should provide for some basic deterrence against Sybil
attacks, where malicious nodes can assume arbitrarily large number of
identities by continuously generating new GUIDs. An intuitive choice
is the IP address of the machine, which guarantees uniqueness across
machines in the network. While IP address spoofing is not trivial,
it can be done with some effort.
A more secure solution is to require a public/private key pair,
where the GUID is generated from a hash of the public key. This
allows anyone to verify the identity of the node in a cryptographically
secure verification process. Generating public/private key pairs
is not difficult, however. A more secure scheme would involve the use
of one or more centralized certificate authorities which would grant
naming certificates once it has verified the requestor. Verification
might require some non-trivial computation task to deter continuous requests
for new GUIDs. Furthermore, the resulting certificate might map
the GUID to a public key and IP address combination, both of which can
be verified on demand.
B. Object GUID assignment
There is also several ways of choosing the GUID of a given piece
of data or resource. Again, the way of assigning a GUID has implications
on the authentication of data or resources. One approach is to
apply the one way hash function to the entire contents of a piece of data
to get the GUID. While this allows for relatively simple verification,
it restricts the ability of objects to change their contents over time.
While acceptable for static resources or immutable data, this naming scheme
may be too limiting for applications dealing with more dynamic data, such
as writable file systems. An alternate scheme is to assign a static
and unique descriptive metadata field to the data, and assign the GUID as
a hash of the metadata.
3. Core Routing
We begin the discussion of Tapestry functionality by looking at
node-to-node routing. This functionality depends solely on the distributed
routing tables available on each Tapestry node to forward messages from
any node to a destination node with the desired GUID. The routing algorithm
uses a modified form of prefix routing, and does not involve object
location lookups anywhere along the route.
A. Route to defined ID: TapestryRouteMsg
This is the message used for communication between two nodes that
know each other by their GUID names. The assumption is that the sender
of the message knows the existence of the destination node.
Key fields:
- Message Type (int, Size = 1 Byte): Message or type code defining
this as a TapestryRouteMsg
- Hops (int, Size = 1): The # of overlay hops taken via the Tapestry
mesh. Set by the Router before delivery to the final destination
- TTL (int, Size = 1): The "Time-to-live" flag on the route message
that defines the number of maximum hops taken by this message en route to
the destination before the routing terminates and the payload delivered
to the local router node at TTL=0. The TTL is decremented by one
upon leaving every intermediate hop.
- QoS (int, Size = 1): A quality of service flag set by the application
to indicate its preference for basic (unreliable) message delivery, or
enhanced (better than best-effort) message delivery. Depending on
the implementation, better than best effort may include limited number of
message retransmissions at the Tapestry level.
- Upcall Bitmap and Termination Flags (2 Bytes = 3 + 13 bits). The
upcall bitmap is an index into the upcall function table. The first
3 bits are flags: UPCALLPERNODE, DELIVERONTERM, ERRORONTERM
- UPCALLPERNODE: If set, this message upcall is triggered on every
Tapestry hop. If not set, it will only trigger the upcall on special
termination conditions (TTL terminating at 0, or reaching final root node
in route).
- DELIVERONTERM: If set, this message delivers the payload to the
local node when a TTL expires
- ERRORONTERM: If set, this message sends back an error message to
the message source when a TTL expires
- Peer (ostore.util.SecureHash, Size = 20): The GUID of the destination
node to deliver to.
- Payloadsize (int, Size = 4): size of the following payload
- Message Payload (byte [ ], Size = Payloadsize): This is the
content of the application level payload, and may include data of any size,
as long as it is packaged inside TapestryRouteMsg as a byte array.
When a node's Router receives a TapestryRouteMsg, it serializes the
headers and concatenates the result to the application payload to generate
a single byte array. It then inserts the resulting full byte array
as the payload of an internal RouteMsg. If necessary, Router should
fragment the full byte array into multiple message fragments, each smaller
than or equal to the set MTU size for the outgoing link. For fragmented
messages, each resulting fragment becomes the payload of a separate RouteMsg
and gets the appropriate RouteMsg headers attached, including fragmentation
and reassembly information. If in the course of routing, this message
finds that no node GUID is an exact match of the destination GUID, the message
is dropped, silently.
From the application's perspective, the message routing is successful
if the full reassembled TapestryRouteMsg is received by a node with the
GUID that exactly matches the Peer field (if it is found within TTL hops,
otherwise delivered to some node that is TTL-hops away from the source
en route to the destination). The number of hops taken (as long as
it is <= TTL) is irrelevant, and should be set in TapestryRouteMsg.Hops
upon delivery. Internally, Tapestry does not need to use a preset
route to deliver the message, and can route around node/link failures where
appropriate.
B. Proximity route to ID: TapestryPrefixRouteMsg
The namespace used for nodes is necessarily large in order to accomodate
potentially very large networks while avoiding name collisions. For
reasonably-sized or smaller networks, however, this results in a very sparse
namespace of nodes. Therefore, simply routing a message to a random
GUID would likely not reach a physical node with an exact match in name.
This message routes messages to some existing node in the network
as a replacement or surrogate of the GUID we'd like to match.
- Message Type (int, Size = 1 Byte): Message or type code defining this
as a TapestryPrefixRouteMsg
- Hops (int, Size = 1): The # of overlay hops taken via the Tapestry
mesh. Set by the Router before delivery to the final destination
- TTL (int, Size = 1): The "Time-to-live" flag on the route message
that defines the number of maximum hops taken by this message en route to
the destination before the routing terminates and the payload delivered
to the local router node at TTL=0. The TTL is decremented by one
upon leaving every intermediate hop.
- QoS (int, Size = 1): A quality of service flag set by the application
to indicate its preference for basic (unreliable) message delivery, or
enhanced (better than best-effort) message delivery. Depending on
the implementation, better than best effort may include limited number of
message retransmissions at the Tapestry level.
- Upcall Bitmap and Termination Flags (2 Bytes = 3 + 13 bits). The
upcall bitmap is an index into the upcall function table. The first
3 bits are flags: UPCALLPERNODE, DELIVERONTERM, ERRORONTERM
- UPCALLPERNODE: If set, this message upcall is triggered on every
Tapestry hop. If not set, it will only trigger the upcall on special
termination conditions (TTL terminating at 0, or reaching final root node
in route).
- DELIVERONTERM: If set, this message delivers the payload to the
local node when a TTL expires
- ERRORONTERM: If set, this message sends back an error message to
the message source when a TTL expires
- Peer (ostore.util.SecureHash, Size = 20): The GUID of the destination
node to route towards.
- Payloadsize (int, Size = 4): size of the following payload
- Message Payload (byte [ ], Size = Payloadsize): This is the
content of the application level payload, and may include data of any size,
as long as it is packaged inside TapestryRouteMsg as a byte array.
When a node's Tapestry Router receives a TapestryPrefixMsg, it serializes
the headers and concatenates the result to the application payload to generate
a single byte array. It then inserts the resulting full byte array
as the payload of an internal RouteMsg. A flag in the RouteMsg
is turned on to turn on "fuzzy routing." Long messages are fragmented
in the same fashion as TapestryRouteMsgs. As this message is routed
between nodes, if an empty routing entry is encountered, meaning no exact
match exists for the destination GUID, the message uses surrogate routing
to map the destination GUID to some existing node in the network. Barring
unrecoverable failures or corruption, this message should always be delivered
to some node in the network.
4. Object Location
Object location and routing is an integral part of the Tapestry functionality
set. The basic routines allow a node to advertise or make known
its access to a local resource, and clients to route messages to that node
given the unique identifier (GUID) of the resource or data. This
object location mechanism is a general layer of indirection which can be
leveraged for a variety of applications, the most general of which is routing
to higher level names. These names are higher level in the sense that
they are not tied down to a particular network node. Any particular
node in the network can advertise an object name, and receive messages on
its behalf. Furthermore, there can be multiple instances of a single
name, providing for a flexible way to do receiver controlled multicast.
A. Publish / advertising objects: TapestryPublishMsg
Before any routing to object or name can occur, the nodes responsible
for the name or object must first make its availability known. This
is done in Tapestry by sending a TapestryPublishMsg into the Tapestry mesh.
This message routes to a deterministically chosen physical node in
the network, using surrogate routing where necessary. The invariant should
be that, no matter where in the network TapestryPublishMsgs originate from,
they all reach the same final "root" node if they are advertising the same
object GUID.
- Message Type (int, Size = 1 Byte): Message or type code defining
this as a TapestryPublishMsg
- Tag (ostore.tapestry.api.TapestryTag, Size = 12): This is essentially
an unique identifier to distinquish different types of objects being advertised
and searched for. The current Berkeley implementation extends the
TapestryTag class to create a new object type, which is then used to do
the differentiation. Only one use of the extension included an additional
field (Depth in ReplicaTag). Therefore, 12 bytes should be enough to represent
all possible extensions of that type, where 8 bytes can be derived from an
encoding of the class name, and the last 4 bytes can be used to distinquish
further inside a particular tag type. Note that for tag types which
do not further differentiate between themselves, the last 4 bytes should
be set to 0. Ideally, this field would be replaced by a simple unique
typecode that differentiates between different types of objects.
- QoS (int, Size = 1): A quality of service flag set by the application
to indicate its preference for basic (unreliable) message delivery, or
enhanced (better than best-effort) message delivery. Depending on
the implementation, better than best effort may include limited number
of message retransmissions at the Tapestry level.
- Upcall Bitmap and Termination Flags (2 Bytes = 3 + 13 bits). The
upcall bitmap is an index into the upcall function table. The first
3 bits are flags: UPCALLPERNODE, DELIVERONTERM, ERRORONTERM
- UPCALLPERNODE: If set, this message upcall is triggered on every
Tapestry hop. If not set, it will only trigger the upcall on special
termination conditions (reaching final root node in route).
- DELIVERONTERM: This flag has no effect in this message
- ERRORONTERM: This flag has no effect in this message
- Guid (ostore.util.SecureHash, Size = 20): The GUID name of the object
or data being advertised
The handling of a TapestryPublishMsg is relatively forward. Since
there is no payload other than the fields listed above, the Router translates
the fields into those of a ostore.tapestry.impl.PublishMsg and forwards it
on towards the root of the object GUID. At each hop, the local router
reads the fields of PublishMsg and stores a mapping between the object GUID
and the node GUID of the original source of the TapestryPublishMsg. The
TapestryTag is also kept alongside the node GUID. The message terminates
when it has reached a root node.
B. Locate / route to objects: TapestryLocateMsg
For a client to locate an object, it must send a TapestryLocateMsg to
the GUID of the desired object. Through Tapestry object location, the
infrastructure will embed this message in an internal message which is then
forwarded towards the object GUID's root node. Along the way, it will
query each router for the desired object / query, and find a location mapping
showing the node GUID of the object's location. The message is then
redirected to that node via Tapestry routing.
- Message Type (int, Size = 1 Byte): Message or type code defining
this as a TapestryLocateMsg
- Hops (int, Size = 1): The number of hops taken by this message from
the client to the eventual destination (node that advertised the data/resource).
This field is set by Router upon exit from Tapestry and just before
final delivery to the application layer at the destination.
- TTL (int, Size = 1): The "Time-to-live" flag on the locate message
that defines the number of maximum hops taken by this message en route
to the final location of the object before the routing terminates at TTL=0.
The TTL is decremented by one upon leaving every intermediate hop.
- Query (byte [ ], Size = 5): Byte string that defines the type of desired
metadata type in addition to the basic object GUID match.
- QueryState (byte [ ], Size = 30): State to be taken along with the
locate message. Can be used to embed previously seen replica locations
by abbreviating location GUIDs down to 2 or 4 bytes each.
- Upcall Bitmap and Termination Flags (2 Bytes = 3 + 13 bits). The
upcall bitmap is an index into the upcall function table. The first
3 bits are flags: UPCALLPERNODE, DELIVERONTERM, ERRORONTERM
- UPCALLPERNODE: If set, this message upcall is triggered on every
Tapestry hop. If not set, it will only trigger the upcall on special
termination conditions (finding a matching object Guid, TTL terminating at
0, or reaching final root node in route).
- DELIVERONTERM: If set, this message delivers the payload to the
local node when a TTL expire
- ERRORONTERM: If set, this message sends back an error message to
the message source when a TTL expires
- QoS (int, Size = 1): A quality of service flag set by the application
to indicate its preference for basic (unreliable) message delivery, or
enhanced (better than best-effort) message delivery. Depending on
the implementation, better than best effort may include limited number
of message retransmissions at the Tapestry level.
- ID (bitmap, Size = 3 bytes = 24 bits): A bit array identifier for
this particular message. Allows responses and failure messages to match
up against the original message at the original message source.
- Peer (ostore.util.SecureHash, Size = 20): The node GUID of the message
origin.
- Guid (ostore.util.SecureHash, Size = 20): The object GUID this message
is querying for.
When a TapestryLocateMsgs is received from the application layer, the Router
routes towards the root node of the object Guid, while at each hop searching
for a locally cached location pointer which matches the desired object Guid.
If the Query field is non-zero, replicas that match the object Guid
must also match the Query tag in order to satisfy the query. LocateMsgs
contain a QueryState field which is accessible to upcall functions.
C. Object multicast behavior
In the basic object location case, a TapestryLocateMsg is forwarded to the
first object replica which matches the desired object Guid, and also matches
its Query. In a core "upcall module" provided by Tapestry, LocateMsgs
invoke their upcalls when they find any objects with matching Guids or terminate
at the root of the object.
The default behavior is for QueryState to encode the number of unique replicas
(N) to route this message to. At each hop where a matching object Guid
is found, an upcall is triggered, which then checks in the list of matching
object replicas to see if the required number of replicas has been found.
If so, the message is replicated and routed to the N closest object
replicas.
5. Internal Message Handling
The argument types specified in Section 4 are those used by applications
to interface with Tapestry, either through a function call or through a message
passing scheme. In this Section, we describe the message types and
formats used for Tapestry internal messages. In many cases, these messages
encapsulate and generalize the specific functionality sets specified by the
API messages.
There are two types of internal messages: Publish/Unpublish Messages, and
Route to Node/Object Messages
A. Routing: RouteMsg
B. Object Location: Publish / Unpublish