Token Server Protocol Requirements 10-jan-2000 / gmt General Idea: a robust, distributed token service for use by Swarm filesystem clients clients communicate with servers to request and release tokens service survives multiple server crashes and arbitrary client crashes on a server crash, noncritical data (unassigned tokens) may be lost Platform: Unix and Java, plus a C client interface UDP, IP multicast Failure model: random messages may be dropped or reordered communication links may fail totally, and may later recover clients and servers crash "hard" they behave correctly, or else they crash (no Byzantine failures) they do not send any more messages after a malfunction they do not crash and recover without *knowing* they're restarting clients and servers may hang they may send otherwise valid messages after appearing to have crashed servers are basically reliable crashes and hangs are infrequent it's okay if recovery is relatively expensive clients may be more flaky crashes are more common than with servers, but still unusual Clients do not intercommunicate: they can get all necessary state from servers Only servers share state: list of servers, which implies delegation of tokens list of clients identity of leader, if one is used Most actions do not involve shared state exchanges between clients and servers are much more frequent on server failure, token redistribution may outweigh global state changes Why Paxos is insufficient too heavyweight for use with every single action ... so must add another protocol for common operations no provision for electing a single leader ... so must add more code to elect one and keep Paxos from thrashing Why Paxos is overkill its basic premise is that participant list can be constantly changing it guards against Byzantine failures and malicious participants it is *very* robust and needs only a simple majority to do anything need not even have the same majority for a single full round What We Really Need Is A Good 5c Leader Election many things get simpler if we put a leader in charge of global state any server can act as leader, a relatively minor additional responsibility must be sure to allow for & recover from all possible failure scenarios