Hi folks. I don't usually watch this forum (nothing against it -- just not one of my habits), so if anyone is interesting in helping the implementation of this, they should contact me elsewhere and we can chat.
I do intend to implement this myself if nobody else does, but I don't have time to do it all myself right now.
I'd like the implementation of this feature to be as independent of the many locks in bitcoind (especially cs_main, but also mempool.cs) as possible. In order to minimize problems, I think it will be a good idea to implement our own transaction cache outside of the purview of any main or mempool locks
I also looked into SCTP, but I think that since we can get the same data from multiple peers, we should be using redundant requests of the same missing data from multiple peers.
I also think there's a ton of cool stuff we could do with forward error correction so that it doesn't matter which packets get dropped, as long as enough get through.
I think that each peer connection should be implemented as a TCP connection for handshake/control information and little else (in part to avoid some DoS scenarios, especially being complicit in an amplification attack), with a UDP port open for the actual data transfer. The TCP connection could be the standard bitcoind p2p TCP, but I think it would be preferable to have it be completely different.
I expect each peer to be connected to a larger number of peers, around 50 to 200, through this protocol in order to achieve optimal performance. The main p2p protocol suffers from performance hits if the number of peers is too high due to the way blocks and transactions invs are broadcast to everyone, with block uploads in parallel instead of series (
http://rusty.ozlabs.org/?p=509). That problem won't occur with blocktorrent, and the invs are only one every 10 minutes instead of several per second.
I also have some ideas for peer preference flags. A peer could volunteer himself as a miner (optimize for early delivery to me, please!), supernode (I have a lot of bandwidth and will upload a lot!) or low-bandwidth user (don't send me more than you have to, and don't prioritize my latency!).
Further options could be for whether it is permitted to start sending data without it being requested first, and if so, how much (optimistic transmission limit). As a well-connected miner, I might be willing to accept 50 kB of data from you without asking it from you just to reduce latency. (That should be enough to code a typical 1 MB block, given the compression tools we have.) If I tell all of my peers that, I might end up getting 2.5 MB of data sent to me in total when a new block is published. That would be okay with me, because it gives me the block 300 ms faster. Others might only be willing to accept about 2 kB optimistically -- enough for the header and one intermediate level of the merkle tree. That saves on the number of round trips, as from that point you just have to start requesting the descendants of those nodes in parallel from different peers.
It will be very useful to have accurate information stored about what each peer has in their mempool. That way, we can guess what transactions they have and don't have, and volunteer some of the missing transactions. Bloom filters can be useful for this task. It could also be worthwhile and affordable to have a bitmap or boolean array for each peer to indicate which transactions they have. That might be easier to program with and more reliable than bloom filters. If you have 40 peers, and you can keep track of mempools with up to 200k transactions, that would require 1 MB of RAM. Probably affordable, right? That bitmap could be populated by listening to each peer's INV messages for transactions. This could be complicated by eviction policies, though, and would involve more interaction with the main mempool and p2p code (since you don't get INVs for txs over the blocktorrent ports and connections as I currently envision it).
If we know that a peer has a tx, for example, we can identify it pretty reliably by just using the last 4 or 5 bytes of its hash. We can also check their other mempool txs to see if any other txs share the same hash. If they run into a collision that we don't expect (very rare), they can detect this and request the full tx hash either from us or from one of their other (maybe lower-latency) peers. This means that we can usually encode a 1 MB block with 3000 transactions using a bit more than 12 kB of traffic, if the peer already has the transactions. (Thanks to Felix Weis for this idea.) If there is a hash collision (which is easy to force), we can detect it and request the full tx hash or the full tx.