BUIP006: (closed) Blocktorrent – a torrent-style block data transport

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
BUIP006: Blocktorrent – a torrent-style block data transport
Status: Draft (under discussion)
Proposer: freetrader

Revision: 0
Submitted on: 07/Jan/2016
Revision 1
Funding Proposal: 26/Dec/2016 by solex

Background

On the bitcoin-dev mailing list, Jonathan Toomim proposed the "blocktorrent" concept to speed up new block propagation:


A quick survey [1] on this forum showed strong interest for implementing such a mechanism in Bitcoin Unlimited due to potential improvement in block propagation speeds.

Especially when faced with the future scenario of larger block sizes and under certain adverse network conditions existing today (packet/connection loss), the current protocol may be significantly out-performed by a more fine-grained p2p algorithm.

The main problem with the current algorithm is that it requires a peer to have a full block before it can upload it to other peers. Breaking up blocks into smaller chunks and distributing these out of order in a Bittorrent-like fashion can potentially yield a great speed improvement.

Proposal

A torrent-style distribution of new block data (named 'blocktorrent) shall be implemented as an optional feature in addition to the existing p2p block distribution based on 'inv' and 'getdata' (henceforth referred to in this document as the 'standard p2p network protocol' - SPNP).

The proposed blocktorrent protocol shall be referred to in this document as 'blocktorrent p2p network protocol - BPNP.

BPNP shall be designed to augment the existing SPNP-based block propagation in cases where this is deemed beneficial from a propagation performance viewpoint (through configurable parameters describing block size and composition) or enforced by the node operator in lieu of SPNP.

It shall be possible to completely disable BPNP so that the client remains 100% compatible with implementations not supporting BPNP.

A peer shall be able to advertise its BPNP capabilities to other peers so that these can decide their optimal methods of exchanging data with it.

High level Design

As this document is a draft, what follows are provisional high-level design ideas for further discussion and elaboration.

The existing p2p network code shall be altered such that it if BPNP is enabled, a set of BPNP-capable peers is established and used based on settings to be decided.

[Note: @theZerg advised to create a transport plugin layer which would be able to accommodate various kinds of transports, including presumably the current SBNP, future BPNP, perhaps some more like SCTP-based transport etc. Such a plugin layer seems a wise architectural decision and I support it, but would prefer if it was address in a separate BUIP to cleanly distinguish the layering (and separate the implementation/testing). It would probably be good to serialize the BUIPs then, starting with the plugin transport layer one.]

There shall be a module (e.g. blocktorrent.cpp) which implements functionality needed for communicating with the BPNP-capable peers:
  • management of peer states and network connections (TCP/UDP sockets)
  • block assembly/disassembly routines for incoming/outgoing block data
  • interfaces to existing block validation and generation routines
  • a separate cache of mempool transactions with additional data fields used in the BPNP protocol (e.g. hashes)
  • gathering of metrics on peer connection and block propagation, which may be used to optimize the protocol's performance (e.g. dynamic chunk sizes) or fallback to SBNP where this would be better (e.g. very small blocks)

Control/data connections

Each logical BPNP peer connection shall consist of two separate network connections:

- a TCP connection for handshake/control information
- a UDP socket for actual block (chunk) data transfer

The control information and UDP message data formats remain to be specified in detail.

[Note: If advisable Google protobuf shall be used to define BPNP message formats and make the protocol more robust for future versioning.]


BPNP-capability advertisement

This remains to be specified in detail.

The current intention is to use version bits to advertise BPNP support.

Simultaneously, until version bits become part of the official reference client,
other forms of advertisement may be needed to bootstrap valid peer discovery.
This could include the user agent information or a query/response protocol after
successfully opening a control connection.


Peer sets, discovery and connection/disconnection

[NOTE: the description below contains a number of possible parameters and pools.
This may be vastly constrained an initial implementation to reduce complexity,
and subsequently refined. Please treat it as a set of ideas for now which are
subject to discussion!]


The set of BPNP peers is defined as the set of peers (identified by IP addresses) that the client knows or assumes are BPNP-capable. This set may explicitly be larger than the set of regular (SPNP) peers.

Optional: Depending on a user-configurable parameter, this set may be restricted to
a subset of the set of regular peers known to SPNP. Such a parameter would be useful to explicitly limit the BPNP to be maximally the same as the SBNP, for certain comparative performance tests between SBNP/BPNP).

Those members of the set of SBNP peers which advertise themselves as BPNP-capable shall form an initial set of eligible BPNP peers.

There shall be a parameter limiting the maximum number of eligible BPNP peers (as obtained through discovery) which are retained in memory.

If the configured maximum number of eligible BPNP peers is larger than the maximum number of SBNP peers, an yet-to-be-specified discovery protocol shall be run to find additional BPNP-capable peers. For now, in the remainder of this document it shall be assumed that such a discovery
protocol is available.

Additionally, there shall be a parameter limiting the maximum number of active BPNP peers, i.e. those with established connections and indicating that they are able to actively provide block data via BPNP.

When an active peer control connection is closed (either due to a controlled disconnect of the remote peer or due to a network timeout), the BPNP module shall consider the peer as temporarily disconnected and seek to re-open the connection.

[TODO: decide if a reconnection attempt should be made in case of orderly disconnect of the remote. Perhaps rather not...]

Optional: If re-connection fails after a user-configurable number of retries (which may be 0), a new peer shall be chosen from the set of inactive eligible peers.

If the connection to the new peer can be opened successful within a configurable number of retries, the failed peer is returned to the inactive eligible pool ("swapped out" for the newly activated peer) and its failure metrics are updated.

Optional: If a connection to an replacement peer cannot be established after a configurable number of retries, or if connection metrics for a peer indicate an excessive number or rate of unexpected disconnects, such a peer may be declared unreliable and deprioritised or removed completely from the eligible pool.

Optional: The BPNP protocol shall attempt to 'top up' the eligible pool when peers are removed from it. This could be controlled by some 'low water mark' parameter.


Data Provisioning Algorithm

Details to be worked out.

Intention is to take the algorithm outlined by @jtoomim in his mail to bitcoin-dev as a rough starting guideline.

The smallest sensible chunk granularity would presumably be single transactions, but in practice larger chunks would probably be more useful. Probably under parameter control or dynamic in response to performance metrics.


Transaction Cache

Details to be worked out.


Other options under consideration

This is mostly still a wish list based on input gathered so far.

In addition to advertising BPNP-capability in a simple way, peers could advertise preference flags which could be used to:
  • prioritize sending of data to high-bandwidth nodes and to avoid saturating low-bandwidth peers
  • enable opportunistic transmission (do not wait for a request but send a defined amount of data straightaway in the assumption that the receiver can handle it)

Forward error correction could be employed in environments where packet loss is known to be a significant problem - i.e. introduce redundancy in to the data so that as long as enough packets get through the block can be reassembled.

--

[1] https://bitco.in/forum/threads/towards-faster-block-propagation-jtoomims-blocktorrent-proposal.742

Funding Proposal
[added by solex]

A quote to complete this development has been made by @someone42.
Therefore, this BUIP is updated to include US$12,000 of funding:
  • someone42 has until 31st July 2017 to fully complete the work detailed,
  • If not completed at the above date the funding will be available to anyone else who applies.
  • If someone42 wishes to abandon the project then a message posted in this thread opens the task to anyone else to apply sooner.
  • someone42 can split the work with other parties but will be responsible for splitting the payment in that situation.
 
Last edited by a moderator:

theZerg

Moderator
Staff member
Aug 28, 2015
1,012
2,327
Since this BUIP is not complete, I'm going to open this BUIP up to public commentary right now and reserve the "Developers time" to occur when the BUIP is completed.

The "bittorrent" metaphor seems to only refer to the idea of breaking a block up into sections by transaction. This is pretty easy to do and is a necessary ingredient in ANY rework of the P2P proposal because its going to deliver the biggest performance gains, since most nodes will already have the transactions.

As I said in the other thread, my personal opinion right now is that if something like this landed in our laps then sure, I'd support it. But if I had to write it myself, I'd lean towards subchains and a lot of optimizations (many of which I discussed in the other thread) that are pretty obvious to any network protocol engineer.

I'd like to see a P2P protocol that can compete against the relay network for performance and can help reduce orphans (subchains).
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@theZerg:

I've taken note of your suggestions for a rework of the network layer and opened a call for input (before I would take that subject further to a separate BUIP):

https://bitco.in/forum/threads/call-for-input-abstracting-bitcoins-network-transport.756/

If the response to that CFI is encouraging and does not require a re-think of the entire protocol or re-work of the entire application design, I might attempt that in a separate BUIP before going on to blocktorrent. I definitely see the enabling potential this would have for other transport protocols.

As I've stated to @Peter R , I feel subchains are too big a fish for me to tackle, given that I'm only starting to familiarise myself with the code. But hopefully these things can go on in parallel.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
Update: am in touch with @jtoomim by email and we are collaborating on a BIP which he will publish (I guess via the usual Core channels) when ready.

I intend to update this BUIP to align it with the BIP content as that matures except in cases where we (BU) propose our own improvements or optimizations that don't break interoperability with Jonathan's BIP.

Please let me know if you see any problems with that. It will certainly not delay but speed up initial implementation of this BUIP.
 

solex

Moderator
Staff member
Aug 22, 2015
1,558
4,694
@freetrader do you have a further update?

I am going to open up a number of BUIPs for voting before the end of January.
Because BUIP006 is still being defined, provisionally I will leave this one out of the first vote, as there will certainly be another in February when it can be included.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@solex I am currently overseas until about 5 Feb and unable to progress this much. I am trying to track the developments of J.Toomim et al though and will update a few days after I get back. I still have to look into whether blocktorrent can provide enough additional benefits alongside extreme thinblocks (which seem to offer an excellent improvment in itself), or whether it should be implemented as an alternative to it.

Addition: To clarify my preference, I would like to see choices in block propagation protocols available. I have expressed to @jtoomim during early BIP drafting that it would be nice to have a choice even of block torrent algorithms, since it may be easier at first to implement a less sophisticated algorithm which works and then refine it. This was considering the urgency of offering an initial improved block propagation method. I think to an extent extreme thinblocks have alleviated this urgency, but my view is the future must offer configurability in the network protocols at various levels.
 
Last edited:

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@jtoomim:

Are you going for an approach of blocktorrent communicating with the daemon via RPC?
Or are you prototyping / simulating first via Python?
 

jtoomim

Active Member
Jan 2, 2016
130
253
Prototyping in Python, communicating with bitcoind via the p2p protocol using halfnode.py to send/receive blocks and receive transactions to/from localhost:8333. No RPC (except to do getblocktemplate to get fake blocks for testing propagation, and for weak blocks once we get to them). The goal is to eventually do a rewrite in C++ that gets embedded in bitcoind, but I think first it makes more sense to test out the algorithm and hammer out the details of the protocol in Python.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@jtoomim: Thanks for the update, I should check my mail more frequently :)
I'll study the code and see where best I can help out.

My search for literature on algorithms for exchanging Merkle trees did not yield much of direct relevance, but I entered the forest of gossip protocols and network simulators.

I noted from some discussion (don't remember which forum or chat, a maybe their IRC meeting minutes) that Core is planning what looked like a big re-arrangement of the networking code post v0.12.
 

jtoomim

Active Member
Jan 2, 2016
130
253
There are a lot of tasks that need to be done. Solving the "algorithm for exchanging Merkle trees" problem is not one of them. I think I've got that pretty well under control. Right now I'm working out a bug in the Merkle hash calculation algorithm -- my merkle root hashes aren't right for some reason -- but I'll probably get that worked out in a day or two.

One unsolved and probably self-contained issue is how to send transactions that can be up to 100 kB in size over UDP in a manner that is resistant to attacks. A single UDP datagram is limited to 65535 bytes, though it's better to keep them under around 1472 bytes (the MTU) to avoid fragmentation. It would be great if we could split up large transactions into chunks that could be downloaded from multiple peers the same way regular transactions are. If we don't do that, then a large-unpublished-transaction-attack is the worst-case scenario for Blocktorrent and could adversely affect the performance. But how do you make sure that each chunk belongs to the transaction as a whole? One idea I had for this was to split the tx into e.g. 1300 byte chunks, and use sha256 to hash each chunk, but instead of using the normal/default initialization vector (IV)/internal state for sha256, we save and load the initialization vector from the other chunks. That is, you hash chunk 0 as normal, but save the IV and transmit the IV separately. When hashing chunk 1, you start with the IV that you ended with at the end of chunk 0, etc. Before you send the transaction, you send the IVs and end hashes for each chunk. That should let you verify that each chunk is contiguous with the previous chunk once you've downloaded it, and will let you identify which chunk is corrupted in case an attacker decides to sabotage your node by sending you junk data.

The python hashlib apparently allows saving and loading of sha256s, but I'm not sure if it has a clear method to extract the internal state out so it can be sent over the network. We might have to implement our own sha256 function in order to get that functionality, or edit the source code to add that functionality. A pure-python sha256 function might be fast enough for our uses, since this is mostly just supposed to be a python proof of concept.

Or maybe there's another way we could do it.

There are a lot of other sub-projects that can be worked on. I'll try to add them to the github as issues when I get a chance.
 
Last edited:

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@jtoomim: w.r.t. the transaction transfer via UDP, have you looked at UDT and if so, what are your thoughts?

Presentation: http://udt.sourceforge.net/doc/udt-2009.ppt

Overview poster: http://udt.sourceforge.net/doc/udt-sc08-poster.pdf

IETF Draft RFC (expired): https://tools.ietf.org/html/draft-gg-udt-03

Background info : https://en.wikipedia.org/wiki/UDP-based_Data_Transfer_Protocol

It looks like there are Python bindings, so we could test something like this quite easily during prototyping, and it should be quite easy to bring it over into bitcoind.
 
Last edited:

jtoomim

Active Member
Jan 2, 2016
130
253
Chris Chua said this via email:

> I don't have a bitco.in account, so I'll put this here: I did some searching and found this: https://pypi.python.org/pypi/sha256

>I think it does exactly what you're looking for.

Yes, I think it does. Thanks.

As for UDT, it does look interesting, but not quite what we're looking for for blocktorrent. I don't want a reliable transport method (blocktorrent should usually request the same data from a different peer if there is a delay or packet loss), and I don't want streams (delaying data until it can be ordered properly is unnecessary and undesirable). The blocktorrent protocol already provides reliability management and order independence; we don't need that out of our transport layer.

Congestion control is desirable, though. That's going to be tricky, though, because blocktorrent is going to have bursty traffic starting up with dozens of peers simultaneously. I'll look at UDT's congestion control algorithm more closely to see if it might be usable for our case.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
I don't want a reliable transport method (blocktorrent should usually request the same data from a different peer if there is a delay or packet loss), and I don't want streams (delaying data until it can be ordered properly is unnecessary and undesirable)
Fair enough, I agree on those criticisms.

Congestion control seems like a feature that can be added after a first prototype.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
@jtoomim:

This looks interesting w.r.t. boosting UDP resistance to high packet loss:

http://www.crewes.org/ForOurSponsors/ResearchReports/2004/2004-07.pdf

I realize that they state "we are willing to sacrifice data quality" (due to the nature of their application) but perhaps their method can be adapted slightly to perform well for our purposes in most cases (*), considering we will eventually be connected to many blocktorrent peers which seems a little unlike their scenario.

I'll see if I can get a Python implementation going along those lines and do some tests.

If this works, the underlying UDP transport could get a nice seismically-related name - something to do with quakes perhaps :)

Then again, perhaps I am getting too excited and this type of protocol is not suitable at all. who knows.

(*) Regardless of this topic, perhaps we need to think about the "only 1-blocktorrent-peer-available" case as well - should we simply fall back to the standard protocol to talk to that peer?
 
Last edited:

jtoomim

Active Member
Jan 2, 2016
130
253
No, we cannot interpolate data. The entropy of our data is much higher than for seismic data. We actually have to download it all. We just don't have to download it from the same peer.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
Thanks for looking at it. I realized that after a bit more reading, and have been looking a bit at Erasure Codes [1] . They look spiffy. In contrast to a "dumb" algorithm, I see the following:

Benefits:
  • save tremendously on bandwidth
Downsides:
  • eating a lot more CPU
  • complex implementations
Not sure what the best tradeoff is here.

[1] e.g. "*hair" projects of https://github.com/catid
 

jtoomim

Active Member
Jan 2, 2016
130
253
> (*) Regardless of this topic, perhaps we need to think about the "only 1-blocktorrent-peer-available" case as well - should we simply fall back to the standard protocol to talk to that peer?

If there's only one blocktorrent peer available, then you should download the full block from that one peer. Even in that case, blocktorrent should be faster than the standard bitcoin p2p protocol, since blocktorrent (a) doesn't require retransmission of data in mempool, and (b) uses UDP with retransmission of lost data.

Since people have been repeatedly bringing up techniques for doing reliable transport over UDP on the socket level, perhaps I should describe the application-level mechanism I had in mind for this in a bit more detail. The oversimplified algorithm I had in mind is this: every e.g. 50 ms, you check for parts of the block that you still need, and you send a request to download some of them from some of your peers. You preferentially request chunks that you have not requested before, or have requested the fewest times, or requested the longest amount of time in the past. You preferentially request from peers that you haven't requested that particular chunk from, that have high bandwidth, and low latency. You continue to make requests for a chunk until you get it. Early on in the block transfer, there will be a lot of chunks that you have not yet requested, so all chunks will be requested either 0 or 1 times. Later on in the transfer, you will have requested pretty much everything once, so you will start to make duplicate and triplicate requests, and you will start emphasizing the low-latency peers. The end result is that you download more data than you have to (maybe 20%, but it depends on configuration) in order to avoid having a few lost packets delay completion.

Later on, we can add some forward error correction to make it easier to reconstruct the block with partial data plus some FEC data.
[doublepost=1456867568][/doublepost]Another part of this algorithm: every so often, you send a packet to your peers telling them the current state of your building block so that they know what they can request from you. This is what TreeState.serialize() and .deserialize() is for: a peer can simply serialize that object and send it to each of his peers every now and then (assuming it has changed). The serialization only requires two bits per populated node, and compresses groups of nodes that all have the same code, so a 4000 tx block would have under 8000 nodes and take about 2000 bytes (barely two IP packets) in the extremely unlikely case in which the peer has a copy of every other transaction. It should be pretty easy to extend that class with the ability to serialize a subtree, so that if you only made changes to e.g. transactions 0 through 31, you don't have to send the state for transactions 32 through 1675.
 

freetrader

Moderator
Staff member
Dec 16, 2015
2,806
6,088
Keeping it simple sounds like a good plan.

I'll work along those lines.
 

jtoomim

Active Member
Jan 2, 2016
130
253
However, the mechanisms I described above -- send your peers your inventory, wait for a request, fulfill the request -- should not be how all (or maybe even most) of the data actually gets sent. Each peer has a volunteered data limit, which is the amount of unrequested and unacknowledged data that peer is willing to accept. If Alice mines a block, she will check to see how far away she is from the volunteered data limit on each one of her peers (e.g. Bob), and then she will send that amount of data to each peer. Once that data arrives, Bob will judge its usefulness and timeliness, and inform Alice that she may send more volunteered data. Meanwhile, Bob will take the data he just received from Alice and forward it unrequested to most of Bob's peers, including Carol and Dave, as long as the volunteered data limits for each are high enough. This way, Alice can send chunk 1 to Bob, chunk 2 to Carol, and chunk 3 to Dave, and all three peers will have all 3 chunks within about 1 round trip time (two 0.5 RTTs, actually). The volunteered data mechanism will result in a lot of duplicate data, so nodes that value latency over traffic efficiency (e.g. miners and supernodes) will allow much higher volunteered data limits than others (e.g. home users).