BUIP017 (passed): Datastream Compression

Peter Tschipper · Mar 21, 2016

jl777 said:
having 8 bits to indicate the type of data would allow for bitwise enabling of codec features and also an easy way to have uncompressed supported within the codec path.
.

good idea but we already use the 12 byte netmessage type to encode that data into. Other than that i'm just trying to keep it as simple as possible for this first go around. i'm sure in the future others will come up with some better compression scheme or enhancement but i just want to get something very good, reliable , long lasting if necessary, and scalable on table.

solex · Mar 21, 2016

Chronos said:
I've heard this line before: Fast Relay Network is so good, let's not improve anything! What I don't understand is how you can say "small blocks because bandwidth" at the same time.

Datastream compression also helping with the alleged 88% which Maxwell says is not block data means even a 20% of that being reduced is comparable to Xthin for reduction of bandwidth overhead. The bonus is that DC helps with both burst and non-burst data.

Peter Tschipper · Mar 21, 2016

@solex it reduces the 88% (the alleged). By how much? it depends on how many historical blocks and tx's are sent/received. I don't know what the numbers total are, it will be a good question to answer as we get to testing this and tracking the actual daily savings. Overall I would say we'll save about 15% to 20% with DC plus another 15 - 20% from xthins but i don't have hard data.

solex · Mar 21, 2016

@Peter Tschipper
Thanks, those are useful estimates.
I was reading the reddit comments about DC and you are doing so much building up goodwill towards BU.

Inca · Mar 21, 2016

Once again this is great stuff. BU and Classic need to start integrating these fantastic optimisations and leave Core behind.

Peter Tschipper · Mar 28, 2016

Hello all,

A reference client for Datastream Compression is available for anybody who wants to compile and try it out. You'll need to
connect to a client that supports compression which can be found on bitnodes and using version 80002.

Reference Client:

you will need "LZO 2.09" to compile this.
https://github.com/ptschip/bitcoin/tree/BUIP017_compress

Bitnodes:

https://bitnodes.21.co/nodes/?q=80002

The client uses LZO compression with two levels of compression. 0 is off, 1 is fast, 2 is maximum compression. Be sure to run the compression.py python test script after you compile to make sure it's working correctly for you.
(./rpc-tests.py compression)

Additional Findings:

1) Most of the compression benefits would come from nodes doing IBD from our node. It's a frequents occurence and is the biggest bandwidth consumer by far, next is transactions and then inv messages. However we don't compress inv's as they are not compressible with LZO however there will be a followup BUIP that will hopefully deal with those in the near future. Also, there was a Core bug which was fixed which was constantly downloading headers from unsync'd nodes which also accounted for a significant amount of bandwidth savings.

2) Bloom filters could not be compressed to any extent using LZO compression. We'll need some kind of Adaptive Range Encoder or Arithmetic Encoding, for which there is no suitable portable open source at this time. It would be an interesting project for someone to produce such a compressor in c/c++ or asm.

Updates to getnetworkinfo:

getnetworkinfo now gives the following information. Xthin % compression over last 24hrs
as well as datastream compressionstats.

Thinblock stats was also updated to include outgoing xthins as well since there is a significant
bandwidth savings from those xthins, which we were not factoring in before.
*Potential compression refers to how much additional compression would be realized had the
other connected peers also been supporting compression.

Sample results from getnetworkinfo:

"thinblockstats": {
"enabled": true,
"summary": "326 thin blocks have saved 214.99MB of bandwidth",
"summary": "Compression (last 24hrs) is: 94.8%"
},
"compressionstats": {
"enabled": true,
"cmp level": 2,
"summary": "Compression has saved 8.95MB of bandwidth",
"summary": "Compression is: 20.1%",
"summary": "Potential Compression could save an additional 354.16MB of bandwidth"
},

Christoph Bergmann · Mar 31, 2016

. Also, there was a Core bug which was fixed which was constantly downloading headers from unsync'd nodes which also accounted for a significant amount of bandwidth savings.

Great! Can you give some numbers?

And am I right that this again only comes into effective if you are connected with other nodes that do this?

I also have a question, maybe it's stupid, but ... is it a compression like the html/css compression or like .zip / .tar? So, does it makes it more difficult for ISPs to see that you are sending and accepting bitcoin-things? And if, is it possible to put a virus in the compressed paket that autoexecutes if you decompress it?

Dusty · Mar 31, 2016

Another related question: could it be useful to store compressed data on the disk?

Peter Tschipper · Mar 31, 2016

@Christoph Bergmann

Yes you have to be connected to another node that supports compression or has compression turned on. It's a config option, so by setting -compression=0, it turns off compression/decompression entirely.

Compression rates are generally 20%, sometimes up to 27% for the larger full blocks.

And yes it is .zip type compression but done with lzo.

I don't think it will make it more difficult for ISP's to see anything, since the message headers are not compressed, only the data portion of the message gets compressed.

Viruses, you could put a virus in anything but it needs to get executed. The only way to do that in Bitcoin is generate a buffer overflow. LZO has been around for 20 years and there are no known overflow errors in LZO. There was an integer overflow found a couple of years back which was patched but it could never be used for anything other than causing an application to lock up.
[doublepost=1459437611][/doublepost]

Dusty said:
Another related question: could it be useful to store compressed data on the disk?

I believe the UTXO is already compressed using Snappy compression, but blocks I don't think so and anyway you can already compress your blocks by created a compressed disk or compressed folder using your OS.

adamstgbit · Apr 18, 2016

does this work with Thinblocks?
It seems to me compressing block and thinblock would be mutually exclusive, and since thinblock achieve a avg of 90% reduction... this compression idea seems like a moo point.

also i dont understand how you can actually get 20% compression on a block
arnt block simply full of <500byte TX which are bacily perfectly random data which cannot be compressed?

say a thinblock is being sent from one peer to the other and they need to send 100 <500 Byte TX, you're telling me concatenating and compressing these TX's would yield ~7.83% compression? really!?

i think 7-20% bandwith savings is huge, if it can be married with thinblocks then this is very worthwhile.

Peter Tschipper · Apr 19, 2016

@adamstgbit The short answer is yes , it does work with thinblocks, in some cases. But if an xthin is truly thin, meaning it only has 1 or 2 tx's added to it, then there is very little to compress since most of the xthin is just tx hashes which can't be compressed very well. However, xthins are not always thin. In two cases, 1) when the node first starts up, xthins can be almost as large as a full block, and 2) sometimes even during normal times and when the system is getting a lot of these large spam type tx's and over the weekend, we tend to get these larger xthins that are only 30 to 50% compressed. In those cases DC helps to compress the xthin further.

Second question is that you are correct. Tx's <500 bytes don't compress well or at all and unfortunately most tx's are <500 bytes. So what we do in DC is concatenate tx's together by first concatenating the inv/getdata requests and then concatenating all the tx's in our inv queue before compressing them. This yeilds good compression rates and also saves the 24 byte tx header, 50 byte TCP ACK, and also the MTU header of 20bytes, per added tx. That's almost 100byte savings per tx just from ACK's and headers. All that combined, and after compression, typically I see between 20 and 25% compression total on average.

Why does concatenating work when a single <500 byte tx won't compress. That has to do with the lack of repeating data in a small , single tx. But combine them into one block of data then you start to have repeating patterns in the data that the compressor can use. The data is serialized and in binary format so it doesn't compress as well as a text file, but then serializing data is also a form of compression, so we are actually compressing a compressed file and still getting 20% out of it when done right.

EDIT: come to think of it, that doesn't include all of the savings. There are also an additional savings for every inv that get's concatenated as well. Because Bitcoin uses TCP_NODELAY, every messages goes out as soon as it's put in the buffer which incurrs additional overhead of the 20 byte MTU header, TCP ACK as well as the message header. So again we save 94 bytes for each inv/getdata request that gets bundled. Although that's not strictly file compression we do save those bytes as well.

adamstgbit · Apr 19, 2016

@Peter Tschipper

I see thanks for making it clear.

i dont understand how core still has so much support. BU team is definitely proving itself. If all nodes were BU nodes using all the improvments you guys have made, bandwidth requirements would drop dramatically!

Dusty · Apr 19, 2016

i dont understand how core still has so much support. BU team is definitely proving itself. If all nodes were BU nodes using all the improvments you guys have made, bandwidth requirements would drop dramatically!

From what I can see, the problem is PR: since BU does not have a 100M$ propaganda machine like BS, there is very little knowledge of the work being done here.
So it's up to us to advertise those features each time we write out from this forum

Chronos · Apr 19, 2016

I prefer to call it "marketing department" rather than "propaganda machine"

bitcartel · Apr 20, 2016

Some test data, syncing a fresh install, connected only to BU nodes with data compression:

"blocks": 209175,
...
"summary": "Datastream Compression has saved 414.01MB of bandwidth",
"summary": "Datastream Compression : 26.4%",

adamstgbit · Apr 20, 2016

414MB is 26% of the blockchain?

that can't be right...

solex · Apr 21, 2016

Up to block 209175 (some time late 2012)

Peter Tschipper · Apr 21, 2016

@adamstgbit you have to multiply the 414 x about 4, and then again x 2 to account for the serialization.

balmerhevi · Apr 29, 2016

Lzma2 is faster when using 4 or more cores and it gives better compression.

More about.....Lzma2 7zip

Balmer

lunar · May 2, 2016

I like where this is going. Great Stuff.

BUIP017 (passed): Datastream Compression

Active Member

Moderator

Active Member

Moderator

Moderator

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Member

Member

Well-Known Member

Moderator

Active Member

New Member

Well-Known Member