iguana - parallel sync full BTC blockchain in 30 minutes, uses half the space, but starts up instant

Chronos · Mar 11, 2016

I think you will find a good number of willing testers in this forum.

proto · Mar 11, 2016

Looks good, I will help testing too when the time comes. When can we estimate a release?

Peter Tschipper · Mar 11, 2016

It sounds all fascinating. I wish I had the time to look at it...we're moving on here with other scalability enhancements over the next months. I hope to hear more about your project as it moves forward.

jl777 · Mar 11, 2016

https://bitco.in/forum/threads/bandwidth-test-version-iguana-for-unix.963/

awemany · Mar 31, 2016

Peter Tschipper said:
It sounds all fascinating. I wish I had the time to look at it...we're moving on here with other scalability enhancements over the next months. I hope to hear more about your project as it moves forward.

In a saner world, we'd be at the point now of specifying internal standard Bitcoin APIs to make Bitcoin more pluggable between different client versions. And we would have other collaborations going on.

Instead we are watching - take your pick - this glacier or train wreck.

@jl777, I am very interested in what you are doing here, keep up the good work!

Richy_T · Apr 4, 2016

In a saner world, we'd be at the point now of specifying internal standard Bitcoin APIs to make Bitcoin more pluggable between different client versions. And we would have other collaborations going on.

This is how I was envisioning things going forward. I think we are starting to see it a bit between Classic/Unlimited/XT. I suspect the next step is to define an architecture which will break things down into its component parts. It is becoming clear that Core hasn't done this already due to fear of losing control.

jl777 · Apr 5, 2016

After battling against performance problems in generating the spendvectors and balance files (I tried quite a few variations), I decided to get a monster machine to see how fast it would be if the hardware was not any bottleneck:

BTC.RT405815 u.202 b.202/202 v.202/404 (1815+1815/1815 1st.202).s0 to 202 N[203] h.405815 r.405815 c.405815 s.405815 d.0 E.202 maxB.32 peers.56/32 Q.(0 0) L.405815 [202:1814] M.405814 000000000000000002c2e6d28b1bdb5224705d5e39055de261edaa86a7e16be5 bQ.1 0:23:27 stuck.0 max.47

E.202 -> 202 bundle files created for the first 404000 blocks
u.202 -> 202 utxo files (spend vectors) created
b.202 -> 202 balance files (accounts) created
BTC.RT405815 -> realtime blocks (current unfinished bundle) which matched blockchain.info
0:23:27 -> less than 24 minutes start to finish! and at the time this was faster than the block 405816 was generated.

When was the last time anybody was able to do a full BTC sync faster than the next block was mined?

The bandwidth, CPU and HDD are all pretty close to saturated during the entire process and to achieve such a time required lots of RAM, SSD and 1gbps bandwidth, but at least it shows there are no software bottlenecks.

To give an idea of what is happening under the hood to achieve such performance, I need to explain the various stages and files that are created. The naive direct creation of files ended up swapping to HDD if you didnt have enough RAM (64GB), which is most all configurations, so even after optimizing via prefetches and minimizing the RAM footprint, it still took anywhere from half hour to 4 hours to go from having all the bundle files to being ready for realtime processing.

The bundle files are explained in detail elsewhere and I will assume you are familiar with them, the tl:dr is that each bundle is the data from each set of 2000 blocks and they are totally independent from all other bundles. The vin data and vout data are put into separate files, but all of these are permanent and never change after they are created.

The UTXO is of course something that is changing and at anytime, any existing unspent can be spent. This creates a horrible access pattern if you have a normal HDD where each seek is 15 milliseconds. This is how the naive direct creation of the spend vector files ends up taking hours, ie millions of 15 milliseconds delays. So even though it would be possible to independently generate each spendvectors file for each bundle (as soon as all the prior bundles are there), it is not practical...

What I had to do was to change from an independent "vertically" parallel process to a dependent batch process.

First the problem:
Each bundle has all of its spends defined as (txid/vout), but a significant percentage of them are references to txid's that are not in this bundle.

In order to search all the other bundles, portions of the other bundles need to be accessed, but since txid is a hash, it spans across large areas of prior bundles, in a random pattern. And to fully dereference a spend, you need not only the txid, but also the amount, which is stored in the unspent data (yet another potential seek)

Even if processing just one bundle, it can easily overwhelm the system resources as the paging of 4kb tends to load in 4kb of memory even for a 64 byte retrieval. one million of these and 4GB of cache is replaced. Since the recent bundles have 5 million+ spends, RAM is consumed and thrashing to HDD ensues. Without SSD, this leads to hours of processing and since we are getting all the bundle files created in 30 minutes, this is very significant delay

I tried to split things into two steps, resolve the txid only and then do the unspent access separately. that improved things a lot, but still not enough.

struct iguana_spendvector
{
uint64_t value; uint32_t pkind,unspentind;
int32_t fromheight;
uint16_t hdrsi:15,tmpflag:1;
} __attribute__((packed));

a spendvector is 22 bytes and once it is fully resolved, it never changes and given all the spend vectors, an entirely new set of balances can be calculated quite quickly, ie a minute or two as it has all the information needed included in itself.

I added a tmpflag to indicate if it needs to be further resolved and in the first pass I just set the hdrsi, unspentind and fromheight. the hdrsi is the bundle and unspentind is relative to that bundle, so it will never overflow the 32bits. a global unspentind will overflow in a few years! By storing indexes that are likely to be repeated, it lowers the entropy of the dataset, which allows for better compression. So whenever there is a choice on how to encode things, I used an encoding that is likely to have many occurances. this is probably why the .xz compression is working the best.

the fromheight field is not actually needed, but by having it, iguana will be able to quickly calculate the balance for any address at any height, so I felt it was well worth the extra "32bits". But really it is much less as all spends from a single block will have the same value replicated across the dataset, so it should be using 16 bits or less in compressed form.

Doing this requires converting the txid/vout into hdrsi/unspentind, which is a parallel search in all the prior bundles. I take advantage of the fact that most spends are for relatively recent vouts, usually 3 to 5 bundles, but with millions of spends in each bundle, we get the exponential tail of accesses into the far past chewing up the precious cache space.

Even though it is possible to parallel generate the above, the costs of a cache miss being 10,000x forces a loose serialization. What I do is allow each bundle to create the spendvector, only if the bundle that is H ahead of it is a negative value or has created its spendvectors. H is the number of helper threads. What this does is allow parallel processing within a window of H bundles, which given there are 202 bundles and usually much fewer cores, it is not a bottleneck.

Since each txid search goes backwards from the current bundle and odds are good that it is already in the cache, we get a 5 microsecond time per search. The reason it is likely in the cache is that when the spendvector processing is started I prefetch the entire bundle to put it in the cache. If the data is not accessed, later prefetches and accesses will just overwrite the unused pages, so over time we end up with just the data that is needed and this allows to fill the cache at 400 MB/sec vs 4MB/sec, a 100x speedup.

Net result is that all the spendvectors are generated in minutes vs hours. But they are in the temporary form and not ready for permanent saving to HDD. Serializing this part also boosts system performance as we are not mixing the HDD saves for the spendvectors during the calculations, ie less seeks. always less seeks. seeks are the enemy of performance.

Next step is to convert the temporary spendvectors into permanent form. To do this, we need to find the remaining fields: value and pkind. But isnt this the same problem as before? Not exactly, there is a subtle difference in that ALL the spends as of the last bundle are in RAM, so there is no seek penalty. But more importantly is we can process horizontally vs. vertically. What I mean by that is that we can scan each bundle's in memory spendvector [s0, s1, s2, ... sn] and only process the ones that need data from a specific bundle. And this is the key. Even if processing one bundle at a time, it will go pretty fast, but since each bundle will process spends only of its vouts, all the bundles can convert in parallel without any restrictions at all. Well, the only restriction is that the prefetch for all the bundles that are converting will fit in memory, so at 1GB per bundle, then 8GB could do 8 at a time.

Prefetch bundle, scan all the other bundles' spendvectors and convert the ones referring to the bundle being processed.

This does seem like so much work to just get a few fields set, but this is what it takes to achieve a half hour parallel sync and explains why nobody has done it before.

It is so fun to see what used to take hours happen in seconds. All this speed up allows it to complete in a few minutes on the lower caliber server, but on the monster server, it finishes in less than a minute! Pretty good improvement from the naive approach.

Now the spendvectors are fully resolved, save them all to disk. They will never change and can go into the permanent dataset section to be compressed 2:1

Once all the permanent spendvectors are done, a single serial process scans them (it is likely still in RAM) and then all the accounts files are created. Since these account files are volatile, I take special care to make sure they are valid. I make a cumulative SHA256 calculation and save that to a file along with crc. This is a locally generated file, so if it is there (after initial pass), then the SHA256 calculation can be skipped and just the crc is needed. that saves a minute on each startup and allows things to go to realtime mode in less than 30 seconds.

And this is where I am now... The good news is that on a monster server the realtime processing goes in a blink and it catches up to the realtime block in several seconds, but since the realtime processing is still using the naive approach, it is horribly slow on lesser servers.

Just getting the logic correct is insufficient when there are performance constraints. So I need to revamp the realtime processing of the spendvectors and balances and at that point iguanacore will be ready for transactions

James

jl777 · Apr 5, 2016

BTC.RT0 u.202 b.202/202 v.202/404 (1955+1955/1955 1st.202).s0 to 202 N[203] h.405955 r.405955 c.405955 s.405955 d.0 E.202 maxB.32 peers.32/32 Q.(0 0) L.405955 [202:1954] M.405954 000000000000000004078101a8bcf5f1bd9f0be6345d8597c0e8a54316185acd bQ.1 0:19:38 stuck.0 max.12

I setup a RAMdisk for the tmp files and got the above result

Dusty · Apr 6, 2016

Thanks for the explainations james, but I can't follow every detail, for example can you explain me what a "Spending Vector" is?
Also, what specs has the "monster server"?

jl777 · Apr 6, 2016

spendvector for a bundle are all the external spends in the following form:

struct iguana_spendvector
{
uint64_t value; uint32_t pkind,unspentind;
int32_t fromheight;
uint16_t hdrsi:15,tmpflag:1;
} __attribute__((packed));

monster server: 128GB RAM, 200GB SSD RAID-1, 12 core CPU, 1gbps connection

jl777 · Apr 6, 2016

on a non-monster machine, it gets more normal times:
BTC.RT0 u.203 b.203/203 v.203/406 (47+47/51 1st.203).s0 to 203 N[204] h.406051 r.406047 c.406047 s.406047 d.0 E.203 maxB.8 peers.47/32 Q.(0 0) L.406051 [203:50] M.406050 000000000000000004a12152bd1d67e9d048d81cf15c27298c1a0d570fcd7c1f bQ.1 1:16:10 stuck.0 max.184

i solved the thrashing problem and it appears that 8GB system is needed to achieve the "lunch break" sync times. this assumes a connection speed of 200mbps, 8 cores. I havent verified 8GB is enough, but 4 is definitely not enough. If you have SSD, then 4GB RAM should suffice to achieve around these times

on a home connection of 10mbps, then it is an overnight sync, just due to the bandwidth required

monster: BALANCES WRITTEN for 203/0 bundles 91167bc0c53c45743fac42edae92ca6d6dcf14e7b95ea9f4631f29f5edd583a2
32GB+HDD: BALANCES WRITTEN for 203/0 bundles 91167bc0c53c45743fac42edae92ca6d6dcf14e7b95ea9f4631f29f5edd583a2

the hashes match!!

the balances info are the hashes for all the bundle file's account data:
struct iguana_account { int64_t total; uint32_t lastunspentind; };
struct iguana_utxo { uint32_t fromheight,prevunspentind:31,spentflag:1; } ;t

he above has the only non-permanent data that iguana needs and was the most subject to changes across nodes, especially since the lastunspentind and prevunspentind are essentially pointers in a linked list and is processing order dependent. So I was most worried about this data not being exact. the "total" field is the total amount the account (rmd160 hash so works for multisig, p2sh, etc) has spent.

I still have some realtime bundle issues to solve, but it appears the historical parallel sync is looking pretty solid. There is no proof that the entire dataset is 100% valid, but the fact that the hashes match from two different parallel syncs, puts the probability at 99%+. Of course, we are doing validations of the entire dataset too, but I am glad I dont have to debug why the hashes dont match!

Dusty · Apr 6, 2016

jl777 said:
spendvector for a bundle are all the external spends in the following form:

struct iguana_spendvector
{
uint64_t value; uint32_t pkind,unspentind;
int32_t fromheight;
uint16_t hdrsi:15,tmpflag:1;
} __attribute__((packed));

I have a hard time understanding the details of the layout of the data of your structures.
Please forgive me if I ask you this initial thing: in the iguana_txid data structure, how do I need to interpret txidind,firstvout,firstvin? Why "firstXX"? Do they point to an index into another table?
If yes, which one?

The other think I would like to know is: how do you validate blocks and signatures? Are you using libconsensus?

jl777 · Apr 6, 2016

I have described the details in other posts in the iguana section, it is not organized in any meaningful way (yet), but lots of useful info scattered in those threads

each txid, vin and vout has a unique index based on its appearance when iterating through blocks. the first one is 1, second one is 2, ... I reserve 0 to mean NULL, so these 32 bit integers are like pointers.

the firstvin is the index of the firstvin for that txid. If space was more important, it could be found by doing binary searches through the txids' via numvins and numvouts, but doing 100 million+ binary searches... better to just have it dereferenced since the permanent data will got into a compressed read only file, the cost is on average half the size

I have written the code to validate blocks using standard SHA256 and ripemd160 libtom
currently signature validation is still using openssl, but it is in one place and when I get a chance I will change to libsecp256k1.

iguana has most of libconsensus reimplemented. I need to fully understand every single memory allocation to manage the performance aspects so c++ makes that impossible for me.

Before I added libsecp to the iguana, there was a total of 1 external dependency, openssl. Now to get libsecp, you need autotools, autoconf, gmp, ...

before that gcc -o iguana *.c was all that was needed to compile iguana as it uses no external libs other than for sig validation. all the crypto algos code is part of the iguana codebase

Dusty · Apr 6, 2016

iguana has most of libconsensus reimplemented. I need to fully understand every single memory allocation to manage the performance aspects so c++ makes that impossible for me.

Reimplementing consensus is a big, big problem: since even a tiny difference matters (you have to reimplement also all the bugs!) you risk to split the chain.
I also understand that you have no alternatives though: libconsensus is not high-level enough to be plugged into other implementations, for what I can remember...

Before I added libsecp to the iguana, there was a total of 1 external dependency, openssl. Now to get libsecp, you need autotools, autoconf, gmp, ...

Yes, but that should be better than using the totally-full-of-bugs-openssl :-D
Jokes apart, libsecp is way faster, but out of curiosity: why don't you use autotools and the rest? That should help you build a more portable source compilable on different systems.

jl777 · Apr 6, 2016

iguana is not going to be mining any blocks in the first versions, that eliminates the vast majority of problems. and no chance that iguana would create a fork.

i had code that used openssl that worked and that was better than the alternative at the time. one step at a time.

not sure how using autotools will make windows properly support unix system calls. does it?

iguana is quite portable without need for all the auto* tools. I just write the code to be portable by using stdio, sockets and system time. mmap is the one luxury, but I only use the readonly mode as some of the target OS doesnt support writable memory maps.

The reason why code isnt OS portable is primarily due to dependency on external libs/system calls that are not available on all OS. So you need to use the least common denominator of system functions. OS_nonportable.c is where I put all the OS specific code and it is about 500 lines, so around 1% of the codebase.

James

P.S. If anybody wants to help with swapping out openssl or other things, help is always welcome

Dusty · Apr 7, 2016

iguana is not going to be mining any blocks in the first versions, that eliminates the vast majority of problems. and no chance that iguana would create a fork.

Ok, but you risk that the client stops processing blocks and leave you behind.
What is the target user of your reimplementation BTW? Wallet users? Advanced users that want to create statistics/explorers?

not sure how using autotools will make windows properly support unix system calls. does it?

I don't know, I'm not a C coder, but maybe it could help with the cygnus port?
In either case with "different systems" I really mean different unix systems: there are many and they are very fragmented, not only in the linux camp, then there are *bsd etc.
I'm not quite knowledgeable on the subject but I usually see the use of autotools to help generate automatically a os_nonportable file.

jl777 · Apr 7, 2016

I can verify it is properly handling all existing blocks. In the event things change and it stops due to changes that are not handled, then it would require an update. Not the greatest outcome, but not anything unusual.

One of the OS it runs on is chrome OS, so that means without installing anything else, it can run from a browser. mass market is the target, along with people that want to sync the blockchain as fast as possilble.

I am using the iguanacore as a platform for the SuperNET services, like instantdex and pangea, which are a decentralized exchange and decentralized poker. So there will be a fully integrated solution for those that uses the iguanacore

yrral86 · Apr 7, 2016

Dusty said:
Ok, but you risk that the client stops processing blocks and leave you behind.
What is the target user of your reimplementation BTW? Wallet users? Advanced users that want to create statistics/explorers?

I don't know, I'm not a C coder, but maybe it could help with the cygnus port?
In either case with "different systems" I really mean different unix systems: there are many and they are very fragmented, not only in the linux camp, then there are *bsd etc.
I'm not quite knowledgeable on the subject but I usually see the use of autotools to help generate automatically a os_nonportable file.

Autotools is helpful for building software on different platforms. It does nothing to help with abstracting system calls an application may need to use. It only helps with compiling and installing.

jl777 · Apr 8, 2016

I had to slow down the fastest configuration so it wouldnt get too many duplicates on the slower configs. It is a tradeoff between getting the last blocks sooner, vs. getting duplicates. Wait too long and you are idle for a long time, dont wait long enough and you end up with a duplicate. My goal was <1% duplicates, but that does not seem possible across the board without really lowering performance.

I had a bug with >256 peers, so fixed that and now testing up to 1000 peers. Now more peers actually slows down performance, the sweet spot seems to be about 64 peers. At 1000 peers it is over 10% all the peers and that means a lot of slower nodes are involved, but at the speeds iguana is going at, saturating 64 nodes seems a bit too much.

totally redid the calculating of the spendvectors and balance files. now it is much more efficient and handles recovery from partial dataset much, much better. It also allows doing a one step or two step generation of the spendvectors file so if you are RAM limited it still wont take more than an hour or so of processing.

Without hardware limits, the spendvectors and balance files are generated in about 2 minutes, which is quite a bit better than the 3 hours it use to be!

Now it seems to have decent on three different calibers of systems and also gets from start to realtime mode automatically, and also recovers from missing files

I also tested BTCD in parallel with BTC and in the process of generating a squashfs to test the readonly mode.

Chronos · Apr 8, 2016

@jl777 I have an old cheap Chromebook, with 2gb ram and a Celeron processor. If you want to, you can put a test build of Iguana in the Chrome Web Store as an "app" and I will go ahead and test it out. Just don't make it require my browsing history.

iguana - parallel sync full BTC blockchain in 30 minutes, uses half the space, but starts up instant

Member

New Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Member