After battling against performance problems in generating the spendvectors and balance files (I tried quite a few variations), I decided to get a monster machine to see how fast it would be if the hardware was not any bottleneck:
BTC.RT405815 u.202 b.202/202 v.202/404 (1815+1815/1815 1st.202).s0 to 202 N[203] h.405815 r.405815 c.405815 s.405815 d.0 E.202 maxB.32 peers.56/32 Q.(0 0) L.405815 [202:1814] M.405814 000000000000000002c2e6d28b1bdb5224705d5e39055de261edaa86a7e16be5 bQ.1 0:23:27 stuck.0 max.47
E.202 -> 202 bundle files created for the first 404000 blocks
u.202 -> 202 utxo files (spend vectors) created
b.202 -> 202 balance files (accounts) created
BTC.RT405815 -> realtime blocks (current unfinished bundle) which matched blockchain.info
0:23:27 -> less than 24 minutes start to finish! and at the time this was faster than the block 405816 was generated.
When was the last time anybody was able to do a full BTC sync faster than the next block was mined?
The bandwidth, CPU and HDD are all pretty close to saturated during the entire process and to achieve such a time required lots of RAM, SSD and 1gbps bandwidth, but at least it shows there are no software bottlenecks.
To give an idea of what is happening under the hood to achieve such performance, I need to explain the various stages and files that are created. The naive direct creation of files ended up swapping to HDD if you didnt have enough RAM (64GB), which is most all configurations, so even after optimizing via prefetches and minimizing the RAM footprint, it still took anywhere from half hour to 4 hours to go from having all the bundle files to being ready for realtime processing.
The bundle files are explained in detail elsewhere and I will assume you are familiar with them, the tl:dr is that each bundle is the data from each set of 2000 blocks and they are totally independent from all other bundles. The vin data and vout data are put into separate files, but all of these are permanent and never change after they are created.
The UTXO is of course something that is changing and at anytime, any existing unspent can be spent. This creates a horrible access pattern if you have a normal HDD where each seek is 15 milliseconds. This is how the naive direct creation of the spend vector files ends up taking hours, ie millions of 15 milliseconds delays. So even though it would be possible to independently generate each spendvectors file for each bundle (as soon as all the prior bundles are there), it is not practical...
What I had to do was to change from an independent "vertically" parallel process to a dependent batch process.
First the problem:
Each bundle has all of its spends defined as (txid/vout), but a significant percentage of them are references to txid's that are not in this bundle.
In order to search all the other bundles, portions of the other bundles need to be accessed, but since txid is a hash, it spans across large areas of prior bundles, in a random pattern. And to fully dereference a spend, you need not only the txid, but also the amount, which is stored in the unspent data (yet another potential seek)
Even if processing just one bundle, it can easily overwhelm the system resources as the paging of 4kb tends to load in 4kb of memory even for a 64 byte retrieval. one million of these and 4GB of cache is replaced. Since the recent bundles have 5 million+ spends, RAM is consumed and thrashing to HDD ensues. Without SSD, this leads to hours of processing and since we are getting all the bundle files created in 30 minutes, this is very significant delay
I tried to split things into two steps, resolve the txid only and then do the unspent access separately. that improved things a lot, but still not enough.
struct iguana_spendvector
{
uint64_t value; uint32_t pkind,unspentind;
int32_t fromheight;
uint16_t hdrsi:15,tmpflag:1;
} __attribute__((packed));
a spendvector is 22 bytes and once it is fully resolved, it never changes and given all the spend vectors, an entirely new set of balances can be calculated quite quickly, ie a minute or two as it has all the information needed included in itself.
I added a tmpflag to indicate if it needs to be further resolved and in the first pass I just set the hdrsi, unspentind and fromheight. the hdrsi is the bundle and unspentind is relative to that bundle, so it will never overflow the 32bits. a global unspentind will overflow in a few years! By storing indexes that are likely to be repeated, it lowers the entropy of the dataset, which allows for better compression. So whenever there is a choice on how to encode things, I used an encoding that is likely to have many occurances. this is probably why the .xz compression is working the best.
the fromheight field is not actually needed, but by having it, iguana will be able to quickly calculate the balance for any address at any height, so I felt it was well worth the extra "32bits". But really it is much less as all spends from a single block will have the same value replicated across the dataset, so it should be using 16 bits or less in compressed form.
Doing this requires converting the txid/vout into hdrsi/unspentind, which is a parallel search in all the prior bundles. I take advantage of the fact that most spends are for relatively recent vouts, usually 3 to 5 bundles, but with millions of spends in each bundle, we get the exponential tail of accesses into the far past chewing up the precious cache space.
Even though it is possible to parallel generate the above, the costs of a cache miss being 10,000x forces a loose serialization. What I do is allow each bundle to create the spendvector, only if the bundle that is H ahead of it is a negative value or has created its spendvectors. H is the number of helper threads. What this does is allow parallel processing within a window of H bundles, which given there are 202 bundles and usually much fewer cores, it is not a bottleneck.
Since each txid search goes backwards from the current bundle and odds are good that it is already in the cache, we get a 5 microsecond time per search. The reason it is likely in the cache is that when the spendvector processing is started I prefetch the entire bundle to put it in the cache. If the data is not accessed, later prefetches and accesses will just overwrite the unused pages, so over time we end up with just the data that is needed and this allows to fill the cache at 400 MB/sec vs 4MB/sec, a 100x speedup.
Net result is that all the spendvectors are generated in minutes vs hours. But they are in the temporary form and not ready for permanent saving to HDD. Serializing this part also boosts system performance as we are not mixing the HDD saves for the spendvectors during the calculations, ie less seeks. always less seeks. seeks are the enemy of performance.
Next step is to convert the temporary spendvectors into permanent form. To do this, we need to find the remaining fields: value and pkind. But isnt this the same problem as before? Not exactly, there is a subtle difference in that ALL the spends as of the last bundle are in RAM, so there is no seek penalty. But more importantly is we can process horizontally vs. vertically. What I mean by that is that we can scan each bundle's in memory spendvector [s0, s1, s2, ... sn] and only process the ones that need data from a specific bundle. And this is the key. Even if processing one bundle at a time, it will go pretty fast, but since each bundle will process spends only of its vouts, all the bundles can convert in parallel without any restrictions at all. Well, the only restriction is that the prefetch for all the bundles that are converting will fit in memory, so at 1GB per bundle, then 8GB could do 8 at a time.
Prefetch bundle, scan all the other bundles' spendvectors and convert the ones referring to the bundle being processed.
This does seem like so much work to just get a few fields set, but this is what it takes to achieve a half hour parallel sync and explains why nobody has done it before.
It is so fun to see what used to take hours happen in seconds. All this speed up allows it to complete in a few minutes on the lower caliber server, but on the monster server, it finishes in less than a minute! Pretty good improvement from the naive approach.
Now the spendvectors are fully resolved, save them all to disk. They will never change and can go into the permanent dataset section to be compressed 2:1
Once all the permanent spendvectors are done, a single serial process scans them (it is likely still in RAM) and then all the accounts files are created. Since these account files are volatile, I take special care to make sure they are valid. I make a cumulative SHA256 calculation and save that to a file along with crc. This is a locally generated file, so if it is there (after initial pass), then the SHA256 calculation can be skipped and just the crc is needed. that saves a minute on each startup and allows things to go to realtime mode in less than 30 seconds.
And this is where I am now... The good news is that on a monster server the realtime processing goes in a blink and it catches up to the realtime block in several seconds, but since the realtime processing is still using the naive approach, it is horribly slow on lesser servers.
Just getting the logic correct is insufficient when there are performance constraints. So I need to revamp the realtime processing of the spendvectors and balances and at that point iguanacore will be ready for transactions
James