iguana "instant on" fast startup

jl777 · Mar 19, 2016

This post assumes you are somewhat familiar with iguana parallel sync as described in other threads. Each bundle is permanently fixed and from historical data it appears that after 10 blocks it will be very very rare for any block to change. https://bitcointalk.org/index.php?topic=1403436.0 and http://pastebin.com/LZxst5vD has reorg history as seen from one node, since 2012!

So this confirms my feeling that 10 blocks is plenty, especially if the cost is having to pause any tx and recalculate the data structures.

since each bundle of 2000 blocks has a fixed set of blocks, txids, vins, vouts, what I do is create a hashtable for the txids and a bloom filter for the blockhashes. This allows a direct lookup for txids and an average scan of 1000 blocks to find it. Since it is coming out of RAM after the first access (they are in memory mapped files), this is quite fast. If needed it would be possible to add another hash table for direct blockhash lookup, but I dont think that is such a common operation. The high volume things are oriented around txids

There are about 200 bundles right now as we are close to block 400,000. To initialized things, these 200 bundle files need to be memory mapped, which will take about the same time as an fopen. Now they are ready! And by ready, I mean ready for parallel searches for blocks, txids, generating rawtxbytes, etc. Notice after initial validation, onetime, there is never need to verify the 400,000 blocks worth of bundles as it never changes, even by a single bit.

The non-compressible vindata is not part of the read-only files as it will either not even be downloaded or be purged after initial validation.

What you might ask is the problem?
The issue is that the blocks since the most recent bundle is finished, to the current block, this is not in any readonly bundle. I left dealing with this issue for now. One approach is to just resync the latest blocks, but with average of 1000 and worst case of 2009 blocks, that is 1GB to 2GB of data, which is not acceptable.

I do have things setup with the first pass data file for each block still in the tmp directory. But unlike the readonly bundles which are protected from tampering to a large degree, especially if they are literally readonly and put into a squashfs, the tmp files are vulnerable to tampering. I could reprocess each of these and validate them again, but it would end up with a bunch of firstpass files, which are more a raw format that is selfcontained, rather than indexed.

Now what I need is a way to combine search results from the parallel bundles and the realtime partial bundle. It takes a few minutes to create the recent bundles, so not so practical to regenerate a new one with each new block, especially when there are a couple fast blocks. So my plan is to generate sub-bundles of 10 and 100. this would be a worst case of 29 extra bundles, but since they are smaller the blockhash search time wont suffer so much, just having 229 bundles vs 200 means a 15% increase in iterating through the bundles serially. so 2.5 milliseconds will go to 3 milliseconds on my laptop.

I think that seems reasonable and now on startup it is a matter of loading at most, 9 firstpass files and making a microbundle out of them. That wont take long at all, should be linear vs the full bundle of 2000, or 200x faster, so a few seconds if it was 9 blocks worth.

Total time cost to startup would be to map 200 bundles, map 19 minibundles (100 blocks), map 9 microbundles (10 blocks) and to create a 9 block partial microbundle and map that.

Not sure if I can achieve a 3 second time for this, but I can always show a little dancing iguana animation during startup if I need 5 to 10 seconds.

I have the changes made to generate arbitrary sized bundles, but have a feeling I missed a few places that also need to change. My goal is to get the realtime partial bundle created and loading this weekend and at that point it will be ready for more rigorous testing

James

jl777 · Mar 23, 2016

got stuck debugging things so it didnt crash, even while syncing BTC and BTCD at the same time.

harder was getting it to transition from historical syncing to realtime mode and it appears to be doing that now. pretty cool to see it get the latest block for both BTC and BTCD as they come in.

I also added support for getting the readonly files from a different directory and if not there, get it from the current location. This way, once you get a validated read only dataset, that takes precedence and then you can keep updating it with normal files. At some point, a way to create a single compressed volume is probably useful, but for now it shouldnt be so important.

Since my VPS has 32GB of RAM, without compressing the data to 20GB, it doesnt fit and so starts swapping and becomes 100x slower. When it is in RAM, it can reconstruct the entire account balances in 2 minutes, but I made it so even the address balances and utxo can go into the readonly section.

Now, if you are paying attention, you will be saying "wait! you cant put the address balances or utxo into the read only section."

Well, turns out it is probably the most efficient thing to do!

After struggling for a while with how to efficiently deal with the realtime updates of the new blocks coming in, the problem is that each vin can spend any prior unspent vout, from any block/bundle. If I could use a r/w memory map, then maybe having the full dataset in a r/w file would be possible, but chrome OS only supports readonly memory maps, plus once you allow r/w memory map, then any stray memory write could corrupt the dataset. So for both safety and portability, r/w memory map is not good.

I am seeing around 10 million vins per recent bundle, so even with 100 bytes per vin would fit in a GB of RAM and on average half that. Now this is big, but not too big and smaller than the lifetime balance/spent state. If space was really important, each vin could be encoded into 6 bytes or so and combined with a bitmap would be very small, but searching for things would be slow and I dont like slow when it could be 1000x faster by using a bit of RAM.

What I did was make an in memory hash table with all the information that would be in the bundle boundary balance/utxo files. However, only the blocks from the current bundles will put things into this in RAM hashtable. This means that regardless of how large the blockchain gets, all except the current bundle's utxo even has to be in RAM. And from prior posts, we know that data in the readonly section is much lower cost as it can be distributed via torrents.

So now needing to be able to do up to date listunspent is the gating issue. I figured out a way to make things much simpler than the prior 100*n + 10*m + 1 to 9. I will make it so as each block comes in, the ramchain data structure can be incrementally updated and by not saving to disk, this data can be accessed as is. It will actually be the identical data structures as all the readonly datasets, so very little changes are needed to support it. Just need to make sure there are no assumptions that all bundles are full.

[read only bundles] + [partial realtime bundle] + [utxo RAM tables]

With the above three datasets, it will allow for parallel searching address balances as of any block, in addition to all the normal block explorer and of course all the blockchain level RPC. For those who are worried about data consistency and atomic updates, I have a single thread handling RPC requests and updating the utxo RAM tables and I will have that thread do the updating of the realtime bundle. This way, when things are out of sync, there is nothing responding to RPC. Atomicity by avoidance of non-atomic timing, which doesnt even require setting locks.

I have a contractor working on generating a snapshot of all address balances via a modified bitcoind, seems it takes 5+ hours to just scan the DB after it is already synced. I guess I shouldnt feel so bad it now takes an hour to fully sync iguana.

After I get the incremental updating debugged, then I will be able to compare against independently calculated balances to validate the integrity of the data.

The only issue I can see left would be the downtime as a bundle boundary is reached. It will take a minute or so to create the new bundle and then update the balance files. Maybe having a bit of downtime is ok, but of course people will complain, so I will need to be able to continue with more than a full bundle worth of "partial" utxo data, but maybe I can just create bundles as they can be created and until things are restarted, it just keeps adding to the RAM utxo data. Then on restart, I would need to check for the non-readonly data to get the latest balance/utxo data.

OK, I think that works.

James

jl777 · Mar 23, 2016

Exportable Squashfs 4.0 filesystem, xz compressed, data block size 131072
compressed data, compressed metadata, compressed fragments, compressed xattrs
duplicates are removed
Filesystem size 16515912.19 Kbytes (16128.82 Mbytes)
39.70% of uncompressed filesystem size (41600283.85 Kbytes)

The bundle files are compressible 2.5:1! and this includes all the search indexes, instant on, even extra tags to allow balance calcs at any height. if it was stripped down to the essence, it would probably come in at 12GB

jl777 · Mar 23, 2016

volatile files create the problem of potential tampering, so I added an sha256 hash of the entire volatile dataset to detect this.

The problem is that it takes over a minute to do this and that really slows down startup.

So I added a crc calculation in parallel and after the the first time the sha256 is verified, then it can use the crc calculation. That was much faster, but still took 12 seconds.

I know it can be faster and there is no need to even do a crc if it is in a readonly filesystem. So I debugged the "ro" mounting where it looks for all the data files in DB/ro/BTC/... first and uses that. In the case where all the files in the dataset (up to the balances are calculated) are from the "ro", then since it is a readonly memory mapping of a readonly filesystem, we can just skip the 12 seconds crc calculation.

For now you need to manually create the squashfs and mount it. I tested various setting and the xc compression was best

-rw-r--r-- 1 dvb2 dvb2 16877101056 Mar 23 08:27 BTC.xz
-rw-r--r-- 1 dvb2 dvb2 16651587584 Mar 23 09:22 BTC.xz1m
-rw-r--r-- 1 dvb2 dvb2 21510365184 Mar 23 10:13 BTC.lzo
-rw-r--r-- 1 dvb2 dvb2 21462687744 Mar 23 11:06 BTC.lzo1m
-rw-r--r-- 1 dvb2 dvb2 19564527616 Mar 23 11:36 BTC.squash
-rw-r--r-- 1 dvb2 dvb2 19539980288 Mar 23 12:08 BTC.squash1M

The default settings for xz was faster and basically the same size as a much bigger block and dictionary, so:

mksquashfs DB/BTC BTC.xz -comp xz
sudo mount BTC.xz DB/ro/BTC -t squashfs -o loop

The above takes quite a while, basically as long as iguana takes to generate the dataset, but we go from 40GB to 16GB and it is totally interchangeable with the normal files.

This did have the expected speedup as the 12 seconds became instant

have filecrc.382ed291 for ebf0e75c6e145a6276ae56b73b1b13fb5cfdacacc1d0b5a60bacb288f9b67890 milli.1458761902445

millis 1458761902445 from_ro.1 written.201 crc.382ed291/382ed291 balancehash.(ebf0e75c6e145a6276ae56b73b1b13fb5cfdacacc1d0b5a60bacb288f9b67890) vs (ebf0e75c6e145a6276ae56b73b1b13fb5cfdacacc1d0b5a60bacb288f9b67890)

MATCHED balancehash numhdrsi.201 crc.382ed291

BTC u.201 b.201 v.0/0 (0 1st.201) to 201 N[202] Q.0 h.402000 r.402000 c.0.000kb s.402000 d.201 E.201:2 est.64 0.000kb 0:00:06 477.758 peers.101/256 Q.(0 0) L.403965 M.401999 00000000000000000698ebac3b51c09608db7acca8ffbdcc3083545bc1dfd3e6

0:00:06 -> six seconds for startup.

I think that is fast enough. It printed things out right after init and before it found and added the realtime blocks, it isnt updating the utxo RAM set yet, so I need to get that debugged and then it is ready for RPC queries. My goal is 30 seconds to get things ready on startup, so that leaves 24 seconds time budget.

James

jl777 · Mar 24, 2016

it took almost 1000 lines of code to deal with things, but I think it is working decently now. The regen time was just a few seconds, even for BTC, so I use the brute force approach of just regenerating the entire partial bundle if anything is changed.

That solves the issue with all but reorgs that invalidate an existing bundle.
I tested using a readonly filesystem and also normal files, with BTC and BTCD and in all the cases things seems to be doing what it should, but still need more testing to be sure.

Now there is all the readonly datasets and a specific window of time where it is valid to make RPC calls, ie after all the updates of block/bundle are done. 99% of the time, it is ready, but when a new block comes in, less than a second of latency. At bundle boundary, maybe up to a minute or two, but that is once every few weeks, so not a practical problem.

dealing with all the memory management and timing issues took a while so I dont quite have the code that actually does the dataset query, but I already had half of that done dealing with the unspents side. The spend side is a mirror so shouldnt be too hard to add that part and then the RAM based hash table wont have any performance constraints.

Due to static memory allocation, I am using about 1.5GB of fixed allocation for the realtime bundle. But it is only incrementally added to when a new block comes in, so on a memory limited system most of that would be swapped out to disk.

Another day or two and I hope to have the listunspents working, but with the added ability to work with any address, including multisig, p2sh, etc. and as of any height. Havent seen a crash for a while and in recent days I fixed a lot of little bugs. On systems where it was slowing down, it is now completing. I had caching of all blocks on by default and that doesnt work so well with bitcoin.

also fixed CPU usage when in realtime mode. was about 10%, but that seemed too high when just idling, so i increased latency a bit, but now it is about 2%

Chronos · Mar 25, 2016

so i increased latency a bit

Interesting. What does this mean, exactly?

jl777 · Mar 25, 2016

the various threads are polling for events, if it finds one, then it iterates again
but if there was no event, then it sleeps for a while, 1 millisecond, or 100 microseconds, etc.

I increased this latency in a few places, so it will slow down response by a bit, but it wont be human noticeable

Search

Search

iguana "instant on" fast startup

jl777

Active Member

jl777

Active Member

jl777

Active Member

jl777

Active Member

jl777

Active Member

Chronos

Member

jl777

Active Member

Latest posts

Latest threads

Members online