This post assumes you are somewhat familiar with iguana parallel sync as described in other threads. Each bundle is permanently fixed and from historical data it appears that after 10 blocks it will be very very rare for any block to change. https://bitcointalk.org/index.php?topic=1403436.0 and http://pastebin.com/LZxst5vD has reorg history as seen from one node, since 2012!
So this confirms my feeling that 10 blocks is plenty, especially if the cost is having to pause any tx and recalculate the data structures.
since each bundle of 2000 blocks has a fixed set of blocks, txids, vins, vouts, what I do is create a hashtable for the txids and a bloom filter for the blockhashes. This allows a direct lookup for txids and an average scan of 1000 blocks to find it. Since it is coming out of RAM after the first access (they are in memory mapped files), this is quite fast. If needed it would be possible to add another hash table for direct blockhash lookup, but I dont think that is such a common operation. The high volume things are oriented around txids
There are about 200 bundles right now as we are close to block 400,000. To initialized things, these 200 bundle files need to be memory mapped, which will take about the same time as an fopen. Now they are ready! And by ready, I mean ready for parallel searches for blocks, txids, generating rawtxbytes, etc. Notice after initial validation, onetime, there is never need to verify the 400,000 blocks worth of bundles as it never changes, even by a single bit.
The non-compressible vindata is not part of the read-only files as it will either not even be downloaded or be purged after initial validation.
What you might ask is the problem?
The issue is that the blocks since the most recent bundle is finished, to the current block, this is not in any readonly bundle. I left dealing with this issue for now. One approach is to just resync the latest blocks, but with average of 1000 and worst case of 2009 blocks, that is 1GB to 2GB of data, which is not acceptable.
I do have things setup with the first pass data file for each block still in the tmp directory. But unlike the readonly bundles which are protected from tampering to a large degree, especially if they are literally readonly and put into a squashfs, the tmp files are vulnerable to tampering. I could reprocess each of these and validate them again, but it would end up with a bunch of firstpass files, which are more a raw format that is selfcontained, rather than indexed.
Now what I need is a way to combine search results from the parallel bundles and the realtime partial bundle. It takes a few minutes to create the recent bundles, so not so practical to regenerate a new one with each new block, especially when there are a couple fast blocks. So my plan is to generate sub-bundles of 10 and 100. this would be a worst case of 29 extra bundles, but since they are smaller the blockhash search time wont suffer so much, just having 229 bundles vs 200 means a 15% increase in iterating through the bundles serially. so 2.5 milliseconds will go to 3 milliseconds on my laptop.
I think that seems reasonable and now on startup it is a matter of loading at most, 9 firstpass files and making a microbundle out of them. That wont take long at all, should be linear vs the full bundle of 2000, or 200x faster, so a few seconds if it was 9 blocks worth.
Total time cost to startup would be to map 200 bundles, map 19 minibundles (100 blocks), map 9 microbundles (10 blocks) and to create a 9 block partial microbundle and map that.
Not sure if I can achieve a 3 second time for this, but I can always show a little dancing iguana animation during startup if I need 5 to 10 seconds.
I have the changes made to generate arbitrary sized bundles, but have a feeling I missed a few places that also need to change. My goal is to get the realtime partial bundle created and loading this weekend and at that point it will be ready for more rigorous testing
James
So this confirms my feeling that 10 blocks is plenty, especially if the cost is having to pause any tx and recalculate the data structures.
since each bundle of 2000 blocks has a fixed set of blocks, txids, vins, vouts, what I do is create a hashtable for the txids and a bloom filter for the blockhashes. This allows a direct lookup for txids and an average scan of 1000 blocks to find it. Since it is coming out of RAM after the first access (they are in memory mapped files), this is quite fast. If needed it would be possible to add another hash table for direct blockhash lookup, but I dont think that is such a common operation. The high volume things are oriented around txids
There are about 200 bundles right now as we are close to block 400,000. To initialized things, these 200 bundle files need to be memory mapped, which will take about the same time as an fopen. Now they are ready! And by ready, I mean ready for parallel searches for blocks, txids, generating rawtxbytes, etc. Notice after initial validation, onetime, there is never need to verify the 400,000 blocks worth of bundles as it never changes, even by a single bit.
The non-compressible vindata is not part of the read-only files as it will either not even be downloaded or be purged after initial validation.
What you might ask is the problem?
The issue is that the blocks since the most recent bundle is finished, to the current block, this is not in any readonly bundle. I left dealing with this issue for now. One approach is to just resync the latest blocks, but with average of 1000 and worst case of 2009 blocks, that is 1GB to 2GB of data, which is not acceptable.
I do have things setup with the first pass data file for each block still in the tmp directory. But unlike the readonly bundles which are protected from tampering to a large degree, especially if they are literally readonly and put into a squashfs, the tmp files are vulnerable to tampering. I could reprocess each of these and validate them again, but it would end up with a bunch of firstpass files, which are more a raw format that is selfcontained, rather than indexed.
Now what I need is a way to combine search results from the parallel bundles and the realtime partial bundle. It takes a few minutes to create the recent bundles, so not so practical to regenerate a new one with each new block, especially when there are a couple fast blocks. So my plan is to generate sub-bundles of 10 and 100. this would be a worst case of 29 extra bundles, but since they are smaller the blockhash search time wont suffer so much, just having 229 bundles vs 200 means a 15% increase in iterating through the bundles serially. so 2.5 milliseconds will go to 3 milliseconds on my laptop.
I think that seems reasonable and now on startup it is a matter of loading at most, 9 firstpass files and making a microbundle out of them. That wont take long at all, should be linear vs the full bundle of 2000, or 200x faster, so a few seconds if it was 9 blocks worth.
Total time cost to startup would be to map 200 bundles, map 19 minibundles (100 blocks), map 9 microbundles (10 blocks) and to create a 9 block partial microbundle and map that.
Not sure if I can achieve a 3 second time for this, but I can always show a little dancing iguana animation during startup if I need 5 to 10 seconds.
I have the changes made to generate arbitrary sized bundles, but have a feeling I missed a few places that also need to change. My goal is to get the realtime partial bundle created and loading this weekend and at that point it will be ready for more rigorous testing
James