30MB UTXO bitmap (uncompressed)

jl777 · Mar 5, 2016

for utxo, my approach with iguana creates a parallel set of datasets in read only memory mapped files. Arguably, this is the ultimate DB as it simply cant get any faster than this.

All things needed for block explorer level queries are precalculated and put into the read only files.

By having each bundle of 2000 blocks being mostly independent of all other blocks, this not only allows for parallel syncing of the blockchain but also using multiple cores for all queries regarading tx history.

I also calculate all address balances for each bundle, so by adding all the values across all the bundles you end up with the current balance as of the most recent bundle. Special case handling of the realtime bundle gets us up to date block explorer data for all blocks, tx, addresses (even multisig)

I dont quite have everything done, but the basic structure is showing very good results. I also remove as much redundancy as possible during the creation of the read-only files so instead of taking more space than the raw blockchain, it ends up taking about half the space. About half of that is the signatures, which can be purged for non-relaying nodes after it has been validated.

The format of the utxo vector would just be the location of the unspent that is being spent and the location of the spend is implicit in the order of the vins.

[u0, u1, u2, u3, ...] is a list of 6 byte locators for the unspent that the corresponding vin in the bundle is spending, where each ui contains the bundle id and the unspentind. Each unspentind is simply the order it appears in the bundle (starting at 1).

the above is not directly usable, but it is invariant and can be created once and put into the read-only file(system). using mksquashfs reduces the size of the index tables by almost 50%.

Given such a list for all bundles, then using a parallel multicore process, they can all be traversed and update a bitmap to indicate which outputs are spent. So a 0 bit for this means it is unspent. Maybe 1 bit of space per utxo is good enough, as that is seemingly as good as it gets, but it is 1 bit per vout, so more like 30MB size. However, I think this can be compressed pretty good with most of the early vouts already being spent (ignoring the satoshi ones). I havent had a chance to get an up to date utxo bitmap, but I hope to get one in the next week.

At the 30MB size or smaller, the entire utxo bitmap will fit into CPU L cache and provide 10x faster performance than RAM. RAM is actually quite slow compared to things that can be done totally inside the CPU.

James

Inca · Mar 5, 2016

I for one am really glad you are using this forum to tell us what you have been up to.

You are doing some fantastic optimisations.

jl777 · Mar 5, 2016

Thanks for your support. I needed a place for permanent record and with the very supportive atmosphere here I am happy to jot down my notes as I continue the process.

It might well be that the old bitcoin devs are not able to scale bitcoin technically, but that does not mean there are no devs that can. crypto oriented devs are of course very smart with all the maths, but to have a feel for overall system performance is not any math. It is from years of experience and using ancient archaic tools like C

But to me C is a high level language that allow OS portability, instead of assembler. So it all depends on your reference point. If you want something that has the best performance, using big giant general purpose libs is not the way. That way allows to make something work in the shortest amount of coding time. This is of course very important during the early stages. It is like packing 100kg of suitcases with everything in it on a camping trip. You are sure to find whatever you need, but you are also sure to have 100kg of suitcases, instead of the 5kg of things you actually need

If you find some CUDA/OpenCL coders, the concepts of parallel processing are quite standard. Changing things that cost N to log2 N is standard decomposition and aggregation. So what I have done is not anything magical, anybody familiar with parallel coding and crypto could have done it. I guess the overlap is the thing that might make me a bit special.

Also I work backwards from the desired result and stubbornly try all things until I find what works. If the theory says it is impossible, I change the problem space to where there is no such theory, the old trick of changing the battlefield as even I am not able to defeat mathematically proven things. What most people dont realize that math proofs almost always assume certain things and it usually isnt 100% reflecting the reality. This refusal to just believe that the impossible cant be done and the stubborn nature to work till it is achieved, ok, I guess that does make be a bit different.

By taking a wholistic view of the entire system and working backwards from the desired result, it is usually just a matter of time to achieve the desired result.

James

jl777 · Mar 6, 2016

finally got past the initial set of bugs for the lossless codec behavior. Still more bugs to fix, but it was a big leap in functionality without any debugging along the way, so had to be debugged painfully. I like making one change and testing it, that way I dont spend anytime figuring out what broke things.

It is getting the first pass done without any problems, but that one just needs to store with the vin, each vin script. It is horribly wasteful of space, but first pass files are deleted as soon as a bundle is created and the block based files arent needed. But the vin refers to a vout, which is usually unknown vout as it is not in this bundle.

sometimes the pubkey is in the output script, other times only in the vinscript. That created a complication where things are out of sync with the canonical ordering as vouts and vins are processed separately and not always in identical order in all places.

It seems I made every off by one error possible, also some strange things with negative indexes, etc. But finally it is generating the lossless block based files, verifying they have proper structure, then combining it into a bundle, verifying it, saving it, and then testing it as a memory mapped read only file. Where it has the wrong size and munches memory.

Usually after it works in the first cases, it works in 99% of cases, but there are those quirky places in the blockchain that it will have to get past. Due to the complexity of all this, I will be doing a brute force validation against bitcoind data by iterating through all the blocks, tx, vins, vouts and making sure they all match.

Even then anything that processes that data might have bugs, but without a 100% accurate low level data set, it is sure to have problems. So lossless codecness, rawbytes validation and then onto utxo optimizations. I might take a little detour and add in realtime signature validation, but it seems the best place to do it is in the third pass when the utxo data is being generated. There are gaps in time as the serial sweeper builds the blockchain and finds missing blocks, so I think a lot of the sig validation can squeeze in during this time

Inca · Mar 6, 2016

I have written a blockchain parser in the past and it is vital to endeavour to have a 100% accuracy in this process. I remember there being oddities in the chain so good luck

jl777 · Mar 6, 2016

MGW has been very stable for a while now and I started the first version almost 2 years ago, so it is a pain to get it working, but it is definitely possible.

The original MGW was totally rewritten to improve it.

Then that was rewritten totally to improve it more.

Then that was rewritten again.

Finally, from that codebase, the iguana bitcoin core was created. So it is about the fifth iteration, each time getting significant improvements, but for iguana I am making some speed tradeoffs as being able to calculate a rich list in 4 seconds is not really a mainstream usecase and it isnt worth the extra data that is needed

Search

Search

30MB UTXO bitmap (uncompressed)

jl777

Active Member

Inca

Moderator

jl777

Active Member

jl777

Active Member

Inca

Moderator

jl777

Active Member

Latest posts

Latest threads

Members online