vout encoding


Active Member
Feb 26, 2016
struct iguana_unspent
uint64_t value;
uint32_t txidind,pkind,prevunspentind,scriptoffset;
uint16_t hdrsi:14,type:4,vout:14;
} __attribute__((packed));

The vout encoding is 10x simpler than vin encoding. Most (literally 99%+) of vout scripts currently fall into standard patterns and dont even require any extra to encoded than the fixed structure above.

txiding/vout/value are the usual txid/vout value with the txid encoded as a 32 bit index. This isnt actually needed as each vout can be determined by it offset from the transaction, but it helps with error checking and allows to know instantly without any searching what txid an unspent is part of. Also I am relying on the compression stage to squeeze out the redundancy of duplicating the txid (once per vout) and all the duplicated vouts for the low integers.

the scriptoffset is usually zero as the type field covers most cases, but it if is a script that doesnt verify when expanded, it is just added to the heap.

hdris is the id of the bundle this is in. this helps again with error checking and the massive redundancy will get removed by the compression step.

pkind is the (hopefully) canonical ordering of the address that the output script correlates to. What this means is that a reused address using a standard script is encoded totally in the above structure of 28 bytes and 12 of those are quite compressible.

prevunspentind this is the most powerful field in this structure and it took me a while to realize that it was even possible to make an append-only structure that created a linked list. I used to have it be a forward linked list, but that has the problem that you need to update it if and only if a new unspent is linked to it. With append only (think streaming as fast as HDD goes, no seeking) you cant go and change anything that is written. Usually the append only DB do this, but then they go and load it all into memory, rearrange things and create indexes, etc. which for a general purpose DB that doesnt know what request will be made of it, needs to go to a universally decent performance set of indexes.

I used to have a non-readonly data field for maintaining the linked list status, but when I stumbled on doing it backwards, I was able to update the linked list for each address, during the same pass where the unspent info is being created. When you have literally hundreds of millions of these, it is quite important to do it as efficiently as possible.

iguana builds the directly usable in memory data structures with minimal reprocessing. The one big reprocessing is that the block based files (which are in a similar but bulkier format) are recombined into a bundle file, thus recalculating all the fields. However, with each bundle file having a lot more external references resolved, and all the numbering changed for many of the fields, it was just faster to fully reprocess. Sometimes the obvious way is best.

So, unlike the vin processing, the vout processing can complete in the second pass and it has no external references. In the space less than the standard scripts, it encodes the txid it is part of, the bundle it is in, the address it is paying, the location of non-standard script and even a linked list for all vouts that pay this address. By knowing where the linked list ends for an address in a bundle, it is a direct traversing of the linked list via the prevunspentind to find all vouts that pay that address.

and the 28 bytes will likely compress to less than 14. I have no idea how the current bitcoind sets things up, maybe it is all a blackbox inside the DB so I cant post any comparative data, but not only is the iguana dataset smaller and all in one place, it doesnt have to rely on an external DB to process the requests. If the data is marked as read only, then the CPU and OS can make optimizations in how it is accessed, so this is how the order of magnitude speed improvement is achieved.

Similar levels of optimization is done throughout the entire blockchain sync process and the end result is a block explorer level data set, which takes half the size and is sync'ed in much less time.