I got the final piece of the puzzle figured out. As I posted elsewhere, the vast majority of blockchain data is permanent and never changing, especially as limited within the scope of a bundle.
The first pass is a special case of a single block
The second pass combines 2000 blocks into a single bundle file, about 20% of vins reference external txids, so they are not resolvable until all prior blocks have the second pass data.
The third pass data creates a fixed vector of all the external vins, basically the (hdrsi, unspentind) of where the txid/vout is. This data is also invariant, it just has to wait for all prior bundles to have the second pass data.
Since the third pass data takes a bit of time to calculate, I generate all of them in parallel with the constraint that each needs to have all the prior level 2 bundles. But there is no need to process them serially or even one at a time. The less restrictions there are, the easier it is to fit into the system's overall data flow. This is now being generated.
After the third pass, there is now all the fixed data, and currently this data compresses about 2:1 from 36GB to 17GB. There is also about 35GB of vindata (I can probably reduce the size of this by 30% to 40%), but this is not needed after validation, unless it is a full relaying node. In that case having the HDD space shouldnt be an issue.
Now we are ready to generate the fourth pass data. This actually doesnt need to be done in order, but there is an atomic requirement for the updates, so it is simpler to just process it serially as there is not a lot of calculations that are needed. Just mostly iterating through the spend vectors, finding the txdata and then the unspent data.
Using the same method for the unspents, a linked list is created within each bundle, but since the order is random as to which future bundle is processed, each traversal will be in a different order. However after the same set of bundle's spend vectors are applied, the same result is created. Each unspent gets a bit for spent status and the previous spend's index (within the bundle). this allows totals for any specific address to be calculated like it can be for the outputs. This data I just call Uextra as it is the extra data needed for each unspent and it takes 32bits each. It has the property of writeonce, so that is why the order does not matter and double spends are detected.
So is there any dynamic data at all?
It turns out there is just one, the total amount each address has spent and the last spend as stored in the Uextra[] array. So within each bundle, all the unspents that are spent are organized by addresses into a reverse linked list in a somewhat random order.
if ( (spentbp->ramchain.Uextras[unspentind] & (1 << 31)) == 0 )
{
spentbp->ramchain.Uextras[unspentind] |= (A2[pkind].lastind & 0x7fffffff);
spentbp->ramchain.Uextras[unspentind] |= (1 << 31);
A2[pkind].total += u->value;
A2[pkind].lastind = spendind;
}
The above is all the code that is needed to update the account debits data. It checks to make sure it is still unspent and if it is, it updates the linked list entry and also the account balance.
Still debugging this, but it is almost working in a standalone fashion. After I get it woven into the realtime dataflow, then all spends for each address will be calculable in parallel at the sub millisecond speed.
Now I just need a way to compare all address balances at a specific bundle boundary to validate all this data. At that point, maintaining a realtime bundle and combining search results from the parallel dataset with the realtime bundle.
But still need to get the various cases like resume to work, but once unspents are quick to calculate for all address, well, basically blockexplorer level data is available at millisecond speeds for all addresses.
James
P.S. I still need to do the sig validation, but that is not needed for all the various tx creations
The first pass is a special case of a single block
The second pass combines 2000 blocks into a single bundle file, about 20% of vins reference external txids, so they are not resolvable until all prior blocks have the second pass data.
The third pass data creates a fixed vector of all the external vins, basically the (hdrsi, unspentind) of where the txid/vout is. This data is also invariant, it just has to wait for all prior bundles to have the second pass data.
Since the third pass data takes a bit of time to calculate, I generate all of them in parallel with the constraint that each needs to have all the prior level 2 bundles. But there is no need to process them serially or even one at a time. The less restrictions there are, the easier it is to fit into the system's overall data flow. This is now being generated.
After the third pass, there is now all the fixed data, and currently this data compresses about 2:1 from 36GB to 17GB. There is also about 35GB of vindata (I can probably reduce the size of this by 30% to 40%), but this is not needed after validation, unless it is a full relaying node. In that case having the HDD space shouldnt be an issue.
Now we are ready to generate the fourth pass data. This actually doesnt need to be done in order, but there is an atomic requirement for the updates, so it is simpler to just process it serially as there is not a lot of calculations that are needed. Just mostly iterating through the spend vectors, finding the txdata and then the unspent data.
Using the same method for the unspents, a linked list is created within each bundle, but since the order is random as to which future bundle is processed, each traversal will be in a different order. However after the same set of bundle's spend vectors are applied, the same result is created. Each unspent gets a bit for spent status and the previous spend's index (within the bundle). this allows totals for any specific address to be calculated like it can be for the outputs. This data I just call Uextra as it is the extra data needed for each unspent and it takes 32bits each. It has the property of writeonce, so that is why the order does not matter and double spends are detected.
So is there any dynamic data at all?
It turns out there is just one, the total amount each address has spent and the last spend as stored in the Uextra[] array. So within each bundle, all the unspents that are spent are organized by addresses into a reverse linked list in a somewhat random order.
if ( (spentbp->ramchain.Uextras[unspentind] & (1 << 31)) == 0 )
{
spentbp->ramchain.Uextras[unspentind] |= (A2[pkind].lastind & 0x7fffffff);
spentbp->ramchain.Uextras[unspentind] |= (1 << 31);
A2[pkind].total += u->value;
A2[pkind].lastind = spendind;
}
The above is all the code that is needed to update the account debits data. It checks to make sure it is still unspent and if it is, it updates the linked list entry and also the account balance.
Still debugging this, but it is almost working in a standalone fashion. After I get it woven into the realtime dataflow, then all spends for each address will be calculable in parallel at the sub millisecond speed.
Now I just need a way to compare all address balances at a specific bundle boundary to validate all this data. At that point, maintaining a realtime bundle and combining search results from the parallel dataset with the realtime bundle.
But still need to get the various cases like resume to work, but once unspents are quick to calculate for all address, well, basically blockexplorer level data is available at millisecond speeds for all addresses.
James
P.S. I still need to do the sig validation, but that is not needed for all the various tx creations