debugging status

jl777

Active Member
Feb 26, 2016
279
345
getting the code to work with all blocks is a pretty good set of test vectors. Currently I have all the script encoding mostly working for the second pass, and working using malloc with the first pass. I used to have some fixed space big enough for most scripts, with an overflow file to deal with the few percent of cases.

The problem is even a few percent of cases is a lot of cases when there are hundreds of millions of scripts!

So it created a performance bottleneck and I decided to switch to memory allocation. And since I need to use memory allocation, might as well do it for all cases, even the small ones.It is easier to debug things with a known working malloc first. Then I will change it to use a single large memory buffer for all the mallocs. That not only speeds things up a lot by eliminated system call, it also makes the freeing of the memory costless

It is doing better on the speed side, but I have to exclude about 5% of bundles as they have scripts that confuse things. Still, it is working 99%+ of the time, so just need to track these cases down and change the malloc to use my memalloc.

N[201] Q.140 h.401663 r.170426 c.0:112000:0 s.170439 d.57 E.56:171050 M.401662 L.401663 est.8 262.8MB 0:04:20 4.097 peers.42/133 Q.(19822 60)

the above is a status line after 4 minutes 20 seconds. it got all 201 headers, sync'ed and saved 170439 blocks, saved 57 bundles and validated 401662 blockheaders so it has the hashes for the entire blockchain. It is sustaining around 40megabytes/sec, which is half the max speed, but since the bandwidth wasnt the bottleneck, I shifted the processing around to create more CPU time for the final signature validation. It used about 300% CPU during the above, but it is still mostly the early blockchain so not many tx per block in those days

The early blocks are very small, so the bandwidth needed is much lower, but the sig validation and utxo vectors needs all prior blocks. Having several minutes of wasted bandwidth is probably worth being able to start the serial sweep that much sooner. I can make it use a lot more bandwidth during the entire time, but CPU usage goes up and it takes all 8 cores to process 100MB/sec. So the final tuning will be a careful balancing of the CPU allocated for sig validation and the CPU for processing the bandwidth.

I think I can dynamically change the CPU to the two by estimating the time to completion for each and have a feedback mechanism to get them to finish as close to each other at the same time.
 
  • Like
Reactions: ntto

jl777

Active Member
Feb 26, 2016
279
345
fixed lots of little bugs, changed the malloc to an internal buffer allocation, split out the sigs into a separate file.

took a while to track down and fix some crashes, but it appears to be happy with all the blocks, so generating the entire dataset to make sure the file sizes are still ok.

The scripts in the rawfiles appear to be all 0, for some strange reason though. need to get that fixed before things will verify, but first step is to get it to not crash through the entire scan.

nope, something about the blocks from 32000 to 34000 has a nonstandard script that is confusing things. So things are 99.99% ok now, but until it is 100%, cant go to the next step
 

Chronos

Member
Mar 6, 2016
56
44
www.youtube.com
I enjoy reading these updates. Keep them coming.

Is your eventual plan to use a single memory buffer for all scripts, avoiding malloc completely? I wonder if the maximum amount of memory demanded by the world's most unfriendly-but-legal script would make that pretty inconvenient. On the other hand, if you need the memory for one script, you might as well reserve it for the entire process, since you'll need it eventually.
 

jl777

Active Member
Feb 26, 2016
279
345
The memory management is quite complicated and no way to get all threads sharing the same memory buffer. I already eliminated the runtime mallocs, the performance behavior is just too unpredictable when using malloc.

I have fixed memory blocks allocated for network buffer, block processing, hashtable pointers and the bundle itself. For the last, I need to do two passes and estimate the max size needed. Then the entire amount is allocated with the various data structures having a starting offset. Since all but two data structures are fixed size, I can know exactly how much space to use for them.

[FIXED SIZE STRUCTURES][HEAP grows up ...] <unused> [.... SIGNATURES STACK]

Then I iterate through the rawblock data, filling in all the fixedfields, allocating from the heap when needed and pushed onto the stack when there is a sig.

That's the easy part...

Next step is to compact it so that there is no wasted space. When combining 2000 blocks, a lot of previously undefined references (vin's txid) get resolved, so instead of storing the rawtxid in the extra fixed space, I can use a 32bit index. Also, I err on the larger side for the HEAP size estimate as it is just a temporary allocation and bigger just locks up mem for a while, too small requires to redo the entire bundle.

Then there the is gap between the HEAP and SIGS, but instead of moving the sigs, I save it into a separate file. This allows to purge it just by deleting the sigs directory.

Now it is compacted (but not compressed), I save it to HDD. Then verify it by memory mapping it, creating pointers into the memory map for the various data structures and iterate through the bundle to make sure it is all happy. (mostly it is now, but not 100% quite yet)

Then I keep around the memory mapped pointers, but do a single free of the memory block.

The above is a simplified explanation. It basically combines aspects of linking object files, setting up runtime contexts for executables. The tricky part was figuring out how to get a data format that works totally read-only

Oh, and the above is happening in parallel

Then there are all the peer threads creating the inputs to the above, also all in parallel and being able to do fast searches without using mutex across the dataset as it is being built, well, that makes the memory management look easy as it is all limited to a singled threaded context at a time.

To get the max performance requires careful balancing of all system resources. Right now, I seem to have lost 50% of throughput and it is only getting 50MB/sec. Probably some part is taking a bit too long, causing a "traffic jam"

James
 

jl777

Active Member
Feb 26, 2016
279
345
slow progress, but added stubs for the final pass calculations. There are a few things that cant be done without all the prior blocks, namely validating the vins and updating the utxo set.

It turns out you can split these final calculations into two parts, which allows doing the first half still in parallel. This part needs to calculate the vector of spends, probably a sparse vector based on the unspentinds. This requires all previous bundles to be there, but not all prior block's equivalent data. Also, with all the prior txid's there along with the spend scripts, all sigs can be validated. Being able to do this in parallel is a big win as the less that requires the complete dataset, the faster the fourth pass (fully serial) can finish.

I think in addition to the spend vector, I can also create linked lists for the spends, along with total spends per address. Combined with the total outputs sent to each address, this basically means all balances for all addresses will be available within each bundle. The same method as for the unspents will be used, so no surprises

The nasty crash is still around, but I fixed a few memory issues so it is rarer, but still quite annoying as I cant get the full data set generated to get a clear idea of the total dataset size with all the latest data. I did find a java process using up a lot of RAM, which appears to have contributed to the slowdown, but things are still not where it used to be. Kind of typical when things are still crashing for there to be strange performance issues, but it is doing a lot more than the last iteration, so maybe 50MB/sec will be the limit.

1st.2 N[201] Q.196 h.401972 r.48173 c.0:48000:0 s.199777 d.24 E.24:208586 M.401971 L.401972 est.8 439.028kb 0:04:03 90.884 peers.39/128 Q.(0 1693)

just saw 60MB/sec and 199777 blocks in 4 minutes, so it is halfspeed of before, but it feels much faster than earlier
 

jl777

Active Member
Feb 26, 2016
279
345
there are some really strange scripts in some parts of the blockchain. I put in a fallback handling, so it tries the efficient encoding and verifies if it worked, if not it goes to a raw storing of the entire script. Not the most efficient, but I think it is about 1% extra space wasted, so at this point I'd rather move to other areas and squeeze out the extra 1% of space

So the lossless encoding is guaranteed to work (assuming no bugs) as if it doesnt work, I store the whole thing. I know, not elegant, but clock's ticking

Something is still strange though. I see the bandwidth just drop to 1MB/sec, but I see peak bandwidth at 70MB/sec+ with 59 peers

progress is slow, but eventually it will all be validated data in readonly files, so we can then distribute them via torrents for the most efficient sync and lighten the load on the peers.

Pretty sure a few more bugs to fix as it is not running as smooth as it should.
 

jl777

Active Member
Feb 26, 2016
279
345
the slowdown appears to be due to HDD contention. I had too many helper threads and multiple bundles ended up being written at the same time. Need to have the helper threads to do the processing, so cant go too far below the number of cores. During the download phase, it is using up a full core per 10MB/sec, but since it is sustaining around 50MB/sec, after about 20 minutes there wont be much to download and then it would be best for all the cores to be busy processing. I have 8 cores on the server, so I set it to 6, which should minimize the contention problem and still get 6+ cores active as there are many threads other than the helper threads.

The bundle for blocks 398000 came in spooky close to 4GB for the temp file. I would be an issue if it goes over the 32bit limit, so lets hope that is as large as the bundles get. Certainly I will need to go to a bit larger index when we go above 1MB/10 minutes

A bit of bad news on the file sizes. Dont have a complete set yet, but it looks like uncompressed might be pushing 30GB with all the indexes, bloom filters and hash tables and that is without counting the sigs files. At least the 30GB will compress down to less than 20GB, but still it is due to the raw "encoding" used for the scripts. I think that might be using up more than 1% extra...
 
  • Like
Reactions: RobertBold

jl777

Active Member
Feb 26, 2016
279
345
1st.159 N[202] Q.41 h.402137 r.336895 c.0:262000:0 s.400755 d.160 E.131:243336 M.402136 L.402137 est.8 3.32GB 0:24:30 19.752 peers.77/256 Q.(0 0)

It crashed once after 10 minutes, but combined 34 minutes downloaded all the blockchain and now it needs to get a few more stragglers to be able to generate all the bundles. but seems something is stuck as it isnt using all the cores.

ramchaindata have 151:1635 at 66528617 | 2529 blocks 613.8MB redundant xfers total 54.58GB 1.10% wasted

very little redundant packets received, so that part is good
 
  • Like
Reactions: RobertBold

jl777

Active Member
Feb 26, 2016
279
345
its been a lot of test versions to get it to work from a fresh state all the way to realtime sync. I optimized many parts, most importantly the average time for finding a txid to 5 microseconds.

added options to manage the parallel sync so depending on if you have SSD or not, lots of RAM or not, it would be possible to get a non-thrashing parallel sync. thrashing is the enemy of performance as instead of taking the few microseconds it can take 10,000 times longer!

I made a new way to fetch blocks that allows for blasting out requests, but yet not getting many duplicate responses. for a while I was getting 30%+ duplicates, which doesnt sound so bad till you realize that means 20GB of extra data to transfer.

It is a tradeoff between waiting too long leading to things idling, versus requesting duplicate data, leading to wasted bandwidth and time.

Literally over a thousand builds later I think I finally have a decent solution that works across a large range of system parameters. The following is on a 32GB RAM 8 core server with normal HDD and a 1gbps connection. I think anything below 8GB of RAM will end up thrashing.

Getting the same code to work at the same time with BTC and BTCD and on an old laptop and fast VPS server, these are pretty different cases and it has been quite a challenge to make all permutations work decently without having any of the other cases degenerate into horrible performance.

Time eth0
HH:MM:SS KB/s in KB/s out
11:47:30 44213.75 415.87
11:48:30 54005.14 484.70
11:49:30 53039.15 489.42
11:50:30 56694.01 505.40
11:51:30 58557.15 496.95
11:52:30 51717.42 538.93
11:53:30 53661.14 518.72
11:54:30 50690.58 502.82
11:55:30 59350.67 559.82
11:56:30 49619.99 457.01

BTC.RT0 u.0 b.0 v.0/0 (1294+1294/2000 1st.10).s0 to 181 N[203] h.404001 r.148000 c.305420 s.305404 d.74 E.74 maxB.100 peers.245/256 Q.(38435 0) L.405518 [8:1992] M.17992 000000009a21bcc1625313a04350e9c5320ec94cb846d900b06dccb793ac4913 bQ.194 0:10:30 stuck.0 max.0

BTC.RT0 u.0 b.0 v.0/0 (365+362/2000 1st.182).s0 to 147 N[203] h.405523 r.256000 c.372392 s.372389 d.128 E.128 maxB.12 peers.245/256 Q.(4505 0) L.405523 [202:1522] M.405522 00000000000000000310996fb950b10e8a3db9fb8a30a28242316bbe3ea7d3d7 bQ.194 0:20:01 stuck.1 max.37
 

jl777

Active Member
Feb 26, 2016
279
345
On a high end server, there are no bottlenecks:

BTC.RT405815 u.202 b.202/202 v.202/404 (1815+1815/1815 1st.202).s0 to 202 N[203] h.405815 r.405815 c.405815 s.405815 d.0 E.202 maxB.32 peers.56/32 Q.(0 0) L.405815 [202:1814] M.405814 000000000000000002c2e6d28b1bdb5224705d5e39055de261edaa86a7e16be5 bQ.1 0:23:27 stuck.0 max.47

24 minutes from start to finish, all the way to realtime block, fully synced, though I am not verifying all the sigs, but checking the latest blockhash matching all the peers
 
  • Like
Reactions: ntto and Chronos