@hfinger We've been noticing that for a while now. The cause as you've pointed out is the very large memory pools we now have due to the back up in transactions. When the mempools normalize we won't see this issue, OR we can increase the size of the bloom filter. It's a trade off. I've been working on another approach which is to have a better mempool limiting in place which does two things: one keeps the bloom filters small and two reduces bandwidth use. It's somewhat analogous to what you described in not adding old tx's to the filter..what about not adding tx's that are not necessarily old, but probably not going to be mined? The problem with old tx's is that they can/do get mined as they sit around and their priority rises over time.
Here are some thoughts:
1) under normal conditions we won't and don't see these re-requests. If block size increases, this problem goes away.
2) it's important to keep bloom filters small since they can also increase the overall bandwidth if allowed to grow without bound.
3) re-requests are pretty fast generally and don't contribute to overall transmission time a great deal. saving bandwidth is great but overall transmission time is the main goal, being able to transmit a block of any size quickly over the network is the key to scalability in my view. Another 1/10 to 1 second isn't a big deal for p2p but would be for the miners.
4) node operators that want minimal re-requests and speed, can either run with a smaller mempool (100MB is the minimum) or run with a very large mempool (1GB) or more. (Running a mempool with the default 300MB is not optimal with all these tx's out there, i think we're disovering that now). The smaller mempool will have bloom filters with fewer false positives but the xthins will be slightly larger. On the other hand, If you run with a 1GB pool or larger and keep your tx's for longer (9days?) then you shouldn't be missing tx's (i haven't tried this approach, if someone could verify this it would be helpful, the default setting for eviction is 3 days so that should be increased as well as mempool size)
5) one other thing we could use is an eviction cache...if the mempool is very large and tx's are often getting evicted, then when we send out an xthin request and during that time we're waiting for the xthin many tx's could get evicted and will not be there when the xthin arrives causing a re-request. I had originally had this in place but it was slow and not well designed so i took it out for the time being. It's something again we won't ever need under normal conditions but occasionlly depending on mempool size is something we could be seeing here. is it worth putting in something that we'll never need if block size increases but has to be invoked for every single tx that enters the pool? I'm not sure but it's something I had planned to work on for the next release...
What to do right now, I don't think this is an emergency. Nothing is breaking, just being slowed down under adverse conditions. We're still doing very good but obviously we can do more optimizing.
Thanks again for such a good analysis, that's exactly the kind of work we need!
Join us on slack if you can, we're usually there in the mornings, it's a small group right now but a good one.