## Summary
I played with those numbers a bit locally and `sample_size=3,
sample_count=8` seemed like a rather stable setup. This means a single
sample consistents of 3 iterations of checking pydantic multithreaded.
And this is repeated 8 times for statistics. A single check took ~300 ms
previously on the runners, so this should only take 7 s.
The benchmark is currently very noisy (± 10%). This leads to codspeed
reports on PRs, because we often exceed the trigger threshold. This is
confusing to ty contributors who are not aware about the flakiness.
Let's disable it for now.