Why do people believe this is a slam dunk? Dynamic dispatch is 5x slower, has a 0.70 IPC and a 12.81% branch miss rate compared to the switch case w/an IPC close to the max on my i9-10940X and a 0.02% branch miss rate. I'm truly baffled.
Some folks thought I should add a second derivative and deploy them randomly. That actually made a difference. Now the dispatch is a bit more than 4ns longer.
Between clients & writing my sections for the upcoming 2nd edition of "Performance Analysis & Tuning on Modern CPUs", I've had little time to post new articles. Here's one to close out 2023:
Fruitful (and civil) discussion between
@cmuratori
and
@unclebobmartin
(the author of the book "Clean Code") regarding the former's recent "Clean Code, Horrible Performance" video:
If you're a Perf Engineer who works in the nano-to-microsecond timescale, then there's no getting around needing to know Assembly. Here's a two-part series on it from a Security guy (see a theme forming here?)
Today, May 31st, marks 11 years since I began spookin' HFT firms away from using the first core of *any* CPU in a server, not just that of the first CPU (i.e., core 0), for latency-sensitive threads😊
"Overeager techies wanna tinker with magic knobs & secret tunables hidden behind names with leading underscores. They want the *tricks* of the trade w/o first understanding the trade itself. But that’s not Performance Engineering. . ."
Four tools made notable impact on my performance consulting business: FTrace, Coz, eBPF, and perf c2c. My crystal ball tells me that this new tool feature will provide the fifth!🔥
Intel cuts no. of dies in half, crams in more cores, cranks up DDR5 & UPI bandwidth, and dramatically increases shared LLC size (2.84x) all while taking up less space in Emerald Rapids when compared to Sapphire Rapids🤔
Colleagues think I'm an expert w/the perf tool. HA! I can't tell you how often I've stumbled upon new perf functionality that I had no idea existed (for YEARS in some cases). For example, check out this one:
QUICK TIP: I've noticed that many of my colleagues reach for "strace" for tracking syscalls in code. Do yourself a favor & use "perf trace" instead. All the same bells & whistles but FAR less overhead. And now back to your regularly scheduled programming. . .
PERFORMANCE TIP: If you publish microbenchmarks for public consumption, you'll do yourself & everyone else a big favor by submitting via established microbenchmark frameworks (e.g., Google Benchmark, JMH, etc.). These help avoid common pitfalls👍🏽
@wil_da_beast630
It will be Roland Fryer once he emerges from the hatchet job perpetrated by Harvard et al. The documentary about the whole debacle (which Glenn Loury is involved with) may help expedite the process.
Dmytro Vyazelenko from Aeron recently gave a presentation on designing for low latency. He made it a point to specify that "in preparation for this presentation, no LLM was used":
Faster hash maps, binary trees etc. through data layout modification
We investigate how to make faster hash maps, trees, linked lists and vector of pointers by changing their data layout.
On my JabPerf blog, I've written brief explanations about DRAM internals only as an intro into larger topics regarding crafting low latency software. But this Cloudflare article does a proper Deep Dive on DRAM organization:
This article cogently supports my firm belief that mastery of any one tool does *not* an expert Perf Engineer make. Fantastic breakdown of pitfalls & rules-of-thumb for perf analysis👍🏽
Metadata: Always Measure One Level Deeper
Can't recall the last time I debugged a Linux kernel networking issue since most of my IT life has revolved around kernel bypass stacks & libibverbs. Still, can't help sharing this article from the always stellar Cloudflare Blog:
I don't typically post Job Openings here. But several of my PerfEng brethren have asked about breaking into the HFT industry, where pay is phenomenal & the security is much better than elsewhere in the IT industry. Check it out:
We often discuss TLB Miss Penalty of big, randomly-accessed working sets yet seldom do I see mention of the increased Cache Pollution (L1d thru L3 on Intel/L2 thru L3 on AMD) via HW Page Walkers traversing the Page Table in such cases. Adds insult to injury.
@PeterVeentjer
AMD uProf measures memory bandwidth utilization for AMD CPUs, while Intel PCM, intel-cmt-cat, and pmu-tools all do the same for Intel CPUs (I use the latter to track per-socket memory bandwidth into Grafana via the intel_rdt Telegraf plugin).
Good paper on dealing w/systems perf variability in benchmarking, why never to assume normally distributed results, how to test for normality, and how to use nonparametric tests (and why I'm not crazy for always rebooting btwn tests)😊
Firstly, Perf Engineers who want to get a better grasp of statistical concepts in digestible posts should follow this guy. Secondly, I find that few ppl talk about
#6
- i.e., the fact that not only small sample sizes can trip you up. Large ones can, too.
Common Statistics Mistakes in Published Papers - A Cautionary Tale 📊📝
1/ Publishing a paper is a monumental task, but it’s essential to get the statistics right. Let's dive into some frequent statistical missteps researchers make and how to avoid them!
2/ Misunderstanding
DB & IO EXPERTS: On Linux w/NVME storage, which multi-queue I/O Scheduler do you prefer for optimal performance? Consensus from a quick online search seems to lean toward "none" but I trust my carefully curated list of X Follows much more than Google😉
Shout out to our very own
@fleming_matt
for founding his new company Nyrkiö for the purpose of bringing a supported Change Point Detection solution to the masses:
JAVA PERF PEEPS: There's a new book out from Oracle Press entitled "JVM Performance Engineering" by Monica Beckwith. I'm interested in your thoughts on its contents. Feel free to comment here or via DM, wherever you're more comfortable.
Email from P99Conf '23:
"Since we expect a large spike of logins, be sure to be among the 1st who login to the keynote sessions. Be ready w/your wireshark, gdb, docker & bpftrace." I *told* you Wireshark's an important perf tool😊
This comical refusal by Uncle Bob Martin to stop & reconsider whether he should reevaluate his hard dynamic dispatch stance illustrates a common pitfall of those hailed as experts - "Expert Ego" sets in. Here are tips to avoid it:
Thorough article on the decision to enable frame pointers for Fedora, as well as a nice breakdown on the pros & cons of various stack unwinding methods, some of which are tried-and-true & others which are on the horizon:
EVERY man over 30 should get a full battery of blood tests at each annual physical. Even if you're not playing a sport. Make sure it includes Lipid, Total/Free Testosterone, DHEA, ApoB, HbA1c, Fasted Insulin/Glucose, LH/FSH/TSH/T4/T3, CRP, etc. Trust me👍🏽
@i_bogosavljevic
No, you're not hallucinating. It's part of the Intel Resource Director family of features called "Cache Pseudo Locking". The Linux kernel supports it, as well. I've never used it personally because the HFT game is won or lost in the L1d anyway🤷🏾♂️
One of the better, if not best, IT Performance Conferences based on how practical and immediately useful all the talks are. BONUS: You don't even have to book a Business Class seat to attend (it's virtual) 😉
“An amateur can be satisfied with knowing a fact; a professional must know the reason why. An amateur practices until he can do a thing right, a professional until he can’t do it wrong.”
I've noticed the fact that splitting work btwn threads not only has diminishing returns but also the potential for *regressive* behavior is tough for clients to wrap their heads around. This is why we explain USL in both
@dendibakh
's perf book editions👍🏽
In yet another example of the symbiotic relationship between Perf & Sec Engineering, here's a new paper reverse engineering Intel's L1/L2 TLB, providing operation details you won't find in the Intel Arch SW Dev Manuals:
@lemire
This mirrors Myths
#3
&
#5
from my article: Sampling profilers can mislead, and mastering any one tool (e.g., perf or VTune or uPerf) won't magically confer perf analysis expertise. Somehow that ruffled feathers on X:
PERF PEEPS: I value your expertise & camaraderie. I'd like it to last as long as possible. So plz do me a favor: Each weekday morning drop & do 30 pushups/30 squats. At some point each weekday do 30mins on the treadmill/bike at a moderate pace. That's all🙏
@GergelyOrosz
I won't lie - I used to *love* non-competes. In the HFT industry, you'd get paid between 75 - 100% of your salary for the entire duration, which typically lasts btwn 6 months to 1 year. I took some of my most memorable vacations during mine.
Continuous Profilers (the 4th Pillar of Observability) will become even more important as general purpose CPUs gain more ground from special-purpose co-processors in the Cloud AI/ML space:
There's the geeky side of me that loves each new advancement in x86-64, ARM, RISC-V, and specialized co-processors. But then there's my *other* side that wishes it all stagnates a bit (~5yrs or so) to force devs to embrace Mechanical Sympathy🤷🏾♂️
Oh, for the love of All That Is Holy & Highly Performant, would someone w/better Google-Fu than me plz tell me where I might find this tool? A system-wide causal profiler that employs virtual speedups againt cores instead of threads🤯
Everyone dances in the streets any time an x86 CPU gets a bigger LLC. I felt like I was dancing alone on an abandoned street w/o any music besides what played in my head when the first one w/48KB L1d arrived. The game is won or lost in the L1d for me🤷🏾♂️
Very interesting performance investigation! I won't tell you much more so that I won't give it away. I'll just say that you shouldn't let the title make you jump to conclusions about possible culprits😉
Has anyone else managed to put Kanye West, Taylor Swift, the MTV Music Video Awards, and Linux FTrace all in the same article? You know what? Never mind. I'm just gonna take the credit anyway:
I know I've endorsed P99 Conf several times before. But for any of you still on the fence, lemme tell ya that I just noticed
@trav_downs
on the agenda (not sure I've ever seen him present before). Register now:
@axelgneiting
HFT apps eschew runtime allocs not just cuz of alloc cost (or the more expensive dealloc cost) but the minor pg fault cost (100s of ns). Allocators that prealloc/pin memory upfront (and prevent TLB Shootdowns by not unmapping on dealloc) avoid this.
Since RAM latency's only growing w/DRAM chip size & each DDR release, I see CPU DCA support as a no-brainer. Yet AMD *still* doesn't offer it, while Intel has DDIO & ARM has DynamIQ Cache Stashing. Are there any perf writeups available for the latter?
@majek04
Yep, this is a nice writeup on DRAM Refresh (I touch on it briefly at ). Interestingly, LPDDR uses per-bank refresh to allow concurrent R/W access during refresh. But that's typically only used in mobile platforms🤷🏾♂️
@disruptnhandlr
And therein lies the problem w/its applicability for HFT: code that we want optimized *are* the cold, rarely executed functions😪 As a result, we resort to clever tricks to keep that code warm artificially. But we're a corner case industry anyway🤷🏾♂️
Wrote this little theoretical performance guide. It contains a bunch of ideas that have been super helpful while building most of the systems I've worked on as an engineer. It's fairly language and system agnostic, focusing on a number of timeless ideas.
Now *this* is a great example of proper benchmarking. Using runtime info alone requires relying too much on intuition, and performance on today's complex systems often defy our intuition.
In the game of Low Latency (and, dare I say, in general computing), if your monitoring stack (e.g., TICK, ELK, etc.) tracks %CPU and %MEM but ignores IPC and Memory BW, you're hamstringing yourself.
After several yrs working w/it now, I can confidently say that if you're not profiling your multithreaded code with
@emeryberger
's Coz profiler then you're costing yourself extra work: