Mark E. Dawson, Jr. @medawsonjr profile

Mark E. Dawson, Jr.

@medawsonjr

Followers

1,942

Following

200

Media

22

Statuses

1,312

CEO of JabPerf Corp, Contributing Author to "Performance Analysis and Tuning on Modern CPUs" (available on Amazon), Blogger, and Former Amateur Boxer

https://t.co/am1ZVUaYNs

Chicago, IL

Joined August 2015

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

ブロック • 455277 Tweets

西田さん • 288450 Tweets

Twitterくん • 276603 Tweets

西田敏行さん • 263545 Tweets

俳優さん • 184519 Tweets

生成AI • 148689 Tweets

Bluesky • 131702 Tweets

AI学習 • 125948 Tweets

ブルスカ • 118340 Tweets

ツイッター • 117498 Tweets

Grok • 105932 Tweets

スーパームーン • 83798 Tweets

イーロン • 77912 Tweets

インスタ • 73277 Tweets

イラスト投稿 • 72798 Tweets

ぐりとぐら • 65248 Tweets

pixiv • 51775 Tweets

블루스카이 • 43324 Tweets

タイッツー • 43058 Tweets

#INDvsNZ • 37876 Tweets

Misskey • 36073 Tweets

他のSNS • 35863 Tweets

OUROAD UPCOMING PROJECT • 32149 Tweets

池中玄太80キロ • 30588 Tweets

ドクターX • 25113 Tweets

YIBO VOGUE FORCES OF FASHION • 20877 Tweets

相互さん • 15933 Tweets

絵師さん • 12107 Tweets

クロスフォリオ

復讐のレクイエム

ポイピク

mixi

対戦型SNS

ダウンズ

個人サイト

ハチナイ

アクエリオン

RADWIMPS

みすきー

モイネロ

ぶるーすかい

블스 계정

WISHING FOR OLBAP

完全移行

ミスキー

#كل_هم_زال_بالاستغفار

#كم_عمرك_الان

#عمر_سيد_معوض

#موقف_محرج_حصلك

Last Seen Profiles

@6x6_3Rock

@xenadu02

@MrMalachiJames

@zzz_negro73305

@lallamapic

@SkywelEdson

@ramonaurquijolo

@bomberosinforma

@mikefrommount

@EkitiSsg

@FaisalGujjarOf1

@Brucsinator

@investorMM

@Iovedbysun

@etherails

@ArmdNDangerous

@GiioTM

@MarDeTavira

@proynu

@polipoly5

Mark E. Dawson, Jr.

@medawsonjr

3 months

"C++ design patterns for low-latency applications including high-frequency trading"

5

77

461

Mark E. Dawson, Jr.

@medawsonjr

1 year

HOLY MOLY!

Advanced Performance Extensions (APX)

Advanced Performance Extensions (APX) expand the entire x86 instruction set with access to more registers and adds various new features that improve general-purpose performance.

www.intel.com

11

64

345

Mark E. Dawson, Jr.

@medawsonjr

2 years

Why do people believe this is a slam dunk? Dynamic dispatch is 5x slower, has a 0.70 IPC and a 12.81% branch miss rate compared to the switch case w/an IPC close to the max on my i9-10940X and a 0.02% branch miss rate. I'm truly baffled.

Uncle Bob Martin

@unclebobmartin

2 years

Some folks thought I should add a second derivative and deploy them randomly. That actually made a difference. Now the dispatch is a bit more than 4ns longer.

7

2

35

7

12

154

Mark E. Dawson, Jr.

@medawsonjr

11 months

Succinct, edifying article on the topic of Tagged Pointers and their various uses in the real world:

Storing data in pointers

Some notes on storing data in pointers and the impact of >48-bit virtual addresses

muxup.com

0

29

141

Mark E. Dawson, Jr.

@medawsonjr

10 months

Between clients & writing my sections for the upcoming 2nd edition of "Performance Analysis & Tuning on Modern CPUs", I've had little time to post new articles. Here's one to close out 2023:

CPU Affinity: Because Even A Single Chip Is Nonuniform

Everyone understands NUMA penalties. And we now know single CPUs exhibit nonuniformity, too. So, why doesn't your CPU affinity reflect that?

www.jabperf.com

1

23

139

Mark E. Dawson, Jr.

@medawsonjr

11 months

Designing a SIMD Algorithm from Scratch · mcyoung

mcyoung.xyz

2

28

128

Mark E. Dawson, Jr.

@medawsonjr

9 months

There's a new Memory Allocation library on the block:

GitHub - akhin/metamalloc: Malloc as a single-header library which can also be used for local...

Malloc as a single-header library which can also be used for local allocations. Linux & Windows. Repo also provides a live per-thread HTTP memory profiler as a separate single-header with n...

github.com

2

19

125

Mark E. Dawson, Jr.

@medawsonjr

6 months

Matt Godbolt describes how hardware is deployed in the low latency HFT space:

2

18

115

Mark E. Dawson, Jr.

@medawsonjr

2 years

Fruitful (and civil) discussion between @cmuratori and @unclebobmartin (the author of the book "Clean Code") regarding the former's recent "Clean Code, Horrible Performance" video:

cmuratori-discussion/cleancodeqa.md at main · unclebob/cmuratori-discussion

Clippings. Contribute to unclebob/cmuratori-discussion development by creating an account on GitHub.

github.com

5

23

113

Mark E. Dawson, Jr.

@medawsonjr

2 years

If you're a Perf Engineer who works in the nano-to-microsecond timescale, then there's no getting around needing to know Assembly. Here's a two-part series on it from a Security guy (see a theme forming here?)

1

29

134

Mark E. Dawson, Jr.

@medawsonjr

5 months

Today, May 31st, marks 11 years since I began spookin' HFT firms away from using the first core of *any* CPU in a server, not just that of the first CPU (i.e., core 0), for latency-sensitive threads😊

My Fear of Commitment to the 1st CPU Core

On this Valentine's Day Weekend I share the heartbreak that caused my fear of commitment to the 1st CPU Core of any socket (not just core 0).

www.jabperf.com

5

11

102

Mark E. Dawson, Jr.

@medawsonjr

1 year

"Overeager techies wanna tinker with magic knobs & secret tunables hidden behind names with leading underscores. They want the *tricks* of the trade w/o first understanding the trade itself. But that’s not Performance Engineering. . ."

My Top 7 Performance Books for Engineers

The most frequent request I get from engineers looking to branch into this niche is for performance book recommendations. These are my Top 7.

www.jabperf.com

4

19

89

Mark E. Dawson, Jr.

@medawsonjr

10 months

Four tools made notable impact on my performance consulting business: FTrace, Coz, eBPF, and perf c2c. My crystal ball tells me that this new tool feature will provide the fifth!🔥

Martin Thompson

@mjpt777

10 months

This could be a game changer.

3

43

213

3

15

85

Mark E. Dawson, Jr.

@medawsonjr

1 year

Intel cuts no. of dies in half, crams in more cores, cranks up DDR5 & UPI bandwidth, and dramatically increases shared LLC size (2.84x) all while taking up less space in Emerald Rapids when compared to Sapphire Rapids🤔

7

8

82

Mark E. Dawson, Jr.

@medawsonjr

6 months

Colleagues think I'm an expert w/the perf tool. HA! I can't tell you how often I've stumbled upon new perf functionality that I had no idea existed (for YEARS in some cases). For example, check out this one:

Linux Perf: Measuring Specific Code Sections with Pause/Resume APIs - Performance Engineering

Blog post summarising how to measure specific code sections of applications with Linux perf

pramodkumbhar.com

5

17

92

Mark E. Dawson, Jr.

@medawsonjr

2 years

EXT4 finally adds *20th Century* filesystem technology with its Concurrent Direct I/O feature in Linux kernel 6.3🤦🏾‍♂️

EXT4 Scores A Nice Direct I/O Performance Improvement With Linux 6.3

With the EXT4 file-system being quite mature at this stage, with many kernel cycles these days this widely-used file-system just sees bug fixes and other minor work

www.phoronix.com

2

23

76

Mark E. Dawson, Jr.

@medawsonjr

2 years

QUICK TIP: I've noticed that many of my colleagues reach for "strace" for tracking syscalls in code. Do yourself a favor & use "perf trace" instead. All the same bells & whistles but FAR less overhead. And now back to your regularly scheduled programming. . .

6

7

74

Mark E. Dawson, Jr.

@medawsonjr

2 years

PERFORMANCE TIP: If you publish microbenchmarks for public consumption, you'll do yourself & everyone else a big favor by submitting via established microbenchmark frameworks (e.g., Google Benchmark, JMH, etc.). These help avoid common pitfalls👍🏽

5

4

67

Mark E. Dawson, Jr.

@medawsonjr

2 years

@wil_da_beast630 It will be Roland Fryer once he emerges from the hatchet job perpetrated by Harvard et al. The documentary about the whole debacle (which Glenn Loury is involved with) may help expedite the process.

2

4

61

Mark E. Dawson, Jr.

@medawsonjr

6 months

Dmytro Vyazelenko from Aeron recently gave a presentation on designing for low latency. He made it a point to specify that "in preparation for this presentation, no LLM was used":

Faster or better designed? Choose any two! by Dmytro Vyazelenko

Designing for low latency is hard. We need principles to help us navigate the performance landscape. I'll present here a ranked set of guidelines to lead you...

www.youtube.com

2

20

66

Mark E. Dawson, Jr.

@medawsonjr

1 year

As the greatest Performance Engineering Rap Group of all time once said, "Cache Rules Everything Around Me (C.R.E.A.M.)"

johnnysswlab.com

@johnnysswlab

1 year

Faster hash maps, binary trees etc. through data layout modification We investigate how to make faster hash maps, trees, linked lists and vector of pointers by changing their data layout.

3

61

263

2

10

59

Mark E. Dawson, Jr.

@medawsonjr

1 year

Engineers at Coinbase describe how they architected the exchange for low latency operation:

SREcon23 Americas - The Making of an Ultra Low Latency Trading System...

The Making of an Ultra Low Latency Trading System with Go and JavaYucong Sun and Jonathan Ting, Coinbase IncCoinbase Exchange team presents learnings from bu...

www.youtube.com

0

21

60

Mark E. Dawson, Jr.

@medawsonjr

1 year

core-to-core-latency: A Nice Little Tool! - Performance Engineering

Saturday, 23rd Sept 2023: I've been curiously staring at my blog for quite some time, and it reminds me over and over again that it's been nearly two years since I managed to write new content here...

pramodkumbhar.com

4

15

58

Mark E. Dawson, Jr.

@medawsonjr

2 years

@lemire Was this prophesied in the Book of Revelation?

10

1

49

Mark E. Dawson, Jr.

@medawsonjr

2 years

Cloudflare is back at it again! Yet another quality writeup on a real TCP latency issue, and the new kernel patch they've created to address it:

1

10

55

Mark E. Dawson, Jr.

@medawsonjr

6 months

Happy Performance Engineer's Day, all (yes, this is a thing)!🥳

2

10

54

Mark E. Dawson, Jr.

@medawsonjr

5 months

The C/C++ Hashmap Showdowns continue:

2

10

52

Mark E. Dawson, Jr.

@medawsonjr

1 year

On my JabPerf blog, I've written brief explanations about DRAM internals only as an intro into larger topics regarding crafting low latency software. But this Cloudflare article does a proper Deep Dive on DRAM organization:

DDR4 memory organization and how it affects memory bandwidth

In this blog, we will study the concepts of memory rank and organization, and how memory rank and organization affect the memory bandwidth performance by reviewing some benchmarking test results.

blog.cloudflare.com

1

8

52

Mark E. Dawson, Jr.

@medawsonjr

1 year

Opening Keynote from ICPE 2023 regarding the performance engineering work that goes into modern video games at Ubisoft:

Pushing the limit of video game performance engineering - ICPE23.pdf

drive.google.com

1

15

51

Mark E. Dawson, Jr.

@medawsonjr

4 months

This article cogently supports my firm belief that mastery of any one tool does *not* an expert Perf Engineer make. Fantastic breakdown of pitfalls & rules-of-thumb for perf analysis👍🏽 Metadata: Always Measure One Level Deeper

1

12

46

Mark E. Dawson, Jr.

@medawsonjr

6 months

@eatonphil Those are 2 of the 7 books I've routinely recommended to swdevs aspiring to a transition to performance engineering:

My Top 7 Performance Books for Engineers

The most frequent request I get from engineers looking to branch into this niche is for performance book recommendations. These are my Top 7.

www.jabperf.com

1

10

44

Mark E. Dawson, Jr.

@medawsonjr

2 years

Why Core-to-Core Latency Matters | Foojay.io Today

An initial goal of Java was to “write once, run anywhere”, but does that mean we should not be sympathetic to the hardware?

foojay.io

1

5

42

Mark E. Dawson, Jr.

@medawsonjr

2 years

Last Level Cache: Where It's Bad To Be Inclusive

CPU Last Level Cache comes in Inclusive & Non-inclusive types. While the former type may sound good, the reality is a bit more complicated.

www.jabperf.com

2

10

43

Mark E. Dawson, Jr.

@medawsonjr

4 months

Oldie but goodie from Matt Godbolt's early days at DRW Trading:

Performance Tuning

A worked example on performance tuning, using an invented "new order" message (such as may be sent to a trading exchange).In the "Take 3" part, the itoa link...

www.youtube.com

0

7

39

Mark E. Dawson, Jr.

@medawsonjr

1 year

Can't recall the last time I debugged a Linux kernel networking issue since most of my IT life has revolved around kernel bypass stacks & libibverbs. Still, can't help sharing this article from the always stellar Cloudflare Blog:

Unbounded memory usage by TCP for receive buffers, and how we fixed it

We are constantly monitoring and optimizing the performance and resource utilization of our systems. Recently, we noticed that some of our TCP sessions were allocating more memory than expected. This...

blog.cloudflare.com

3

10

40

Mark E. Dawson, Jr.

@medawsonjr

8 months

I don't typically post Job Openings here. But several of my PerfEng brethren have asked about breaking into the HFT industry, where pay is phenomenal & the security is much better than elsewhere in the IT industry. Check it out:

ML Performance Engineer - XTX Markets

www.xtxmarkets.com

4

7

37

Mark E. Dawson, Jr.

@medawsonjr

2 years

5-level vs 4-level Page Tables: Does It Matter?

Current Linux distributions enable 5-level page tables by default on supported CPUs. Will it negatively impact application latency?

www.jabperf.com

4

9

35

Mark E. Dawson, Jr.

@medawsonjr

2 years

We often discuss TLB Miss Penalty of big, randomly-accessed working sets yet seldom do I see mention of the increased Cache Pollution (L1d thru L3 on Intel/L2 thru L3 on AMD) via HW Page Walkers traversing the Page Table in such cases. Adds insult to injury.

2

5

36

Mark E. Dawson, Jr.

@medawsonjr

11 months

@PeterVeentjer AMD uProf measures memory bandwidth utilization for AMD CPUs, while Intel PCM, intel-cmt-cat, and pmu-tools all do the same for Intel CPUs (I use the latter to track per-socket memory bandwidth into Grafana via the intel_rdt Telegraf plugin).

2

6

33

Mark E. Dawson, Jr.

@medawsonjr

2 years

Good paper on dealing w/systems perf variability in benchmarking, why never to assume normally distributed results, how to test for normality, and how to use nonparametric tests (and why I'm not crazy for always rebooting btwn tests)😊

1

5

33

Mark E. Dawson, Jr.

@medawsonjr

1 year

Firstly, Perf Engineers who want to get a better grasp of statistical concepts in digestible posts should follow this guy. Secondly, I find that few ppl talk about #6 - i.e., the fact that not only small sample sizes can trip you up. Large ones can, too.

Selçuk Korkmaz

@selcukorkmaz

1 year

Common Statistics Mistakes in Published Papers - A Cautionary Tale 📊📝 1/ Publishing a paper is a monumental task, but it’s essential to get the statistics right. Let's dive into some frequent statistical missteps researchers make and how to avoid them! 2/ Misunderstanding

7

166

604

2

4

33

Mark E. Dawson, Jr.

@medawsonjr

11 months

One performance specialist's experience with software optimization, distilled down to his four (4) classes of tuning strategies:

0

11

33

Mark E. Dawson, Jr.

@medawsonjr

1 year

Address Sanitizer with *far* less overhead than that of typical compilers due to clever usage of the FPU:

FloatZone - vusec

TL;DR FloatZone is a redzone-based memory sanitizer to efficiently detect buffer overflows (and use-after-frees) by means of floating-point underflows. Memory Sanitizers Memory sanitizers are...

www.vusec.net

0

8

29

Mark E. Dawson, Jr.

@medawsonjr

1 year

RCU and Hazard Pointers have been voted into C++26! Congratulations on all your efforts & those who worked diligently with you on this, @paulmckrcu !

2

5

31

Mark E. Dawson, Jr.

@medawsonjr

1 year

Intel Releases x86-simd-sort 2.0 With Faster AVX-512 Sorting, New Algorithms

Earlier this year Intel software engineers published a blazing fast AVX-512 sorting library that was initially picked up by Numpy where it netted them 10~17x faster sorts

www.phoronix.com

0

1

30

Mark E. Dawson, Jr.

@medawsonjr

1 year

Adrian Cockroft's rebuttal to all the Microservices vs. Monolith hoopla of the past few days:

So many bad takes — What is there to learn from the Prime Video microservices to monolith story

The Prime Video team published this story: Scaling up the audio/video monitoring service and reducing costs by 90%, and the internet piled…

adrianco.medium.com

1

8

30

Mark E. Dawson, Jr.

@medawsonjr

11 months

AWS prepares for the inexorable march toward the cloud for financial exchanges:

Low latency cloud-native exchanges | Amazon Web Services

Large capital market infrastructure providers (specifically exchanges) around the world use the cloud to accelerate their innovation and modernization journey. Trading workloads present unique...

aws.amazon.com

5

8

28

Mark E. Dawson, Jr.

@medawsonjr

2 years

Optimizing Binary Search - Sergey Slotin - CppCon 2022 via @YouTube

Optimizing Binary Search - Sergey Slotin - CppCon 2022

https://cppcon.org/---Optimizing Binary Search - Sergey Slotin - CppCon 2022https://github.com/CppCon/CppCon2022People generally don’t get very excited over ...

www.youtube.com

0

7

27

Mark E. Dawson, Jr.

@medawsonjr

1 year

Am I the only one who didn't know that C++ contained support for Garbage Collection? And I'm learning of it just as C++23 drops all support for it😯

C++23: Removing garbage collection support

If we go through the list of C++23 features, we can stumble upon the notion of garbage collection twice. Once among the language and once among the library features. Both entries refer to the same...

www.sandordargo.com

3

4

26

Mark E. Dawson, Jr.

@medawsonjr

9 months

Proper procedure for benchmarking SSDs:

Test SSD Performance like the Pros

Test Procedure for SSD Performance Measurement If you are just doing some quick benchmarking or just getting started with FIO, check out the [FIO Basics post]. Have you ever wondered how the spec...

ssdcentral.net

0

8

26

Mark E. Dawson, Jr.

@medawsonjr

2 years

@unclebobmartin Another way of putting it is that, even in a toy microbenchmark, dynamic dispatch increases runtime by 5x when compared to switch.

2

1

27

Mark E. Dawson, Jr.

@medawsonjr

6 months

Case Study in how *not* to run a performance analysis:

3

27

Mark E. Dawson, Jr.

@medawsonjr

6 months

DB & IO EXPERTS: On Linux w/NVME storage, which multi-queue I/O Scheduler do you prefer for optimal performance? Consensus from a quick online search seems to lean toward "none" but I trust my carefully curated list of X Follows much more than Google😉

5

9

27

Mark E. Dawson, Jr.

@medawsonjr

8 months

Shout out to our very own @fleming_matt for founding his new company Nyrkiö for the purpose of bringing a supported Change Point Detection solution to the masses:

3

26

Mark E. Dawson, Jr.

@medawsonjr

2 years

It's been a while, so I wrote another one:

Beginner's Mindset: Key to Engineering Expertise

You've worked years to reach your level of engineering expertise. Yet, cultivating a Beginner's Mindset could be the key to true mastery.

www.jabperf.com

2

6

26

Mark E. Dawson, Jr.

@medawsonjr

2 years

@halvarflake Apple M1 has a microarchitectural component called "Data Memory-Dependent Prefetcher" that optimizes pointer following workloads.

4

0

26

Mark E. Dawson, Jr.

@medawsonjr

3 years

Google Benchmark w/o PMU Counters? Congratulations, you played yourself!

Google Benchmark offers per-function performance analysis, which is good. But you "play yourself" when you ignore gathering PMU counters.

www.jabperf.com

1

5

26

Mark E. Dawson, Jr.

@medawsonjr

2 years

0

5

25

Mark E. Dawson, Jr.

@medawsonjr

1 year

@TanelPoder illustrates a clever way of using FlameGraphs for SQL Plan Execution diagnosis:

Visualizing SQL Plan Execution Time With FlameGraphs

Learn a new way for applying FlameGraphs for visualizing RDBMS SQL execution plan response time and resource usage at the plan operator level, while still maintaining the ability to see the big...

www.p99conf.io

0

11

25

Mark E. Dawson, Jr.

@medawsonjr

1 year

Interesting C++ presentation about HFT performance considerations from perhaps the most flamboyant presenter I've ever seen😄

Jason McGuiness — A detailed performance analysis of a simple...

Подробнее о конференции C++ Russia: https://jrg.su/W8skjE— —. . . Target audience: intermediate-to-expert. Purpose of the talk: High-Frequency Trading (HFT) ...

www.youtube.com

0

6

25

Mark E. Dawson, Jr.

@medawsonjr

4 months

JAVA PERF PEEPS: There's a new book out from Oracle Press entitled "JVM Performance Engineering" by Monica Beckwith. I'm interested in your thoughts on its contents. Feel free to comment here or via DM, wherever you're more comfortable.

2

5

23

Mark E. Dawson, Jr.

@medawsonjr

1 year

Email from P99Conf '23: "Since we expect a large spike of logins, be sure to be among the 1st who login to the keynote sessions. Be ready w/your wireshark, gdb, docker & bpftrace." I *told* you Wireshark's an important perf tool😊

My Top 7 Performance Books for Engineers

The most frequent request I get from engineers looking to branch into this niche is for performance book recommendations. These are my Top 7.

www.jabperf.com

0

7

24

Mark E. Dawson, Jr.

@medawsonjr

2 years

This comical refusal by Uncle Bob Martin to stop & reconsider whether he should reevaluate his hard dynamic dispatch stance illustrates a common pitfall of those hailed as experts - "Expert Ego" sets in. Here are tips to avoid it:

Beginner's Mindset: Key to Engineering Expertise

You've worked years to reach your level of engineering expertise. Yet, cultivating a Beginner's Mindset could be the key to true mastery.

www.jabperf.com

1

5

22

Mark E. Dawson, Jr.

@medawsonjr

9 months

Thorough article on the decision to enable frame pointers for Fedora, as well as a nice breakdown on the pros & cons of various stack unwinding methods, some of which are tried-and-true & others which are on the horizon:

Performance Profiling in Fedora Linux - Fedora Magazine

Performance profiling in Fedora Linux, how to get started, how it works, and possible changes in the future.

fedoramagazine.org

1

9

22

Mark E. Dawson, Jr.

@medawsonjr

11 months

EVERY man over 30 should get a full battery of blood tests at each annual physical. Even if you're not playing a sport. Make sure it includes Lipid, Total/Free Testosterone, DHEA, ApoB, HbA1c, Fasted Insulin/Glucose, LH/FSH/TSH/T4/T3, CRP, etc. Trust me👍🏽

3

1

23

Mark E. Dawson, Jr.

@medawsonjr

6 months

@i_bogosavljevic No, you're not hallucinating. It's part of the Intel Resource Director family of features called "Cache Pseudo Locking". The Linux kernel supports it, as well. I've never used it personally because the HFT game is won or lost in the L1d anyway🤷🏾‍♂️

0

2

24

Mark E. Dawson, Jr.

@medawsonjr

8 months

connect() - why are you so slow?

This is our story of what we learned about the connect() implementation for TCP in Linux. Both its strong and weak points. How connect() latency changes under pressure, and how to open connection so...

blog.cloudflare.com

1

3

23

Mark E. Dawson, Jr.

@medawsonjr

1 year

One of the better, if not best, IT Performance Conferences based on how practical and immediately useful all the talks are. BONUS: You don't even have to book a Business Class seat to attend (it's virtual) 😉

P99 CONF - The Event on All Things Performance

P99 CONF is a cross-industry virtual event for _engineers_ and by engineers, centered around low-latency, high-performance design.

www.p99conf.io

1

2

22

Mark E. Dawson, Jr.

@medawsonjr

1 year

“An amateur can be satisfied with knowing a fact; a professional must know the reason why. An amateur practices until he can do a thing right, a professional until he can’t do it wrong.”

2

1

22

Mark E. Dawson, Jr.

@medawsonjr

1 year

@lemire @cmuratori @strager Exactly! This is why I've always stated that performance must be included in the Architecture & Design phase:

Are You Shifting Left Enough?

Shifting performance left into Agile Dev Teams is not far enough - it must be shifted into the Architecture Design Phase, as well.

www.jabperf.com

3

20

Mark E. Dawson, Jr.

@medawsonjr

3 months

I've noticed the fact that splitting work btwn threads not only has diminishing returns but also the potential for *regressive* behavior is tough for clients to wrap their heads around. This is why we explain USL in both @dendibakh 's perf book editions👍🏽

1

2

21

Mark E. Dawson, Jr.

@medawsonjr

2 years

In yet another example of the symbiotic relationship between Perf & Sec Engineering, here's a new paper reverse engineering Intel's L1/L2 TLB, providing operation details you won't find in the Intel Arch SW Dev Manuals:

0

8

22

Mark E. Dawson, Jr.

@medawsonjr

5 months

@lemire This mirrors Myths #3 & #5 from my article: Sampling profilers can mislead, and mastering any one tool (e.g., perf or VTune or uPerf) won't magically confer perf analysis expertise. Somehow that ruffled feathers on X:

Debunking 5 Stubborn Systems Performance Myths

Some performance myths resist debunking after being whispered among techies in cubicles and server rooms for generations. Let's try anyway.

www.jabperf.com

2

5

22

Mark E. Dawson, Jr.

@medawsonjr

5 months

PERF PEEPS: I value your expertise & camaraderie. I'd like it to last as long as possible. So plz do me a favor: Each weekday morning drop & do 30 pushups/30 squats. At some point each weekday do 30mins on the treadmill/bike at a moderate pace. That's all🙏

2

1

21

Mark E. Dawson, Jr.

@medawsonjr

6 months

@GergelyOrosz I won't lie - I used to *love* non-competes. In the HFT industry, you'd get paid between 75 - 100% of your salary for the entire duration, which typically lasts btwn 6 months to 1 year. I took some of my most memorable vacations during mine.

2

0

20

Mark E. Dawson, Jr.

@medawsonjr

1 year

Continuous Profilers (the 4th Pillar of Observability) will become even more important as general purpose CPUs gain more ground from special-purpose co-processors in the Cloud AI/ML space:

The Case for Running AI on CPUs Isn't Dead Yet

GPUs may dominate, but CPUs could be perfect for smaller AI models

spectrum.ieee.org

3

8

20

Mark E. Dawson, Jr.

@medawsonjr

2 years

In celebration of the One Year Anniversary this month of my blog, I present my latest article:

Debunking 5 Stubborn Systems Performance Myths

Some performance myths resist debunking after being whispered among techies in cubicles and server rooms for generations. Let's try anyway.

www.jabperf.com

1

6

20

Mark E. Dawson, Jr.

@medawsonjr

9 months

@forked_franz You guys may also benefit from running roofline analysis on your code, a topic about which both @i_bogosavljevic and @dendibakh have blogged. Here's @i_bogosavljevic 's article on how to use it on your code:

Measuring Memory Subsystem Performance - Johnny's Software Lab

In this post we introduce a few most common tools used for memory subsystem performance debugging.

johnnysswlab.com

0

10

20

Mark E. Dawson, Jr.

@medawsonjr

3 years

IPC is to CPU% What Milk is to Cinnamon Toast Crunch

Cinnamon Toast Crunch is tasty alone but much better w/milk. Most shops track CPU% yet neglect IPC, the milk that makes CPU% so much better.

www.jabperf.com

0

5

19

Mark E. Dawson, Jr.

@medawsonjr

1 year

There's the geeky side of me that loves each new advancement in x86-64, ARM, RISC-V, and specialized co-processors. But then there's my *other* side that wishes it all stagnates a bit (~5yrs or so) to force devs to embrace Mechanical Sympathy🤷🏾‍♂️

2

19

Mark E. Dawson, Jr.

@medawsonjr

2 years

The Most Important Optimizations to Apply in Your C++ Programs - Jan...

https://cppcon.org/---The Most Important Optimizations to Apply in Your C++ Programs - Jan Bielak - CppCon 2022https://github.com/CppCon/CppCon2022Writing ef...

www.youtube.com

2

7

20

Mark E. Dawson, Jr.

@medawsonjr

1 year

Oh, for the love of All That Is Holy & Highly Performant, would someone w/better Google-Fu than me plz tell me where I might find this tool? A system-wide causal profiler that employs virtual speedups againt cores instead of threads🤯

SCOZ: A system‐wide causal profiler for multicore systems

The increased complexity of hardware and software makes it difficult to analyze programs with conventional profilers. The causal profiling technique is introduced to solve the problem of convention...

onlinelibrary.wiley.com

4

6

20

Mark E. Dawson, Jr.

@medawsonjr

1 year

Everyone dances in the streets any time an x86 CPU gets a bigger LLC. I felt like I was dancing alone on an abandoned street w/o any music besides what played in my head when the first one w/48KB L1d arrived. The game is won or lost in the L1d for me🤷🏾‍♂️

1

19

Mark E. Dawson, Jr.

@medawsonjr

3 years

New blog post:

Noisy Neighbor Effect: How to Manage It

Sharing a CPU with a busy app and sharing a wall with a rowdy roomie both illustrate noisy neighbor effect. This entry addresses the former,

www.jabperf.com

0

7

19

Mark E. Dawson, Jr.

@medawsonjr

2 years

Valentine's Day Weekend Edition:

My Fear of Commitment to the 1st CPU Core

On this Valentine's Day Weekend I share the heartbreak that caused my fear of commitment to the 1st CPU Core of any socket (not just core 0).

www.jabperf.com

0

4

18

Mark E. Dawson, Jr.

@medawsonjr

11 months

Very interesting performance investigation! I won't tell you much more so that I won't give it away. I'll just say that you shouldn't let the title make you jump to conclusions about possible culprits😉

Rust std fs slower than Python!? No, it's hardware!

An infrastructure engineer, focused on distributed storage system

xuanwo.io

1

7

19

Mark E. Dawson, Jr.

@medawsonjr

2 years

Has anyone else managed to put Kanye West, Taylor Swift, the MTV Music Video Awards, and Linux FTrace all in the same article? You know what? Never mind. I'm just gonna take the credit anyway:

System Interrupts: How to Hunt Them Down

Few things hinder our productivity like an interruption. The same is true for applications. How can we identify sources of system interrupts?

www.jabperf.com

0

8

17

Mark E. Dawson, Jr.

@medawsonjr

2 years

Interesting article compares AWS, GCP, and Azure on the likelihood of VM colocation and the resulting performance impact:

0

6

18

Mark E. Dawson, Jr.

@medawsonjr

1 year

I know I've endorsed P99 Conf several times before. But for any of you still on the fence, lemme tell ya that I just noticed @trav_downs on the agenda (not sure I've ever seen him present before). Register now:

P99 CONF - The Event on All Things Performance

P99 CONF is a cross-industry virtual event for _engineers_ and by engineers, centered around low-latency, high-performance design.

www.p99conf.io

2

18

Mark E. Dawson, Jr.

@medawsonjr

6 months

@axelgneiting HFT apps eschew runtime allocs not just cuz of alloc cost (or the more expensive dealloc cost) but the minor pg fault cost (100s of ns). Allocators that prealloc/pin memory upfront (and prevent TLB Shootdowns by not unmapping on dealloc) avoid this.

2

1

18

Mark E. Dawson, Jr.

@medawsonjr

6 months

Interesting proposal for improving the accuracy of performance profiles in the era of OoO, deeply pipelined CPUs

Björn Töpel

@bjorntopel

6 months

This seems to be a really, REALLY good CPU performance monitoring thesis:

2

17

54

1

6

18

Mark E. Dawson, Jr.

@medawsonjr

1 year

Since RAM latency's only growing w/DRAM chip size & each DDR release, I see CPU DCA support as a no-brainer. Yet AMD *still* doesn't offer it, while Intel has DDIO & ARM has DynamIQ Cache Stashing. Are there any perf writeups available for the latter?

3

18

Mark E. Dawson, Jr.

@medawsonjr

1 year

BOOM! Intel Linear Address Masking Masking (LAM) support has finally been merged!💪🏾

Kostya Serebryany

@kayseesee

1 year

Intel LAM support is merged to Linux upstream! (LAM makes HWASAN possible, similar to Arm's top-byte-ignore)

2

30

106

1

17

Mark E. Dawson, Jr.

@medawsonjr

2 years

@majek04 Yep, this is a nice writeup on DRAM Refresh (I touch on it briefly at ). Interestingly, LPDDR uses per-bank refresh to allow concurrent R/W access during refresh. But that's typically only used in mobile platforms🤷🏾‍♂️

Game of Low Latency

What do Low Latency Optimization, "Mortal Kombat", and Bruce Lee's "Game of Death" all have in common? They are all a multi-level games.

www.jabperf.com

0

4

17

Mark E. Dawson, Jr.

@medawsonjr

1 year

@disruptnhandlr And therein lies the problem w/its applicability for HFT: code that we want optimized *are* the cold, rarely executed functions😪 As a result, we resort to clever tricks to keep that code warm artificially. But we're a corner case industry anyway🤷🏾‍♂️

3

1

15

Mark E. Dawson, Jr.

@medawsonjr

6 months

This is a cool distillation of important high-level performance concepts along with ample references to source material👍🏽

Tyler Neely

@sadisticsystems

4 years

Wrote this little theoretical performance guide. It contains a bunch of ideas that have been super helpful while building most of the systems I've worked on as an engineer. It's fairly language and system agnostic, focusing on a number of timeless ideas.

10

116

453

0

6

17

Mark E. Dawson, Jr.

@medawsonjr

11 months

Core to Core Latency Data on Large Systems

Multicore CPUs have to give user programs a way to synchronize between different cores, and ensure a coherent view of memory.

chipsandcheese.com

0

2

17

Mark E. Dawson, Jr.

@medawsonjr

3 years

Now *this* is a great example of proper benchmarking. Using runtime info alone requires relying too much on intuition, and performance on today's complex systems often defy our intuition.

johnnysswlab.com

@johnnysswlab

3 years

We try to answer why #quicksort is faster than #heapsort and then we dig deeper into these algorithms' #hardwareefficiency . The goal: making them faster.

1

6

28

0

5

16

Mark E. Dawson, Jr.

@medawsonjr

3 years

In the game of Low Latency (and, dare I say, in general computing), if your monitoring stack (e.g., TICK, ELK, etc.) tracks %CPU and %MEM but ignores IPC and Memory BW, you're hamstringing yourself.

2

1

16

Mark E. Dawson, Jr.

@medawsonjr

3 years

After several yrs working w/it now, I can confidently say that if you're not profiling your multithreaded code with @emeryberger 's Coz profiler then you're costing yourself extra work:

GitHub - plasma-umass/coz: Coz: Causal Profiling

Coz: Causal Profiling. Contribute to plasma-umass/coz development by creating an account on GitHub.

github.com

2

1

16