![Lucky Iyinbor Profile](https://pbs.twimg.com/profile_images/1737183090303250432/1Pe_com1_x96.jpg)
Lucky Iyinbor
@Luckyballa
Followers
3K
Following
2K
Statuses
511
Physics Simulation | Geometry Processing | Computer Graphics | AR/VR
Joined June 2012
A few weeks ago, I implemented a paper where an algorithm was described from a CPU standpoint. By rethinking it for GPU architecture, I was able to get almost a 100x speedup compared to the metrics from that paper. Here's how I did it: The Problem: We have vertices randomly assigned to clusters (spheres). We need to compute per-cluster statistics where each vertex contributes to its cluster's metrics. With significantly more vertices than clusters, the main challenge was figuring out how to do parallel reduction efficiently while supporting scaling to millions of vertices and centroids I developed 3 methods: Method 1: Direct Atomic - Each thread handles one vertex with a cluster assignment - Directly updates global cluster memory with atomics It's nice and simple, but has high memory contention, random access, and in general, atomics to device memory are quite slow Method 2: Threadgroup Memory - Each thread handles one vertex with a cluster assignment - Each thread group allocates memory for all clusters - First accumulates in thread group memory for all clusters, then to global memory Here we have less atomic contention and good thread workload, but we're limited to ~512 clusters (32KB threadgroup memory limit), which is fine for most cases but not scalable Method 3: Range-Based - Sorts vertices by cluster ID - Finds start and end indices of sorted vertices for each cluster - Computes cluster distribution given a desired thread group size - First accumulates in thread group memory for a single cluster, then updates global memory This one is my favorite - it scales to any cluster count, has perfect memory coalescing, and we have one cluster per group, so no limits on number of clusters! The downside is that it requires sorting and some threadgroups aren't fully utilized In practice: Method 2 and 3 outperform the first one significantly. Method 2 is fastest but limited, Method 3 is slightly slower but scales to any size. Both methods allow you to go from 2-3 seconds stated in the paper to 30-40ms on M1 Ultra Mac That's it, have a good day :)
0
2
19
RT @ssh4net: A Radiance Field Loss for Fast and Simple Emissive Surface Reconstruction Ziyi Zhang, Nicolas Roussel, Thomas Müller, Tizian Z…
0
3
0
@Spiritandsoul23 No framework Using this method in my compute shader playground
I want to work more on high-dimensional optimizations I'm bad at math, so deriving all gradients in a chain by hand is painful for me I'm focused entirely on GPU development, so CPU auto-grad isn't an option I hate using frameworks - what are my options? The answer is this -
0
0
0
Voronoi for the win 🏎️
📢📢📢 "𝐑𝐚𝐝𝐢𝐚𝐧𝐭 𝐅𝐨𝐚𝐦: Real-Time Differentiable Ray Tracing", a mesh-based 3D represention. Co-lead by my PhD students Shrisudhan Govindarajan and Daniel Rebain, and w/ @kwangmoo_yi
0
0
13
@jmeseguerdepaz Another thing I am bad at is python, so probably this is not the best option for ahah
0
0
2
RT @zianwang97: 🚀 Introducing DiffusionRenderer, a neural rendering engine powered by video diffusion models. 🎥 Estimates high-quality geo…
0
130
0
Another way to approach MAT is to describe it as a field The medial field M(x) is the radius of the medial sphere centered at projM(x), where projM(x) is the intersection of a ray from a surface point to the medial axis in the normal direction It can be defined as a function that satisfies these constraints: M*(x) ≥ |Φ(x)| M*(x) = |Φ(projM*(x))| ∇M*(x) · ∇Φ(x) = 0 This representation allows finding a projection point on the medial axis in O(1): projM(x) = x + ∇|Φ(x)| · (M(x) - |Φ(x)|) Key applications include faster ray marching, collision proxy building, and ambient occlusion computation
0
3
18
@miketuritzin Isn’t it a bit specialized? Like decomposition for arbitrary 3D shapes is probably not trivial
1
0
0
RT @miketuritzin: Ran across this great article on sampled SDFs that has *great* interactive WebGL illustrations that work really well for…
0
40
0
RT @QianqianWang5: Introducing CUT3R! An online 3D reasoning framework for many 3D tasks directly from just RGB. For static or dynamic sce…
0
98
0
@turtlespook Here I use the Gauss–Newton method. It has only 4 parameters per sphere, so the system matrix (approximate Hessian) is very small (4x4 per sphere) Fits nicely in threadgroup memory, which is 32kb on Apple Silicon
0
0
2
@daveseidman Agree! You can use meshes or oriented point clouds, no additional input is needed
1
0
5