![Jay Chia - getdaft.io Profile](https://pbs.twimg.com/profile_images/1571350653884313600/95ka8cYU_x96.jpg)
Jay Chia - getdaft.io
@JayChia5
Followers
305
Following
143
Statuses
364
Cofounder @ Eventual. Works on Daft (https://t.co/i5vV81AuTj) the Distributed Python Dataframe. LESS OOM MORE ZOOM
San Francisco, CA
Joined August 2022
Late night rant: Spark is an awesome piece of software. But a horrible developer experience. What happened to OSS that was simply `apt install` and 🚀? Why should software be excused for slow local performance because it was built for "production scale"? So much of "big data" JVM-based tooling was hacked together on the giant datacenters of tech giants. The world has changed, and so too must our big data tooling. ⭐️ Rust: self-contained compiled native binaries that have no dependencies. Hello, clean installs, my old friend. 🐍 Python: the undeniable winner of iterative plumbing for data/ML. Build with a Python API in mind. Using the JVM through a Py4J gateway should be an automatic disqualification. ☁️ Cloud: Build cloud-first, lightweight, ephemeral software. Cattle vs Pets. S3, not NFS/HDFS. Spot instances, not machines on a rack. 🤓 Dev UX: build for the single developer, on their laptop, then think about scaling. A docker-compose local dev story is lazy bundling of overly complicated software. ☀️ Open Formats: let software TALK to each other, so devs can choose the right tool for the right job, and so devs can keep building better tooling. This is why JSON is awesome. Arrow is awesome. Iceberg is awesome. Parquet and CSV are (I begrudgingly admit) somewhat awesome. And please build flexible SDKs for these formats, in C++ or Rust, not just for the JVM.
0
1
17
@criccomini @continuedev @cursor_ai @daft_dataframe Yes! Was shocked at how good it was at GitHub Actions specifically. I guess there’s a ton of training data out there that looks really similar since there’s just a finite set of ways to configure actions.
1
0
4
@haro_ca_ Plenty! Unstructured data clustering, GPU model batch inference, running Python UDFs (efficiently), dataset curation on unstructured data, video ingestion/indexing... Hint hint: Daft can do all that, and we're going to get a SICK data warehouse with daft as the engine :)
2
0
4
@mim_djo @daft_dataframe @duckdb Disk cache -- for repeated access of the same version of an iceberg/delta table?
1
0
2
We’ll be building some blogposts in the new year to talk about benchmarks and use-cases! Some good ones off the top of my head: - streaming ML dataloading (URL downloads into image decoding/tensors) - interactive data science — .show() is significantly snappier - ETL: much better memory usage/stability
0
0
0