Jay Chia - getdaft.io Profile
Jay Chia - getdaft.io

@JayChia5

Followers
305
Following
143
Statuses
364

Cofounder @ Eventual. Works on Daft (https://t.co/i5vV81AuTj) the Distributed Python Dataframe. LESS OOM MORE ZOOM

San Francisco, CA
Joined August 2022
Don't wanna be here? Send us removal request.
@JayChia5
Jay Chia - getdaft.io
11 months
Late night rant: Spark is an awesome piece of software. But a horrible developer experience. What happened to OSS that was simply `apt install` and 🚀? Why should software be excused for slow local performance because it was built for "production scale"? So much of "big data" JVM-based tooling was hacked together on the giant datacenters of tech giants. The world has changed, and so too must our big data tooling. ⭐️ Rust: self-contained compiled native binaries that have no dependencies. Hello, clean installs, my old friend. 🐍 Python: the undeniable winner of iterative plumbing for data/ML. Build with a Python API in mind. Using the JVM through a Py4J gateway should be an automatic disqualification. ☁️ Cloud: Build cloud-first, lightweight, ephemeral software. Cattle vs Pets. S3, not NFS/HDFS. Spot instances, not machines on a rack. 🤓 Dev UX: build for the single developer, on their laptop, then think about scaling. A docker-compose local dev story is lazy bundling of overly complicated software. ☀️ Open Formats: let software TALK to each other, so devs can choose the right tool for the right job, and so devs can keep building better tooling. This is why JSON is awesome. Arrow is awesome. Iceberg is awesome. Parquet and CSV are (I begrudgingly admit) somewhat awesome. And please build flexible SDKs for these formats, in C++ or Rust, not just for the JVM.
0
1
17
@JayChia5
Jay Chia - getdaft.io
1 day
@criccomini @continuedev @cursor_ai @daft_dataframe Yes! Was shocked at how good it was at GitHub Actions specifically. I guess there’s a ton of training data out there that looks really similar since there’s just a finite set of ways to configure actions.
1
0
4
@JayChia5
Jay Chia - getdaft.io
7 days
@haro_ca_ Plenty! Unstructured data clustering, GPU model batch inference, running Python UDFs (efficiently), dataset curation on unstructured data, video ingestion/indexing... Hint hint: Daft can do all that, and we're going to get a SICK data warehouse with daft as the engine :)
2
0
4
@JayChia5
Jay Chia - getdaft.io
7 days
@Ubunta Why do you think tools like DuckDB are only discussed over lunch, but organizations are still using the big expensive data warehouses?
2
0
1
@JayChia5
Jay Chia - getdaft.io
8 days
@mim_djo @daft_dataframe @duckdb Disk cache -- for repeated access of the same version of an iceberg/delta table?
1
0
2
@JayChia5
Jay Chia - getdaft.io
9 days
Beautiful dev documentation examples: OpenAI's docs: FastAPI: And of course, Daft:
0
0
1
@JayChia5
Jay Chia - getdaft.io
21 days
Daft cooks. We ate. Get yourself a CEO like @Sammy_Sidhu who can cook for the entire team??
0
2
11
@JayChia5
Jay Chia - getdaft.io
1 month
New year 2025 and Spark still makes me sad Had a 30m back-and-forth conversation with Claude to figure out wtf is s3:// vs s3a:// vs s3n:// Also all the magical `--confs` that need to be added to get this stuff working Jeez.
1
0
2
@JayChia5
Jay Chia - getdaft.io
2 months
We’ll be building some blogposts in the new year to talk about benchmarks and use-cases! Some good ones off the top of my head: - streaming ML dataloading (URL downloads into image decoding/tensors) - interactive data science — .show() is significantly snappier - ETL: much better memory usage/stability
0
0
0
@JayChia5
Jay Chia - getdaft.io
2 months
We expect some instabilities initially (a BIG thank you to our beta testers!) but by an large you should see a big improvement in both the user experience and memory! HAPPY HOLIDAYS all, from the Daft team 🤗🤗🤗
0
0
2
@JayChia5
Jay Chia - getdaft.io
2 months
@mim_djo Cool. How would something like Daft we able to interact with PowerBI? Is this JDBC/ADBC or something similar?
1
0
1