![Nadeem Bitar Profile](https://pbs.twimg.com/profile_images/459367464283824128/VRLIGNde_x96.jpeg)
Nadeem Bitar
@shinzui
Followers
546
Following
4K
Statuses
5K
Technologist • Software engineer • Engineering leader • Product developer • Photographer • Swimmer • Japanophile • Lover of beautiful things.
Los Angeles, CA
Joined April 2007
@devongovett @igoranikeev @getjustd We’re planning to adopt Justd for our next project. Right now, we use React Aria and style the components ourselves. I hope Justd gains more traction over time.
0
1
2
@devongovett @igoranikeev .@getjustd is the Shadcn equivalent and is built on top of React Aria, but as far as I know, no company currently uses it in production.
2
0
0
For anyone looking to understand Apache Iceberg, this is a great place to start.
Apache Iceberg is used at industry leaders like Apple, LinkedIn, Netflix, Stripe, Airbnb and Pinterest. But do you know how it works? Let's cover it in the simplest way. Iceberg is basically a way to manage a collection of files in a larger abstraction called a table. 💡 It's an open table format. That's a standard that defines: • a table abstraction • a schema • history • and other metadata all around a bunch of regular files in a data lake. By adding structure and extra capabilities to these random files, table formats help organize them all while actually increasing performance and giving you better functionality, like CRUD operations, ACID guarantees and scalability. 👌 File formats, like Parquet, help you modify data or skip reading certain data (so you read faster) in a SINGLE file. In the same way, table formats, like Iceberg, help you modify data or skip reading certain data, for a LIST of files. 🤔 The way Iceberg achieves this is through a few layers. The state of the table is mainly held in files. These files are stored in your data lake (e.g S3). Read carefully because their names are similar: 1. metadata root 2. metadata snapshot 3. manifest LIST 4. manifest FILE 5. data file Let's start from the bottom. 👇 1️⃣ The data file is a simple Parquet file (other formats are supported too). It's literally just your data. 🙂 Typically you'd have many, many of these in your data lake. (too many) 🥵 2️⃣ Next is the manifest file - this is a sort of mini-metadata file that contains information about a set of data files. An Iceberg table consists of many of these mini-metadata files for a bunch of reasons - mainly scalability and performance. The information the Manifest File contains is of crucial importance: • partition info - if your data is partitioned by date, the manifest file can say "these 5,000 files are all for June". If your query is searching for May, it would skip inefficiently searching through them. 👌 • per-file column-level statistics - it holds info for each file's min/max columns. For example, if this were for purchases it would hold the minimum and maximum price of a purchase (e.g $10 - $100). A query that's searching for purchases above $200 would entirely skip searching through that file. 💸 3️⃣ Next up, we have the manifest list. Obviously, if you have many manifest files - you have to store the list of them somewhere. This is stored here. 🤷♂️ It contains pointers to every manifest file, as well as higher-level summaries of the partitions covered by the manifests. This information is again used for efficient querying. 4️⃣ Metadata Snapshot The latest state of an Iceberg table is defined by a metadata snapshot. This snapshot links to a manifest list with pointers to the latest data. The snapshot also stores other metadata, including the previous snapshot. (s1 to s0 in the graphic) Notice that there are a couple of snapshot files - this is because Iceberg holds historical state so it can allow you to seamlessly rollback / time-travel back to previous versions. 🕰️ This forms a sort of linked list representing snapshot lineage and is what allows Iceberg to "traverse back in time". Every new write in Iceberg creates a new snapshot and a new manifest list. The new list tries to reuse as many old manifest files as possible, but the update may make it dereference old ones and reference new freshly-created ones too. 5️⃣ Metadata Root At any given time, there is only one metadata snapshot that's the latest. How do we know which one it is? It's whatever the Metadata Root points to. This is a simple JSON file that represents all the metadata around an Iceberg table. It tells us what the latest snapshot is, but also what older snapshots are. It contains info where the table is located (e.g the S3 URL), info about partitions, historical schemas, etc. This file is updated atomically, and that's how Iceberg serializes updates. The way clients write to an Iceberg table is the following: 1. they optimistically write the data/manifest files and attach it to a newly-created snapshot Until they're linked from the metadata root, the files are useless. At this point, these files are essentially unknown. 2. clients then perform an atomic compare-and-swap (usually via the Catalog) to update the metadata root with the new data As you can see - there is a TON of indirection here. Catalog -> Metadata Root -> Snapshot(s) -> Manifest List -> Manifest File(s) -> Data File(s). But that's the point. ☝️ Notice clients never re-write the whole table state. They only re-write the minimal set of data, and then just update the pointers. When you have 100k data files and 20k manifest files, this is the most efficient way. That's how Iceberg does it. Leave me a like and repost if you found this valuable!
0
0
2
@rhughesjones I’ve read that before and really appreciate your writing in general—always find it valuable and insightful.
1
0
1
Cab looks promising. Hopefully, it evolves beyond a proof of concept and replaces the Nix cli.
The ideals of Nix are almost perfect. The implementation sucks and it's only really usable if you know why it is the way it is (and why it's badly implemented). That's why I'm working on Cab, and I'm taking my time thinking everything through before committing to an implementation. It's not exactly a build language either. It doesn't specialize in any concept such as derivations, units, resources or whatever. It only lets you compose expressions with contexts—the stuff that makes Nix magical in the first place. What are expression contexts? They're when a subset of an expression implicitly carries the whole expression with it, as a context. It's how you don't explicitly specify what a Nix derivation depends on—it just works! The problem with Nix, ignoring all the QoL stuff that's missing [1] is that Nix contexts can *only* be used for derivations. They're not generic. derivationStrict is a builtin function, you cannot emulate it in the language itself. That prevents you from using contextful expressions for other things, such as process management. Resource (like terraform) management. Literally anything that forms a graph - it cannot be done cleanly! This is why a generic contextful-expression language is required. Cab will fix this, and that won't be the only thing it will fix. Cab has structural types enforced with a super novel system that works super well for a dynamic build system language. It has patterns, rather than hard coding for identifiers. I'd argue that its type system is going to be more powerful than typescript since there is no runtime-comptime difference (it's all the same, there's no IO either so it's simple). Anyway, stay tuned for an MVP. I also plan on supporting using the Nixpkgs package set with the system I'm going to build on top of Cab (the Cull Build System) using the efforts the Ekala project has been going through, eventually. However, I won't announce it until I have a working LSP, a unified documentation system that's flexible, a properly designed, hack-free "project" abstraction for the Cull Build System and a fast runtime overall (bye bye, NixOS module system and home-manager eval times). [1] - Flakes are a bad solution to the purity problem, nix-* commands shouldn't exist, FODs are hard to use, nixConfig can pwn you, too many tunables, too little separation of concerns between the distro, module system, nix itself, the nix daemon, literal oddities in nix, much much much more
0
0
4
@kmett @Savlambda @tangled_zans @jvanbruegge There are a couple of production-quality packages that support row polymorphism, but it should be a first-class feature of the language. I’ve successfully used row-types in the past but stopped when the maintainers discontinued support for it.
0
0
1