Guilherme Penedo
8 months
With Datatrove 🏭, you can:
- quickly read data in diff formats from disk, the cloud, or hf hub
- use SOTA filters out of the box
- deduplicate data at scale (minhash, exactsubstr, bloom filters, etc)
- tokenize your data
- easily run and scale your custom processing logic