I'm starting a company, Datalab:
- Task-specific models that outperform frontier LLMs and existing tools
- Examples: my projects marker and surya (25k GH stars) with task-specific arch
- Goal: Train models, open source as much as possible, do hosted inference and on-prem
I wrote a blog post on going from not knowing anything about deep learning last year to training state of the art OSS models - .
Hope it helps you.
tldr; read the deep learning book, implemented papers + taught, built open source tools
I'm excited to ship marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data.
Find it here - .
Cool to see a 500M param model I trained myself do better than Google cloud vision, Claude, and GPT-4V on this task. (look at the thread for the results)
It's a relatively narrow one (OCR), but feels nice to see that small open source models still have a place.
It's weird how we live in an age of miracles with respect to AI/ML, and yet when I want to extract some text from a screenshot the best (very bad) option is tesseract, last updated ~7 years ago.
Better data = better AI. That's why I've spent the last 3 months on:
- Marker - fast, accurate PDF to markdown (5k GH ⭐️s)
- Texify - SOTA math to LaTeX OCR
- Libgen to txt - get 3TB of HQ data
- Textbook quality - HQ synth data
Find them at .
I'm training a text line detection model for a document OCR pipeline.
It could also be useful on its own, but I'm not sure. Is anyone interested in a standalone release?
It works for every language I tried - it detects text bboxes and column breaks. ~2 second inference per
I'm tweaking my line detection model to get it ready for a Github release. This was a fun test case. It's not really designed for newspapers, so I was surprised this worked
I've shipped most of the models + libraries I wanted in the last few months:
- PDF to markdown - marker
- Text line detection, OCR in 93 languages, layout analysis, reading order - surya
- Equation to LaTeX
- PDF text extraction
Find them on Github - .
Announcing surya reading order! It predicts the order that a human would read a document in.
It's useful for RAG, accessibility, and text extraction. It works on a variety of documents, layouts, and languages.
Announcing surya layout! It detects tables, images, figures, section headers, and more. It works with any language, and a variety of document types.
Find it here - .
Thanks
@LambdaAPI
for sponsoring compute.
I can't get over
@ylecun
tweeting that surya was nice. Lifetime achievement unlock.
My next steps are:
- Improving old/scanned doc performance
- Seeing if I can do anything about rotations
Then on to the next recognition part! Here's the repo - .
Announcing texify - an OCR model that turns inline and block equations into markdown/LaTeX. It's more accurate at this than nougat and pix2tex.
Find it here - .
I spent a week improving the performance of surya (OCR, layout), and marker (PDF -> markdown).
OCR is now 2.2x faster, layout 1.35x, and marker 1.2x. Accuracy is the same as before (SoTA for open source/their task, see repos for details).
The biggest barrier to GPT-quality open source LLMs is data.
If you want 1TB of quality data, here's my repo that will convert libgen nonfiction to txt format - .
I made pdftext, a small tool that extracts text like pymupdf, but with an Apache license (mupdf is AGPL). It can pull out blocks and lines or plain text.
Find it here - .
Marker v2 is out! The main new features:
- Extracts images/figures
- Better table parsing
- Pip package install
- Can be used commercially
- Improved OCR with more languages
- Better ordering for complex docs
Get it here - .
Surya () has been updated with a new model checkpoint that is far better on scanned/old docs.
It works even with blurry/rotated complex layouts, like this one:
I just released new surya layout and text detection models:
- 30% faster on GPU, 4x faster on CPU, 12x faster on MPS
- Accuracy very slightly better
- When I merge this into marker, it will be 15% faster on GPU, 3x on CPU, 7x on MPS
Surya () didn't work well on scanned/rotated docs, so I decided to spend a couple of days on it this week.
I'm making good progress. It's still training, hopefully will have something out tomorrow.
I'm going to release my reading order model next week. I had to change the architecture to perform better with complex layouts.
It seems to be working, though (see the image). There are mistakes, but it's only 20% trained, and still improving.
Textbooks generated with finetuned mistral + search and wikipedia RAG are surprisingly good. They seem close to GPT-3.5.
See samples here - , and here - .
Working on a bigger set now! Please let me know if you can sponsor.
I have a beta version of a hosted API for marker and surya up at . It does OCR, layout + reading order, and PDF -> markdown.
Averages 40 seconds to convert a 50 page PDF to markdown.
I've generated 70M tokens of extremely high quality synthetic textbooks - , using retrieval and gpt-3.5.
Seriously, the quality is 💯.
I'm generating 1B tokens, but will use llama for $$ reasons. Please DM if you can sponsor compute or credits.
My reading order model is getting close to being release-ready. (it may not be immediately obvious, but this is a hard doc to order properly)
Working on fixing just a few remaining issues.
I released marker last week - .
Within 72 hours, marker got to
#1
on HN, with 700 votes, and was starred 3.4k times on Github.
I didn't expect this kind of response - thank you so much for the support!
An update on surya text recognition - I'm happy with the data/architecture, and I'm ready to scale up training.
Here are some results from a (very) early checkpoint. Left is original, right is OCR (Malayalam)
I'm building a dataset of high quality synthetic textbooks for pretraining. Here's a 4M token preview - . The quality is incredibly high (it really surprised me).
I've been generating additional textbooks! is up to 115M high quality tokens, and is up to 85M.
I'm seeing promising humaneval results with models trained on this data.
As
@jeremyphoward
shared yesterday, I'll be joining
@answerdotai
! I'm excited to work with such a strong team.
Before I start, I'm going to finish some in-progress work:
- Integrate surya with marker
- Commercial version of marker
- Launch an API for both
Libgen to txt now supports marker for pdf -> markdown.
Turn libgen rs nonfiction into 3TB of high quality markdown. AI labs are using this data to train LLMs - now you can, too.
Full instructions and usage are here - .
Marker is now faster! 7x on MPS, 3x on CPU, and 10% on GPU. Due to a more efficient architecture for 2 models.
Marker converts pdfs to markdown very effectively. I hope the speedup will let people create more high-quality datasets.
I built a dataset of every package on pypi. The quality of code is high, and I'm finding it great for finetuning and pretraining - .
I cleaned extra leading comments, and rendered notebooks, so this data should be ready to use.
I'm excited to release a 400m token synthetic programming textbook dataset - .
This is a mix of GPT-3.5 (great quality), and finetuned llama (good quality).
It was generated with the textbook quality repo - .
A timeline of
@DataCamp
2017-2020:
- CEO sexually harassed an employee
- The company covered it up
- After years of community pressure, the CEO stepped down
- They just BROUGHT THE CEO BACK 🤦🏾♀️
This is a repeated and ongoing failure of leadership and ethics.
Expectation: Data science is all about ML and deep learning.
Reality: It's 80% storytelling and data acquisition + cleaning. And these parts are actually quite interesting (I promise!)
I'm amazed by the quality of RAG-augmented books from finetuned mistral. The writing is higher quality than 34b codellama, but it does make subtle mistakes (see math below).
Mistral -
Codellama -
I've improved my synthetic textbook generator in collaboration with
@ocolegro
- . The books are now longer and a lot more detailed!
Here's a preview - . (the programming books were generated with this technique)
@Yampeleg
Thank you! I have a finetuned model that can generate similar quality to GPT-3.5. Just need compute credits to scale to 1B+ tokens 🙏🏾 .
LLM credits (OpenAI or other) are also nice!
Dataset is here, btw -
Excited to ship classified - a quality rater for LLM pretraining and instruct data - .
It can stream datasets from HF hub, or from disk.
It uses GPT-(4/3.5) now, but custom classifier training and dataset filtering are coming soon.
I have a very early commercial usage preview of marker on the dev branch.
This removes layoutlm and pymupdf, and swaps in new models I trained.
I'd love some help testing it. You can find it here - .
Surya was trained on a diverse set of documents, including scientific papers. It works with every language that I've tried.
It should work with good quality scanned documents as well due to image augmentation.
If you're learning data science, it can be exciting to jump straight to machine learning. But data cleaning, data visualization, and SQL will take up most of your time in entry-level roles. Don't neglect those skills.
@1littlecoder
Note: this is a thin wrapper around marker - - but strips out the marker commercial license.
Please see the marker repo for details about licensing.
@sterlingcrispin
@peterthiel
Too many people are fine-tuning generalist models, and too few people are building pipelines of models for specific tasks. I think niche data + pipeline will beat generalist models.
Text detection is step 1 in building a GPU-accelerated OCR model that is more accurate than tesseract. Step 2 is to build the text recognition system - I'll be working on that in the next couple of weeks.
Also, thank you
@jeremyphoward
- I joined with the mutual understanding that we'd see if there was a fit.
When it was clear there wasn't (I want to train/open source models), Jeremy was very gracious. It's hard to find people who genuinely want you to
Ok - looks like I will be releasing this one standalone) Note that this is just the text detection (drawing bboxes around the text). I'll be working on text recognition (turning the bboxes into text) next week
At
@dataquestio
, we aren't flashy. We don't raise $$ from investors. What we do instead is build the best way to learn data science.
Students who finish >10 courses see an avg $16.6k salary boost, and we've created $103.9M in total salary gains. And all it costs is $49 a month.
I'm a self-taught data scientist. When I looked for jobs, I got rejected many times for not having credentials.
It was crushing. But I realized that the rejections only mattered if they stopped me from trying. Don't let them stop you.
When I first got into data science, I had impostor syndrome, and I dealt with insecurity by not engaging with people, or acting like I knew everything. This was a mistake. The best way through it is to humbly engage with people - I've learned a lot more this way!
Open source AI is very important to me, and will be a core part of this company.
You can find my current projects here - .
Hosted inference, which launched last month and has decent traction, is here - .
Benchmarking was a little tricky, since surya generates line-level bboxes, and tesseract generates word level. Most datasets are also word-level. I decided to benchmark using doclaynet.
I used to work in a UPS hub. I once thought I'd work there my whole career (until my boss told me they wouldn't promote me).
The fact that I've been able to find my own path, and that I'm able to help others do the same with
@dataquestio
, is something I never take for granted.
@kevinsxu
This is a good thing - most architectural changes don't make a big difference (the training data does). This makes Yi compatible with all the existing llama inference tools. They also acknowledged the issue and will rename - .
Last year, I built Endless Academy - - a site for AI-generated personalized courses.
It has potential, and I'd love to see it grow, but I don't have the time. I'm looking for someone who's interested in taking it over.
Surya is built on some amazing open source work, including:
- transformers from
@huggingface
- segformer from
@nvidia
- CRAFT from the
@official_naver
team - an amazing paper and team
Thank you to everyone who makes open source AI great.
The niches where I'd train task-specific models (like OCR) have large enterprise demand.
These models will be faster/cheaper/more customizable than existing tools and frontier LLMs. This is by fitting model architecture and data to a specific task.
I'm also planning to work on other PDF-related projects soon, like table/image detection/extraction, and reading order detection.
I will be porting all of these into marker (), my pdf to markdown converter, to improve accuracy.
1/ In this thread, I'll discuss
@LambdaSchool
, a bootcamp that charges 17% of your pre-tax income for up to 2 years (ISA).
tl;dr Lambda is much more expensive than the average bootcamp, and has similar outcomes. 75% of Lambda students could pay an avg of $9k less elsewhere.
A summary of 90% of management books:
1. Build trust
2. Build culture
3. Share context
4. Create process, but not too much
5. Give honest, caring, feedback
6. Delegate, but don't micromanage
7. Set actionable goals
8. Hold people accountable
9. Be a mentor
10. Solicit feedback
I'm excited to start shipping again tomorrow. Stay tuned for:
- General purpose OCR model
- Open version of layoutlmv3 (or vgt)
- Commercial version of marker
- Better support for non-European languages
I'm very excited about this direction - it's so much fun to understand a use-case, then find just the right architecture/data to make a SoTA model.
If you're interested in working together in some way (partner, collaborate, invest, etc), feel free to ping me!
Surya has limitations, including:
- It is specialized for document OCR. It will likely not work on photos or other images. It will also not work on handwritten text.
- Performance on scanned documents can be hit or miss.
- It doesn't work well with images that look like ads or
Find it here - .
By combining reading order with OCR and text detection in surya, it's easy to turn entire documents into readable plain text. Even complex ones like newspapers or magazines.
@adithya_s_k
It looks like you copied all of my code out of the marker repo, but split it across several of "your" commits - . You then removed my commercial usage license.
Other code seems copied, too like your florence-2 code is from here -
I hope you find this useful! Please join the Discord - - if you'd like to discuss surya.
If you do try surya out, please let me know how it went for you. I've tried it across a range of images, but there are so many edge cases.
Surya uses a modified segformer architecure from
@nvidia
. I found that by changing some of the shapes in the decoder, I could cut inference RAM usage to 1/4 of the original without a performance degradation.
We announced scholarships for underrepresented groups
@dataquestio
. Here's why:
- Data skills unlock economic opportunity + widely distributing them keeps the field ethical
- Some groups have been excluded due to systemic bias
- Scholarships help level the playing field
Surya () already supports line detection , and I'm excited to have it do full end to end OCR.
The final model should support ~90 languages (all major languages in use today).
I want to spend the rest of my career working towards a world where only what you can do matters - not the logo on your degree, who you know, or what you look like.
One lesson that was hard for me to learn is that the success of those around me doesn't diminish my own.
It actually enhances it by building a stronger network.
Don't hoard knowledge. Help the people around you. Not only is it the right thing to do, it also helps you.
I just uploaded a new model checkpoint for texify , a math OCR tool.
The recognition quality is incredibly good. Left is the selected region of a PDF page, right is detected and rendered Markdown/LaTeX.
The benchmark is calculated by % coverage of predicted bboxes by references (precision), and vice versa (recall). Anything over a .5 threshold is a hit. There is a small penalty for overlapping multiple reference boxes in precision.
Based on my experiences as a solo technical founder growing
@dataquestio
to 30+ people, I wrote a guide on quickly improving your management skills - .
This is how I went from having no idea what I was doing to kind of knowing what I'm doing :)
You can find surya here - .
Coming soon:
- Merge this into marker for a significant speedup
- Work on a more efficient OCR architecture (hoping for similar speedups)
- Table parsing and OCR heuristic improvements in marker
After I start, I'm planning to continue working on OSS data tools/models.
Early ideas are:
- Decode images from any language and doc type into markdown (like nougat, but faster/more general)
- A single chat model that can do OCR, layout analysis, reading order, etc
My next project is reading order detection. I will then be porting all of these into marker (), my pdf to markdown converter, to improve accuracy, and allow commercial usage.