Matei Zaharia @matei_zaharia profile

Matei Zaharia

@matei_zaharia

Followers

42,108

Following

1,199

Media

164

Statuses

2,638

CTO at @Databricks and CS prof at @UCBerkeley . Working on data+AI, including @ApacheSpark , @DeltaLakeOSS , @MLflow , .

https://t.co/2U46PEiGgt

Berkeley, CA

Joined October 2010

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

GCF HEADLINER LISA • 525534 Tweets

Bama • 87371 Tweets

Martinez • 74224 Tweets

Román • 70511 Tweets

スプリンターズS • 51850 Tweets

Tuscaloosa • 44220 Tweets

Riquelme • 38905 Tweets

#MostRequestedLive • 35113 Tweets

Mets • 33143 Tweets

Heisman • 31519 Tweets

Ryan Williams • 28186 Tweets

Kirby • 25842 Tweets

Carson Beck • 25436 Tweets

Milroe • 24784 Tweets

Saban • 21575 Tweets

#RollTide • 20717 Tweets

Jeremiah Smith • 20406 Tweets

Dawgs • 19959 Tweets

Jodhpur Case • 17292 Tweets

ショウマ • 15081 Tweets

招き猫の日 • 14718 Tweets

トッキュウジャー • 13277 Tweets

#KaoPPFanMeet2024 • 13086 Tweets

Florida State • 12148 Tweets

ジンギスカン • 11958 Tweets

ショアキーパー • 11364 Tweets

メェメェ • 10795 Tweets

Penn State • 10476 Tweets

Jim Gaffigan

Davey

Ashton Jeanty

アウトサイダーズ

Cruz Azul

橋木くん

John Black

Melgar

Dbacks

そくほー

Norvell

モスラの歌

デススト2

南泉くん

Laine

設営完了

DeBoer

Figal

Dana Carvey

ピーニャ

#SNLPremiere

#のど自慢

Last Seen Profiles

@lazymare70

@goingmingoo

@FronFronia

@findomg0dd

@rwbparis

@ymnozl

@Gurvind10755496

@bicontinentar

@McTeamOCE

@Ohh_Lenka

@Calahoo38

@PemuasBinor6

@MrShigemitsu

@razr_sharp99

@michavo22

@Edu_Idem

@MaryCruzPertuz

@btselineia

@JavierBlancoH

@jjmbjae

Matei Zaharia

@matei_zaharia

1 year

Lots of people are wondering whether #GPT4 and #ChatGPT 's performance has been changing over time, so Lingjiao Chen, @james_y_zou and I measured it. We found big changes including some large decreases in some problem-solving tasks:

122

789

3K

Matei Zaharia

@matei_zaharia

2 years

Building a ChatGPT-like LLM might be easier than anyone thought. At @Databricks , we tuned a 2-year-old open source model to follow instructions in just 3 hours, and are open sourcing the code. We think this tech will quickly be democratized.

Hello Dolly: Democratizing the magic of ChatGPT with open models

Introducing 'Hello Dolly,' a project to democratize AI by integrating ChatGPT and open models, making advanced AI accessible to everyone.

www.databricks.com

43

507

3K

Matei Zaharia

@matei_zaharia

7 months

Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems.

The Shift from Models to Compound AI Systems

The BAIR Blog

bair.berkeley.edu

30

262

1K

Matei Zaharia

@matei_zaharia

2 years

ChaatGPT: More real than anyone thought?

29

75

1K

Matei Zaharia

@matei_zaharia

1 year

Very excited to return to UC Berkeley as a professor starting this week. I’ll be collaborating with the Sky Lab, @UCBEPIC , @berkeley_ai and others!

UC Berkeley EECS

@Berkeley_EECS

1 year

@Berkeley_EECS welcomes @matei_zaharia , who returns to Berkeley EECS as an Associate Professor. Matei’s research interests include computer systems and machine learning. He is also the co-founder and Chief Technologist of Databricks. Welcome back, Matei!

0

6

147

51

44

868

Matei Zaharia

@matei_zaharia

4 years

Pretty sure I've seen people driving with only 19 neurons too!

MIT CSAIL

@MIT_CSAIL

4 years

This autonomous car can drive itself using only 19 control neurons. Video: More: (work w/ @ISTAustria @tuvienna ) #SelfDrivingCars #Autonomy #ML #DL #MachineLearning

11

273

698

6

124

783

Matei Zaharia

@matei_zaharia

1 year

We're launching two comprehensive online courses on building and using Large Language Models! The first is on using LLMs in applications, covering topics like prompt engineering, embeddings, chains, and MLOps. The second teaches you to build your own LLMs.

New Expert-Led LLM Courses on edX | Databricks Blog

www.databricks.com

9

147

674

Matei Zaharia

@matei_zaharia

6 months

At Databricks, we've built an awesome model training and tuning stack. We now used it to release DBRX, the best open source LLM on standard benchmarks to date, exceeding GPT-3.5 while running 2x faster than Llama-70B.

13

134

668

Matei Zaharia

@matei_zaharia

1 year

MLflow just added first-class support for LLMs, including integrations with @huggingface transformers/pipelines, @OpenAI and @LangChainAI ! Open source #LLMOps is here.

Introducing MLflow 2.3 with LLMOps Support | Databricks Blog

MLflow 2.3 introduces native LLMOps support and new features, streamlining machine learning lifecycle management.

www.databricks.com

4

146

610

Matei Zaharia

@matei_zaharia

2 years

We've just launched a version of Dolly on HuggingFace, with new examples showing its capabilities. This is all with just 50k training examples. Stay tuned for new versions with other datasets soon.

databricks/dolly-v1-6b · Hugging Face

huggingface.co

9

93

566

Matei Zaharia

@matei_zaharia

2 years

Who are the World Cup champions? I knew ChatGPT would get it wrong when it launched, but it's surprising that all the new search+LLM engines do too. Combining retrieval+LMs won't just be a matter of prompting. That's why we've been building tools like DSP at Stanford to do it.

22

58

531

Matei Zaharia

@matei_zaharia

11 months

Thrilled to receive this award; the credit is due to my students, my mentors, my collaborators in academia and open source, and my colleagues at Databricks for making all this work happen!

ACM SIGOPS

@ACMSIGOPS

11 months

The 2023 @ACMSIGOPS Mark Weiser was presented to @matei_zaharia for innovation and impact in large-scale data processing. The award was announced at @sospconf From next year, awards will be announced annually as @sospconf is now an annual conference. See you in #austin in 2024

0

9

59

55

38

462

Matei Zaharia

@matei_zaharia

2 months

Not a problem with Lakehouse.

Roshan Patel

@roshanpateI

2 months

my friend works in fashion. i set her up with one of my tech homies. this is how it went.

4K

5K

273K

17

25

434

Matei Zaharia

@matei_zaharia

4 years

Thanks @pbailis , hope you let me graduate soon!

22

10

433

Matei Zaharia

@matei_zaharia

6 years

Super excited to announce MLflow, a new open source Machine Learning platform from Databricks to manage the complete machine learning lifecycle:

6

208

424

Matei Zaharia

@matei_zaharia

1 year

For example, GPT-4's success rate on "is this number prime? think step by step" fell from 97.6% to 2.4% from March to June, while GPT-3.5 improved. Behavior on sensitive inputs also changed. Other tasks changed less, but there are definitely singificant changes in LLM behavior.

11

60

420

Matei Zaharia

@matei_zaharia

1 year

One of my favorite announcements: English SDK for @ApacheSpark ! No more need to remember weird syntax, just chain transformations in natural language with the familiar Spark API. So many fun examples.

12

73

416

Matei Zaharia

@matei_zaharia

4 years

We've started a great collaboration between @PyTorch and @MLflow , to bring a rich set #MLOps functionality to PyTorch users. We've been working on this with the PyTorch team for a while and we're super excited to release a first wave of integrations today:

MLflow and PyTorch — Where Cutting Edge AI meets MLOps

We are announcing a number of technical contributions to enable end-to-end support for MLflow usage with PyTorch.

medium.com

3

85

380

Matei Zaharia

@matei_zaharia

4 years

Due to COVID19, we decided to make #SparkAISummit virtual and also *free* for anyone to attend this year! We still have the same great program with over 200 talks and keynotes from @NateSilver538 , @jenniferchayes , @apaszke and more. Tune in for the largest data & AI summit ever.

#DataAISummit

@Data_AI_Summit

4 years

We can’t wait to solve the world’s toughest problems — and it starts with #SparkAISummit , the world’s largest data and machine learning conference. As a global virtual event, we'll converge to shape the future of big data, analytics and AI. Join us:

0

25

57

10

165

370

Matei Zaharia

@matei_zaharia

1 year

Our MOOC on Large Language Models: Application through Production started today! Join me, Sam Raymond, Chengyin Eng and Joseph Bradley from Databricks as we cover how to build end-to-end apps with LLMs, including components like vector DBs and chains.

4

64

367

Matei Zaharia

@matei_zaharia

1 year

Our new MOOC on #LLM Foundation Models from the Ground Up is now available! Join me, Chengyin Eng, @sjraymond , Joseph Bradley and @abhi_venigalla for a detailed look at how LLMs are built, how to improve them, and where the field is going.

5

86

360

Matei Zaharia

@matei_zaharia

3 years

Congrats to my student @codyaustun (with @pbailis ) on defending his PhD today! Cody did amazing work improving the resource and data efficiency of deep learning, including widely used benchmarks (DAWNBench/MLPerf), perf analysis, and new 10-1000x faster algorithms (SVP & SEALS).

14

36

352

Matei Zaharia

@matei_zaharia

2 months

Does long context solve RAG? We found that many long-context models fail in specific and weird ways as you grow context length, making the optimal system design non-obvious. Some models tend to say there's a copyright issue, some tend to summarize, etc.

Long Context RAG Performance of LLMs | Databricks Blog

www.databricks.com

12

78

339

Matei Zaharia

@matei_zaharia

1 year

How can you efficiently evaluate RAG-based LLM applications like document question answering? We've tested several methods on our internal question answering applications at Databricks and found some effective ways to do this using LLMs.

Best Practices for LLM Evaluation | Databricks Blog

This blog post discusses best practices for evaluating retrieval-augmented generation (RAG) applications using large language models (LLMs).

www.databricks.com

1

61

326

Matei Zaharia

@matei_zaharia

5 years

I'm super honored to have received a #PECASE award this year. Percy Liang from @StanfordNLP also got one, which is great news for Stanford CS. Congrats to everyone else who received one!

19

26

316

Matei Zaharia

@matei_zaharia

10 months

This thread highlights a point we've been seeing in for a while: you can't meaningfully talk about capabilities of a *language model*, you have to talk about capabilities of a *system*, including the inference algorithm. 32-CoT is not the same as 5-shot.

GitHub - stanfordnlp/dspy: DSPy: The framework for programming—not prompting—foundation models

DSPy: The framework for programming—not prompting—foundation models - stanfordnlp/dspy

github.com

Aravind Srinivas

@AravSrinivas

10 months

. @JeffDean why the need to do 32-CoT Gemini Ultra vs 5-shot GPT-4? Why not just report 5-shot vs 5-shot?

20

724

6

41

316

Matei Zaharia

@matei_zaharia

4 years

For @VLDB2020 , we wrote a paper on @DeltaLakeOSS , one of the most exciting new technologies from Databricks. By adding ACID transactions over cloud object stores, we can provide data-warehouse-like capabilities & performance on low-cost, HA cloud storage.

5

91

306

Matei Zaharia

@matei_zaharia

4 years

AI research today

5

46

298

Matei Zaharia

@matei_zaharia

4 years

We've made the ebook version of Learning Spark 2nd edition available for free -- don't miss it!

Big Book of Data Engineering — 3rd Edition | Databricks

Learn the latest methodologies in data engineering for the AI era. This guide covers data pipelines, ETL, data streaming, orchestration, data governance and more.

www.databricks.com

6

94

293

Matei Zaharia

@matei_zaharia

3 years

Databricks just set a new record on the official TPC-DS data warehousing benchmark, showing that a lakehouse system based on open data formats can outperform previous DW systems. Don't listen to folks who say open means bad performance!

4

69

289

Matei Zaharia

@matei_zaharia

4 years

Excited to share our #Lakehouse technical paper published at #CIDR21 . We describe a new class of data platforms that are (1) completely open, (2) efficiently support #MachineLearning , and (3) provide all traditional #DataWarehouse capabilities+performance.

6

89

288

Matei Zaharia

@matei_zaharia

5 years

Really excited to introduce the MLflow Model Registry today, a new MLflow component for collaboratively sharing and managing models:

Introducing the MLflow Model Registry--Machine Learning Model Hub

Discover the MLflow Model Registry, a central hub for managing machine learning models, enhancing collaboration, and streamlining model lifecycle.

www.databricks.com

2

101

277

Matei Zaharia

@matei_zaharia

4 years

#ApacheSpark 3.0 greatly simplifies writing Python user-defined functions through type hints, and makes it easier for your functions to process data efficiently in batches via Pandas and Apache Arrow. Check out how to use them:

How Python type hints simplify Pandas UDFs in Apache Spark 3.0

Learn more about new Pandas UDFs with Python type hints, and the new Pandas Function APIs coming in Apache Spark 3.0, and how they can help data scientists to easily scale their workload

www.databricks.com

2

77

277

Matei Zaharia

@matei_zaharia

4 years

To be fair, if you're asking someone who worked on Windows, "shut down and restart" worked pretty well there.

5

31

277

Matei Zaharia

@matei_zaharia

1 year

This is a big release: we've spent the past 3 years working on LLM pipelines and retrieval-augmented apps in my group, and came up with this rich programming model based on our learnings. It not only defines but *automatically optimizes* pipelines for you to get great results.

Omar Khattab

@lateinteraction

1 year

🚨Announcing 𝗗𝗦𝗣𝘆, the framework for solving advanced tasks w/ LMs. Express *any* pipeline as clean, Pythonic control flow. Just ask DSPy to 𝗰𝗼𝗺𝗽𝗶𝗹𝗲 your modular code into auto-tuned chains of prompts or finetunes for GPT, Llama, and/or T5.🧵

24

138

644

1

50

279

Matei Zaharia

@matei_zaharia

6 years

#ApacheSpark 2.4 is out today! This release has tons of new features including barrier execution mode for ML applications, higher-order functions in SQL, optional eager evaluation for previewing DataFrames in Jupyter, Scala 2.12 support and more.

0

126

263

Matei Zaharia

@matei_zaharia

7 years

I'm co-organizing a new conference on Systems for Machine Learning starting in February; our first call for papers is up at , so submit your interesting SysML work by Jan 5th!

7

147

257

Matei Zaharia

@matei_zaharia

1 year

As we worked with customers using LLMs, a common pattern we saw was that everyone wanted to add a layer in front of the LLM API to manage credentials, rate limits, etc, and to easily swap between models. We've built this the open source @MLflow AI Gateway:

Announcing Advanced Security and Governance in Mosaic AI Gateway | Databricks Blog

www.databricks.com

4

52

248

Matei Zaharia

@matei_zaharia

1 year

We're very excited to be one of the launch partners for Meta's Llama 2 🦙! We got to test Llama 2 in advance and were very impressed. The new version also has a much more permissive license. We've set everything up so you can run it on Databricks today.

Building Generative AI Apps with Llama 2 | Databricks Blog

Learn how to build generative AI applications using Meta's Llama 2 models on the Databricks platform for enhanced AI capabilities.

www.databricks.com

1

44

241

Matei Zaharia

@matei_zaharia

1 year

Cool to see this model from @MosaicML being trained on RedPajama and Dolly data. Fully open source AI is becoming a reality -- open source efficient training, curated web dataset, and instruction data. Still early and small model but it will get better.

mosaicml/mpt-1b-redpajama-200b-dolly · Hugging Face

huggingface.co

1

45

247

Matei Zaharia

@matei_zaharia

4 months

Super excited about the new Agent Framework, Tool Catalog, Vector Search, Evaluation and Training capabilities we launched today in Mosaic AI. We see more companies building compound AI systems, and we have created an end-to-end environment to do this.

2

49

239

Matei Zaharia

@matei_zaharia

3 years

We just posted ColBERTv2, which dramatically reduces the space usage of ColBERT and gets state-of-the-art information retrieval quality on MS MARCO as well as out-of-domain on BEIR🍺, Open-QA retrieval, and our new long-tail task benchmark LoTTE☕️.

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late...

Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector...

arxiv.org

5

52

239

Matei Zaharia

@matei_zaharia

7 months

Want to efficiently query a vector DB while filtering on structured attributes? My student Liana Patel, together with @petereliaskraft and @guestrin , modified HNSW to do this efficiently in ACORN, to appear at SIGMOD:

4

41

236

Matei Zaharia

@matei_zaharia

1 year

Databricks just published our #StateofDataAI report, with interesting trends at our enterprise customers: 1. Adoption of LLMs is booming, with use of SaaS LLM APIs exploding since #ChatGPT launched, but the largest use (and growth) still in custom LLMs.

2

57

232

Matei Zaharia

@matei_zaharia

1 year

Very cool to see Dolly-v2 hit #1 trending on HuggingFace Hub today. Stay tuned for a lot more LLM infra coming from Databricks soon. And register for our @Data_AI_Summit conference to hear the biggest things as they launch -- online attendance is free.

2

39

228

Matei Zaharia

@matei_zaharia

3 years

Large NLP models are expensive and opaque, but maybe it doesn't have to be that way. This exciting work with Omar Khattab and @ChrisGPotts uses retrieval to set SotA results in hard NLP tasks at low cost. Our Baleen paper will be a spotlight at NeurIPS.

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

By tapping into knowledge stored explicitly in text corpora, retrieval helps tackle the inefficiency, opaqueness, and static nature of large language models.

ai.stanford.edu

3

43

233

Matei Zaharia

@matei_zaharia

1 year

Want to build your own chat AI from scratch? We're launching a Building LLMs course at @Data_AI_Summit to teach everyone how to build a Dolly clone: . Tiny model, big attitude, for anyone. #DemocratizeAI

6

40

216

Matei Zaharia

@matei_zaharia

4 years

It's hard to believe that #ApacheSpark was first released as a research project 10 years ago! My @SparkAISummit keynote (live now) goes through the lessons in the past 10 years and what's new in #ApacheSpark 3.0.

6

42

212

Matei Zaharia

@matei_zaharia

11 months

As good a time to say this as any: if you’re on the AI research job market, Databricks is hiring, with the mission to democratize AI. We power amazing customer use cases and we publish. Check or reach out.

Careers at Databricks | Databricks

Join Databricks to work on some of the world’s most challenging Big Data problems. Explore opportunities, see open jobs worldwide.

www.databricks.com

5

29

212

Matei Zaharia

@matei_zaharia

4 years

Databricks is now available on @googlecloud ! We've also built great integrations with BigQuery, Looker, GCS and Google AI services across the product.

Databricks

@databricks

4 years

Open #lakehouse platform meets open #cloud with unified data engineering, data science and analytics. Learn more about Databricks on @GoogleCloud :

0

20

48

7

42

211

Matei Zaharia

@matei_zaharia

2 years

Very excited that @ApacheSpark won the SIGMOD System Award this year. Congrats to the whole community behind the project!

ACM SIGMOD

@sigmod

2 years

2022 ACM SIGMOD Awards Edgar F. Codd Innovations Award goes to Dan Suciu. Contributions Award goes to Christian S. Jensen. Test-of-Time Award goes to “NoDB: Efficient Query Execution on Raw Data Files”. Systems Award goes to “Apache Spark”. Congrats!

2

33

142

5

24

207

Matei Zaharia

@matei_zaharia

2 years

We updated the code for Dolly so it only trains in 30 minutes now. It’s nice to be able to experiment quickly with instruction tuning.

Mike Conover

@vagabondjack

2 years

We’re actively updating the Dolly repo with model improvements! Make sure to pull the latest changes. At $30 / 30min per training run it’s dead simple to run multiple experiments. Also, 688 stars in 20 hours! Neat!

8

42

238

2

45

205

Matei Zaharia

@matei_zaharia

2 years

DSP: and some of our past work on retrieval-based NLP for accurate question answering and other tasks:

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

By tapping into knowledge stored explicitly in text corpora, retrieval helps tackle the inefficiency, opaqueness, and static nature of large language models.

ai.stanford.edu

4

31

208

Matei Zaharia

@matei_zaharia

5 years

I gave a keynote at @ACMSoCC about lessons from building a large-scale cloud service at @Databricks . Did you know that Databricks runs millions of VMs/day to process exabytes of data with <200 engineers? Slides here:

2

55

203

Matei Zaharia

@matei_zaharia

4 years

Congrats to the #ApacheSpark community on the 3.0 release! Over 440 developers contributed 3400 patches to this release, with big improvements in SQL performance, ANSI SQL support, Python usability and management features.

Apache Spark

@ApacheSpark

4 years

[ANNOUNCEMENT] Congrats to the Apache Spark community and all the contributors! The Apache Spark 3.0 is here. Try it out!

9

302

617

1

62

194

Matei Zaharia

@matei_zaharia

4 months

Congratulations and so well deserved, Omar! It's been fantastic working together.

Omar Khattab

@lateinteraction

4 months

I'm excited to share that I will be joining MIT EECS as an assistant professor in Fall 2025! I'll be recruiting PhD students from the December 2024 application pool. Indicate interest if you'd like to work with me on NLP, IR, or ML Systems! Stay tuned for more about my new lab.

251

93

2K

3

8

197

Matei Zaharia

@matei_zaharia

4 years

Exciting times at @Databricks . We're hiring in all departments, so take a look if you want to help shape the next generation of infrastructure for data and AI.

TechCrunch

@TechCrunch

4 years

Databricks raises $1B at $28B valuation as it reaches $425M ARR by @alex and @ron_miller

0

22

78

3

25

196

Matei Zaharia

@matei_zaharia

1 year

Meet #LakehouseIQ : a knowledge engine from your enterprise that understands your business & data to power AI apps. Every platform is adding an AI assistant, but in data, LLMs don't just work out of the box, because every org has its own jargon, data, etc.

11

88

185

Matei Zaharia

@matei_zaharia

5 months

I'm co-organizing the inaugural research workshop on Compound AI Systems on June 13th: . Send in your work on designing & optimizing such systems! Thrilled to have @RichardSocher , @MonicaSLam and @polynoamial as speakers, and host this at @Data_AI_Summit .

Home

IneDevelopers are increasingly constructing Compound AI Systems, i.e. systems that use multiple model calls and/or external components, to tackle the most challenging AI tasks. These systems can...

sites.google.com

2

35

194

Matei Zaharia

@matei_zaharia

5 years

Presentation videos from #SysML19 are now up. Find all the talks on YouTube here:

SysML Conference

www.youtube.com

0

81

186

Matei Zaharia

@matei_zaharia

4 years

We also have a big announcement for @MLflow today: it's joining the @linuxfoundation as a long-term vendor-neutral home to host the project! We've been blown away with how fast MLflow has grown and hope this leads to even more contributors.

MLflow Joins the Linux Foundation, Adds New Features

Learn more about MLflow’s move to the Linux Foundation and the new features that make it the open standard for ML lifecycle management platforms.

www.databricks.com

1

79

184

Matei Zaharia

@matei_zaharia

16 days

Really cool to see OpenAI o1 launched today. It's another example of the trend towards compound AI systems, not models, getting the best AI results. I'm sure that future versions will not only scale inference, but also use tools (coding, search, etc) for better results.

Matei Zaharia

@matei_zaharia

7 months

Interesting trend in AI: the best results are increasingly obtained by compound systems, not monolithic models. AlphaCode, ChatGPT+, Gemini are examples. In this post, we discuss why this is and emerging research on designing & optimizing such systems.

30

262

1K

5

25

188

Matei Zaharia

@matei_zaharia

5 years

Second big announcement is open sourcing Databricks Delta as Delta Lake. Delta dramatically simplifies building reliable data lakes on HDFS and cloud storage through ACID transactions, indexes and scalable metadata handling. More info here:

5

108

176

Matei Zaharia

@matei_zaharia

3 years

Really cool to see @MLflow as the second-most popular ML tracking tool in this year's @kaggle survey (only behind TensorBoard), given that it started in 2018! We're excited to bring easy, open source observability to all ML workflows.

2

25

180

Matei Zaharia

@matei_zaharia

6 months

The great thing is that for customers wishing to build such models that natively understand their data, the cost could be even less. We have the checkpoints, data cleaning pipeline, instruction tuning pipeline, etc from DBRX — just apply these to your data.

clem 🤗

@ClementDelangue

6 months

Just $10M and two months to train from scratch a GPT3.5 - Llama2 level model. For context, it probably cost 10-20x more to OAI just a year ago! The more we improve as a field thanks to open-source, the cheaper & more efficient it gets! All companies should now train their own

11

56

502

1

21

164

Matei Zaharia

@matei_zaharia

4 months

We just posted the first release of open source Unity Catalog! It supports tables, unstructured data, and AI, and we have a great set of partners across data and AI integrating with it. Read more at

Open sourcing Unity Catalog, creating the industry’s only universal catalog for data and AI |...

Databricks open sources Unity Catalog, creating industry’s only universal catalog for data and AI, ensuring interoperability across formats, platforms, and clouds.

www.databricks.com

Databricks

@databricks

4 months

. @matei_zaharia just open sourced Unity Catalog LIVE at #DataAISummit !

2

8

60

3

32

180

Matei Zaharia

@matei_zaharia

6 months

Probably the thing I’m most excited about with DBRX, it’s super fast! Easily 150 tokens/s for quality comparable to much slower closed models.

Nathan Lambert

@natolambert

6 months

Okay @databricks what're you cooking behind this space its so fast lmao

7

18

174

6

30

174

Matei Zaharia

@matei_zaharia

2 months

How can you make LLM-as-judge reliable in specialized domains? Our applied AI team developed a simple but effective approach called Grading Notes that we've been using in Databricks Assistant. We think this can help anyone doing domain-specific AI!

Enhancing LLM-as-a-Judge with Grading Notes | Databricks Blog

Evaluating long-form

www.databricks.com

4

30

171

Matei Zaharia

@matei_zaharia

3 years

Congrats to my student @deepakn94 for defending his PhD! Deepak worked on a ton of exciting systems and ML research, including Weld, DAWNBench/MLPerf, and most recently pipelining methods for efficient DNN training, including PipeDream-2BW (ICML'21) and Megatron's 1T param model.

5

7

168

Matei Zaharia

@matei_zaharia

11 months

MLflow 2.8 is out today, with new support for LLM-based eval metrics among other features. Read about how we've been using it to improve our RAG apps at Databricks, like our docs assistant:

Announcing MLflow 2.8: LLM Judge Metrics | Databricks Blog

MLflow 2.8 introduces automated evaluation with LLM judges, saving time and costs. Data cleaning enhances RAG application performance. Try it now!

www.databricks.com

2

29

169

Matei Zaharia

@matei_zaharia

10 months

Everyone is doing RAG on unstructured docs, but what if you want to mix in structured business data? Databricks RAG can connect to feature tables & functions to query the latest data in your catalog, all with centralized governance, security and MLOps.

Improve RAG Application Response Quality | Databricks Blog

Enhance RAG applications with real-time structured data using Databricks' Feature and Function Serving, simplifying Gen AI application deployment.

www.databricks.com

3

29

162

Matei Zaharia

@matei_zaharia

4 years

Really proud of my student @sppalkia who passed his (online) PhD defense today! He's the first of my students to graduate, and he did awesome work accelerating data applications with Weld, Mozart and other systems. You can see his talk and slides here:

Shoumik's PhD Defense "Interfaces for Efficient Software Composition...

My defense for my PhD thesis "Interfaces for Efficient Software Composition on Modern Hardware".Slides are here:https://www.shoumik.xyz/static/papers/shoumik...

www.youtube.com

1

26

164

Matei Zaharia

@matei_zaharia

4 months

Thrilled that Forrester named Databricks a Leader in their report on AI Foundation Models in enterprise! We help organizations build the best AI for *their* domain and data, using the best techniques available, with a world-class research team to back it.

5

39

165

Matei Zaharia

@matei_zaharia

1 year

Apache Spark (and Databricks) are getting first-class support in @HuggingFace ! You can now rapidly load data from these engines for HuggingFace training and inference, giving up to 40% speedups.

Databricks ❤️ Hugging Face | Databricks Blog

Databricks collaborates with Hugging Face to contribute a Spark loader for Hugging Face datasets, enhancing data processing capabilities.

www.databricks.com

2

20

160

Matei Zaharia

@matei_zaharia

4 years

One of my favorite features in the upcoming #ApacheSpark 3.0 is Adaptive Query Execution (AQE), which tunes number of reduce tasks, join algorithms and skew joins automatically. Learn how it works and how it speeds up TPC-DS queries by up to 8x:

0

42

159

Matei Zaharia

@matei_zaharia

1 year

Everyone’s excited about vector DBs, but there’s a lot to do to get truly high quality retrieval systems! Check out this paper benchmarking quality, latency and cost.

Avi Sil

@aviaviavi__

1 year

#acl2023 findings paper for folks working on retrieval leaderboards- Read on: ✅ We show multi-dimensional tradeoffs e.g. quality , latency & cost (instead of just F1) ✅ Metrics that include concrete efforts e.g. DynaScore. -- Code in PrimeQA:

1

28

87

2

26

160

Matei Zaharia

@matei_zaharia

7 years

The new research group I'm part of at Stanford, DAWN, is building infrastructure for usable machine learning:

1

71

154

Matei Zaharia

@matei_zaharia

6 months

We’re hiring for the RAG / AG research team at Databricks. Come help make AI even better at incorporating real-time data and external tools.

Michael Carbin

@mcarbin

6 months

“How’s your sabbatical?” Well…DBRX is GREAT at RAG! If you’ve been using Mixtral/Llama2/GPT3.5, then try DBRX! The combination of RAG with its SoTA capabilities on knowledge/code/reasoning will unlock new CompoundAI opportunities.

5

21

149

2

21

154

Matei Zaharia

@matei_zaharia

1 year

So excited about this -- bringing amazing platforms for data and AI together. @NaveenGRao , @hanlintang and @jefrankle have built an amazing team that has steadily reduced the cost of AI training and released breakthroughs like the first open source LLMs with >64K context.

Naveen Rao

@NaveenGRao

1 year

Today we’re announcing plans for @MosaicML to join forces with @databricks ! We are excited at the possibilities for this deal including serving the growing number of enterprises interested in LLMs and diffusion models.

58

66

659

4

16

152

Matei Zaharia

@matei_zaharia

1 year

Want to build your own conversational AI from open datasets and your own data? Join this free webinar on April 25th with some of the Dolly authors:

2

44

153

Matei Zaharia

@matei_zaharia

5 years

Videos for all @SparkAISummit SF 2019 talks are now online. You can find them here:

Databricks

Databricks is the Data and AI company. More than 10,000 organizations worldwide — including Block, Comcast, Conde Nast, Rivian, and Shell, and over 60% of the Fortune 500 — rely on the Databricks...

www.youtube.com

1

72

152

Matei Zaharia

@matei_zaharia

5 years

Just in time for my lecture on data quality at Stanford.

5

12

154

Matei Zaharia

@matei_zaharia

10 months

Sad about the chaos around OpenAI, which was crazier than anyone imagined, and how it’s affecting people, especially those on visas. I hope everyone lands on their feet!

4

13

152

Matei Zaharia

@matei_zaharia

1 year

We want to run a longer study on this and would love your input on what behaviors to test!

21

9

148

Matei Zaharia

@matei_zaharia

5 years

Congrats to the whole team at Databricks for the continued ultra-fast growth! We're hiring in all roles to continue simplifying how organizations work with data through technologies such as @DeltaLakeOSS , @MLflow , @ApacheSpark and more.

Databricks

@databricks

5 years

We're excited to announce that we've raised $400 million to continue our rapid global growth and engineering expansion, an investment that brings our valuation to $6.2 billion. Learn more:

3

61

217

1

27

146

Matei Zaharia

@matei_zaharia

4 years

My talk on #Lakehouse systems from #CIDR21 is now online, explaining this new trend in data management systems: . You can also find our paper at

Lakehouse: A New Generation of Open Platforms that Unify Data...

Authors: Matei Zaharia (Stanford and Databricks); Ali Ghodsi (UC Berkeley and Databricks); Reynold Xin (Databricks); Michael Armbrust (Databricks)Paper: http...

www.youtube.com

2

34

147

Matei Zaharia

@matei_zaharia

1 month

Welcome Omar, and really excited to keep working together on research along with the DSPy community.

Omar Khattab

@lateinteraction

1 month

Some personal news: I'm thrilled to have joined @Databricks @DbrxMosaicAI as a Research Scientist last month, before I start as MIT faculty in July 2025! Expect increased investment into the open-source DSPy community, new research, & strong emphasis on production concerns 🧵.

49

28

638

5

6

146

Matei Zaharia

@matei_zaharia

10 months

We've just released a suite of awesome features for building high-quality RAG apps on Databricks: . In talking with enterprises, we found quality was often the top concern with RAG, so we help teams monitor and improve it at all levels of the stack.

High Quality RAG Apps with Databricks | Databricks Blog

Discover how to build high-quality Retrieval-Augmented Generation (RAG) applications using Databricks.

www.databricks.com

3

30

140

Matei Zaharia

@matei_zaharia

4 years

#PySpark downloads are growing 3x year-on-year. As a result, the @ApacheSpark community is investing a lot in making its Python APIs easier as part of "Project Zen". Read about some of the work currently in progress, including type hints, viz and docs:

3

42

138

Matei Zaharia

@matei_zaharia

5 years

My keynote talk on the MLflow Model Registry is now available online, including a great demo from Corey Zumar:

Simplifying Model Management with MLflow - Matei Zaharia (Databricks)...

Last summer, Databricks launched MLflow, an open source platform to manage the machine learning lifecycle, including experiment tracking, reproducible runs a...

www.youtube.com

0

42

139

Matei Zaharia

@matei_zaharia

2 years

Pretty accurate!

Vijay Vankayalapati

@vijayv500

2 years

Apache Spark - Query Execution Plan. #apachespark #sql #dataengineering #databricks #scala #python #azure #Hyderabad

3

25

110

5

18

138

Matei Zaharia

@matei_zaharia

4 months

We're serious about an open, compatible foundation for all enterprise data. Very excited to work with the @tabulario team to make the open source data ecosystem even better.

Ali Ghodsi

@alighodsi

4 months

Databricks to acquire @tabulario , a data platform from the original creators of Apache Iceberg. Together, we will bring format compatibility to the lakehouse for @DeltaLakeOSS and @ApacheIceberg

11

84

376

4

21

130

Matei Zaharia

@matei_zaharia

2 years

Super excited about this work, and it's open source! One of the coolest open source frameworks from my research group. It lets developers use language-based models (including retrievers) in a composable way to build complex apps.

Omar Khattab

@lateinteraction

2 years

Introducing Demonstrate–Search–Predict (𝗗𝗦𝗣), a framework for composing search and LMs w/ up to 120% gains over GPT-3.5. No more prompt engineering.❌ Describe a high-level strategy as imperative code and let 𝗗𝗦𝗣 deal with prompts and queries.🧵

32

197

986

2

18

138

Matei Zaharia

@matei_zaharia

5 months

@mvpatel2000

2

7

140

Matei Zaharia

@matei_zaharia

1 year

I'm excited to participate in the LLMs in Production virtual conference on June 15-16! I will be speaking about "The Emerging Toolkit for Reliable, High-quality LLM Applications". Register here to join:

4

26

135

Matei Zaharia

@matei_zaharia

9 months

Proud to see Databricks named a leader in the Gartner CDBMS MQ for the 3rd year, advancing in both dimensions! We’ve made so many improvements to the platform this year and we’re just getting started with data intelligence, marketplace, cleanrooms & more.

6

22

136

Matei Zaharia

@matei_zaharia

8 months

A lot happened in Databricks SQL in 2023 -- no wonder it's one of the fastest growing data warehouse platforms. Read how we improved latency and concurrency, made it serverless, and began automatically optimizing most workloads with AI:

Databricks SQL Year in Review (Part I): AI-optimized Performance and Serverless Compute | Databri...

Look back at the major areas of performance advancements for Databricks SQL in 2023, reducing the need for manual tuning through the use of AI.

www.databricks.com

3

21

133

Matei Zaharia

@matei_zaharia

5 years

Our first open source announcement at #SparkAISummit today is Koalas, a more complete Pandas API implementation on #ApacheSpark :

GitHub - databricks/koalas: Koalas: pandas API on Apache Spark

Koalas: pandas API on Apache Spark. Contribute to databricks/koalas development by creating an account on GitHub.

github.com

1

65

132

Matei Zaharia

@matei_zaharia

4 years

I'll be opening up @SparkAISummit tomorrow with a talk on what's new in @ApacheSpark 3.0. This release greatly improves SQL & Python support, including 2x speedup on TPC-DS, adaptive execution to reduce tuning, ANSI SQL, and new Python APIs. Short summary:

1

39

131

Matei Zaharia

@matei_zaharia

7 years

All the videos from #SparkSummit 2017 are now up online for free! Check them out at

0

89

128