A lot of the insider knowledge on how to build an LLM has gone underground in the last 24 months.
We are going to build
#SnowflakeArctic
in the open
Model arch ablations, training and inference system performance, dataset and data composition ablations, post-training fun, big
Blown away by
@OpenAI
GTM. Great launch:
🚩 Many months of pre-launch buzz
🚩 Soft launch on Bing
🚩 Tons of embargoed launch partners
🚩 Paywall to drive ChatGPT-Plus signups
🚩 From arxiv to marketing white papers
🚩 s/SOTA on MMLU/GPT-4 can get into Stanford/g
Congrats ...
A lot of the insider knowledge on how to build an LLM has gone underground in the last 24 months.
We promised to build
#SnowflakeArctic
in the open, and here we are, with the third edition of our cookbook series, this time on data ...
Data ablations are the lifeblood of any LLM
I went to Google straight out of school, and spent 12 years there in a variety of engineering roles. Contrary to all the advice you'll read about levels and scope and responsibility and setting yourself up for success, growing your career in any eng org is pretty simple.
👇
World-class search needs world-class metrics 🚀
And great metrics need to be constantly evolved to avoid overfitting 🤟
As we build out a great search experience at
@SnowflakeDB
AI research, we are excited to join forces with
@lintool
and the University of Waterloo 🙌
Our
.
@SnowflakeDB
is thrilled to announce
#SnowflakeArctic
: A state-of-the-art large language model uniquely designed to be the most open, enterprise-grade LLM on the market.
This is a big step forward for open source LLMs. And it’s a big moment for Snowflake in our
#AI
journey as
We are thrilled to announce
@Neeva
is joining
@SnowflakeDB
.
We're bringing our expertise in search, AI + LLMs to Snowflake’s customers to help them safely & effectively realize the power of their data.
We remain forever grateful for all the support we received from all of you.
I asked GPT-3 to produce a funny limerick about privacy, search and ads. Here you go:
There once was a user who whined
"I don't want my data mined
I don't like being spied on
And I hate all these ads!"
But the search engines just smiled
And said "We'll do it anyway!"
Impressive
Excited to partner with
@DaniYogatama
@YiTayML
and the
@RekaAILabs
to bring them to
@SnowflakeDB
Cortex.
@RekaAILabs
is the sleeper in the LLM wars.
* Consistently top-tier models ✅
* Upcoming Reka Core model approaching GPT-4 ✅
* Top-tier team ✅
* Impressive execution ✅
As a part of our commitment to helping our customers unlock the power of
#AI
on all types of data, we’re furthering our partnership with
@RekaAILabs
to bring gen AI to images, video and more to Snowflake Cortex.
Learn more about our partnership:
More thinking re: the OpenAI ChatGPT GPT-4 plugin launch ...
The browsing plugin (WebGPT) uses Bing search.
Bing just increased prices on that API 30x (from $7 for 1k requests to $200 for 1k requests) for AI applications
OpenAI could not have pulled this off w/o MSFT ...
You can argue with
@tailopez
's methods. Not his results though. Before
@Neeva
, I led monetization for YouTube ads.
@tailopez
's "Here in my garage" video was among the most effective YouTube ads we ever encountered. (And I can't believe I am saying this).
At
@Neeva
, bi-encoders were 🔑 to great web retrieval performance.
Smaller enterprise corpora => optimizing for recall is even more important in enterprise AI search.
#AI
#MachineLearning
For the techniques that work, read the blog post, or get the tldr from the 🧵 ...
New blog post from me and my colleagues at Snowflake — an explainer on training text embedding models (a key technology behind modern search).
A moderately deep dive into the techniques that several top-scoring models use to improve performance.
I’m honored and beyond excited as we start a new phase of
@SnowflakeDB
journey together.
Snowflake is a once-in-a-generation company and a truly special place. I love our customer-first obsession to deliver a tightly integrated and efficient platform.
I am excited by the
Snowflake has combined a fine-tuned Snowflake SQL generation model and
@MistralAI
state of the art Mistral Large model to create a new text-to-SQL engine for Snowflake Copilot. Read more below:
Most of your searches should have exactly one answer.
A summary that reads like a Wikipedia page for your query.
With real-time information.
With citations from sources so you can verify.
Sign up for a
@Neeva
account and tell us what you think of
#NeevaAI
1/
#NeevaAI
is here, powered by AI & LLMs and
@Neeva
's independent search stack to search in an authentic, believable way.
This is unlike anything we, or anyone, have built before.
U.S. users, try it out by logging into your Neeva account & searching ⤵️
2. Obsess about making the people around you successful. Your peers, your bosses, your team, your x-functional partners, anyone you can help. Actually care.
Big news! 💥
@Neeva
is on this year’s
@forbes
AI 50!
This list recognizes the most promising privately-held companies building businesses out of artificial intelligence. 💡
The hardest part is the value function -- crawl schedulin + index selection.
Essentially, you need to invert your ranking function.
LLMs don't help much, as they primarily capture topical relevance, and don't (by themselves) capture authority or popularity.
Building a high quality fresh web index is harder than building GPT 4. Coz it’s not largely money and resources that will solve the problem for you. There are more trade offs to make.
As promised, today we're going over our learnings from using LLMs to rank UGC content at
@Neeva
!
What is the key to it all, you ask? 🗝️
Well, let's get into it... 🧵 (and big shout-out to the
@Neeva
ranking team)
A sense of dread as I see this. We are slowly seeing the end of the open web.
- AI search powered experiences take away clicks from high quality publishers in aggregate +
- Crawlbots from AI companies not playing by the rules
leading to
- High quality publishers are looking to
As a publisher, you might be worried integrating LLMs into search will cause you to lose all your traffic.
You want to stop depending on referral traffic from search engines, and instead have users start their search on your website.
But how? Read on for how
@Neeva
can help 🧵
3rd week in a row, 3rd LLM from
@SnowflakeDB
...
Arctic-TILT is a 800M model that has GPT-4 quality performance on information extraction tasks, as measured by the DocVQA benchmark.
And it fits in an A10!
Snowflake’s Arctic-TILT model, powering our Document, Al beats GPT-4 with just 0.8B parameters, securing a top spot in the standard benchmark for document understanding DocQVA.
Super excited to see
@neeva
and
@browsercompany
partner!
Been using Arc from
@browsercompany
as my exclusive desktop browser for the last 4 weeks with
@neeva
set as default search. This is how search & browsing should work!
ps. Great job on Arc
@joshm
!
1/
#Neeva
🤝
#Arc
users
We’re happy to announce Neeva is now available for Arc users, and can be easily set as their default
#searchengine
within Arc, the new browser from
@browsercompany
@lennysan
At
@neeva
, what we found was it was less "usage", and more "people were super passionate about dark mode". And when you have passionate users, you listen to them. So we rolled out dark mode.
Very excited to ship the Snowflake CoPilot to a public preview today 👨✈️
It's powered by the best data LLM in the world 🔝
A blend of Snowflake's text-to-SQL model and the conversational capabilities of Mistral-large 🎸
It outperforms GPT-4
It makes you more productive as an
@cdixon
(co-founder of
@Neeva
) Agree with Google search ad load. OTOH, I think the right hybrid model we'll converge to "use LLMs for presentation, use search for retrieval" vs "use LLMs for search".
When analyzing a feature like Pages from Perplexity, wear the lens of "growth hacking" more than "user value" to understand why it makes sense.
- A sink for Google SEO traffic.
- Destination pages can get shared around; a way to create sticky community
I think it's a cool idea
The insane part of this -- real businesses depend on OpenAI / Azure OpenAI for real use cases.
If you are an enterprise, would you do business with a partner that could go under over a weekend?
It's been a wild ride. Just 20 of us, burning through thousands of H100s over the past months, we're glad to finally share this with the world! 💪
One of the goals we’ve had when starting Reka was to build cool innovative models at the frontier. Reaching GPT-4/Opus level was a
@A_Daneshzadeh
(founder
@Neeva
) To your list, I would add -- safer (ads-free), private, richer (widgets, full page experiences), more real (Reddit, Stack, Quora, Medium and other forums), and more personalized (site preferences, personal search over your apps)
1. Obsess about your effectiveness, your personal impact to the product and the company. Constantly ask yourself with your "wins above replacement" is.
@jeremyphoward
(Founder of
@neeva
) So we are speaking the same language, the core Google search systems barely used any ML for decades. If heuristics & intuition worked to build a $150B business, it's probably good enough for you. (Not saying don't use ML, just saying don't disregard intuition)
Today we announced an expanded collaboration with
@nvidia
to empower enterprise customers with a full-stack AI platform. Jensen and I are thrilled to be bringing together
@nvidia
's GPU-accelerated platform with the trusted data foundation and secure AI of the
@SnowflakeDB
Data
Excited to bring mistral-large to
@SnowflakeDB
Cortex and amp AI up
Starting today,
@SnowflakeDB
customers can use the most capable LLMs out there in Cortex, now in public preview!
Huge shout out to
@arthurmensch
and the
@MistralAI
team who have been a joy to work with.
We’re excited to announce a global partnership to bring
@MistralAI
's most powerful language models directly to Snowflake customers in the Data Cloud.
Learn more about how Snowflake users can leverage AI with their enterprise data:
@OpenAI
Similar awe-inspiring GTM motion during ChatGPT
🚩GPT3.5 pre-launch to drive buzz
🚩ChatGPT launch during NeurIPS to drive virality
🚩ChatGPT-Plus made lemonade ($) out of lemons (availability)
🚩API launch with 10x cheaper model
🚩Tons of enterprise launch partners
@sriramk
@sriramk
@neeva
's mobile browser lets you control how you want to handle all consent popups. On desktop, you can download our cookie cutter extension for chrome:
Say goodbye to complex SQL queries and hello to an era of data exploration made simple, secure, and efficient. Snowflake Copilot, our breakthrough AI-powered SQL assistant, is now generally available!
(1/) Competing in search starts from crawling the web. You can't serve up search results if you can't crawl the web.
Crawling the web is very hard today because of the discriminatory and
#anticompetitive
nature of how crawlers are treated on the web. 🧵
I jumped off my chair when I saw this data from the collaboration across
@HelloSurgeAI
/
@echen
and
@Neeva
.
It's why I am so excited to work w/
@Reddit
to make search awesome.
1/ Psst, hey did you hear
#Google
is dying? 👀
🗣 Users prefer
#Reddit
23-30% of the time over Google’s top 3 in externally commissioned surveys
🗣 Reddit only shows up 3% of the time in Google’s top 3
🗣 Reddit is 10x under-ranked relative to SEO shops on Google
Together AI and Snowflake partner to bring their state-of-the-art Arctic LLM to enterprise customers. Experience Arctic on Together Inference with best in class performance.
As I read about the OpenAI ChatGPT / GPT-4 history leak issue, I am not sure how this is not a bigger deal.
At a company like
@Google
, this would be a "save all your docs, add the lawyers in and needs disclosure to the FTC" level privacy incident
Consumer search alternatives to Google seem to have come and gone.
* Bing took a shot at the king and missed
* Brave has pivoted hard to an API biz
* DuckDuckGo continues to pretend to do search
* continues to ship; unclear if can build a subs biz.
WDYT?
📣 Thrilled to announce that starting today:
@Neeva
will offer a completely 🆓 version of its product with all the features users have come to love!💙
A premium version with exclusive benefits is also available at a low monthly cost!🎉
More here:
1/ Yesterday,
@vivek7ue
and I outlined the companies best positioned to take advantage of AI's breaking dam.
Today, we're taking you through what the market looks like today + where it's headed this year.
🔖 Read on...
On a tear at
@SnowflakeDB
...
* Text analytics on your data: Cortex functions in GA ✅
* Better semantic search: Arctic embed in Cortex, vector functions in PuPr ✅
* Higher performing LLMs in Cortex: reka-core, Llama3, Arctic ✅
* Enhanced AI safety: LlamaGuard2 + Arctic in
At
@SnowflakeDB
, we are on a mission to bring AI innovation to the enterprise with lightspeed.
So excited that
#SnowflakeCortex
is now generally available to our customers! And we also added easy access to the latest industry-leading AI models
#RekaCore
@RekaAILabs
and
#Llama3
Looks like I picked the wrong weekend to be Twitter-unplugged ...
Looking for a drama-free environment to do research on AI?
Our team at
@Snowflake
AI is building a world-class team of researchers to build serious models for serious use cases ...
DM me or
@RamaswmySridhar
...
Want to embed large corpora blazingly fast? ✅
Care about SOTA quality? ✅
And don't want to pay too much? ✅
@SnowflakeDB
's arctic-embed family, is available on HuggingFace under an Apache2 license 🥁
It's the most practical family of text embeddings in the world,
@SnowflakeDB
is open sourcing the best embedding models in the world! 🚀🚀
They are now available open source in
@huggingface
We are releasing it under the Apache 2 license so that it is easy for the OSS community to experiment with them freely.🎁🎁
These impressive models
#Neeva
is now available for
#Android
! 👾
Experience:
☑️ Quick results w/o spammy ads
☑️ Customization of which news sources & retailers you prefer
☑️ More efficient searching with
#FastTap
☑️ Integrations of community forum content (
#Reddit
)
Download ➡️
GPT-4 is not a 100T parameter model. Notwithstanding what the AI hype cycle is predicting.
For an optimistic but more sensible take, read
@RamaswmySridhar
's thread below
Every year,
@TIME
highlights innovations that are making the world better, smarter, and even more fun. We’re very thrilled and honored to be on Time’s Best Inventions of 2021 list!
#TIMEBestInventions
#TryNeeva
Open web crawl always relied on a quid-quo-pro bt. aggregators and publishers ("you take my data; you send me clicks")
With LLMs, the value exchange has tipped and pubs like Reddit go after cash-rich aggregators for their "fair share".
May we live in interesting times ...
One of my favorite things about
@Neeva
is agency over the sources you trust. We started with news and recipe providers. I want result ranking to reflect your explicit preferences. Not the editorial decisions of an algorithm. Try it out at
Another example of how search is so hard
PFIX is down from ~$74 to ~$40 today.
Query: [why is pfix going down today?]
All the search engines, including the LLM-powered ones fail.
Read on for the answer at the end of the 🧵...
Today we are thrilled to share that we’ve raised $106M in a new round led by
@SalesforceVC
with participation from
@coatuemgmt
and our existing investors.
Our vision is to rapidly bring innovations from research to production and to ultimately build the best platform we can for
Announcing v1 of the Together Python SDK!
◆ More intuitive OpenAPI compatible API
◆ Async support for batching requests
◆ More robust with better error handling
About time Congress went after Google for running the casino and counting cards at the same time.
Next up: search. Or as
@lutherlowe
says: "Going after Google without touching search is like going after Standard Oil without touching oil."
For a fresh take on search, try
@Neeva
A huge shout out to
@darinwf
and the
@Neeva
mobile team on NeevaScope. NeevaScope is our first iteration of Darin's vision for how search and the browser can be reimagined together -- the idea that we can give the Neeva browser AI (ambient intelligence) using our search index.
📢 Introducing NeevaScope, your new guide to the web!
Browse smarter in the Neeva iOS app. 🌎
Tap on the Neeva icon to see a website’s related links, recommended products, chatter from
#social
, and more! 🔥
Learn more:
Exciting launches!!
Both the user facing product feature (
#NeevaAI
on Reddit seeking queries) and equally importantly, the new
#NeevaAI
ranking system for Reddit queries that made this possible.
Eager to continue to share our learnings on both in the coming weeks.
Two HUGE
#NeevaAI
features launched today:
1️⃣ You can now get instant
@Reddit
summaries on
@Neeva
w/
#NeevaAI
2️⃣ Our new AI ranking system now powers NeevaAI for UGC, starting w/ site:reddit searches
You trust Reddit to give you authentic answers from real people. So do we. 🧵
Excited to launch
#NeevaAI
on
@ProductHunt
today.
Our vision is a single answer to every query.
Cited and sourced so you can check the answer for yourself.
With citation cards that show summaries of the underlying sources.
Go to and give it a go ..
🚨 NeevaAI officially launched on
@ProductHunt
today!
✨Search powered by AI. Get answers. Not ads.✨
▫️ Real-time AI search
▫️ Authoritative answers, always with cited sources
▫️ Powered by our own LLMs & search stack
Share what you ♥️ about
#NeevaAI
:
There's an important missing perspective in the "GPT-4 is still unmatched" conversation:
It's a process (of good engineering at scale), not some secret sauce.
To understand, let's go back to 2000s/2010s when the gap between "open" IR and closed Google Search grew very large. 🧵
I’m excited to share
that Coda has entered a strategic partnership with
@Snowflakedb
, the world’s leading Data Cloud company, plus a large investment round in Coda led by Snowflake Ventures!
Thrilled where this partnership is headed, and here’s a few reasons why. 🧵
NewsGuard has partnered with
@Neeva
, the first search engine to provide its users with NewsGuard’s Reliability Ratings directly in their search results.
To try it out today and start getting more context for the news you encounter online, visit .
👋 Hello World,
Today, we're thrilled to introduce our new company, Augment, as we emerge from stealth to announce our funding milestone. We have raised $227 million at a $977 million valuation to empower software teams with AI.
My 4 yr old after the Warriors lost Game 1:
"You know why the Warriors lost?
Because of my Mom.
She put me to bed.
When I watch the game, they win.
When I don't, they lose."
Claim: t5-*, flan-t5-*, ul2 and flan-ul2 are Google's greatest gifts to the ecosystem.
@YiTayML
-- any update on the timing of when the decoder variants of flan-ul2 are due for open-sourcing.
@canadaduane
@Neeva
Duane -- I am a co-founder at
@Neeva
. Thanks so much for the feedback (and please keep it coming). On threejs docs, stay posted. We are hard at work on improving the quality of official documentation in our technical search results.
Dall-E is going to revolutionize a bunch of stuff.
My 10 yr old used it today to illustrate chapter 1 of a book of stories she is writing.
We would easily pay $10/mo for access.
(Will add link to book after she's done writing the rest of the chapters)
Folks who have been going "LLMs or search" are getting this wrong. The real angle is "LLMs and search". At
@Neeva
, we are super excited to be building that.
At
@neeva
, we've been revolutionizing search w/ an ad free, privacy-first model
But we’ve also been quietly upgrading the experience entirely w/cutting edge AI & LLMs.
ChatGPT cannot give you real time data or fact verification. In our upcoming upgrades,
@neeva
can
Announcement!📣
Neeva is changing our business model - we are now a 100% ALL ADS Search Engine!
We heard your feedback. You want:
✅ All the ads
✅ All the tracking
✅ No annoying real search results
Now you can rest easy knowing advertisers can find you anytime, anywhere.
With all the amazing improvements in LLMs for search, there's still so much work to do.
Query: [how much should i pay for home insurance in moss beach ca]
Context: I am trying to find it if I am getting stiffed.
To find out how the various search engines did, see 🧵
1/ It's not the size, it's the skill - now releasing
#Neeva
's Query Embedding Model!
Our query embedding model beats
@openai
’s Curie which is orders of magnitude bigger and 100000x more expensive. 🤯
Keep reading to find out how... 📖
Watching our competitor chatbots hallucinate, fail at basic math, & fall in love — we felt like something was missing at
@Neeva
.
That ends today with:
#NeevaAI
personalities 🪄
🚜 Feeling country?
🌐 On Bard's waitlist?
🫡 GenZ? Bet.
🏴☠️ Captured by pirates?
Follow along🧵
Often hear companies that want alignment as a service.
In other words, configurable bar on what to refrain from answering.
For example, my chatbot should only answer questions relating to my domain and not general purpose questions like [who is the president of the United
✨ You asked. We delivered. ✨
Today, NeevaAI is officially launched and ready to try in:
🇨🇦 Canada
🇫🇷 France
🇩🇪 Germany
🇪🇸 Spain
🇬🇧 UK
Just log in to your Neeva account and start searching at
(1/) Re-imagining search begins with ranking.
@Neeva
's ranking system is built using a mix of traditional IR techniques and deep learning.
Here's a recent example of how hard this is from
@yashpande98
's work applying deep learning to solve for out-of-order searches. See 👇
Another shoe drops ...
Wonder how Reddit and Stack will draw the line between who pays and who doesn't?
What about open-source LLM efforts like
@AiEleuther
GPT-J/GPT-Neo or
@StanfordCRFM
Alpaca or
@FacebookAI
Llama / OPT or
@databricks
Dolly?
What are the strategic implications of Llama3-400B?
- 400B x 15T tokens = about 2x compute of GPT-4 (200B active x 13T tokens).
- Zuck's very strategically lighting money on fire
- When you have an irrational player in the market, how does everyone else respond ...
1/ Ten blue links headed to a museum near you!!
@Neeva
is applying cutting edge AI to definitely change up the search experience!
Not only are we providing real time, cited AI, because we are making the whole search experience a breeze!
🧵
At
@neeva
, we've been revolutionizing search w/ an ad free, privacy-first model
But we’ve also been quietly upgrading the experience entirely w/cutting edge AI & LLMs.
ChatGPT cannot give you real time data or fact verification. In our upcoming upgrades,
@neeva
can
<marketing_message>
Wonder how to filter web data for high quality?
Wonder how you compose the code part of your training data?
Read on for what
@nathanwiegand
and friends did as part of building
#SnowflakeArctic
...
</marketing_message>
We just published our next blog post in the Arctic Cookbook series about how we generated and managed our training data for Arctic. Up next, we'll talk about getting the most from your hardware.
My $0.02: These are/were hard tradeoff questions that very thoughtful people working on search struggle(d) with.
Alternative search engines are experimenting with a subscription approach but it's hard to get users to care when defaults are so powerful.
Super exciting to start talking about the LLMs that power our Snowflake CoPilot ...
It's been eye-opening to see how building these for the real world is so different from more simplistic published work ...