I’ll probably launch a new product later this year. Build in public. To developers. Still data infra. But with AI.
Incredibly exciting time to run startups.
Follow me. stay tuned.
New data engineering trend (?): Several companies have told me that they are moving away from Kafka to S3 for message queuing use cases. The reason is that they think Kafka is too expensive, and it's not worth running Kafka instances just for system decoupling or connection.
We are officially a series-A startup now! 🎆🎆🎆 With $36M from the series-A round, we will accelerate the development of our flagship product, RisingWave Cloud, which is now open for private preview!
Press release:
It was my paper published in VLDB 2017. The original title was "This is the Best Paper Ever on In-Memory Multi-Version Concurrency Control." We changed the title 3 times as the chair threatened to desk reject our paper 🙃
I recently wrote a new blog () to share my thoughts about stream processing. While having been working on stream processing for 10+ years, I still feel I am pretty new to this domain. I am still learning, and any comments are greatly appreciated!
It's already 2023, and RisingWave will enter its 3rd year. RisingWave, has grown from a personal project to one with 3.7K stars and 100+ contributors. I summarized RisingWave's 2022 here in a blog: .
Any comment is welcome, and happy new year!
Who uses
@duckdb
for real? Very interesting discussion. Seems that DuckDB is gaining widespread popularity in the data science domain. Can we simply use SQL (instead of Python, like Pandas) to do data science???
People are talking about data lakes and lake houses these days. At
@RisingWaveLabs
, we've put lots of effort into integrating with data lakes. Here's an exciting project we are working on: . All projects written in Rust will soon enjoy better integration
Register here ->
Topics covered:
✅ BYOC vs. managed cloud;
✅ Open data lake format;
✅ S3 Express and S3 as the primary storage;
✅ Transition from batch to streaming;
✅ many others!
See you on this Thursday at 9am PST!
RisingLight () is open-sourced! It is an OLAP database built by a group of talented students (initiated by
@Cat99Vegetable
, our company's previous intern, now PhD at
@UMassAmherst
) with the aim of helping people learn OLAP database internals using Rust! ...
If you are new to stream processing, or want to understand how to use SQL to write streaming applications, then you may be interested in this repo: .
No BS - just runnable code. No cluster required - works on laptop 😀😀
No matter how the data infra world evolves, three things will always remain constant, and I always bet on them:
* Postgres
* Kafka
* Iceberg
How are they connected? They are all open standards and essential building blocks for data persistence. Any system should be designed to
S3 is the universal storage layer for modern data sytems, and RisingWave () is the
#1
stream processing system built on top of S3. I won't change my mind.
S3 is increasingly becoming the default storage layer for cloud infrastructure. I wrote notes on this trend, its benefits, its challenges, its early adopters, and the opportunity it presents for new startups to disrupt large infrastructure categories
.
@ClickHouseDB
is one of RisingWave’s best friends in the OLAP domain. There’s a cool project called chDB () that embeds ClickHouse to applications.
Small data is the new trend, and I believe more and more cool projects will emerge in this space. Excited!
Just talked to our engineers and decided to change our slogan in GitHub:
🚀 SQL stream processing with
@PostgreSQL
-like experience.
🪄 10X faster and more cost-efficient than
@ApacheFlink
.
Announcements and reports coming soon.
GitHub: .
Just came back from Kafka Summit London 2023. It was a great event that brought together thousands of data enthusiasts. I wrote a blog describing my takeaways from
#kafkasummit
: .
TLDR:
* Cost efficiency is becoming the key thing.
@redpandadata
and
The purpose of distributed systems has changed drastically over the last 2 decades.
When MapReduce first emerged, the need for dist. systems was to get better perf - single node wasn't powerful enough. But now, in the cloud era, we we can easily rent machines with big DRAM from
I actually don't like the idea of Continuous Queries (
@googlecloud
BigQuery), Dynamic Tables (
@SnowflakeDB
), and Delta Live Tables (
@databricks
). It's not because the technology (which is stream processing) is wrong, but because stream processing in most cases still belongs to
I keep wondering why
@duckdb
has suddenly become so popular. While everyone is advocating for big data, how is it that DuckDB, a single node OLAP database, is gaining so much traction in 2020s? Can anyone explain?
cc
@motherduck
@PuffinDB
@BoilingData
@RillData
.
@RocksetCloud
was one of the few companies I interviewed with just after graduating from grad school. At that time, Rockset was a tiny 10-person company, and that's why I knew the Rockset founding team pretty well.
I could never have imagined two things:
1️⃣ It would
Data infra trend:
-
@confluentinc
: managed Kafka ➡️ event streaming platform
-
@databricks
: managed Spark➡️ data lakehouse + AI platform
-
@elastic
: managed ElasticSearch ➡️ search + analytics platform
Data infra vendors are moving to the SaaS layer. Data product is the future.
The fun part of working at a startup is that we can always learn sth new from our daily work. From a candidate I got to know that
#Rust
's energy consumption is 75X lower than Python, actually even lower than C++! We are essentially building an energy-efficient database!
#rustlang
Finally! We open sourced RisingWave, the streaming database designed for the cloud! A big milestone for
@SingularityData
!
#RisingWave
's goal is to democratize stream processing: stream processing must be made simple, affordable, and accessible, for everyone!
🎉One year ago today, on April 8th, 2022, we open sourced RisingWave, a distributed SQL streaming database. It's been a journey filled with challenges, lessons, and accomplishments. As we celebrate RisingWave's 1st birthday, let's reflect on some milestones:
🚀 First production
That's a SUPER COOL feature!!! Let me try to understand: so
@redpandadata
's tiered storage data will be written and read using Iceberg format, correct? That essentially means, Redpanda will be evolving to a streaming lake house, correct?
@emaxerrno
1/4 There was no big announcement, but
@emaxerrno
mentioned something big during a Twitter Space yesterday.
@redpandadata
will add support for reading their Tiered Storage data with Iceberg format 🤯. This is HUGE.
An interesting move by IBM: . Two things worth noting: 1) IBM, known for its heavy investment in Apache Spark (they even had Spark Technology Center in San Francisco), has now acquired Ahana, a SaaS for Presto; 2) the investment seems to have been
We are hosting the 4th Int'l Workshop on Applied AI for Database Systems and Applications (AIDB 2022) in this year's VLDB. If you are a database/AI person, please do consider submitting a paper here! website: .
@vldb2022
#Database
some fun topics I'd love to discuss:
- is big data really dead?
@duckdb
- vector search as a plugin or a database?
- scale out vs scale up?
...
Next Thursday!
Join
@nikitabase
from
@neondatabase
,
@ryguyrg
from
@motherduck
and
@YingjunWu
as they discuss key database trends to look out for in 2024.
▶ What's the future of vector databases?
▶ Is Postgres becoming the database lingua franca?
and more...
Sign up:
One of the most important things I learned from
@CMUDB
is that a great DBMS must support storing emoji 👻 I bore that in mind when designing
#RisingWave
(). Now everyone can use it to do stream processing over their favorite emojis with low latency! 💩😊💩
David Maier () was the person that lead me into the stream processing world. He advised me on stream processing research during my PhD. Many years later, I founded my own startup
@SingularityData
focusing on stream processing. Thank you, Dave!
#sigmod2022
In the field of stream processing, the performance and usability of
@ApacheFlink
have always been a widely discussed topic 🔥🔥🔥! That's why
@AlibabaGroup
, the world's largest investor in the Flink community, re-implemented Flink internally using cloud-native architecture,
Fun fact: the co-authors of the paper have launched three startups: Ran Xian founded Metabit Trading in 2019;
@andy_pavlo
founded
@OtterTuneAI
in 2020; I founded
@SingularityData
in 2021.
@CMUDB
is truly amazing!!!
It was my paper published in VLDB 2017. The original title was "This is the Best Paper Ever on In-Memory Multi-Version Concurrency Control." We changed the title 3 times as the chair threatened to desk reject our paper 🙃
To PhD candidates: TREAT YOUR PHD THESIS SERIOUSLY! I talked to 5 candidates yesterday and 3 of them asked me about my thesis! Candidates don't give a s**t to your startup if you don't treat your own business seriously! Thanks to everyone who pushed me forward during my PhD 😊
If a database vendor claims that they can beat Oracle because they have better technology, I am pretty sure it will be a failure. Oracle is the dominator not because they have the best tech, but because they have the best customer service, best channel, and the best ecosystem.
@sv_techie
@YingjunWu
@CockroachDB
@Yugabyte
@PingCAP
@PlanetScale
A few that come to my mind:
1. Postgres is not yet ready for tier 0 and tier 1 enterprise grade workloads, where even 1 second of downtime is costly. More foundational features such as faster failovers, reliable and simple to setup/manage active-active (across region) etc. are
Ten things to understand about your database:
1) High level Architecture
2) How writes work? (Replication, data distribution, internal organisation etc)
3) How reads work? (Consistency guarantees, tuning options, etc)
4) CAP theorem, ex. CP or AP
5) Transactions and Concurrency
𝐇𝐨𝐰 𝐌𝐮𝐜𝐡 𝐃𝐨 𝐘𝐨𝐮 𝐏𝐚𝐲 𝐟𝐨𝐫 𝐘𝐨𝐮𝐫 𝐂𝐥𝐨𝐮𝐝 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞 𝐒𝐞𝐫𝐯𝐢𝐜𝐞𝐬?
While many cloud databases market themselves as low-cost options, a closer examination of their configurations and pricing reveals that prices across different vendors tend to fall
I wrote a blog on why Kafka is the new data lake: . I received lots of criticism after publishing it, and people argue that
@apachekafka
is just for streaming, and it’s at most a “data river”.
I won’t change my view. I believe streaming platform vendors
I’ve been working on RisingWave, the stream processing system, for over 3.5 years. During this time, we built everything from scratch, went through countless failed PoCs, and now have thousands of users processing event streaming data with RisingWave. But how are people using
Yesterday, we saw the launch of
@warpstream_labs
, a Kafka-compatible platform built on S3. No matter how good/bad their product is, here're some implications for the data infra space:
* Innovation must still comply with established protocols (e.g., Kafka protocol);
* Everyone is
.
@rustlang
doesn’t just bring better performance; it empowers engineers to be 10x more productive when working on a complex, collaborative project.
That’s the real reason we use Rust to build RisingWave for enterprise-grade stream processing.
#risingwave
@RisingWaveLabs
We
@RisingWaveLabs
are pushing hard to enable developers to build stream processing solutions with high productivity and low cost. My TODO list for H1 2024:
⏹️Python interface
⏹️Standalone mode
⏹️Adaptive scaling
⏹️Unified stream&batch processing
Let's see we can ship them!
Lessons I Learned After running a startup for Three Full Years: People do not need a “perfect” product. Instead, they require a product that satisfies all the following three criteria:
✅ Addresses customers' pain points;
✅ Fits into customers' existing environments;
✅ Is
My friends in Singapore 🇸🇬: we are going to host the first physical meetup in our SG office (near Lau Pa Sat) on August 11. Our engineer will talk about the evolution of stream processing. Please join us if you are interested! link: . Dinner is served! 🇸🇬🇸🇬
When I was at AWS Redshift (C/C++ codebase), I spend 2 months developing a feature, followed by 3 months testing and debugging it. SIG11 is always the nightmare. Modern C++ does have cool features like smart pointers, but when you work in a big team, you still have to suffer -
RisingWave will soon make its metadata service pluggable 🔌🔌🔌. Right now, we use
@etcdio
to store our metadata service. But unfortunately, it's hard for us to make things right when supporting large workload. We have to find a solution.
Instead of using classic services like
#RisingWave
was born in early 2021. I still remembered the day I resigned from
@awscloud
Redshift and started coding alone at my home office. I am super lucky to have dozens of engineers to join us reinventing stream procesing. Now it's time to open source the project!
#startup
Stream processing technology is not black magic. I have no idea why companies pay so much hiring talented engineers simply for maintaining hard-to-use streaming systems. Modern streaming systems must be simple, affordable, and accessible!
#RisingWave
#Database
#StreamProcessing
Over the past few years, the amount of streaming data has grown rapidly and now constitutes an increasing share of the data tech teams have to deal with daily. To make their life easier, our talented engineers developed RisingWave.
Stay tuned to learn more!
#RisingWave
#Database
Glad to see more folks talking about
@RisingWaveLabs
, the stream processing system we've been working on for the last 4 years...
Build, ship, repeat.. and with some luck, people will know your name one day 😀
(and no, we didn't hire any fake reviewers🙂
When we first began building RisingWave, we used Calcite. But it turned out to be unsuitable, in terms of compatability, flexibility, and several other reasons. Now we are using our home-made optimizer to optimize streaming queries.
I am not trying to persuade anyone that
Every new database engine requires a query optimizer. And it's just a TON of work. We couldn't get one off the shelf in the past - Apache Calcite was an interesting attempt, but it didn't translate to OLTP or native languages. Hopeful for this effort!
Came back from
#Current22
. Takeaway: the stream processing area is booming 💥💥💥 2000 in-person + 5000 online attendees! Will write a blog post on what’s happening at Current22!
Database architecture thread. Technical. There has been several startups building an operational relational databases focused on OLTP with a shared nothing architecture.
@neondatabase
is using a different approach - shared storage. What's the difference?
.
@neondatabase
is one of the most popular serverless PostgreSQL providers in the data world. I'd like to learn from their CEO
@nikitabase
about: 1) their views on vector databases; 2) PostgreSQL's position; 3) how to scale out PostgreSQL; and 4) many more!
Save the date for this Thursday!
Join in to hear Neon's CEO
@nikitabase
discuss key database trends to look out for in 2024 along with
@ryguyrg
and
@YingjunWu
.
90% of RisingWave's business is b2b. We see the great value of community and decide to double down on it. RisingWave's free tier is coming next week. Developers can use stream processing technology for free.
@ThePrimeagen
90% of our business is b2b. The avg customer pays us $271,000. The free tier is not our funnel. We’ve shed an immense amount of spending and gained infinite runway. Hundreds of millions MAU run on top of our tech. Can any of the database startups throwing shade claim anything
@easyAbi
(Boston University) will give a talk on how to build persistent KV stores tailored for stream processing systems. Zoom talk is open to public at 8:00am ET, Mar 24, 2022. Details:
We are building an open and collaborative community for
#RisingWave
- everyone is welcome, and we are eager to partner with other communities! Right now, we are actively working with
@redpandadata
to unleash productivity in building real-time apps. Blog post to be released soon!
How many people are using RisingWave, the open-source streaming database ()? See chart below. The daily Kubernetes deployment has increased by 10X 🚀🚀🚀 over the last 2 months!!! Start building your real-time apps with
@PostgreSQL
SQL today!
I am always skeptical of any technology that *(self-)claims* to be the "gold standard." Technologies come and go; only protocols can last forever.
@PostgreSQL
protocol is the gold standard;
@apachekafka
protocol is the gold standard. A specific technology, however, is not.
@eatonphil
I'm the top 1 contributor of Peloton DB, and proud to see that my name gets mentioned in Andy's blog. Very few database was built from scratch since 2017, as there have been so many successful DBs. RisingWave () is the very few that started after 2020.
The Kafka ecosystem has reached a pivotal moment, and several significant changes are either underway or have already occurred:
💸 𝐑𝐮𝐧𝐧𝐢𝐧𝐠 𝐊𝐚𝐟𝐤𝐚 𝐰𝐢𝐥𝐥 𝐛𝐞 10𝐗 𝐜𝐡𝐞𝐚𝐩𝐞𝐫 𝐭𝐡𝐚𝐧 𝐢𝐭 𝐰𝐚𝐬 𝐚 𝐟𝐞𝐰 𝐲𝐞𝐚𝐫𝐬 𝐚𝐠𝐨. For those using Kafka primarily for
In a professional technical presentation, should I include the names of all my competitors in my slide deck? It's NOT a sponsored talk; it's all about technology.
Interesting blog at
@Medium
-: "You can also adopt streaming database like
@RisingWaveLabs
which can joining/aggregating streaming data with SQL syntax."
Yes we do see the trend of using RisingWave in ML infra.
risingwavelabs / risingwave: The distributed streaming database: SQL stream processing with Postgres-like experience 🪄. 10X faster and more cost-efficient than Apache Flink 🚀. ★4968
I've been playing with
@RisingWaveLabs
It's absolutely insane. The performance is out of this world. I haven't touched this thing and am just running the all-in-one version locally.
- 120k+ rows/s ingestion from Kafka (redpanda - 2G memory)
- Joined to output 240k+ rows/s!!
I’ll be talking about building stream processing applications with RisingWave and Apache Pulsar at
#PulsarSummit
Europe! Check out the schedule here: !
Today's news about
@ApacheIceberg
tells us that in the ever-changing data market, the key to success is finding consensus. Iceberg is the consensus for data lakes, and whether it's
@SnowflakeDB
,
@databricks
, or
@awscloud
, they all need to adhere to this consensus. Clearly,
Multi-versioning is a cornerstone technology in database systems. Even though it's not new, it's still used in almost every data product out there. At
@RisingWaveLabs
, we use multi-versioning to create a feature called "time-traveling," which lets users go back to a specific
Want to learn how database concurrency control really works?
Check out this paper from
@YingjunWu
and
@andy_pavlo
! It dives deep into the most widely used type of concurrency control today: multi-version concurrency control (MVCC). The basic idea behind MVCC is for a database