ABC Profile Banner
ABC Profile
ABC

@Ubunta

Followers
3,950
Following
3,157
Media
280
Statuses
5,210

Data & ML Infrastructure for Healthcare Opinions are पड़ोसी' | DhanvantriAI 📍 🇩🇪Berlin & 🇮🇳Kolkata

Berlin, Germany
Joined August 2009
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@Ubunta
ABC
10 months
As a Senior Staff Data Engineer, my top five tasks over the past 2 years include: 1. Simplifying Kubernetes for Data Scientists/Engineers: Developed user-friendly libraries and containers, enabling Data Scientists to utilize Kubernetes effortlessly. Achieved a complete
12
40
335
@Ubunta
ABC
2 years
Lazydocker - A very useful terminal UI based application to manage Docker This is really a brilliant application for simplifying docker management
22
133
636
@Ubunta
ABC
2 years
People are debating on Snowflake vs Databricks and I am rebuilding my Data/ML stack on @duckdb , Apache Arrow, @IbisData and @flyteorg
15
33
331
@Ubunta
ABC
2 months
DrawDB is an excellent tool for database design and ER modeling. I found it very user-friendly, and it also allows you to upload existing schemas. 📌You can check it out here: (). I used the generated SQL for PostgreSQL!
3
81
324
@Ubunta
ABC
2 months
Data Engineering and Machine Learning are currently in one of their most exciting phases: - Single-node data stacks, like @DataPolars and Apache Arrow, are now capable of handling 80% of data use cases, even with terabytes of data. - @duckdb is rapidly gaining traction, with
6
25
254
@Ubunta
ABC
2 years
Apache Arrow is on Fire 🔥🔥🔥 🙏 Data Fusion 🔥 @duckdb ⚡Polars Data To me, @ApacheArrow is now the most important component in the data and ML community
7
21
244
@Ubunta
ABC
1 year
Data Engineering offers good pay if you're skilled in several technologies - Streaming engines: Flink & Kafka - DWH: Spark , snow, trino, clickhs - Distributed DB: hbase, cockroachdb, yugabyte - Infrastructure: elk stack, docker + Python & sql "Ability to explain ☝️ these"
3
37
214
@Ubunta
ABC
1 month
The most effective and modern approach to Data Engineering and Machine Learning involves using @duckdb , @DataPolars , or Data Fusion with external data formats like Delta, Iceberg, or Parquet. If you're still using CSV, the first step should be converting it to Parquet!
7
13
180
@Ubunta
ABC
9 months
A very minimal Data Engineering technology stack designed to address the majority of data use cases for a team newly established in close collaboration with Data Science. 1. Apache Spark: This is our primary tool for data processing, ETL, and SQL queries. Apache Spark is highly
6
29
154
@Ubunta
ABC
1 month
Here are some potential shifts we may see in modern data engineering: - High-performance DataFrame technologies such as @DataPolars , @daft_dataframe , and DataFusion are set to gradually replace Pandas as the standard for data manipulation and analysis. - @duckdb will continue to
5
18
153
@Ubunta
ABC
2 years
The biggest problem with the current state of Data Engineering is that we are trying to do *almost everything with SQL. Previously, we were trying everything to Avoid SQL!
17
8
146
@Ubunta
ABC
10 months
Over the past two years, I have conducted over 100 interviews for Data Engineering positions and have noticed the following trends: 1. Over 80% of Data Engineers have not collaborated with Data Science teams, predominantly having backgrounds in analytics. 2. More than 90% of
6
17
139
@Ubunta
ABC
7 months
A very Data Engineering Problem: When working with data that contains rows with very similar content, it's more effective to deduplicate these rows with a certain degree of probability rather than attempting to process all the data. For this purpose, I utilized Splink, and the
3
10
136
@Ubunta
ABC
1 year
Avoid @ApacheSpark when: - Your dataset is smaller than 50GB; opt for @duckdb & @DataPolars - Lack of experienced data engineers (infra) and budget for Databricks - Priority is cost-saving - Project is short-term - Need only 2-3 medium nodes; explore vertical scaling.
5
21
134
@Ubunta
ABC
2 years
I demoed @duckdb to 10 data scientists last Friday & they used it regularly for the week. Feedback - * Speed is the most impressive thing about DuckDB. * It's not exactly a Pandas replacement ( not yet) * Most of them found out the real power of @ApacheArrow 1st time
9
17
128
@Ubunta
ABC
11 months
Inefficient Data Engineering Practices: - Apache Flink pipeline for handling 1kb data/sec - Apache Airflow for manual task execution - Switching to dbt without need, just following trends. - Handling massive CSV files inefficiently. - 1000s of Dashboards with no users
9
15
122
@Ubunta
ABC
2 years
When a majority of the data/ML community was busy building tools for large-scale systems, @duckdb built an extremely fast database & @ApacheArrow designed a memory format to optimize a single node. Both have a massive impact on the overall data/ML ecosystem.
3
17
121
@Ubunta
ABC
2 years
Technically we are trying to rediscover databases in Data Engineering
10
7
115
@Ubunta
ABC
1 year
Data Engineering involves: - Crafting SQL/dbt pipelines for dbs & dashboards. - Designing Spark jobs & orchestrating via Airflow. - Launching real-time tasks with Flink, Kafka, & KV. - Analyzing data using pandas & polars. - Building data infrastructure. - Converting NB to prod
2
20
117
@Ubunta
ABC
10 months
As a Data Engineer, working closely with many data scientists has provided me with valuable insights. Here are my 5 key takeaways: 1. Diverse Interests in Data: It's a misconception that all data scientists primarily focus on model building. Many are deeply interested in
3
10
109
@Ubunta
ABC
2 years
The biggest problem of @duckdb & entire @ApacheArrow stack including Ibis is Awareness. Most of the Data/ML community are not even aware of these technologies, forget usecases.
Tweet media one
7
11
104
@Ubunta
ABC
3 months
The evolution of highly performant single-node data tools like @duckdb , @ApacheArrow @IbisData , and Polars, which can process vast amounts of data and exceed the limits of RAM, is a remarkable development for cloud deployment. With these tools, I can now design highly complex
3
14
102
@Ubunta
ABC
3 months
Today, I met with five senior managers from enterprise data teams. Here are the key points from our discussion: All of them are using data warehouses, but none have tried single-node data stacks like @duckdb , Arrow, or @IbisData . They argued that single-node technology cannot
7
9
98
@Ubunta
ABC
3 months
I had an amazing time yesterday in Kolkata debating on Data Engineering, Single Node Data Stack, and the confusing state of GenAI 😃 Many senior engineers argued that processing 1TB+ of data is too common nowadays, and that's why @duckdb isn't the solution. But later, we
4
8
96
@Ubunta
ABC
3 months
How will I process over 100GB+ of data on the **cloud: Choose an instance with ~32GB+ RAM and SSD storage. Install @ApacheArrow , @duckdb , and/or DataFusion to utilize lazy evaluation and perform complex queries. Begin with the initial setup and run your data queries. If the
4
9
95
@Ubunta
ABC
2 years
Despite only having a 30GB memory limit, @duckdb allows me to process 133 million records on S3 on an R5xl machine. It was possible to run all my existing queries on a single machine, some of them had four layers of CTE.
5
6
92
@Ubunta
ABC
7 months
In Data Engineering, Key Aspects to Focus On: 1. Dependable Database Systems 2. The choice of data format is critical; while CSV is widely recognized, Parquet offers superior efficiency and optimization. 3. Effective Logging: Even simple print statements can be invaluable. 4.
3
8
93
@Ubunta
ABC
9 months
Data Engineers continue to face the same major challenge as they did in 2023. 1. Clear Communication with Clients: Avoid technical jargon to prevent confusion. Understanding client needs clearly is vital for the success of data projects. 2. Explaining Data Pipelines to
1
20
89
@Ubunta
ABC
2 years
@AdiPolak The data is clean 🥹
0
0
82
@Ubunta
ABC
2 years
Deploying @duckdb on Kubernetes & interact with the duckdb over @apachesuperset is impressive. It's by far the cheapest option (*for me) to explore data quickly for building data science applications. Native integration with #apachearrow made life much easier
5
11
84
@Ubunta
ABC
2 years
While building Data Engineering or Machine Learning pipelines using CSV data, you should first: 🔥Convert the CSV data to parquet and save it using @DataPolars . Using Parquet + @ApacheArrow combination with almost any data engine is significantly faster than CSV
7
11
77
@Ubunta
ABC
1 month
The "Tools don't matter" argument falls flat in Data Engineering and ML. - Working with large datasets on a Spark cluster can quickly shift your focus from business problems to infrastructure challenges. - Using Pandas for 10GB+ data demands huge RAM, but switching to @duckdb
3
3
77
@Ubunta
ABC
1 year
You don't need any cloud subscription to start learning Data Engineering. - Setup Docker with @Minio @duckdb @apachesuperset - Use Minio to create S3 compatible object store - Use DuckDB as a database by specifying the S3 endpoint - Use Apache Superset to create dashboards-
6
10
75
@Ubunta
ABC
2 years
Several startups in Berlin 🇩🇪 are finding data warehouses extremely expensive & pulling out. I am personally aware of 3 companies from India who dropped DWH licenses & moved entirely to object store data. Speed is good but comes with a cost & not always make 💵 .
15
5
73
@Ubunta
ABC
10 months
In data engineering, I encountered some issues due to overcomplicating things with advanced tools and technologies: 1. Instead of using complex programs like PySpark, a simple pipe could have been enough for the task. 2. Reading data directly from a storage service like Amazon
3
14
72
@Ubunta
ABC
10 months
A few strategies I implemented in my Data Engineering production systems to make it more "cost effective💰" and "less cool" 1. Maximize the use of caching wherever feasible. adding a cache layer between the client and data warehouse, resulting in over 60% cache hits weekly 2.
2
6
71
@Ubunta
ABC
10 months
When a data scientist finds out that a big data warehouse isn't the only way and finds modern data stack: - Big data warehouses can be costly for testing and exploring data. - It might take a while for data scientists to switch to @duckdb or new data stack, but they get there.
Tweet media one
1
7
70
@Ubunta
ABC
10 months
Some interesting Data Engineering benchmarks for production 1. Measure how long different technologies take to write data to blob storage. Not all tech behaves save while writing to the blob. 2. Track the time required to serialize and deserialize data. This is crucial for
0
13
71
@Ubunta
ABC
2 months
Guidelines for Data Engineers to collaborate easily with Data Scientists - Don't assume Data Scientists will like or be familiar with containers; automate containerization and deployment steps as much as possible. - Data Engineers may not prefer Notebooks in production, but for
1
5
71
@Ubunta
ABC
11 months
Tech Update: Over a million dashboards were created last month, and the total estimated number of views they received was 100.
2
6
70
@Ubunta
ABC
2 years
I blogged how to set up and explore @duckdb using @apachesuperset on Kubernetes This is becoming an important setup for my ML Platform and I am convincing all my data scientists to use it.
3
11
70
@Ubunta
ABC
3 months
I typically default to PostgreSQL for most data use cases, but it’s not always convenient for data science applications. Setting up PostgreSQL involves significant effort in creating specific users, schemas, and loading backups. That's why @duckdb is becoming my preferred
3
5
68
@Ubunta
ABC
3 months
Using @duckdb , building data products for healthcare has become incredibly simple. I developed a straightforward product stack for medical analysis using MIMIC-IV data: - Utilized @ollama 's Open-webui to access local LLMs - Integrated DuckDB with Open-webui, unlocking the entire
0
8
67
@Ubunta
ABC
2 years
LinkedIn posts are all about Databricks or Snowflake but Twitter is dominated by @duckdb Conclusion: Ll is more enterprise driven and Twitter is open source 😁
2
2
67
@Ubunta
ABC
8 months
Why aren't @ApacheArrow or Protocol Buffers the universal standard for data formats in Data Engineering and Machine Learning applications? - What factors lead Data Engineers and ML professionals to pay minimal attention to the data format? - What barriers prevent vendors from
9
6
66
@Ubunta
ABC
2 years
Polars Dataframe: How to work with larger-than-memory files 🎉 Polars allow streaming results that are larger than RAM to be written to disk. I have tested it on my old laptop and it worked. Sync_parquet function converts your CSV to a smaller #parquet file as well!
Tweet media one
3
14
63
@Ubunta
ABC
2 years
🍻 @clickhouse is fast but not designed for Single Node 🦆 @duckdb is the best OLAP local but is not good with object storage ❄️ @SnowflakeDB & DBC are highly scalable but not cheap 🐘 @PostgreSQL is the best but sharding is still a pain
12
5
63
@Ubunta
ABC
5 months
I've been using @SnowflakeDB Copilot for the past two days and have successfully onboarded my entire Data Science team using Copilot. Snowflake copilot is pretty fast and way faster than OpenAi. Text-to-SQL is likely to be one of the first challenges successfully tackled by
3
7
63
@Ubunta
ABC
2 months
Apache Airflow remains one of the most widely used orchestration platforms for data engineering, primarily due to its simplicity. While you might encounter a bit more work and a somewhat sluggish UI, the Airflow operator model is very simple, and creating your own operators is
7
2
61
@Ubunta
ABC
11 months
New Data Engineering is Doing More by using Less 1. Choose @DataPolars / @duckdb instead of data warehouses. 2. Utilize Jupyter/ @streamlit for temporary dashboards. 3. Implement code from tools like Copilot/GenAI, avoiding rewrites. 4. Focus on core principles over tools.
1
10
61
@Ubunta
ABC
2 years
A simple Kubernetes open source based MLOps platform * @MLflow for tracking & versioning * @DVCorg for Data Versioning * Seldon Core for #MachineLearning Microservices * @bentomlai for Model Deployment * @flyteorg for k8s native orchestration
1
9
58
@Ubunta
ABC
2 years
In the field of data engineering and machine learning, Apache Parquet and Apache Arrow are becoming the cornerstones
2
5
57
@Ubunta
ABC
2 months
A small but efficient Data Engineering and Machine Learning infrastructure stack for seamless collaboration between engineers and data scientists. Data Stack: - Python-based with Apache Arrow, DuckDB, Polars, and Pandas - Jupyter Notebook endpoint - Integration with a separately
1
5
59
@Ubunta
ABC
2 years
Why do you need a large-scale Data warehouse, when 70% of data science work is data exploration on a few GBs of data stored in blob storage? Can simply use @duckdb locally or on a single node without needing to deal with any complicated setup & save significant cost
3
3
57
@Ubunta
ABC
10 months
Challenging Data Engineering Practices That Are Difficult to Avoid 1. Reliance on Spark UDF: Despite its inefficiency, Spark UDFs are widely used for their speed and ability to handle complex tasks, making them a tempting option. 2. Overusing @streamlit : Streamlit's ease of
0
1
54
@Ubunta
ABC
9 months
If you effectively utilize @ApacheIceberg or @apachehudi , you can bypass the need for a managed data warehouse for a medium-sized data engineering team. Incorporating @ApacheArrow as the foundational layer for data further enhances system efficiency, rivaling that of any
2
10
55
@Ubunta
ABC
5 months
Insights gained from Young Data Engineers - Many young DEs primarily focus on specific platforms like AWS or Azure, often missing the broader context of their field and concentrating solely on tool-specific functionalities. - They are influenced by popular posts on social media
7
7
53
@Ubunta
ABC
2 months
What a Data Engineer wants - Setting up @DataPolars and using a data stack around @ApacheArrow - Converting all code to PySpark and bringing in a cluster for production - Adding all possible alerts for metrics and resource utilization What a Data Scientist wants - Developing a
2
2
52
@Ubunta
ABC
1 year
As a data engineer, implement a centralized logging framework immediately if you're collaborating with data scientists. It streamlines the process, saving time, and prevents them from navigating the chaotic maze of sourcing logs from separate data tools.
3
8
51
@Ubunta
ABC
10 months
Very Data Engineering decisions 😅: 1. Overlooking Caching: Treating Spark's 'cache' as the sole caching mechanism. 2. Ignoring Query Optimization: Not considering query plans despite using large databases. 3. Minimizing Code: Favoring tools over high level programming,
0
7
52
@Ubunta
ABC
1 year
Peak Data Engineering moments: -Attempted loading a hefty csv with pandas/spark - no luck. -Gave @duckdb @DataPolars a shot - still no go. -Went old-school: read line by line, fixed errors, then rewrote. -Transformed the polished csv to parquet. -feeling lucky as parquet exists😅
4
4
51
@Ubunta
ABC
3 months
Over the past weeks, I've delved deeply into the evolving DataFrame ecosystem and have realized one fundamental truth: Apache Arrow is essential and default. @DataPolars is impressively fast, but the performance difference between Polars and Pandas 2.0 with Arrow integration
8
3
52
@Ubunta
ABC
8 months
Most of the advancements in Data and Machine Learning are now being led by Rust, with @ApacheArrow playing a crucial role in this development. The data, stored in Arrow format, allows for seamless operation with both Rust and Python because of 0 copy!
3
4
51
@Ubunta
ABC
11 months
Challenges in Data Engineering and Data Science 1. Dockerfile isn't popular among many. 2. GitHub workflow is still an unknown territory 3. While data accessibility is crucial, data security poses challenges 4. Thinks DE/DS is a gold mine 💰 5. No standard Production setup
3
5
50
@Ubunta
ABC
11 months
Avoid data engineering pitfalls: 1. Don't migrate to a large database/dwh when simpler options like Postgres suffice. 2. Avoid costly clusters for non-revenue projects. 3. Never compromise data security for convenience. 4. Find your real customer 5. Setup the metrics
2
4
48
@Ubunta
ABC
1 year
Running a Data Engineering Production pipeline with @duckdb - Create a Container - Install DuckDb & other libraries - Run the queries inside the container - Update the dashboard This is not true for ad-hoc analysis but production pipelines are just fine with only 🦆
4
4
50
@Ubunta
ABC
2 years
If you have a reasonable data size in s3 which can be easily tackled by @duckdb , then why can't you use DuckDB in Production? What is the exact problem? I don't see a point in starting a new cluster and spending more when DuckDB can easily do the job 🤔
9
1
48
@Ubunta
ABC
7 months
Unspoken Principles of Data Engineering 1. Once established in the production environment, a table is eternal. Whether it’s actively used or not, it’s destined to stay indefinitely. 2. The data platform that seems straightforward to you will invariably be perceived as the most
1
3
47
@Ubunta
ABC
2 years
The most groundbreaking and useful upcoming feature of Databases is WASM. Database on Browser with exceptional speed is simply unbeatable. Another reason for loving @duckdb
4
4
46
@Ubunta
ABC
11 months
My fine-tuned GenAI 7B Model for Data Engineering is in prod: 1. Domain-specific txt-to-SQL 2. Pyspark, Airflow, Polars, Pandas & internal libraries code generation, tailored to your codebase 3. Explaining code/stats to the non tech 4. Mining unique codes from domain docs
8
5
47
@Ubunta
ABC
3 months
Reflecting on my experience using a @streamlit app in production over the past six months, even with a limited user base of 20 users, I've learned a few interesting lessons: - From the very beginning, I found it crucial to use an external database for maintaining the state and
3
2
47
@Ubunta
ABC
11 months
Each of Iceberg, Hudi, and Delta has a dedicated audience: 1. Spark users who are Databricks customers lean towards Delta. 2. Snowflake users not keen on Delta(random reasons) often go for Iceberg. 3. Those with a long-term interest in this technology generally prefer Hudi.
8
4
46
@Ubunta
ABC
3 months
Why are new DataFrame technologies not developing APIs similar to Pandas? API comfort is a thing, and it's challenging to persuade Data Scientists to adopt new methods for DataFrame transformations when they can simply argue, "Pandas has been around forever; why should I change
7
4
46
@Ubunta
ABC
2 years
The future of using databases are getting more interesting . - Create a decent large size sample and deploy @duckdb to access it by data scientists and others - More non core users, create a read only duckdb wasm and let user access the data over browser
1
2
44
@Ubunta
ABC
3 months
Today, I had a call with three senior data leaders, and we discussed my preferred data warehouse solutions. I believe modern data warehouses are incredibly powerful and can address almost all use cases. However, they can also become the most expensive data assets within an
0
6
45
@Ubunta
ABC
2 years
Apache Airflow is a good orchestration platform for DE/ML but it doesn't solve all problems and it's really not easy we just got used to it. But now it's changing and looking for alternatives are the most obvious & common step.
16
5
43
@Ubunta
ABC
2 years
Using @duckdb with Streamlit is simply magical.
2
1
45
@Ubunta
ABC
9 months
The field of Data Engineering is currently grappling with a significant challenge related to Data Format. Generally, there are two primary types of data formats: 1. Human-Readable formats: These include familiar formats such as Excel, CSV, and JSON. They are highly preferred by
3
11
44
@Ubunta
ABC
2 years
If your data engineering system is composed of multiple platforms like Databricks, @SnowflakeDB , @duckdb , etc., just use SQL 🏋️‍♂️ Trying to use Python is not gonna make your life easier and everything is neither Pandas nor Pyspark but SQL is indeed possible
3
1
44
@Ubunta
ABC
2 years
If you don't have a very large dataset, try to consider polars dataframe, @duckdb & @ApacheArrow for a cheaper but faster analytics need. Local development is extremely fast with the above package and you can avoid connecting to a data warehouse cluster for many use cases!
1
3
43
@Ubunta
ABC
11 months
Data Engineering Anti patterns *Overkill: A data warehouse for tiny public datasets. *One-offs: Data pipelines for yearly tasks. *Misjudged Scale: Treating under 50Gb as big data. *Underestimation: Overlooking Postgres/MySQL's capacity *Neglecting CSV->strict schema conversion
2
3
41
@Ubunta
ABC
8 months
Processing data locally with Pandas2 and @DataPolars reveals Polars as the superior choice. When processing approximately 3.5GB of CSV text data row by row, including extracting a set of keywords, using regex, splitting and merging strings, and then writing the results back as
@Ubunta
ABC
8 months
Which is quicker for processing a ~10GB text dataset locally: using @IbisData with @duckdb or @DataPolars ? There are numerous methods for processing text data, and while Pandas remains a convenient option, it's time to consider alternatives. 😇
13
5
40
4
3
43
@Ubunta
ABC
9 months
Data compression plays a crucial role in establishing a robust data infrastructure and is vital for efficient data transmission over networks. zstd has emerged as a leading compression algorithm, reducing data size by 55-60% when transmitted. Notably, its 'Fastest' mode offers
Tweet media one
4
3
43
@Ubunta
ABC
2 years
Now Data engineering and ML stack is composed of 🏋️‍♂️ @databricks or Spark ❄️ @SnowflakeDB or other DWH 🐍 @getdbt &\| Python pandas, numpy, scikit & other ml libraries 🐻‍❄️ @DataPolars for pandas 🦢 @duckdb for OLAP ⏩1 orchestrator The missing part is @IbisData to bring consistency
1
8
43
@Ubunta
ABC
1 year
How I *finally persuaded data scientists to adopt @duckdb & @DataPolars 🛠️🔍 - Developed a streamlined library mirroring Pandas functions for polars/duckdb - Implemented an AI interface to recommend exact functions in polars/duckdb docs that is not in lib (still refining)
4
3
41
@Ubunta
ABC
11 months
Created a Data Engineering basic pipeline in 10 mins using @OpenAI Vision & @phidatahq 's sdk. Smooth image-to-code transformation—futuristic! - Read CSV data and convert to pqt by polars - Pyspark Data Processing - Snowflake data loading - Orchestrating All with an image
3
7
42
@Ubunta
ABC
2 years
Some core insights on Apache Arrow vs Parquet Arrow is primarily an in-memory format, whereas Parquet is a storage format. * #parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized
1
8
42
@Ubunta
ABC
7 months
A significant challenge in transitioning from Pandas to @DataPolars is the syntax difference. There's often debate over the numerous changes required. Then I realized that @IbisData functions as a DataFrame programming language, allowing the use of @duckdb , Polars, or any other
1
7
42
@Ubunta
ABC
10 months
The B+Tree is an indispensable data structure that underpins many of the databases we rely on daily. While it's often a topic of discussion in interviews, its practical implementation can be quite challenging. This blog post demystified the complexities of the B+Tree. Real-world
1
7
41
@Ubunta
ABC
2 years
Now, @duckdb is backed by investment. I am really hoping, DuckDB won't take the route of the distributed database and be the most optimized single-node OLAP database
3
0
40
@Ubunta
ABC
8 months
Which is quicker for processing a ~10GB text dataset locally: using @IbisData with @duckdb or @DataPolars ? There are numerous methods for processing text data, and while Pandas remains a convenient option, it's time to consider alternatives. 😇
13
5
40
@Ubunta
ABC
1 year
Engineers are usually more interested in advocating their preferred technologies than understanding the actual business requirements 😵‍💫 Many startups have ❄️ db for an overall data size of less than 100GB with a revenue of 100k💲
3
1
41
@Ubunta
ABC
12 days
Here are 3 Dashboards that integrate well with existing data and machine learning ecosystems: 1. Apache Superset: Offers excellent integration and is user-friendly. 2. Metabase: Provides an intuitive interface, especially for joining tables. 3. DataEase: A newer option I'm
6
2
40
@Ubunta
ABC
2 years
Why is exactly @duckdb so fast? 😕 Is it because of some very sophisticated query plan generations, very optimized code, or some unique ways of extracting data or what? Or it's not fast, it's just the first OLAP engine designed for the local system 🏋️‍♂️
4
7
40
@Ubunta
ABC
2 months
Another open-source database, @CockroachDB , has moved away from the open-source model. I believe PostgreSQL will continue to be the dominant choice now and in the future!
1
5
39
@Ubunta
ABC
2 years
ADBC is a single API for getting Arrow data in and out of different databases. JDBC and ODBC drivers are widely popular drivers but these are more database specific. But ODBC Columnar format representation is not a perfect match with Arrow.
Tweet media one
1
6
40
@Ubunta
ABC
9 months
The key issues in Data Engineering: 1. Resource Optimization: Challenges include inefficient code reliant on scalability, premature adoption of new data technologies leading to increased costs and complexity, and a lack of deep optimization or understanding of these systems. 2.
0
2
39
@Ubunta
ABC
2 years
3 exciting developer-centric tools make Data Engineering & Machine Learning easy, @duckdb , Malloy & @IbisData Duckdb - If you *just want a simple, lightweight, fast OLAP DB @duckdb is not for very huge data but most business doesn't even need it
1
10
38
@Ubunta
ABC
1 month
More than half of Data Engineering challenges can be addressed more effectively through smart caching rather than by scaling.
2
3
39
@Ubunta
ABC
11 months
Developing a GPT-powered tool simplifies tasks and could revolutionize the engineering field - Data Engineering Assistant - DE bootcamp - Domain Specific Code generation Consider an SDK tailored for data engineering that could automate half of the workload.
Tweet media one
1
1
39