Ubunta Profile Banner
ABC Profile
ABC

@Ubunta

Followers
4K
Following
4K
Statuses
5K

Data & ML Infrastructure for Healthcare Opinions are पड़ोसी' | DhanvantriAI 📍 🇩🇪Berlin & 🇮🇳Kolkata

Berlin, Germany
Joined August 2009
Don't wanna be here? Send us removal request.
@Ubunta
ABC
4 months
Today in Basel, a team of 24 developers, including myself, built and deployed the following stack to handle 2TB of offline data processing and a 15mbps real-time stream on a bare-metal Kubernetes cluster hosted on Hetzner. The entire setup was achieved on a budget of approximately $200, demonstrating a highly cost-effective, frugal data infrastructure: - Launched a highly available Kubernetes cluster with 50 nodes. - Deployed Apache Flink across 17 nodes for stream processing. - Set up a Jupyter Notebook ecosystem, concurrently used by 16 developers to process datasets ranging from 20 to 50 GB. - Deployed LocalStack to mimic AWS services. - Implemented both Apache Spark and Ray for distributed computing. - 50% of the ETL processes were initially done using @duckdb and then scaled to PySpark. - Used @argoproj for orchestration of workflows. - Built six complex dashboards using @apachesuperset for visualization. - Deployed @apachekafka on a small node to maximize efficiency in extracting data. - @dify_ai is used for building LLM applications , more around this will happen tomorrow System testing will take place over the next two weeks, and we expect to gather more detailed feedback from that process. Thanks 🙏 to the team Basel Infra, you all rock 🚀
7
13
165
@Ubunta
ABC
2 hours
@andfanilo Its pretty good. If you wanna use the full mentioned stack or just nocodb, try Cost-Effective Analytics Stack
Tweet media one
0
0
1
@Ubunta
ABC
23 hours
Layoffs are often influenced more by personal relationships than by performance. In many cases, cultivating strong relationships with managers and maintaining a solid network proves more valuable than being a high-performing yet isolated coder.
1
2
11
@Ubunta
ABC
2 days
As a Data Engineer, spent years navigating the complexities of Data Infrastructure and AI engineering. That hands-on experience inspired me to create HotTechStack - A platform that comes with pre-deployed Data & AI tech stacks. With HotTechStack, you can skip the tedious setup and jump right into building your products. 🔥 What can you do with HotTechStack? - Learn and Understand the Tech Stack - See How Different Technologies Integrate and Behave - Integrate Varying Solutions in the Tech Stack - Find the Best Tech Stack for Your Preferred Jobs - Design Your Own Tech Stack - Feeling adventurous? Build your ideal setup from scratch. HotTechStack guides you through the selection and deployment of components, helping you craft a tech stack that perfectly matches your vision. - Infrastructure with Integrated AI (Coming Soon!) By offering flexible, customizable tech stacks paired with comprehensive documentation, HotTechStack lets you focus on what truly matters—innovation and growth. Welcome to a smarter, faster way to build the future. Let’s get started! I Won’t Only Write About Tech Stack and Infrastructure—I’ll Make It Available to Try and Go Live! 🙏🙏 A huge shoutout to all my mentors who validated and gave me early feedback, pushing me to launch HotTechStack sooner rather than later. Thank you for your support and belief in this project! 😁
Tweet media one
0
0
3
@Ubunta
ABC
3 days
When automating Data Engineering and ML tasks, I often ran into issues integrating multiple data sources for enhancing RAG pipelines. For example, I needed to extract analytics using @duckdb , gain high-level business insights from Snowflake/DBC, and trigger pipelines seamlessly. My Solution: I turned to as an orchestration tool to manage my entire workflow. This platform allowed me to interact with a broad data ecosystem while integrating generative AI (GenAI) tasks. The Results Were Phenomenal: Extensive Integrations: n8n supports a wide range of data-centric applications—such as Snowflake, PostgreSQL, and AWS S3—making it easier to connect to the systems you already use. - Any service that exposes an API can be integrated via HTTP requests. This means you can connect virtually any tool to n8n, including those needed for AI functions. - With support for loops and conditional logic, n8n lets you route data to different endpoints based on specific scenarios. This flexibility is crucial for building complex workflows. - Beyond basic AI functionalities, n8n enables you to connect data applications with GenAI services with minimal or even no coding. Use Cases: Using n8n as an orchestrator, I can automate tasks such as: - Synchronizing data across systems - Uploading files to S3 - Triggering events in data warehouses (DWH) - Running analytics queries on DuckDB - Migrating data to larger DWH environments and many more
Tweet media one
1
6
47
@Ubunta
ABC
4 days
Why are so many tech Twitter users who once left now making a comeback? What changed?
3
0
1
@Ubunta
ABC
4 days
@Subhash_Peshwa @duckdb Kubernetes and everything in front an auth layer
0
0
1
@Ubunta
ABC
4 days
@_ashwanthkumar @duckdb AI Router is mainly to redirect queries to different size replicas, I designed it to keep the cost very low and all read replicas are as well of diff size. Serious : Well I wanna see how far I can go with it
0
0
0
@Ubunta
ABC
4 days
@dataenggdude @duckdb 300TB is not a small size so I will recommend using manager services . In my case my data size is within 1-3 TB and the high quality clean data is just within 100gb
0
0
0
@Ubunta
ABC
6 days
@JayChia5 With an obvious marketing budget, let me focus on 2 technical problems It's hard to calculate the exact memory requirement for tasks, so uncertain in production. No easier ways to debug queries, understand metrics etc, so observability is challenging. More I will write ✍️
0
0
0
@Ubunta
ABC
8 days
This week, I developed a deep reasoning agent tailored for data infrastructure, inspired by ChatGPT’s research. I integrated a Mixture of Models approach combining GPT-4o and Deepseek-Reasoner. Tech Stack & Deployment Tools - Utilized Airflow and Dagster for streamlined task management. - Deployed multiple JupyterHub instances equipped with Polars and DuckDB for fast data processing. Key Implementation Details - Instead of relying solely on the latest documentation, I introduced a semantic layer. This approach ensures that every model call is enriched with contextual understanding, improving response relevance. - Both models function as conversational agents until a predefined completion criterion is met. During initial interactions, they discuss and determine subsequent steps. Notably, insights from Deepseek-Reasoner prompted GPT-4o to refine its performance over the course of the conversation. - I capped the interaction to a maximum of 10 calls. This threshold struck a balance between operational efficiency and resource cost. - Internet search capabilities were enabled during model interactions Results & Observations - A single run generated complex pipelines, an effective deployment strategy, a code server, comprehensive logging, and basic debugging tools—all within a 15 ~ 20 minute window. - While the generated code required further fine-tuning and minor API bug fixes, the agent provided substantial guidance for resource discovery. This implementation validates the potential for AI Agents for Data Engineering workflows. While domain-specific adaptations still require significant engineering effort, the foundation is promising.
0
0
4
@Ubunta
ABC
9 days
When it comes to Data or AI Infrastructure, logging is the unsung hero—vital, yet often overlooked. Most folks wouldn't choose to look through logs, but when things go south, that's where you turn. However, uncovering the root cause hidden in endless lines of logs demands more than just looking; it takes real skill and a whole lot of patience. While today's logging systems aren't exactly easier to use, I'm optimistic. With GenAI on the rise, there's a awesome opportunity to make logging not only simpler but also smarter.
0
0
7
@Ubunta
ABC
12 days
@rishdotblog Its more or less same but now it tries to listen to the user better, so if you give more hints, it may give you the desired outcome.. but not always
0
0
1
@Ubunta
ABC
12 days
Indian Budget broke the DeepSeek trend 😀
0
0
1
@Ubunta
ABC
13 days
Data Engineering Documentation Problems🤦‍♂️ that seems very hard to solve... 1. The docs paint a picture of a flawless Docker setup—everything runs like a Swiss watch. But the moment you deploy to production? Chaos. Suddenly, security rules clash, network policies twist your mind, and resource limits pop up out of nowhere. It’s like trying to release a lab-raised mouse into the wild—good luck with that! 2. Documentation authors seem to live in a dreamland where their tool is the only one that exists. Meanwhile, back in reality, we’re juggling Kafka, Spark, Airflow, dbt, and half a dozen other tools. Yet the docs give us single-tool examples, like handing out a spoon in the middle of a knife fight. 3. You copy a code snippet, only to realize it’s like an IKEA shelf with half the screws missing. It shows you a nice, clean transformation—but what about error handling? Logging? Monitoring? Apparently, production code is supposed to run on hopes and prayers. 4. "Just use version X.Y.Z" Reality: Your production system is running on A.B.C, your staging on P.Q.R, and the examples in the docs? Those are from version ∞.∞.∞ that doesn't even exist yet! Trying to piece together compatibility information feels like solving a Rubik's cube in the dark. 5. The docs: "This operation is blazing fast! 🚀" Reality: Sure, it's fast... when you're processing 10 rows on your laptop. But throw in a billion records, add some network latency, and suddenly your "blazing fast" operation is moving slower than a sloth in a meditation session. 6. The "Security? What Security?" Approach Most docs treat security like that one distant relative everyone forgets to invite to family gatherings. Authentication? Authorization? Data encryption? These crucial topics often get the "we'll add it later" treatment, leaving you to figure out how to not make your data pipeline a Swiss cheese of security holes. 7. When things go wrong (and they will!), most docs leave you feeling like a detective with a broken flashlight. The error messages are about as helpful as a chocolate teapot, and the debugging guides seem to assume everything fails in the most convenient way possible. Spoiler alert: they don't! 8. Every data pipeline is a complex recipe, mixing tools into a technical curry. But the docs? They barely teach you how to boil water. Want to handle retries when Tool A crashes, Tool B gets stuck, and Tool C starts having an existential crisis? Yeah, you’re on your own, my friend.
0
4
20
@Ubunta
ABC
14 days
If you're dealing with rare data engineering problems, GenAI is one of the least effective platforms for finding solutions. It will confuse you more and may end up wasting time
1
0
11