ayudovin Profile Banner
Artsiom Yudovin Profile
Artsiom Yudovin

@ayudovin

Followers
191
Following
60
Statuses
129

Principal Data Engineer | Open-Source contributor | Conference speaker | #Data | #DataEngineering

Warsaw, Poland
Joined May 2016
Don't wanna be here? Send us removal request.
@ayudovin
Artsiom Yudovin
5 days
Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming
Tweet media one
Tweet media two
0
0
0
@ayudovin
Artsiom Yudovin
5 days
Temporal join helps stream solutions because it can track all changes and join data with the right version.
@ayudovin
Artsiom Yudovin
5 days
Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming
Tweet media one
Tweet media two
0
0
0
@ayudovin
Artsiom Yudovin
14 days
🛠️ Comparison of Data Versioning Tools A quick breakdown of Git LFS, DVC, Delta Lake, Quilt, lakeFS, and Apache Nessie: 1️⃣ Primary Purpose •Git LFS: Large file versioning in Git •DVC: Data & model versioning for ML •Delta Lake: Data lakes & pipelines •Quilt: Dataset management/sharing •lakeFS: Git-like version control for object storage •Nessie: Version control for tabular data 2️⃣ Storage Backend •Git LFS: Git repos (with linked storage) •DVC: S3, GCS, etc. •Delta Lake: Cloud/HDFS •Quilt: Cloud (S3) •lakeFS: Object storage (S3, GCS, etc.) •Nessie: Cloud/on-premise databases 3️⃣ Data Lineage •Git LFS: Minimal •DVC: Excellent •Delta Lake: Excellent •Quilt: Good •lakeFS: Excellent (branch-based) •Nessie: Excellent (branch-based) 4️⃣ Key Use Cases •Git LFS: Binary versioning w/ Git •DVC: ML workflows •Delta Lake: Table analytics •Quilt: Collaborative datasets •lakeFS: Data lakes w/ Git-like branching •Nessie: Table versioning (e.g., Iceberg/Delta) Which fits your workflow best, or what comparison criteria do you want to investigate deeper? Let me know! 👇 #Data #dataengineering
Tweet media one
0
1
4
@ayudovin
Artsiom Yudovin
18 days
@venkat_s I agree. You need to be passionate and not afraid to express your thoughts.
0
0
0
@ayudovin
Artsiom Yudovin
18 days
I'm happy to share some exciting news! 🚀 My topic proposal, "The architecture of ClickStream solution" has been selected to be part of the Data & AI Warsaw Tech Summit program, chosen from all the fantastic submissions during the Call For Presentations process! I'm super excited to chat about our excellent clickstream solution! ClickStream is this incredible data analytics platform that tracks and analyzes every click and interaction users have on our website or app. We built it on AWS using cool techs like GRPC, K8S, Apache Flink, AWS Redshift, and more! These tools have been key in helping us get fantastic results. So, let's jump in together and explore how it all works, the ups and downs we've discovered, and how this can help our business thrive. I can’t wait to share all of this with you! I also have something special for you: a 10% discount code: FromSpeaker10. Use it to receive a 10% discount on registration! Registration: I hope to meet you there – April 10-11, 2025! #DataAiWarsawTechSummit #DAWTS
Tweet media one
0
0
2
@ayudovin
Artsiom Yudovin
20 days
@Franc0Fernand0 Also, many engineers don’t know what the tail recursion means
0
0
1
@ayudovin
Artsiom Yudovin
24 days
Many data professionals believe using One Big Table (OBT) is a bad practice, but that is not true. A One Big Table can enhance performance and simplify data management by consolidating everything into a single location. Dimensional data modeling often involves numerous joins and other operations to extract insights from the data. Therefore, one effective way to improve your data marts is to incorporate OBT into your dimensional data modeling. There is no need to be apprehensive about it. It can make your data more accessible for less technically inclined people. #data #datamodel #dataengineering
Tweet media one
0
1
2
@ayudovin
Artsiom Yudovin
30 days
Static Duck Typing in Python: A Closer Look We often mention Python's dynamic typing nature, where the variable type is determined at runtime. However, an interesting concept called "static duck typing" combines Python's dynamic nature with static type checking. What is Static Duck Typing? Look at PEP 544 – Protocols: Structural subtyping (static duck typing) Static duck typing refers to the practice of inferring types based on an object's behavior or interface rather than its explicit type declaration, but this is done at compile time or before runtime through static type checkers. Do you know about static duck typing in other languages? #Python
Tweet media one
0
1
2
@ayudovin
Artsiom Yudovin
30 days
@Ubunta I don’t hate “Bad Data.” That means I totally have a job.
0
0
1
@ayudovin
Artsiom Yudovin
1 month
@bigdatasumit Additionally, consider using ZSTD compression instead of Snappy, as it will likely be more efficient based on benchmarks.
0
0
0
@ayudovin
Artsiom Yudovin
1 month
One step to improve your latency in a streaming solution! Latency is an essential element in any streaming solution. Storing results from your streaming in storage consumes a lot of latency. Some storage solutions, such as AWS Redshift, Clickhouse, etc., offer deep integration with services like Apache Kafka. You will require minimum effort to get access to the data from Kafka. - Amazon Redshift supports data ingestion from Amazon MSK (Managed Streaming for Apache Kafka) or On-premise Apache Kafka. This feature allows for low-latency and high-speed ingestion of streaming data into an Amazon Redshift materialized view. Because it does not require staging data in Amazon S3, Amazon Redshift can ingest streaming data with lower latency and reduced storage costs. Using SQL statements to authenticate and connect to an Amazon MSK topic, you can configure streaming ingestion in an Amazon Redshift cluster. - In Clickhouse, you can create a Kafka consumer using an engine, treating it as a data stream. First, create a table with the desired structure. Next, establish a materialized view that transforms data from the engine and inserts it into the previously created table. When the MATERIALIZED VIEW connects with the engine, it collects data in the background, enabling you to continuously receive messages from Kafka and convert them into the required format using SELECT statements. One Kafka table can support multiple materialized views. These views do not directly read data from the table but receive new records in blocks. This allows you to write to several tables at different levels of detail, including options with and without aggregation. #data #dataengineering
Tweet media one
Tweet media two
1
0
1
@ayudovin
Artsiom Yudovin
1 month
@EcZachly This is a common scenario: stakeholders request streaming, but when you ask what latency they expect, they typically say 2-3 minutes. However, if you continue the conversation and inquire about how they plan to use this data, you often hear, "I am going to build daily analytics."
0
0
10
@ayudovin
Artsiom Yudovin
1 month
💡 The Journey of Developing My Technical Blog In early November, I created my technical blog to share my experiences with the world. After two months of dedication, I'm excited about my progress, no matter how small it may seem. Thank you for being so supportive! As a Senior Staff Data Engineer with a decade of experience in the Big Data arena, I thrive on contributing to open-source projects like Apache Spark and Apache Airflow. I also find great joy in inspiring others as a speaker at various conferences. 🚀
Tweet media one
1
0
6
@ayudovin
Artsiom Yudovin
1 month
@KaiWaehner Any advice on choosing between Kafka Stream and Apache Flink? From experience, Kafka Stream is worth more than Apache Flink from a scalability perspective.
1
0
0
@ayudovin
Artsiom Yudovin
1 month
@Ubunta Bad data will find you wherever you are 👿
0
0
0
@ayudovin
Artsiom Yudovin
1 month
I believe the following challenges should be included: 8. Unstructured Data Chaos - Extracting meaningful information from unstructured data, such as free-form text, images, or audio, requires advanced processing techniques like natural language processing (NLP), optical character recognition (OCR), or machine learning. - Logs and event data often lack a consistent format, making them difficult to parse. - Rich-text fields can contain embedded HTML, JSON, or other nested formats, which adds complexity to data handling. 9. Metadata Management Confusion - Missing or incomplete metadata makes it challenging to understand the context and lineage of the data. - Inconsistent or outdated documentation can lead to the misinterpretation of data fields. - The absence of versioning in datasets creates uncertainty about which version of the data is in use.
0
0
0
@ayudovin
Artsiom Yudovin
2 months
Yes, I do. I always advise my team to approach refactoring with caution, as it can be an endless process. To manage this, we plan our refactoring efforts and establish clear goals. This helps us maintain a clear vision of when to stop and what specific improvements we want to achieve.
0
0
0
@ayudovin
Artsiom Yudovin
2 months
@bigdatasumit Nice post. I have found a good idea for using RAG on my data team. We implement RAG to enable communication with our data catalog using natural language through the LLM model.
@ayudovin
Artsiom Yudovin
2 months
How can AI, LLM, and RAG improve the availability of a data catalog for your users? #RAG (Retrieval Augmented Generation) will enhance your data catalog by improving its availability and usability. The Data Catalog serves as a knowledge base for the Data Warehouse (DWH), focusing on metadata management and data discovery. Typically, DWH users find the data catalog uninteresting and are reluctant to engage with it. Imagine applying the experience of interacting with large language models (LLMs) to the data catalog. Instead of sifting through a data catalog, users could ask questions about the data availability in the DWH. The data catalog would be an accessible knowledge base through your RAG system. Also, as benefit, You could request the generation of SQL queries that are aware of the DWH’s metadata, allowing for more efficient exploration of the DWH. #Data #DWH #AI What do you think about using the RAG solution in such a way?
Tweet media one
0
0
0