Artsiom Yudovin @ayudovin profile

Artsiom Yudovin

@ayudovin

Followers

191

Following

60

Statuses

129

Principal Data Engineer | Open-Source contributor | Conference speaker | #Data | #DataEngineering

Warsaw, Poland

Joined May 2016

Don't wanna be here? Send us removal request.

Artsiom Yudovin

@ayudovin

5 days

Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming

0

Artsiom Yudovin

@ayudovin

5 days

Temporal join helps stream solutions because it can track all changes and join data with the right version.

Artsiom Yudovin

@ayudovin

5 days

Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming

0

Artsiom Yudovin

@ayudovin

14 days

🛠️ Comparison of Data Versioning Tools A quick breakdown of Git LFS, DVC, Delta Lake, Quilt, lakeFS, and Apache Nessie: 1️⃣ Primary Purpose •Git LFS: Large file versioning in Git •DVC: Data & model versioning for ML •Delta Lake: Data lakes & pipelines •Quilt: Dataset management/sharing •lakeFS: Git-like version control for object storage •Nessie: Version control for tabular data 2️⃣ Storage Backend •Git LFS: Git repos (with linked storage) •DVC: S3, GCS, etc. •Delta Lake: Cloud/HDFS •Quilt: Cloud (S3) •lakeFS: Object storage (S3, GCS, etc.) •Nessie: Cloud/on-premise databases 3️⃣ Data Lineage •Git LFS: Minimal •DVC: Excellent •Delta Lake: Excellent •Quilt: Good •lakeFS: Excellent (branch-based) •Nessie: Excellent (branch-based) 4️⃣ Key Use Cases •Git LFS: Binary versioning w/ Git •DVC: ML workflows •Delta Lake: Table analytics •Quilt: Collaborative datasets •lakeFS: Data lakes w/ Git-like branching •Nessie: Table versioning (e.g., Iceberg/Delta) Which fits your workflow best, or what comparison criteria do you want to investigate deeper? Let me know! 👇 #Data #dataengineering

0

1

4

Artsiom Yudovin

@ayudovin

18 days

@venkat_s I agree. You need to be passionate and not afraid to express your thoughts.

0

Artsiom Yudovin

@ayudovin

18 days

I'm happy to share some exciting news! 🚀 My topic proposal, "The architecture of ClickStream solution" has been selected to be part of the Data & AI Warsaw Tech Summit program, chosen from all the fantastic submissions during the Call For Presentations process! I'm super excited to chat about our excellent clickstream solution! ClickStream is this incredible data analytics platform that tracks and analyzes every click and interaction users have on our website or app. We built it on AWS using cool techs like GRPC, K8S, Apache Flink, AWS Redshift, and more! These tools have been key in helping us get fantastic results. So, let's jump in together and explore how it all works, the ups and downs we've discovered, and how this can help our business thrive. I can’t wait to share all of this with you! I also have something special for you: a 10% discount code: FromSpeaker10. Use it to receive a 10% discount on registration! Registration: I hope to meet you there – April 10-11, 2025! #DataAiWarsawTechSummit #DAWTS

0

2

Artsiom Yudovin

@ayudovin

20 days

@Franc0Fernand0 Also, many engineers don’t know what the tail recursion means

0

1

Artsiom Yudovin

@ayudovin

24 days

Many data professionals believe using One Big Table (OBT) is a bad practice, but that is not true. A One Big Table can enhance performance and simplify data management by consolidating everything into a single location. Dimensional data modeling often involves numerous joins and other operations to extract insights from the data. Therefore, one effective way to improve your data marts is to incorporate OBT into your dimensional data modeling. There is no need to be apprehensive about it. It can make your data more accessible for less technically inclined people. #data #datamodel #dataengineering

0

1

2

Artsiom Yudovin

@ayudovin

30 days

Static Duck Typing in Python: A Closer Look We often mention Python's dynamic typing nature, where the variable type is determined at runtime. However, an interesting concept called "static duck typing" combines Python's dynamic nature with static type checking. What is Static Duck Typing? Look at PEP 544 – Protocols: Structural subtyping (static duck typing) Static duck typing refers to the practice of inferring types based on an object's behavior or interface rather than its explicit type declaration, but this is done at compile time or before runtime through static type checkers. Do you know about static duck typing in other languages? #Python

0

1

2

Artsiom Yudovin

@ayudovin

30 days

@Ubunta I don’t hate “Bad Data.” That means I totally have a job.

0

1

Artsiom Yudovin

@ayudovin

1 month

@bigdatasumit Additionally, consider using ZSTD compression instead of Snappy, as it will likely be more efficient based on benchmarks.

0

Artsiom Yudovin

@ayudovin

1 month

One step to improve your latency in a streaming solution! Latency is an essential element in any streaming solution. Storing results from your streaming in storage consumes a lot of latency. Some storage solutions, such as AWS Redshift, Clickhouse, etc., offer deep integration with services like Apache Kafka. You will require minimum effort to get access to the data from Kafka. - Amazon Redshift supports data ingestion from Amazon MSK (Managed Streaming for Apache Kafka) or On-premise Apache Kafka. This feature allows for low-latency and high-speed ingestion of streaming data into an Amazon Redshift materialized view. Because it does not require staging data in Amazon S3, Amazon Redshift can ingest streaming data with lower latency and reduced storage costs. Using SQL statements to authenticate and connect to an Amazon MSK topic, you can configure streaming ingestion in an Amazon Redshift cluster. - In Clickhouse, you can create a Kafka consumer using an engine, treating it as a data stream. First, create a table with the desired structure. Next, establish a materialized view that transforms data from the engine and inserts it into the previously created table. When the MATERIALIZED VIEW connects with the engine, it collects data in the background, enabling you to continuously receive messages from Kafka and convert them into the required format using SELECT statements. One Kafka table can support multiple materialized views. These views do not directly read data from the table but receive new records in blocks. This allows you to write to several tables at different levels of detail, including options with and without aggregation. #data #dataengineering

1

0

1

Artsiom Yudovin

@ayudovin

1 month

@EcZachly This is a common scenario: stakeholders request streaming, but when you ask what latency they expect, they typically say 2-3 minutes. However, if you continue the conversation and inquire about how they plan to use this data, you often hear, "I am going to build daily analytics."

0

10

Artsiom Yudovin

@ayudovin

1 month

💡 The Journey of Developing My Technical Blog In early November, I created my technical blog to share my experiences with the world. After two months of dedication, I'm excited about my progress, no matter how small it may seem. Thank you for being so supportive! As a Senior Staff Data Engineer with a decade of experience in the Big Data arena, I thrive on contributing to open-source projects like Apache Spark and Apache Airflow. I also find great joy in inspiring others as a speaker at various conferences. 🚀

1

0

6

Artsiom Yudovin

@ayudovin

1 month

@KaiWaehner Any advice on choosing between Kafka Stream and Apache Flink? From experience, Kafka Stream is worth more than Apache Flink from a scalability perspective.

1

0

Artsiom Yudovin

@ayudovin

1 month

@Ubunta Bad data will find you wherever you are 👿

0

Artsiom Yudovin

@ayudovin

1 month

I believe the following challenges should be included: 8. Unstructured Data Chaos - Extracting meaningful information from unstructured data, such as free-form text, images, or audio, requires advanced processing techniques like natural language processing (NLP), optical character recognition (OCR), or machine learning. - Logs and event data often lack a consistent format, making them difficult to parse. - Rich-text fields can contain embedded HTML, JSON, or other nested formats, which adds complexity to data handling. 9. Metadata Management Confusion - Missing or incomplete metadata makes it challenging to understand the context and lineage of the data. - Inconsistent or outdated documentation can lead to the misinterpretation of data fields. - The absence of versioning in datasets creates uncertainty about which version of the data is in use.

0

Artsiom Yudovin

@ayudovin

2 months

Yes, I do. I always advise my team to approach refactoring with caution, as it can be an endless process. To manage this, we plan our refactoring efforts and establish clear goals. This helps us maintain a clear vision of when to stop and what specific improvements we want to achieve.

0

Artsiom Yudovin

@ayudovin

2 months

@bigdatasumit Nice post. I have found a good idea for using RAG on my data team. We implement RAG to enable communication with our data catalog using natural language through the LLM model.

Artsiom Yudovin

@ayudovin

2 months

How can AI, LLM, and RAG improve the availability of a data catalog for your users? #RAG (Retrieval Augmented Generation) will enhance your data catalog by improving its availability and usability. The Data Catalog serves as a knowledge base for the Data Warehouse (DWH), focusing on metadata management and data discovery. Typically, DWH users find the data catalog uninteresting and are reluctant to engage with it. Imagine applying the experience of interacting with large language models (LLMs) to the data catalog. Instead of sifting through a data catalog, users could ask questions about the data availability in the DWH. The data catalog would be an accessible knowledge base through your RAG system. Also, as benefit, You could request the generation of SQL queries that are aware of the DWH’s metadata, allowing for more efficient exploration of the DWH. #Data #DWH #AI What do you think about using the RAG solution in such a way?

0