![Artsiom Yudovin Profile](https://pbs.twimg.com/profile_images/1867290793066590208/gHEOe3EQ_x96.jpg)
Artsiom Yudovin
@ayudovin
Followers
191
Following
60
Statuses
129
Principal Data Engineer | Open-Source contributor | Conference speaker | #Data | #DataEngineering
Warsaw, Poland
Joined May 2016
Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming
0
0
0
Temporal join helps stream solutions because it can track all changes and join data with the right version.
Temporal Joins in Apache Flink: The join that Apache Spark does not support! ⌛️ A Temporal Table in Flink evolves, capturing changes dynamically. It can be: - A Changing History Table: Tracks all changes (e.g., database changelog). - A Changing Dimension Table: Stores only the latest snapshot (e.g., database table). Event Time Temporal Join Flink's event-time temporal joins enable enriching tables with evolving metadata: This retrieves the historical values of a key at a specific time, ensuring accurate joins. Example Use Case: A table of orders with different currencies needs normalization to USD using historical exchange rates: Benefits of Temporal Joins: - Accurate Historical Data: Ensures time-consistent joins. - Efficient Processing: Stores only necessary snapshots. - Streaming & Batch Compatibility: Works in real-time and batch environments. - SQL-Standard Compliant: Uses SQL:2011 syntax. Conclusion: Flink Temporal Joins enable robust, time-aware data merging, ensuring historical accuracy and efficient processing. #dataengineering #streaming
0
0
0
🛠️ Comparison of Data Versioning Tools A quick breakdown of Git LFS, DVC, Delta Lake, Quilt, lakeFS, and Apache Nessie: 1️⃣ Primary Purpose •Git LFS: Large file versioning in Git •DVC: Data & model versioning for ML •Delta Lake: Data lakes & pipelines •Quilt: Dataset management/sharing •lakeFS: Git-like version control for object storage •Nessie: Version control for tabular data 2️⃣ Storage Backend •Git LFS: Git repos (with linked storage) •DVC: S3, GCS, etc. •Delta Lake: Cloud/HDFS •Quilt: Cloud (S3) •lakeFS: Object storage (S3, GCS, etc.) •Nessie: Cloud/on-premise databases 3️⃣ Data Lineage •Git LFS: Minimal •DVC: Excellent •Delta Lake: Excellent •Quilt: Good •lakeFS: Excellent (branch-based) •Nessie: Excellent (branch-based) 4️⃣ Key Use Cases •Git LFS: Binary versioning w/ Git •DVC: ML workflows •Delta Lake: Table analytics •Quilt: Collaborative datasets •lakeFS: Data lakes w/ Git-like branching •Nessie: Table versioning (e.g., Iceberg/Delta) Which fits your workflow best, or what comparison criteria do you want to investigate deeper? Let me know! 👇 #Data #dataengineering
0
1
4
I'm happy to share some exciting news! 🚀 My topic proposal, "The architecture of ClickStream solution" has been selected to be part of the Data & AI Warsaw Tech Summit program, chosen from all the fantastic submissions during the Call For Presentations process! I'm super excited to chat about our excellent clickstream solution! ClickStream is this incredible data analytics platform that tracks and analyzes every click and interaction users have on our website or app. We built it on AWS using cool techs like GRPC, K8S, Apache Flink, AWS Redshift, and more! These tools have been key in helping us get fantastic results. So, let's jump in together and explore how it all works, the ups and downs we've discovered, and how this can help our business thrive. I can’t wait to share all of this with you! I also have something special for you: a 10% discount code: FromSpeaker10. Use it to receive a 10% discount on registration! Registration: I hope to meet you there – April 10-11, 2025! #DataAiWarsawTechSummit #DAWTS
0
0
2
Many data professionals believe using One Big Table (OBT) is a bad practice, but that is not true. A One Big Table can enhance performance and simplify data management by consolidating everything into a single location. Dimensional data modeling often involves numerous joins and other operations to extract insights from the data. Therefore, one effective way to improve your data marts is to incorporate OBT into your dimensional data modeling. There is no need to be apprehensive about it. It can make your data more accessible for less technically inclined people. #data #datamodel #dataengineering
0
1
2
Static Duck Typing in Python: A Closer Look We often mention Python's dynamic typing nature, where the variable type is determined at runtime. However, an interesting concept called "static duck typing" combines Python's dynamic nature with static type checking. What is Static Duck Typing? Look at PEP 544 – Protocols: Structural subtyping (static duck typing) Static duck typing refers to the practice of inferring types based on an object's behavior or interface rather than its explicit type declaration, but this is done at compile time or before runtime through static type checkers. Do you know about static duck typing in other languages? #Python
0
1
2
@bigdatasumit Additionally, consider using ZSTD compression instead of Snappy, as it will likely be more efficient based on benchmarks.
0
0
0
One step to improve your latency in a streaming solution! Latency is an essential element in any streaming solution. Storing results from your streaming in storage consumes a lot of latency. Some storage solutions, such as AWS Redshift, Clickhouse, etc., offer deep integration with services like Apache Kafka. You will require minimum effort to get access to the data from Kafka. - Amazon Redshift supports data ingestion from Amazon MSK (Managed Streaming for Apache Kafka) or On-premise Apache Kafka. This feature allows for low-latency and high-speed ingestion of streaming data into an Amazon Redshift materialized view. Because it does not require staging data in Amazon S3, Amazon Redshift can ingest streaming data with lower latency and reduced storage costs. Using SQL statements to authenticate and connect to an Amazon MSK topic, you can configure streaming ingestion in an Amazon Redshift cluster. - In Clickhouse, you can create a Kafka consumer using an engine, treating it as a data stream. First, create a table with the desired structure. Next, establish a materialized view that transforms data from the engine and inserts it into the previously created table. When the MATERIALIZED VIEW connects with the engine, it collects data in the background, enabling you to continuously receive messages from Kafka and convert them into the required format using SELECT statements. One Kafka table can support multiple materialized views. These views do not directly read data from the table but receive new records in blocks. This allows you to write to several tables at different levels of detail, including options with and without aggregation. #data #dataengineering
1
0
1
@EcZachly This is a common scenario: stakeholders request streaming, but when you ask what latency they expect, they typically say 2-3 minutes. However, if you continue the conversation and inquire about how they plan to use this data, you often hear, "I am going to build daily analytics."
0
0
10
💡 The Journey of Developing My Technical Blog In early November, I created my technical blog to share my experiences with the world. After two months of dedication, I'm excited about my progress, no matter how small it may seem. Thank you for being so supportive! As a Senior Staff Data Engineer with a decade of experience in the Big Data arena, I thrive on contributing to open-source projects like Apache Spark and Apache Airflow. I also find great joy in inspiring others as a speaker at various conferences. 🚀
1
0
6
@KaiWaehner Any advice on choosing between Kafka Stream and Apache Flink? From experience, Kafka Stream is worth more than Apache Flink from a scalability perspective.
1
0
0
I believe the following challenges should be included: 8. Unstructured Data Chaos - Extracting meaningful information from unstructured data, such as free-form text, images, or audio, requires advanced processing techniques like natural language processing (NLP), optical character recognition (OCR), or machine learning. - Logs and event data often lack a consistent format, making them difficult to parse. - Rich-text fields can contain embedded HTML, JSON, or other nested formats, which adds complexity to data handling. 9. Metadata Management Confusion - Missing or incomplete metadata makes it challenging to understand the context and lineage of the data. - Inconsistent or outdated documentation can lead to the misinterpretation of data fields. - The absence of versioning in datasets creates uncertainty about which version of the data is in use.
0
0
0
@bigdatasumit Nice post. I have found a good idea for using RAG on my data team. We implement RAG to enable communication with our data catalog using natural language through the LLM model.
How can AI, LLM, and RAG improve the availability of a data catalog for your users? #RAG (Retrieval Augmented Generation) will enhance your data catalog by improving its availability and usability. The Data Catalog serves as a knowledge base for the Data Warehouse (DWH), focusing on metadata management and data discovery. Typically, DWH users find the data catalog uninteresting and are reluctant to engage with it. Imagine applying the experience of interacting with large language models (LLMs) to the data catalog. Instead of sifting through a data catalog, users could ask questions about the data availability in the DWH. The data catalog would be an accessible knowledge base through your RAG system. Also, as benefit, You could request the generation of SQL queries that are aware of the DWH’s metadata, allowing for more efficient exploration of the DWH. #Data #DWH #AI What do you think about using the RAG solution in such a way?
0
0
0