Sumit Mittal @bigdatasumit profile

Sumit Mittal

@bigdatasumit

Followers

10K

Following

1K

Statuses

2K

Big Data Trainer • Founder & CEO TrendyTech • Tweets about #BigData & #DataEngineering. Helping you get hike 💰 & find your Dream Job 🚀 Join 20000+ students👇

Joined March 2018

Don't wanna be here? Send us removal request.

Sumit Mittal

@bigdatasumit

3 days

Excited to Inform that, I am conducting a Live Session on how to make a career in Data Engineering. This session will be conducted Tomorrow, 8th Feb. 2025 @ 11:30 AM (IST) Agenda of Tomorrow sessions will be - 1) Proven Roadmap to become a Highly Paid Big Data Engineer. 2) Answers to Top 10 most asked questions from aspiring Big Data Engineers. 3) About my 32 weeks Premium Program 'The Ultimate Big Data Masters Program (Cloud Focused)' . If you want to know how you can master Data Engineering & Cloud then this webinar is for you. Register for the session through below link - Also Join WhatsApp Community for the updates. Looking forward to meeting you in this Live Session Note - If you are my student then you can skip this, as you already know all of this. #bigdata #dataengineering #azure #aws

0

1

20

Sumit Mittal

@bigdatasumit

4 days

My student Diwakar reddy joined Tiger Analytics with more than 300% hike He got rejected in 4 interviews but he didn’t give up. He then enrolled in my Ultimate Big Data Masters Program to get a structured & in-depth learning. Worked hard, followed everything I taught and the result is here. This time he cracked 4 interviews, & he recently joined Tiger Analytics with a dream hike. From Back to Back Rejections to multiple Offers with 3X the Salary, Thats what we can achieve If we have the right focus & we work hard! Always remember, Your present circumstances don’t determine where you can go, they merely determine where you start! #dataengineering #bigdata

0

27

Sumit Mittal

@bigdatasumit

4 days

7 Powerful Tips to 10X your Naukri Profile visibility If you are finding difficulty in getting calls through Naukri platform, follow below points 1. Check Key Words you are being searched for under Profile Performance. - If Keywords doesn’t match, Change Key Words in your profile. - Key Words are very essential under - Resume Headline, Profile Summary & Key Skills 2. If you are Immediate Joiner, mention 'Immediate Joiner' in Resume headline. 3. Avoid mentioning Notice Period more than 2 months 4. Ensure Resume is ATS Compliant. Check my post on ATS Resume 5. Set proper salary range. Check salary range in Naukri portal under; > Naukri -> Companies -> Compare Companies (Research by Ambition Box) > Filter by Experience, Profile, Company etc. 6. Keep applying for Jobs on routine basis using Keywords for better visibility (Algorithm) 7. Keep your Profile Active by making some changes at least twice a week. (Algorithm) Also, in my next post, I will talk about 'Checklist to avoid most common mistakes' Let me know if you find this helpful, in comment box! #dataengineering #resume

1

5

37

Sumit Mittal

@bigdatasumit

5 days

Data Engineers Interview Preparation - A complete Package (Free) Complete SQL Free Course - Complete Python Free Course (6 videos already released till now) lecture 1 - lecture 2 - lecture 3 - lecture 4 - lecture 5 - lecture 6 - more videos in python playlist yet to come.. Data Engineering Mock Interviews (Pyspark, SQL, DSA, Data Modeling, System Design, Azure Cloud, AWS cloud) Interview 1 - Interview 2 - Interview 3 - Interview 4 - Interview 5 - Interview 6 - Interview 7 - Interview 8 - Interview 9 - Interview 10 - Interview 11 - Interview 12 - Interview 13 - Interview 14 - Interview 15 - Interview 16 - Interview 17 - Interview 18 - Interview 19 - Interview 20 - Interview 21 - Interview 22 - Interview 23 - Interview 24 - Interview 25 - Interview 26 - Interview 27 - Interview 28 - Interview 29 - Do subscribe to my channel so that you do not miss out on anything. Link to the channel - To check about my big data program you can visit Happy Learning! #bigdata #dataengineering #interview #apachespark #sql #python

0

4

14

Sumit Mittal

@bigdatasumit

5 days

What is meant by Data Modeling? Data Modeling is a way to structure your data so that it fits your needs in the best possible way. Needs can be different based on what system are we Modeling & who is the end user? model the table in a way that reduces the storage space, modeling it in a way that queries run faster, modeling in a way that user can easily query etc.. In a transactional system (OLTP) we generally use Normalization - It's a technique to divide one big table into multiple smaller tables with an intent to reduce the redundancy. However, OLTP systems are not meant to do reporting? our reporting work can overload the OLTP systems Datawarehouse (DWH) is best fit for reporting purpose (OLAP) What is DWH - it's like a Database but the objective is to make your analytical queries faster. Dimensional Modeling is one of the well know techniques for modeling a DWH. here are 2 definitions that you should know - - "Dimensional Modeling is a design technique for Databases intended to support end user queries in a DWH" Ralph Kimball - the process of modeling a business process into a series of facts and dimension tables designed for analysis. Key highlights of a Transactional DB design - designed towards fast maintenance of data - inserting and updating is quick - very small sets of data is retrieved in a query - Data consistency is critical - Focus is on customers who are entering the data Reporting DB design - copy of transactional data (not exactly the same way) - the resulting model reflects the kind of questions business wants to ask rather than the functions of underlying operational system. - Descriptive data like customer name, customer address is separated from the quantity data such as order quantity, order amount. - larger datsets - insert and update speed is not relevant - performance focus is on retrieving the data quickly. fun facts regarding facts & dimension tables => dimension tables would contain more than 90% of total columns, however data in fact table would be more as we keep getting new data. => in fact table the foreign keys might take more space then the actual fact information . When storage was costly, Dimension modeling ruled and it's even ruling today. Now a days OBT (one big table) is also becoming popular, because storage space is not a big concern. We can pay little extra for storage if that helps us to build faster queries, and it's easy for end users to query it. will OBT take a lead over Dimensional modeling in the future? PS~ I teach Data Engineering, DM to know more! #bigdata #datawarehouse #datamodeling #dataengineering #database

0

3

30

Sumit Mittal

@bigdatasumit

7 days

Many of my students joined Walmart as Data Engineers in past few months This time, it’s another super woman Shilpi Srivastava Her journey started with 3 amazing years at Cognizant, working on Python and Hadoop projects. But she knew she had to level up to grow in Big Data. Dedicating time post office hours, she mastered many things from PySpark to Databricks, Azure Data Factory to Data Modeling and lot more. More importantly, she gained the confidence to take on bigger challenges. When she writes ‘’Enrolling in Sumit Sir's Big Data program was a game-changer”; this one statement speaks enough about my in-depth program & it’s impact on my student’s career Shilpi’s hard work & commitment made her join Walmart as Data Engineer 3. Once again congratulations on your success Shilpi. Wishing you an incredible journey ahead. When my first student joined Walmart, I never imagined the list would grow this long. I would just say one thing to everone, Keep working hard. The harder you work, the luckier you get. #dataengineering #walmart #success

1

39

Sumit Mittal

@bigdatasumit

9 days

15 Medium to Hard Databricks Interview Questions 1. How does Databricks handle cluster management and resource allocation for optimized performance? 2. What are the best practices for optimizing Apache Spark jobs in Databricks? 3. How does Databricks autoscaling work, and what are the key considerations when enabling it? 4. What are the main performance bottlenecks in Databricks Spark jobs, and how do you troubleshoot them? 5. What are Delta Lakes in Databricks, and how do they improve data reliability compared to Parquet or ORC? 6. Explain how ACID transactions work in Delta Lake and their benefits in a Data Engineering workflow. 7. How does Databricks handle schema evolution in Delta Lake, and what challenges can arise? 8. What is data skipping in Delta Lake, and how does it optimize query performance? 9. How can you implement data partitioning in Databricks, and when should you repartition data? 10. What are the advantages and disadvantages of using caching in Databricks, and when should you use CACHE TABLE vs. persist()? 11. How does Databricks handle data serialization, and which formats (Parquet, Avro, ORC) are best for different use cases? 12. How can Z-Ordering improve query performance in Delta Lake tables? 13. How can you implement Role-Based Access Control (RBAC) in Databricks for better security? 14. How does Unity Catalog in Databricks enhance data governance and security? 15. What are the best practices for managing secrets and credentials securely in Databricks? Do mention your answer in the comments! #databricks

0

4

32

Sumit Mittal

@bigdatasumit

10 days

How to become a good Data Engineer Prerequisites --------------------- 1. Programming fundamentals 2. SQL Basics You should learn the below things -------------------------------------- 1. Distributed Computing Fundamentals 2. Data Lake Concepts 3. One data ingestion tool 4. DWH concepts 5. One NOSQL Database (good to know not mandatory) 6. In-memory computation using Apache Spark (pyspark) 7. Structured Streaming with Kafka for real time data 8. One of the Cloud - AWS/Azure/GCP (knowing multi cloud is a add on) 10. Integration of various components 11. One Scheduling/Monitoring tool 12. CICD for production readiness 13. Do a couple of projects to get a good feel of it. If you are looking to learn AWS cloud then learn the below technologies EMR Redshift Athena Glue S3 Lambda EC2 If you are looking to learn Azure then try learning the below ADLS gen2 Azure Databricks Azure Data Factory Synapse If you are having 8+ years experience then focus on - Performance Tuning Part & Design Aspects If you are targeting Top Product based companies then Data Structures & Algorithm is also very important. Arrays, LinkedList & Trees should be good enough. Just to get confidence check a few mock interviews on my youtube channel and that should be the final level of confidence you need. Remember, don't just learn to prepare for interviews. your objective should be to effectively work on projects. So it's important to focus more on internals & this will be the best way to be interview ready also! If I am missing anything please feel free to add in comments. PS~ I follow similar roadmap in my Ultimate Big Data Program. New batch starting tomorrow. DM to know more. #bigdata #dataengineer #roadmap

0

41

246

Sumit Mittal

@bigdatasumit

10 days

After Maternity break to 120% hike & getting 4 offer letters . Yes this is possible & My Student shifa saleem achieved this success. She recently joined PwC India as Senior Associate Data Engineer Lets go through her challenges & success in her own words "Starting my career as Data Engineer was challenging. Came across various instances in projects at LTI Mindtree, requiring in-depth understanding of things. Enrolled for Sumit Mittal’s Big Data Masters Program. Gaining lot of knowledge of Hadoop and Spark along with optimisation techniques I could started giving interviews. Cracked multiple technical rounds at a go & received 4 offers. Excited to share, that I recently joined PwC India as Senior Associate Data Engineer with 120% hike" Shifa, this incredible journey of yours is going to inspire many women around the globe. I am proud to have such passionate students like you, who show us that with right dedication and hard work, anything is possible!

1

25

Sumit Mittal

@bigdatasumit

11 days

23 Trending Pyspark Interview Questions (Difficulty level - Medium to Hard) 1. How can you optimize PySpark jobs for better performance? Discuss techniques like partitioning, caching, and broadcasting. 2. What are accumulators and broadcast variables in PySpark? How are they used? 3. Describe how PySpark handles data serialization and the impact on performance. 4. How does PySpark manage memory, and what are some common issues related to memory management? 5. Explain the concept of checkpointing in PySpark and its importance in iterative algorithms. 6. How can you handle skewed data in PySpark to optimize performance? 7. Discuss the role of the DAG (Directed Acyclic Graph) in PySpark's execution model. 8. What are some common pitfalls when joining large datasets in PySpark, and how can they be mitigated? 9. Describe the process of writing and running unit tests for PySpark applications. 10. How does PySpark handle real-time data processing, and what are the key components involved? 11. Discuss the importance of schema enforcement in PySpark and how it can be implemented. 12. What is the Tungsten execution engine in PySpark, and how does it improve performance? 13. Explain the concept of window functions in PySpark and provide use cases where they are beneficial. 14. How can you implement custom partitioning in PySpark, and when would it be necessary? 15. Discuss the methods available in PySpark for handling missing or null values in datasets. 16. What are some strategies for debugging and troubleshooting PySpark applications? 17. What are some best practices for writing efficient PySpark code? 18. How can you monitor and tune the performance of PySpark applications in a production environment? 19. How can you implement custom UDFs (User-Defined Functions) in PySpark, and what are the performance considerations? 20. What are the key strategies for optimizing memory usage in PySpark applications, and how do you implement them? 21. How does PySpark’s Tungsten execution engine improve memory and CPU efficiency? 22. What are the different persistence storage levels in PySpark, and how do they impact memory management? 23. How can you identify and resolve memory bottlenecks in a PySpark application? These questions were asked in recent interviews at top companies. You can try answering these in comments! I am sure this post should help all the candidates planning to attend Data Engineering Interviews. Which next topic should I pick? #bigdata #pyspark

1

11

41

Sumit Mittal

@bigdatasumit

11 days

For all those who don't know much of AI & might be wondering whats the buzz around DeepSeek, and how it affected Nvidia. Let me explain, You might be aware of ChatGPT, at a high level building such a chatbot requires 2 steps - Step 1. Pre Training Stage Lets see what happens in the Pre Training Stage- Its built by taking a large chunk of internet, wikipedia, blogs, online books and a lot more. This is really huge amount of data, 100's of TBs. You might have heard that GPT3 is a 175 billion parameter model. This means that there are 175 billion floating point numbers, and these numbers are better referred to as weights or parameters. Now the above data (100's of TB's) is used to train the model. when I say train the model, it kind of tweaks the 175 billion parameters. It keeps adjusting the values of these by seeing loads of examples from the inputs data. This is a very compute intensive task and costs a lot. OpenAI, Facebook, Google, till now basically all the companies where spending 100's of Millions of dollars to do this fine tuning stage. They use very high end H100 Nvidia chips for doing this processing. Remember all of this is to get the 175 Billion weights (decimal numbers) This entire phase is called the pretraining stage and the output we get is called the foundation model. (For Example GPT3, GPT4) This is something which not everyone can afford to do, as its a big financial burden. Step 2: Fine tuning the above foundation Model This is more like aligning the model to fit to a specific task. For example if we want to build a chatbot kind of thing then we feed 100k high quality questions and answers and the system understands it should respond a question with a answer. This phase is not that compute intensive, think it like little more training done on the foundation model. Many companies or even individuals can afford to do that. Now in both these steps you understood that getting a foundation model was beyond scope for many companies, because of heavy infrastructure requirement and associated cost. This is where DeepSeek, a chinese startup has done wonders. Whatever OpenAI has done by spending 100's of millions or nearing to a billion, these DeepSeek people have done the same by just spending 6 million dollars. Also they used H800's, which is not the state of art GPU. They kind of broke the notion that model pretraining requires a billion dollars. They showed how by efficient use of resources and by optimizing the process things can be done in very less cost by using minimal resources. Not just this, they released the weights of the model and made it open source. This means anyone can download the foundation model, and can do the fine tuning. Now how does this impact Nvidia? If things can be done without using state of the art H100 GPUs, and by using lesser resources then definitely it will impact the sales. I am sure, even if you are new in the field of AI you would have learnt something from this post.

1

6

42

Sumit Mittal

@bigdatasumit

13 days

Preparing for TOP Product based companies as Data Engineers - Here is A complete Plan (All Free Resources) Complete SQL Free Course - Complete Python Free Course (6 videos already released till now) lecture 1 - lecture 2 - lecture 3 - lecture 4 - lecture 5 - lecture 6 - more videos in python playlist yet to come.. Data Engineering Mock Interviews (Pyspark, SQL, DSA, Data Modeling, System Design, Azure Cloud, AWS cloud) Interview 1 - Interview 2 - Interview 3 - Interview 4 - Interview 5 - Interview 6 - Interview 7 - Interview 8 - Interview 9 - Interview 10 - Interview 11 - Interview 12 - Interview 13 - Interview 14 - Interview 15 - Interview 16 - Interview 17 - Interview 18 - Interview 19 - Interview 20 - Interview 21 - Interview 22 - Interview 23 - Interview 24 - Interview 25 - Interview 26 - Interview 27 - Interview 28 - Interview 29 - Do subscribe to my channel so that you do not miss out on anything. Link to the channel - To check about my big data program you can visit Happy Learning! #bigdata #dataengineering #interview #apachespark #sql #python

2

30

121

Sumit Mittal

@bigdatasumit

13 days

All the Things that you should know about DeepSeek - Its a Chinese AI research lab Founded in 2023 - They created DeepSeek-R1, it's competing with OpenAI's GPT series. - It's said that they have just spent $6 million to train the foundation model, on the other hand the other companies Google, Facebook, OpenAI have used ~50 to 100X more funds to do the same. - Not just training cost, the inference cost is also just a fraction of what OpenAI offers. - This significant cost reduction is due to efficient usage of resources (Nvidia GPU's) - Earlier export restrictions to China banned the sale of Nvidia's H100 processors, preferred by U.S. AI firms. Chinese companies instead accessed lower-performance versions like the Nvidia H800 or A800. - innovative techniques like Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE), allows DeepSeek to achieve impressive results without relying on the most advanced and expensive hardware. - DeepSeek is Open source unlike OpenAI where its closed even after having the name open in its name. - Why Nvidia is impacted - DeepSeek-R1, achieved top-tier performance using far fewer computational resources. This efficiency reduced the reliance on Nvidia's high-performance GPUs, like the H100. - DeepSeek-R1, achieved top-tier performance using far fewer computational resources. This efficiency reduced the reliance on Nvidia's high-performance GPUs, like the H100 - DeepSeek-R1 employs a Mixture-of-Experts (MoE) architecture, which is a variant of the transformer model. In this design, only a subset of the model's parameters is activated for processing each token, enhancing computational efficiency. What's the future of AI? Looks like in Future the best foundation models will be available to us like a commodity and the focus will be more on the applications and business usecases. This era will be even better! PS~ I offer a AI for Data Engineers program, DM to know more! #deepseek #openai #ChatGPT

1

3

15

Sumit Mittal

@bigdatasumit

13 days

DeepSeek has about 50,000 NVIDIA H100s, However they can't talk about because of the US export controls that are in place. This is what Scale AI CEO Alexandr Wang said. Elon Musk thinks this is Obvious. This is getting super interesting, but if the claims but DeepSeek are true then this can change the direction in which AI is heading, and we can see more open source LLM's performing on par or even better than the closed models. So much is happening in this AI space, super excited to see whats next! #deepseek #chatgpt

1

2

18

Sumit Mittal

@bigdatasumit

13 days

Let's talk about Delta Live tables (DLT) DLT is framework to Build ETL pipelines with a Declarative approach Basically we tell what to do, rather than how to do it. DLT simplifies traditional ETL Challenges with traditional ETL: Infra management, scaling up and scaling down the infra manage dependencies track lineage handle batch and streaming in the pipeline ensure data quality handle failure and retries monitor and optimize pipeline So basically time spent on tooling kind of dominate your efforts. DLT solves the above issues, and make it easy to build ETL pipelines. Now why declarative programming? understanding your objective allows DLT to take care of all the boring stuff, and do it in the best optimized way. DLT has 2 core abstractions: - Streaming table (it is a delta table) it has a stream writing to it. it is used for ingesting huge amount of data where we cant afford to do recomputation. - Materialized view - The results of query stored in a delta table. It is the answer to the query. Also we can define the Data Quality using Expectations 1. do not drop records just a raise a warning 2. drop bad records 3. abort processing for a single bad record we can use either the Hive metastore or Unity catalog for handling your table metadata. Unity catalog is always a better choice as the metadata can be shared across different workspaces. A lot of performance tuning is build in. For example when we create a materialized view, instead of doing a complete reprocessing of data which could be really costly, It tries to see if we can get the right results without doing complete reprossing. The above optimizations are build as part of Enzyme project & it see what can be done in below order.. - monotonic append - partition recompute - merge updates - full recompute If you have ever implemented a merge or SCD2 then you know the pain, so many lines of code and we a lot of chances to go wrong. with the declarative approach if we want to implement SCD2 we just write APPLY CHANGES into live.customers_silver from STREAM(live.customers_silver_cleaned) keys(customer_id) sequence by load_time stored as scd type 2; In the above we mention what should be the key to do the merge, can be a combination of keys too. we define the sorting order based on load_time and tell what we want like scd type 2 or anything else. DLT is evolving each day and I am super excited to keep a track on whats coming next. I hope you found this post helpful, if you have any doubts related to this or want to add on any points do mention in comment. For those who do not know me, I offer an end to end Data Engineering Program. DM to know more! #bigdata #dataengineering #databricks

0

4

40

Sumit Mittal

@bigdatasumit

17 days

After teaching thousands of students all across the world, I have gained the insight that each student is different. Some of them are working for top US Based MNCs including FAANG. Some of them are satisfied working at startups. Some of my students get hikes of 200-400% Some of them just get satisfied with 70-80%. Average package is of 23 LPA but some of them push their boundaries to go till 100 LPA. Same mentor, same content, same guidance, same platform. Then why this kind of difference in outcome ? Because all of the students have different capabilities and different needs from their jobs. Some of them prefer better work culture, some of them prefer higher CTCs and some of them just come to get a job. I can't put them in one slot and compare them on same parameters. Your learning potential is different, so never compare your success with anyone else's. The key to happiness is defining your own success metrics and achieving them. P.S~ I teach big data and I am happy that my students are leading Big Data all around the globe. If you want to know more about my course then DM Me.

0

10

Sumit Mittal

@bigdatasumit

19 days

All Data Engineers should definitely read these 10 posts.. 1. From 0 to Hero in SQL - Follow this Plan 2. Crunching Big Data in absolute layman terms 🔥 3. Normalization vs Denormalization 4. Super Interesting Conversation of 2 friends - OLTP & OLAP 5. Let's understand Partitioning using a deck of cards. 6. End to End Big Data Pipeline - From Ingestion to Reporting 7. Delta Engine Optimizations in Databricks 8. Delta Engine Optimisation - Data skipping using stats 9. What is Databricks 10. Lakehouse Architecture PS~ I teach big data. visit my website to know more about my big data program. #dataengineer #bigdata #apachespark #data #students

0

57

259

Sumit Mittal

@bigdatasumit

20 days

Normalization vs Denormalization Normalization is a process of dividing the data into multiple smaller tables with an intent to reduce data redundancy & inconsistency. However, Denormalization is totally opposite of above idea. Denormalization is the technique of combining the data into a big single table. This definitely leads to redundancy in the data. Note: Redundancy causes inconsistency - Consider that same data is repeated 2 times and when updating you update at one place and forget to do at second place. This leads to inconsistent state. When retrieving data from Normalized tables we need to read many tables and perform join which is a costly operation. However, when reading the data from denormalized tables it's quite fast as no joins are required. When to use Normalized tables vs Denormalized ones? when we talk about OLTP systems (Online Transaction Processing) where we deal with lot of insert, delete and updates then you should go for Normalized tables. However, when you talk about OLAP (Online Analytical Processing) systems where you need to analyse historical data then Denormalized tables are best fit. Since you wont be doing updates on data here, even though after having redundancy we wont end up in inconsistent state. Let's take a simple Example - when you make purchase on amazon then it requires a OLTP system (a rdbms kind of database). Here the tables should be normalized. when amazon is doing data analysis of historical data, they will create denormalized tables just to make sure analysis is faster and costly joins can be avoided. I hope you liked the explanation, feel free to add more in comments! PS~ My new Big data batch is starting on coming Saturday. DM to know more! #dataanalysis #dataengineering #normalization

0

7

29

Sumit Mittal

@bigdatasumit

20 days

Data Engineers hiring process at Databricks So, lets start with basic questions you should be knowing before applying. Q: How many Interview Rounds are conducted and which are they? A: There are generally 5 Interview rounds, with 1 Managerial and rest 4 as Technical. Q: What all topics are asked in each interview rounds? A: Round 1 is generally a managerial round. But few of my students also mentioned that, in some cases Round 1 is a Screening round followed by Managerial round. Topics for Round 2 to Round 5 are almost same but the complexity and difficulty level increases with each round. Main topics covered include SQL, Spark Internals, PySpark, Database Architecture, Streaming Internals and entire DBMS. Round 2 will have more concept-oriented questions. Do note 1 important thing for the HR Round, which is your Resume. You need to be very clear on each and every word you mention in your Resume. Q: What are the hierarchy levels, experience required, band and CTC? A: The nomenclature of designations varies from one vertical to another. I am mentioning the one I got to know from most of my students. L3 > Associate Technical Solution Architect > (2 – 3 Years) > (15-25 LPA Basic + Stocks) L4 > Technical Solution Architect > (3 – 6 Years) > (25-35 LPA Basic + Stocks) L5 > Senior Technical Solution Architect > (6 – 10 Years) > (35-50 LPA Basic + Stocks) L6 > Resident Solution Architect > (10+ Years) > (50-60 LPA Basic + Stocks) L7 > Senior Resident Solution Architect > (15+ Years) > (60-75 LPA Basic + Stocks) There is a significant portion of Stocks as part of CTC which ranges from around 60 Lacs to 1 Crore projected for 4 Years. Q: How would you rate the difficulty level out of 5, with 5 being most difficult? A: I got feedbacks with difficulty level as 4 out of 5. Q: What was one common thing realised by most of the students while going through different technical rounds? A: The Interviewer would be judging you based on your fundamentals and clarity of thoughts. If your fundamentals are strong, it would be easy for you to answer the complex questions also of Round 4 & 5. Q: Sir, do we need to undergo any other separate preparation apart from your curriculum, to be ready for this interview? A: My ‘Ultimate Big Data Masters Program’ is designed in such a way that it covers all the essential technology stack required by the industry today. Do let me know, if you have experienced anything else which is being missed out, so that others can also be benefited from it. #databricks #dataengineering

0

12

42

Sumit Mittal

@bigdatasumit

21 days

Gen AI - Lets learn about the Working of a Transformer Lets say we give a sequence of words to a Transformer as the input "Sachin Tendulkar Plays the sport of" -> Transformer -> "Cricket" The output from Transformer will be a probability distribution of most likely words. for example Cricket 91% the 3% a 2% . . How does this work internally.. Step 1 - The input words are tokenized for example Sach in Tend ul kar plays the sport of so the above sequence of words gets divided in 9 tokens (lets assume) Step 2 - For Each token take out the static embedding from the Embeddings Weight Matrix. As per GPT3 this will be a vector of 12,288. This means each token will be represented by a list of 12,288 floating point numbers. Each vector will encode the general meaning of that word (not the context) Btw how are these embeddings calculated? They are calculated as part of the pre training phase, and learnt based on various examples. Step 3 - Refine the Embedding so that each token has the position encoded, for example the token "plays" should have the position 6 encoded. Step 4 - Go through the "Attention mechanism" Here the token vectors talk to each other and try to get the context from other tokens. The output will be the new set of embeddings which holds the context. Step 5 - Feed Forward Layer (MLP) Here the Vectors do not talk to each other, they try to learn from the knowledge / Facts thats build during the pre training stage. For example - Rohit Sharma, often called the 'Hitman,' is known for his effortless timing and elegance. During his innings, every shot he played was a treat to watch, filled with... Next word might be "grace" The introduction emphasizes "elegance" and "effortless timing," signalling that the conclusion will align with these qualities. The step 4 & 5 are repeated many times.. In the very end the embedding of last vector as per our example "Sachin Tendulkar Plays the Sport of" The last vector is "of" so this token will encode the entire meaning of the sentence and lot of knowledge from the outside world. Step 6 - Multiply the embedding of the Last vector "of" with the Unembedding matrix The Umembedding matrix is also learn as part of pretraining stage. This will map to a list of all tokens in the vocabulary with a certain score. Technically referred to as Logits Step 7 - A normalization step to convert the above Logits to a probability distribution, we use the softmax function for example if the Logits are batsman - 105 king - 12 founder - 18 cricketer - 189 The probability distribution might be something like batsman - .2 king - .1 founder - .05 cricketer - .7 And finally the next word is selected! This is how a Transformer works! if you don't know, Transformers is one of the most important thing that you should know when learning Gen AI, because this Transformers architecture is one of the reasons of the AI boom today!

0

1

14