Transcript of SQL Full Course for Beginners (30 Hours) – From Zero to Hero

Video Transcript:

Hello and welcome to this unique course to master SQL. My name is Barzalini and I lead big data projects at Mercedes-Benz over a decade of experience in SQL data engineering, building data warehouses and data analytics. Now, of course, the first question is what makes this course so special. Well, not only you will learn how to write SQL codes, but more important than that, you will learn how exactly SQL works behind the scenes. So I'm going to break complex concept in SQL using hundreds of animated visuals. This makes it really easier to understand SQL and as well it is more fun than just sharing my screen and I just show you code. Right. The second reason is this course is taught by me. I have industrial experience and I will be sharing with you everything that I know about SQL and how I use it in my real projects. So I will be sharing with you hundreds of best practices, tips and tricks and I'm going to show you my decision-m process in SQL. So by the end of this course, you will be ready to solve any complex task like I do using SQL. So now I designed this course to cover the basics like writing your first SQL query and then we're going to keep progressing in the course by covering advanced techniques in SQL like the window functions, stored procedures, indexes and even at the end we're going to build a data warehouse using SQL. And this course is suitable for anyone data engineers, data analyst, data scientist and even for students. And by the way the good news everything is for free from the start until the ends I will be sharing with you as well a lot of materials code presentations and animations and there are no hidden costs. So you don't have to pay for anything. But my friends in return I really appreciate it if you support the channel in order to grow. All right my friends I'm really excited about it. I don't know about you. If you are motivated join me learning SQL. This is going to be amazing. So let's go. All right. Now I'm going to show you the road map in order to learn everything about SQL starting from very basics and then advance step by step until we have very advanced topics. So now at the start we have to understand few stuff like what is SQL, why to learn it, what are databases and the types of databases and after the theory we're going to prepare your PC with data and the softwares. Now once we have everything then we can go to the next chapter. This is the basics how to query data using SQL and here we're going to cover the basic components in each SQL query like select from where those basics. Now once you understand how to query the data, how to get the data out of the database the next step we're going to go and learn how to define the structure of the database. How to create a new table add a new column remove column and as well how to drop a table. So with that you are defining new stuff in the database and then the next chapter you have to learn about the data manipulation. This time we're going to go inside the table and we're going to learn how to insert a new data, how to update the data and as well delete few rows from our database. So with that you have the basics how to query data, how to define the structure of your tables and how to manipulate your data. And I can say with that you cover the basics about SQL. Now after that we start with the intermediate phase where we're going to deep dive into topics like how to filter your data. Here we're going to learn about the comparison operators, logical operators, between and like. So all the operators that you can use in order to build a condition in order to filter your data. Then after that it's going to be very interesting topic. You have to learn how to combine them. And here we have two mechanism either using the join or using the set operators. And oh my god joining data. It's going to be very interesting topic. Here we're going to cover like a lot of stuff like we're going to start with the basic joins and then we go to advanced and then you have to learn how to choose the right join and after that you have to learn about the set operators and here you have like four methods union union all except intersects. So with that you learn how to combine multiple tables by combining the columns or the rows of your tables. So this is very important. Now moving on in our course. Now using SQL you can do a lot of stuff cleaning up the data a lot of data preparations and at the end you can do a lot of analytics and aggregations. So there are like two families of functions. The first one is the role level functions and here we have a lot of stuff you can transform your string values the numbers date and time and how to handle the nulls in SQL and at the end the amazing case statements. So all those stuffs are transformation for only one single value. We call it role level functions. And after you learn how to do data transformations, then you have to learn about how to do data analytics and aggregations using SQL functions. So we're going to start with very basics like the aggregate functions. And then we're going to deep dive into the window functions, analytical functions. And here we have like aggregates, ranking and value functions. Those are very important tool for any data analyst or data scientist doing analytics task in SQL. So I can say the rowle functions is for data engineers and the analytical functions are for data analysts. So at the chapter 8 we can say you have covered now the intermediate level and the last four chapters they will be the advanced stuff in SQL. So here there are a lot of techniques that you have to learn about SQL. So the first one is the subquery query inside another query and the very famous CTE common table expression. A lot of developers like this one and then you will learn about how to create views in the database. This technique if you learn it you're going to be really professional in SQL. Then we're going to learn how to create tables using select the temporal tables and then we're going to learn about the third procedures how to write a program in SQL and after that of course comes the triggers. So those are the advanced techniques that you have to learn in SQL in order to do advanced projects using SQL. So now once you learn all those concepts and you start writing a lot of SQL codes you will notice that some queries going to be really slow and for that you have to learn how to optimize the performance of your queries and here there are a lot of techniques. The most famous one is to create an index in the database or create a partition and at the end I will be sharing with you the top 10 best practices that I have learned in my projects on how to optimize the performance of your queries. So this is very important and then we're going to move to very interesting one. I will be sharing with you how I use AI like shy GBT or copilot as I'm using SQL in my projects. So here you have to learn how to write correct prompts to get assistance from AI as you are using SQL. And finally and my favorite one it will be about SQL projects. So my friends here you have to bring everything that you have learned about SQL in handon projects. With real projects you will get challenges and struggle and here going to happen the magic and the real learning and here there are three types of projects. The first one is data warehousing project. This is very data engineering focused project where you're going to learn how to build real data warehouse where you're going to take the data from the raw formats and then process it in different layers. Once you build it then you jump to another project. Here you're going to start exploring the data and start getting the first insights about the business. And the last project that you can do is the advanced data analytics project. So this is very important section where you do SQL projects. So my friends this is the road map on how to learn SQL. So as you can see it takes you step by step from basics to intermediate and you will end up having advanced topics and with that I can tell you you will learn everything about SQL. Okay. So now let's start with the first chapter the introduction to SQL and here we're going to cover few topics. So we have to understand first what is exactly SQL? Why we have to learn it? what are databases and the different SQL commands that we have in SQL. So it is the basics the theory about SQL. So what is exactly SQL? Let's go. So what is exactly SQL? Everything generate data and data is everywhere. Your first name is data your mobile and everything inside the mobile is data. Car is as well generating a lot of data. Bank, your finance statements, everything is data. And now of course the question is where do we store our data? Personally we store a lot of our data in like excels, spreadsheets in a text file. So you store a lot of your data in different files. Now how about companies? They have a lot of things that generate a lot of data that the products that they produce their customers as well generating a lot of data and sales informations and a lot of things. So companies generate massive amount of data. So now the big question is how they handle the data how they store it. Of course, they cannot go unused like simple files. They need something bigger, stronger and smarter. And here where the database comes in. So think about the database. It's like a container for storing data. But instead of just dumping files into folders, the database organized the data. So it is easy to access, to manage and to search. So a database simply it is a container that stores data. So now you might ask why we are using database. Can't we just use files like I do it personally? Well, let me tell you why we use databases. Imagine that someone asks the following question. Go and find the total spending in your data. So now, in order for Mike to find the total spending and the costs, he will be opening each of those files one by one, searching for the costs trying to combine the data and it's going to be very long and messy process. But now in the other side, if your data in database and you want to ask a question, it's going to be very easy. So all what you have to do is to talk to the database to ask a question and the database can answer your question with a result. And now comes of course the question how do we talk to a database? Well we use SQL. SQL is the language that you use in order to talk to the database. It stands for structured query language SQL. And here you have people that call it SQL like me and others that call it SQL. There is no right and wrong but if you follow me through the course I think you will start saying SQL. So by using SQL you can ask the database you can ask your data and the database going to answer your question by sending you a result. So this process is very easy simple and fast and this is way better than having your data stored in different files. Another reason why we use databases is that they can handle really huge amount of data. So sometimes we have like millions of data inside our database but in the other side if you are storing your data inside spreadsheets and you have like massive amount of data what can happen your spreadsheets going to just break they simply can't handle big data and another reason why we use databases is that it is just secure. It is safer to store important and critical data inside the database than just storing it in spreadsheets and files. So the databases are secure and you can control who is accessing what. So it is just more professional to store the data inside a database. All right my friends so far what we have learned most of the companies stores their data inside a container called a database and for you in order to ask questions and to talk to your database you have to speak the language of SQL. Now I'm going to show you how it looks like usually in companies. So we have our data inside the database and then you will have multiple people with multiple roles that are just writing different SQLs in order to talk to the data. But now not only employees and people interact with the database. You could build a website or an application that as well interacts with the database by sending different SQLs. And of course, depend on how many people are interacting with the application and the website, it might generate really massive amount of SQLs that sent to the database. And not only that, you might has as well tools in order to do data visualizations where you have like a dashboard or reports maybe created using PowerBI or Tableau and it is used by stakeholders and managers in order to make decisions and as well those tools will be connected to the database and creating SQLs. So now as you can see we have a lot of interactions with the database from people applications tools a lot of things are generating SQLs and interacting with the database but the database is just a container and storage right so we need something a software that manage all those requests and that's why we have something called database management system DPMS so it is a software that going to manage all those different requests to our database and it going to make the priority which SQL must be executed First, this software can as well manage the security whether the SQL is allowed to be executed in the first place. So my friends, the DPMS is the software that going to manage the database. And now we are not done yet. There is something missing. So we have our data, we have the software. What is missing here is the hardware. So in real companies, we cannot run that on our PC because first our PC is weak and as well it goes offline. That's why we need a server. server it is like very powerful PC and as well it lives 24/7 so it is always available and here we can decide whether we're going to have a server inside the company or we can use cloud services in order to run our database so my friends so far what we have learned the database it is container to store the data the SQL it is the language in order to talk to the database the DPMS it is the manager it manage the database and the server it is the physical machine where the database lives so this is how it looks Like and now my friends there are different types of databases. So let's see what do we have. The first and the most famous one it is the relational database. It is very simple. It is like spreadsheets call them table where we have columns and rows and then there is like a relationship between those tables to describe how they relate to each other and that's why we call it relational database. So if people hear a database they're going to think about this one. Now we have another type of databases called key value. This time the data is organized completely different where you have pairs of keys and values. Think about it. It's like a big dictionary where you have a word like the key and the definition of the word this is the value. And now moving on to the next one. This is as well important column based. So now instead of grouping the data by the rows this type of databases group the data into columns. That's why it's called column paste. And this is very advanced database in order to handle huge amount of data where the main purpose is to search for data. Moving on to another database called graph database. The main focus here is the relationship between objects. So the main idea here is how to connect my data points. And now finally we have the document database. The data is stored as entire documents where the structure of the data is not that important. What is more important is to fit everything in one page in one document. And now if you look to those five types, we can group the document, graph, column based, key value, all those databases called NoSQL databases and the relational database, SQL database. And in this course, we will be focusing of course on the relational database. And I'm sure you have heard about like the Microsoft SQL server, the MySQL, the possesses they are SQL relational database. And for the key value you have the radius the Amazon Dynamo DB and we have for the column paste we have the Cassandra and the red shift. For the graph database we have the Neo 4G and the very famous database the MongoDB as a document database. Now my friends for this course we're going to be focusing on the SQL relational databases because it is the most famous one and the most used one in companies and I will be focusing on the Microsoft SQL server. So those are the different types of databases. Now the databases are very structured and organized. It has the following hierarchy. The starting point is the server as we learned it is powerful PC and it is where the database lives and inside it we can have multiple databases. So maybe you have a database for the sales and another one for the HR. So the server can host multiple databases and as we learned a database is a container of your data. Now moving on to the next level. In each database we can have multiple schemas. A schema it is like category or you can call it a logical container that we can use it in order to group up related objects like let's say you have hundred of tables. So you can split all the tables that has to do with the orders in one schema and then another group of tables with the schema customers and so on. So it help you to organize your tables and your objects in the database. And now if you go inside schema you can have multiple objects like tables. So now of course the question is what is a table? It is like spreadsheet. It organize your data into columns. The column define the data that you store inside it. So you have one column about the customer ID. Another column about the names, the scores, the birthday. So each column is about one type of data and sometimes we call the columns as fields. Now the other thing that we have in tables is the rows or sometimes we call it records. It is where actually the data is stored. Now in this example each record represent one customer one person. So we have one record for Maria, John and Peter. Those we call them rows. Now in each table there is like one very important column called the primary key. It is always very important to have like one unique identifier for each customer for each row and we use it for different purposes in order to combine it with another table in order to identify quickly one customer. So it is unique. It's like fingerprint and there is no two customers having the same ID. Now the overlapping between the columns and the rows we have a single value a cell and each value each column stores specific data type. A data type it is like what kind of data we are storing like an integer 1 2 30 or a decimal where you have a decimal point 3.14. Now if you want to store characters we have different data types for that like you want to store the name or the description. So here we can use the char or the vchar. So you store inside them like the first name Maria or something. Now you might ask what is a char or vchar. So the char always a fixed one. So if you define it like five characters always it's going to go and reserve five characters from the space. But if you want things more dynamic then you go with the vchar. And now moving on we have another data types called the date and time. So if you want to store a date like the birth dates and if you want to store the time information you can use the time data type. So we call those stuff int, decimal, char, date, time. They are data types. So my friends, as you can see, SQL databases are very organized and structured. Okay. So now let's focus more about the SQL itself. We have in SQL different type of commands. So let's say that we have a database and this database is empty. So we have nothing inside it. Now, of course, the first thing that you have to do is to write an SQL with the command create in order to create brand new table in the database. So, once you executed the database going to go and build one, but this table is empty. So, we have nothing inside it. So now what you have done here is you have defined something new, right? And we call this type of commands the data definition language, the DDL. We have create to create something new, alter in order to edit something that already exists and drop in order to delete something. to drop for example a table. So this is the first family of commands. Now if you look at our table, it is empty. What do we need? We need data. So let's say that we have a website or an application. Now this application is generating a lot of data. Now in order for this application to move the data inside our new table, it must use the SQL command insert. So if you execute insert, you can add a new data inside your table. This type of commands we call it data manipulation language. And here we have three commands. insert in order to insert a new data, update in order to update an already existing data and delete in order to go and delete data from your table and that's why we call it data manipulation language because you are manipulating your data. So what do we have now? We have table, we have data inside the table. Now what we can do we can start asking questions. So let's say that you have analytical question about your data. Now all what you have to do is to write something called SQL query and inside it you use the command select but the whole thing we call it a query. So you send a query to the database, you have a question and the database can return for you the result, the data answering your query, your question and we call this type of activities using SQL, the data query language. And here we have only one and it is very famous. We have the select. We can use it in order to query our data. So those are the three different commands in SQL. And of course, we're going to learn all of them, but we will spend most of our time learning how to write the correct query for the correct answer. And now you might ask me, Barra, why we have to learn SQL? And if the time goes back, are you going to learn SQL again? Well, for sure, of course. And here are the top three reasons that I have. The first one, you have to learn it in order to talk to the data. You know, most of the companies stores their data in databases, and this is a standard way. This is how they do it. And if you want to work on the company in the data field and you want to talk to their data, then you have to use SQL. It's like you move to another country where they speak another language and you want to live there for a long time, you have to speak their language. The same thing here. If you want to work with data, you have to learn the language in order to speak to the database, the SQL. So this is for me the most important reason why we have to learn SQL and SQL it is in high demand. If you go now and check the job description of the software developer, data analyst, data engineer, data scientist, I promise you you will find there that they going to demand for SQL. So you will find they going to ask for SQL skills almost in each job description. So if you check for any data related jobs, you will find that they going to ask for SQL skills. Now another reason that I have is it is industry standard. So if you go and check multiple modern data platforms and tools like PowerBI, Tableau, Kafka, Spark, Synaps, you will understand that there will be always a section where you have to enter SQL code. So most of those vendors adopt SQL because it is the standard. It is widely used. It is like selling points that their tools are easy. So those are my top three reasons why SQL is still relevant and why you have to learn it. Okay, my friends. So with that we have now clear understanding what is an SQL why we need it what are databases and their different types why do we have DBMS servers and as well now you have understanding how things are very organized and structured inside the databases so that's all this is SQL all right so with that we have covered the basics about what is SQL and databases now in the next step we're going to go and set up our environments so that means we're going to prepare your PC with the data with the databases and all the tools that you need in order to learn SQL. Okay. So now go to the link in the description and you will land here in my newsletter website and you can subscribe if you want to get weekly news about my content. I make as well post about data and many other projects. So once you do that what we're going to do now we're going to go to the downloads over here and you will find here all the materials of different courses and the one that we want is SQL ultimate course. Let's go over here. Now once you do that you will land to this page where I have listed all the important links. So the first one and the most important one is to go and download the course materials. Here you can find everything code the slides the presentations the whole course or if you don't want that you can go to my get repository and there you will find exactly the same materials. So let's go and download everything. Okay. So now go and put the downloaded folder somewhere safe and let's go inside it. And here you can find three things. The first one is the data sets. Here if you go inside it you will find the data for the course the databases that we will be using in order to practice SQL. So everything is available here. Now the second folder you can find all the documentations. So that means all the visuals the presentation slides everything that I present during the course. It is available here as a documentation notes for you. Now moving on to the third one we have the scripts. So during the course we will be writing a lot of SQL codes and all those codes are here available. So that means those are all the codes that is used in the course. Okay. So with that you have now all the course materials. All right. So now the next step is that we have to go and download the SQL Server Express and you can find the link as well over here. So let's go there SQL Server Express. And now we're going to land on the Microsoft page where we can see the different offering from Microsoft where it's called server. So either we have it on the Azure or we can download it on the on premises. But we don't want those stuff. Just scroll down to see those two options. So the first option on the left side we have the developer edition. You will get all the features and services that Microsoft offers with the SQL server. It is as well free but the installation here is little bit complicated. But in the second option on the right side we have the express edition. Installation here going to be really fast and very easy. You will get as well all the stuff that you need for practicing SQL and learn SQL. So both of the options are free. It's just a matter of the installation. We will go now for the express edition. So go and click download now and it's very small file. So let's go and start it. And now the installation going to start. So we have basic, custom and download media. So download media means download now and later we're going to do the installation. Custom means we have more control on how to download and install the stuff. The basic is the easiest one and the quickest one. So let's go with the basics and click on that. And let's go and accept all those stuff. And now let's click on install. So now it's going to install the applications, drivers and so on. It may take a little bit time. So in order to do that, let's go and click on install SS SMS. So let's click on that and as well we can find the link over here. So let's go to SQL Server Management Studio. So let's click on that. You can find of course this link as well with the other links that I have collected. So now we are again at Microsoft page. Let's go scroll down and now we will see the following link free download for SQL Server Management Studio SS SMS. So let's go and click on that and then it's going to go and download it. Let's go and start it. So the first thing that we have to define the location. I will go with the default stuff. So let's click on install. Okay. Setup completed. We just installed SM SS SMS. So let's go and close it. So now let's go and start it. If you go to your menu over here, search for SQL Server and you will find it here. SQL Server Management Studio. Let's go and start it. Okay, so now we're going to get this window in order to connect to our server. So again, what is our server? It is the one we have installed at the first step, SQL Server Express. And that's why you're going to see in the server name, your PC name, of course, like it's not going to be my PC name. But here we have something called SQL Express. This is the server we just installed. So in the first option, we have database engines. We have reporting services. Those are different stuff from Microsoft. We're going to leave it as a database engine. And it should be like this. SQL Express. Now, how to access this database? We have the following stuff. We can do that using the window authentications or SQL server authentications. I'm going to say that let's stick with the window authentication. And the username going to be the PC name and as well the window user. If you don't have it for some reason those informations, you can go to your search search for cmd and then here you can say who am I? And with that you will get the PC name and as well the user that you are currently logged in. And this is exactly what I'm seeing over here. One more thing if you're having issue connecting to your database make sure to check the encryption. It should be mandatory and to click on the trust server certificates. So once you do that you will be able to connect. Okay. So with that we have the server we have the client. And now the last step we have to go and create the database. We want to insert our data. So now if you look to the object explorer and open the databases you can see that we don't have any database. So now let's do something about it. Go back to the course materials inside the data sets you will find the following. You will find we have here three folders MySQL postcress and SQL server. So if you want to follow with this course using different database like MySQL and Postgress you can find the exact same data for the database that you are using. But now in this course we are using the SQL server. So if you follow me with that go inside the SQL server folder and here you will find four files with different extensions. So what is going on here? Now for this course we have two databases. One that is very simple called my database and second one that has more tables called sales DB. And now in SQL server there are multiple ways on how to create databases. I will show you now two methods on how to create the database. Now the first option we want to create the database from a script. And if you look to those files, we have here two files with the extension SQL. Those are files with SQL code. So let's start with the first one, the init SQL server my database.SQL. Go inside it. And now here we have the SQL code. Copy everything. And now let's go back to our studio and then go to the menu and click on new query. And here in the middle you can paste the code. So now we have the code for the first database. And all what you have to do is to go and execute it. So once we executed you will see we will not get any error. And now on the left side we don't see yet our database because we have to refresh. So right click on the databases and click refresh. And now you can see it my database. So now let's see the content. Go extend it and then go extend the tables. And now you see here our two tables customers and orders. Inside those tables we can find our data. In order to see the data right click for example of the customers and let's go with the option select top 1,000 rows. Once you do that you can see now in the results we have here five customers. This is our data inside the table customers. So here again about the interface on the left side we have the object explorer where you can see the whole structure of the database from server to databases to tables. So you can see the whole structure on the top we have a menu with a lot of icons and then in the middle this place here we call it the SQL editor. We're going to go and write their SQL codes and then once you execute it at the bottom you will get the result and messages and below the SQL editor we have the output. So here you can see for example the data the results or different messages from the database. So the interface is very simple. Now we have to go and get our second database. So if you go back to our files you can find a second SQL file the initql server sales db.sql. Open that and let's go and copy everything here and let's go back to our studio. Same thing you have to go and create a new query then paste the whole code and this database is about the sales DB. So let's go and execute it and with that we will not get any errors and now we go to the left side and we do the same thing refresh and we can see the second database sales DB. Now we can go and explore it. So extend it go to the tables and here you can see five tables customers employees orders products. So here this is the intermediate database for our course. So now let's go and check our data. For example, let's go to the orders, right click on it and select top 10,00. And those are the orders of our database. Perfect. So everything is working. So those are the main two databases that we will be working through the whole course. And of course if you want to go and practice using another database, it's totally fine. For example, in Microsoft, there are a database called Adventure Works. It is really amazing. And I'm going to show you now how to import it. We can go over here the adventure works. So let's click on this link. So now we are again in Microsoft page. If you scroll down you can see here three different types of databases. The OLTB, data warehouse and lightweights. So they are like different databases. The OLTP is the most like complicated one. A lot of tables and transactions and so on. The data warehouse it is like really nice one in order to do data analyzes and stuff. The lightweight it is the simplest one. So let's go for example and get the data warehouse. So click on that and now as you can see the extension of this file isbak and now I'm going to show you the second way on how to create databases in SQL server. So now all what you have to do is to go to the following path. It really depends where you have installed the SQL server. So for me I have installed it in the program files Microsoft SQL Server MSSQL SQL Express then MSSQL backup. You have to go there. So here what you can do you can place all the files with the extension bak. For example, the adventure works that we just installed. This is a backup file for the database and we want to go and restore it and with that you are creating like a database. So this is the second method on how to create databases in SQL server by restoring the database. If for some reason the script didn't work for you. Now let me show you quickly how we can do that. Let's go back to our studio. Right click on the database and then here we have an option called restore database. Click on that. And now here we have two options under the source database and device. The default going to be database but we have to switch to a device because we want to import it from files. And then we go to these three dots. Click on that. And now we have to go to the option add. And now it's going to take you to the place where the SQL server creates backups. So here we can find our files and what we want you to create is the adventure works. Select that. Then okay, one more okay and one final okay. So now the database will be restored and it is successfully. So now on the left side we can see our third database. If you don't see it go and refresh of course and here you will find a lot of tables in the adventure works. And as usual we can go and explore the data by selecting top thousand rows. So my friends now you have three databases but of course our focus is only the first two that we have done my database and sales DB. And with that you have learned two ways on how to import databases into SQL server. So with that my friends we have prepared everything. We have the SQL Server Express running on your local PC. We have the studio the clients where we're going to use it in order to interact with the database and we have created our two databases that we will be using in order to practice SQL. So we are ready. All right my friends. So with that we are done with the first chapter. We have our introduction to SQL and now we're going to start learning the first thing in SQL and that is how to query our data. So let's go and start with that. Okay, so now we can understand exactly what is an SQL query. Now normally your data is inside the table and your table is inside the database and now you might have a question from the business like what is the total sales? What is the total number of customers? So any question that you have in your mind and you want to go and ask your data you want to go and retrieve data from the database and in order to do that you have to talk to the database using its language the SQL. So in order to do that you're going to go and write a query where you write inside the query something called select statement and with that you are asking the database for data. So once you execute your query the database going to go and fetch your data and then it prepares a result to be sent back to you. So with that you are asking the database a question by writing a query and the database going to process your query and answer your question by sending back data and with that we are like reading our data from the database and the queries will not modify anything will not change the data inside your tables or even change the structure of the database. So you use select statement only in order to read something from the database. You just want to retrieve data from the database. So this is what we mean with a query. And now my friends, each SQL query has usually different sections, different components. We call them clauses. And this is amazing because you're going to have enough tools to write a query that matches any question that you have about your data. So what we're going to do, we're going to cover all those clauses step by step in order to write any query that you need. So now we're going to start with two clauses that makes the simplest query in SQL. the select and from. So let's start with that. All right. So now it's really important for me that you understand how SQL works with the code with the queries. So now what I'm going to do, I'm going to show you on the right side the syntax of the query in SQL and then on the left side I'm going to show you exactly step by step how SQL going to go and execute your query. So now we have the table customers inside our database and we will start with the easiest form where we're going to select everything. Select the star. So the select star is going to go and retrieve all the columns from your table. So everything and the from clause it's going to tell SQL where to find your data. So with the select we select the columns that we want and the from you specify the table where your data come from. So the syntax going to be very simple. In each query we start always with the select. And now since we want all the columns we're going to write star and with that SQL going to understand I want to see everything. And then after that comes the keyword from. And now we want to tell SQL where the data come from. So we have to specify the table name. And that's it. This is all what you need to do. So once you execute it what's going to happen? SQL going to go and execute first the from clause. So it's going to go and retrieve all the data from the database to the results. And then in the next step going to go and check the select statement. So which columns we have to keep in the result since you are saying star then the SQL going to keep everything all the columns and with that you will see in the result everything all the columns and all the rows. So that's it. This is how it works. Now let's go back to scale in order to select few data from our database. Okay. So back to our studio. Let's go and start a new query and let's go and find our database just to expand it and our tables. Now it is very important to make sure that you are connected to the correct database. So go to the top left in the menu over here and make sure to select your database. So my database like this or we have a command for that called use and then just write the database name like this. So I'm telling SQL just use my database like this and with that SQL going to switch to your database. Now if you are learning any new programming language, it is very important to understand about the comments. So comments are like notes that you add to your code in order to understand what is going on. And of course the engine, the database will not go and execute it. it's going to go and ignore everything inside it. And there is like two ways on how to do that. Either you make inline comments by typing two dashes like this and then you write anything this is a comment. So now in SQL if you see it is green that means it is a comments. Now the other type you can have multiple line comments and in order to do that what you can do you can write slash and then start and then you can write anything this and then start a new line is a comment. So as you can see all the lines after the slash star it is getting green that means it is a comment and now let's say that you are at the end. So in order to close it you write again star and then slash and that you are telling SQL I'm done with my comments. So those are the two types of writing comments in SQL. Now back to our query. Let's say that we have the following task says retrieve all customer data. So I would like to see in the results all the data of my customers everything all the rows and all the columns. So currently our data is stored inside the table called customer and I need to see all the data in the output. In order to do that we're going to write a query and all our query start always with a select and since I need everything all the columns we write star and then a new line. Let's go and specify for SQL from where it's going to go and get the data. So it's going to be from and then we going to write the name of the table. It must be exactly like it is in the database. So it's called customers and you have to have it here as a customers. So that's it. Let's go and execute it. And now if you look to the results, you can see we have four columns and five rows. So with that you are seeing everything inside the table customers. You can see we have five customers and you can see all the columns about the customers. So this is very simple. We have ask question for the database using SQL query and the database should answer our question by returning our data in the results. All right. So now let's move to another task. I'm going to go and create a new query and this time we're going to retrieve all the order data. So that means I would like to see all the data inside the orders. So let's go and write a very simple query. We start as usual with select and since we want everything. So it is select star from our table orders. So that's it. Let's go and execute. And with that you can see in the output we have again four columns but this time we have only four rows. So that means in this table we have four orders and we can see all the data inside this table. So with that we can understand we have five customers inside our database and these customers did generate four orders. So as you can see we are now talking to our database and this is the simplest form of query in SQL. All right. So now let's move to the next step in our query where you say you know what I don't want to see all the columns from the database. I want to be more specific. So I would like to select exactly the columns that I need. So now we want to select few columns from the database where we select only the columns that we need instead of everything. Now about the syntax we're going to go and change a little thing. So instead of using star we're going to go and make a list of columns that we want to see in the output. So we're going to select column one column two and we're going to separate them using a comma. So we are just writing a list of columns exactly after the select. And for the from it's going to stay as it is. So from a table. Now if you execute this what going to happen as usual SQL going to start with the from. So it's going to go and get the data from the database and then the next step is going to go and check the select. So what going to happen? SQL going to go and keep only two columns like for example the name and the country and all the columns that are not mentioned in the select statements will be excluded. So SQL going to go and remove it from the results and keeps only the columns that we mentioned in our query. So this time instead of having four columns in the output we can have only two. So with that you are like filtering the columns and you are selecting exactly what you need. So now let's go back to SQL in order to practice this. All right. So now we have the following task and it says retrieve each customer's name, country and score. So that means I don't want to see everything from the table customers. I need only to see the three columns. So let's see how we can do that. As usual we start with select and I'm going to go with a star in order to see the whole table first from the table customers. So it's exactly like before. Let's go and execute it. And now I can see everything inside the table customers. But the task says I need only three columns. So now what we're going to do instead of the star, we're going to make a list of columns. So we start a new line and then we write the name of the first column. So the first name and a new line for the second column for the country and then again a comma and then we write a score. So with that we have the three columns. Now what I usually do, I go and select them and give it then a push using a tab. This just looks nicer and easier to read. So with that we have now between the select and from list of columns. Now there is like mistake that happens a lot where we go and type a comma after the last column. So if you do that and execute it you will get an error because SQL going to expect from you a column after the comma and since there is no column and immediately you have a from you will get an error. So there is no need for a comma after the last column. Now let's remove it and execute. And now that you can see in the output we don't have four columns we have only three. the first name, the country and the score. And by the way, they are ordered exactly like you selected in your query. So first we have the first name and then the country and then the last one the score. So that means if I go and now change the order. So let's get the country at the end and execute. You will see the country at the end. I'm going to go and put it back in between to match exactly like the task and remove the last comma. So execute again. And with that we have selected few columns from our table. So we are more specific to what we need. Okay. So that we have covered the two select and from next we're going to talk about the wear clause that you can use in order to filter your data. So let's go. So what is exactly where? We use where in order to filter our data based on a condition and any data that fulfill the condition going to stay in the output in the result and the data that don't meet the condition will be filtered out of the results. Condition could be anything like for example we say the score must be higher than 500 or you can say the country must be equal to Germany. So any condition that you have in your question. Now let's see the syntax in SQL. As usual we start with a select. We select the columns that we need. Then we write from where the data come from and then after the from we're going to write the where and exactly after that you specify your condition. So now let's see how SQL going to execute this. First SQL start as usual from the from. So it's going to go and get your data from the database and after that SQL going to go and execute the wear clause. So let's say that the condition should be higher than 500. And now what going to happen? SQL going to check each row whether it meets this condition or not. So for example for Maria she doesn't fulfill the condition because her score the 350 is not higher than 500. So she doesn't fulfill the condition and SQL going to go and remove completely this row this record from the results. Now SQL going to go to the second record. So Joan is fulfilling the condition. So he going to stay in the result. The same thing for George. Now moving on to the fourth one Martin. So this customer is not fulfilling the condition and SQL going to go and remove it from the results. The same things happen for the last customer. The score is zero and not fulfilling the condition. So that means if we apply this filter, SQL going to return only two customers out of five. So with that we are filtering the rows based on condition using the work clause. Now as you can see in the result we are getting all the columns but if you specify in the query like for example only two columns like the name and the country then SQL going to start removing as well the columns of the results. And this means in the output we will get only two columns and two rows. So with that you are filtering the columns and the rows of your results. So now let's go back to scale in order to practice this. All right. So let's have the following task and it says retrieve customers with a score not equal to zero. So now if you are looking to our task you see we have like here a condition. The condition says the score must not be equal to zero. So I don't want to see all the customers. I want to see only the customers thus fulfill this condition. So it's like we have to filter the data. So let's go and solve the task. Let's start as usual. Select star. There's no specifications about the columns from our table customers. Okay. So I'm going to start with this. Let's go and execute it. Now if you look at the result, you can see like almost all the customers are fulfilling the condition. Their scores are not equal to zero. Only one. The last customer his score is zero. So this customer does not fulfill our condition. Now let's go and build filter for that. So we're going to say where. And now there will be a section that is only focusing on how to build conditions and filtering in SQL. So don't worry a lot about the syntax of the conditions. We're going to cover that later of course but it is very simple. Now for the condition we need a column. So in which column is our condition based on it's going to be on the score. So we're going to write here score and since we are saying not equal there is like an operator in SQL called not equal and then we have to write a value after that. It's going to be a zero. So again the condition is like this. The score must not be equal to zero. It's very simple, right? And with that we have our condition and we are using the where in order to filter the data. So let's go and execute it. And now as you can see SQL did remove the last customer because he is not fulfilling this condition. And we have now only the rows that fulfill our condition. So as you can see it is very simple how to filter the data. All what you have to do is to write where clause after the from and then write a condition after that. Now let's have another task like for example it says retrieve customers from Germany. So I don't want to see all customers from different countries. I just want to see the customers that come from Germany. So that means we have a condition here. Country of the customer must be equal to Germany. So let's go and remove the current condition. It is not the one that we need and execute. If you are looking to the results, we have two customers that come from Germany and we are interested only to show those two customers. So let's go and make a filter for that. We're going to write where clause and after that we need a column. The column going to be the country. So we're going to write here country and this time the country must be equal to Germany. So we're going to write an equal operator. So we're going to write Germany like this exactly like the value inside our data. But now as you can see we are getting like an error here. And that's because in SQL if you want to write a value that contains characters then you have to put it between two single quotes. So at the start you put a single quote and as well at the end. And now as you can see the red line is away and the value now is red and that's because it is a string value. It is a value that contains characters and with that you will not get an error. So if your columns contains only numbers you can write it without single quotes. But if your values contains characters then you have to write it between two single quotes. Okay. So now back to our condition the country must be equal to Germany. Let's go and execute it. And it is working. So as you can see now we are seeing in the output only the customers does fulfill my condition where the country is equal to Germany. So this is exactly how we work with the wear clause in order to filter our data. So my friends this is how you filter your rows. And now let's say that I would like to filter the rows together with the columns. So I just want to keep the first name and the country and not interested to see the scores and the ids. So in order to do that we're going to go to the select and list the columns that we want to see. So the first name and after that a comma then the country and that's it. So let's go and give it a push and execute it. So we have two rows and two columns. So guys as you can see SQL is very simple. All right. So with that you have learned how to filter your data using the wear clause. Next we're going to talk about how to sort your data using the order by. So let's go. Okay. So what is exactly order by? You can use this type of clouds in order to sort your data. And of course, in order to sort your data, you have to decide on two mechanism. Either you want to sort your data ascending from the lowest value to the highest value or exactly the opposite way using descending from the highest value to the lowest. And the syntax kind of looks like this. So as usual, we start with the select and then from and after the from you can specify order by and with that you are telling SQL we have to sort the data and you have to specify two things. First you have to specify for SQL the column that should be used in order to sort the results. So for example you can say score and after the column name you have to specify the mechanism. So for example you say ascending from the lowest to the highest. And in SQL if you don't specify the mechanism the default going to be ascending. So you will not get an error if you don't specify anything after the column name. But my advice here is always to specify something after the column easier because it's just straightforward and easier to understand and if someone reads it can understand immediately it's going to be ascending because maybe not everyone knows what is the default in SQL. So always specify a value even if it's like easier to skip it and if you want to store the data from the highest to the lowest then you can specify descending. So as usual SQL going to go and start from the from it's going to go and grab your data from database. Then the second step is SQL going to go and sort the result. So the order by going to be executed and SQL going to see okay I'm going to sort it by the score and using the sending mechanism and still going to go and start like moving around your rows where the first row going to be the customer with the highest score and in this example John has the highest score the 900. So John going to appear as a first row at the result and that's because his score and after that the second highest is going to be George with 750 and SQL going to go and keep sorting the data and then we have 500 then 350 and the last row going to be the customer with the lowest score the zero. So this is how SQL executes your order by. Now let's go back to scale in order to practice. All right. So now we have the firming task and it says retrieve all customers and sort the result by the highest score first. So now by looking at the task we need all the customers. So there is like no conditions or anything to filter but we have to sort the results. So let's go and do that. We're going to start as usual by selecting all the columns from the table customers. So now if you go and execute it you will get all your customers and you are now seeing the data exactly like stored in the database. And you can see the result is not sorted by the scores. So we have here a low score then high score then low and so on. Now the task says we have to sort the results. So we have to go and use the order by and now you have to understand from which column and we can get that from the task. So it says it should be sorted by the score. So we're going to go and define the score here. And the final thing that you have to define is the mechanism descending or ascending. And you can get it as well from the task. So we have to sort the data by the highest score first. So the highest first and then the lowest. So that means we're going to go and use the descending. So that's all. Let's go and execute it. Now as you can see in the results, the first customer has the highest score. Then we have the second one with the second highest until the last one with the lowest score. That's it. This is how you sort your data. And with that we have solved the task. Now let's do exactly the opposite. So we want to sort the results by the lowest score first. So that means we want to see first the customers with the lowest score like here in this example we should see the ID number five as the first because he has the lowest score the zero. Now in order to do that all what you have to do is to switch the mechanism instead of descending when you can use ascending. Let's go and execute it. And that's it. As you can see now we have the lowest score then the second lowest score until the last row. It's going to be the customer with the highest score. So the lowest score comes first. So it is very simple. This is how you sort your data using SQL. And now I'm going to show you one more thing that you can do with the order by. You can sort your data using multiple columns. And we call it nested sorting. So now let's take this very simple example where you want to sort your data using country. So we are saying order by the column country and the mechanism going to be ascending. So from the lowest to the highest. Now if you do that going to go and sort the data this time based on the country. So we're going to have like the first two customers from Germany. It is sorting it alphabetically. Then we have the UK and the last two going to be from USA. Now if you are checking the final results you might say you know what there is like something wrong. The data is not completely sorted correctly. So if you are looking to the first two customers that come from country Germany. You can see the scores are sorted in ascending way from the lowest to the highest. So first we have 350 then 500. Then UK it's fine because we have only one customer. Now if you look to the customers from USA you see that it is like sorted the way around. It is sorted descending from the highest to the lowest. So first we have the score 900 then zero. So there is like no clean way on how the data is sorted and the result is not really clean and this issue happens usually if you are sorting your data based in a column that has repetition like here the country we have twice Germany and twice USA. So now in order to refine the sorting and make it more correct, we can include in the sorting another column in this scenario for example the score. So we can make a list of columns in the order by and we can separate them using the comma. And of course you can have different mechanism for each column like for the country we are saying it is ascending but for the score we say you know what let's make it descending. It will not be only one for all columns. So now what can happen is we're going to start sorting the data for each section. So for the two customers from Germany the sorting going to be from the highest to the lowest. So it's going to go and switch the two customers. So Martin going to be first because he has higher score than Maria. And with that we are refining the scores based on the same value of course the country. Now for the UK nothing going to happen because we have only one value and for the USA as well nothing going to happen because it is already sorted in the correct way from the highest to the lowest. So as you can see if you are including a second column you are refining your sorting and as well my friends the order is very important. So this is how you can do nested sorting in SQL. Let's go back to our SQL and start practicing. All right so now we have the following task and it says retrieve all customers and sort the results by the country and then by the highest score. So again we need all customers. So select everything from customers table. And now the task says we have to sort the result by the country. So we're going to start with the order by and since it says by the country. We're going to go with the country and we're going to sort it alphabetically. So it's going to be ascending. So let's go execute it. Now you can see the data is sorted completely differently by the country. So we have first Germany, UK and then USA. But that's not all and says then by the highest score. So we have to go and include another column in the sorting and we can go and add that by adding a comma and then mention another column the score and now we have to specify the mechanism. It says by the highest score. So the highest must come first and with that we are using descending. Now what is the current situation in that? If you look to the results for example for those two customers we have 350 and then 500. So that means the scores are sorted ascending right the same thing for USA. So from the lowest to the highest. Now if you go and do it like this what going to happen it's going to go and switch it. So you can see over here now for Germany first comes the highest the 500 and then the 350 and for USA as well they switched. So we have the highest and then the lowest and with that we have solved the task. Now again the order of those columns are very important. So since the scores comes after the country we will not get the highest scores first at the results. So we will not get the 900 as a first row. And that's because the scores must be sorted after the country. So the country has more priority. Now if you go and flip that. So let's go over here and says sort first the score and then the country. So let's go and execute it. It's called has first to sort the scores. So with that you will get the 900 first, right? And then the countries. And since there is like no duplicates in the scores, this makes no sense at all. So you can go and skip it. So nested sorting only makes sense if you have repetition in your results and you can use the help of a second column in order to make the sorting perfect. So that's it and with that of course we have solved the task. All right. So with that you have learned how to sort your data using order by. Now in the next step we're going to talk about how to aggregate and group up your data using group by and we're going to put it between the where and the order by because in the order of the query the group by comes between the where and the order by. So let's go. Okay. So what is exactly group by? It's going to go and combine the rows with the same value. So it's going to go and combine and smash press your rows to make it aggregated and more combined. So all what group by does it aggregates a column by another column. Like for example, if you want to find the total score by country. So you aggregate all the scores value for one country. If you have this kind of tasks, then you can use the group I. Let's see the syntax of that. We will start as usual with the select. And now what we want to see in the result is two columns. So we have to specify like a category like the country. This is the value that you want to group the data by. and another one where you are doing the aggregations. So for example you are saying I would like to see the total score. So we use the function sum in order to summarize the values of the score. After that as usual we use the from in order to select the data from specific table. And now comes the magic we use after the from group by. And now understands okay I have now to combine the data. I have to group up the data by something. And this time we are saying you have to group up the data by the country. So that means each value of the country must be presented in the output only once and for each country we want to see the aggregation and that is the total score. So let's see how is going to execute it. So it's going to first start with the from it's going to go and get the data from the database and then it's still going to execute the group by and now scale understand okay I have to group up now the data by the country and it understands it has to aggregate the scores for that. So it's going to go and identify the rows that are sharing the same value. Like for example here we have two rows for Germany and it's going to bring it to the results. So now we have two rows for the same country but since we are saying group by country SQL going to try and combine them smash them together in only one row. So each value of the country must exist at maximum once. We cannot leave it like this. So now what we going to do with the scores? We have two scores. Now SQL going to check the aggregate function. It is the summarization. So, and it's going to go and add those values 350 + 500. And with that, we're going to get the total score of 850. And with that, as you can see, scale is combining those two rows into one. So, in the output, Germany will exist only one. And about the scores, we will get the total score. And the same thing going to happen for the next value. In the country, we have the USA. We have it twice. So, we're going to get two rows. And scale going to combine those two rows in one because USA must exist only once. And with the scores we will have the total scores. So 900 plus zero we will get 900. And with that it's still converted those two rows into one. And for the last value in the countries we have the UK. It's going to stay as it is. There is no need to smash and combine anything because it's already one value. So my friends if you are looking to the output you can see we grouped the original data by the country. And that means we're going to get one row for each value inside the country column. So my friends the original data you have five rows in the output if you are using group by like this you will get only three rows. So this is exactly how the group by works. Let's go back to scale and practice. Okay. So we have the following task and it says find the total score for each country. So from reading this you can understand we have to do aggregations and we have to combine the data by a column. So now usually I start like this. I start selecting the columns that I need in order to solve this task. So what do we need? We need the country and score from our table customers. So let's start like this. Now you can see we have the countries and the scores. And the task says we have to group up the data by the country. So that means this is the column where we're going to do the group by and the total scores will be aggregated. So what we have to do? We're going to use the group by since it says for each country. We're going to use it over here. Group by country. And now we have to go and aggregate the scores. We cannot leave it like this. So we're going to say the sum of the score. So let's go and execute it. And with that, as you can see, we are getting the total scores for each country. So now instead of having five customers, we have only three rows now. And that's because the countries has three rows. And now if you check the result, you can see something weird. It says no column name. And that's because we have changed the scores. It's not anymore the original score. It is it is the total scores. We have summarized those values. So SQL don't know how we going to call it. So those values doesn't come directly from the database. It is manipulation that you have done here. Now in order to give a nice name for that we can go and add aliases. An alias it is only like a name that lives inside your query. So we can do it like this as and you can specify any name you want like for example total score. And now scale can understand okay this is the name for this column and if you go and execute it you will see the new name in the results. But you have to understand this name exists only in this query. You are not renaming anything inside your database and you cannot use it in any other queries. It is just something that is known inside this query and only for your results. And of course you can rename anything any column like for example here you can say this is the customer country and if you execute it you are just renaming the column in the output. So this is really nice in SQL. Okay. So now there is like one more thing about the group I the non-aggregated columns that you are adding in the select must be as well mentioned in the group I. So now for example let's say that okay I'm seeing now the countries the total scores I would like to see as well the first name. So you go over here and say you know what let's get the first name. So country first name the total scores and execute. You will get an error because it's going to tell you I need only the columns that you want to group the data by or should be aggregated. So now the first name it is not aggregated and as well not used for the group I. So it is just here to confuse SQL and it will not work. So if you bring a column either it should be in the aggregation or it should be part of the group I. So in order to fix this and you really want to see the first name you can go over here and say you know what let's add it to the group I and execute. This time it going to work because all the columns that are mentioned here is as well part of the group I. So now as you can see we have the countries the first name and the total scores and you can see again we have five rows we don't have three rows and that's because now you are combining the data by the country and as well the first name and now you can see in the output we are getting five rows we are not getting anymore the three rows the three countries and that's because SQL now grouping the data by two columns the combination of the country and the first name and those two columns gives five combinations and that means you will get five rows so that means you have to be really careful what you are defining in the group I and the number of the unique values that those columns are generating going to define the output the results. So if you go and remove the first name and from here as well you are grouping by only one column and this column has only three values and that's why you are getting three rows and with that of course we have solved the task and now let's extend the task and say find the total score and total number of customers for each country. So that means we need two aggregations. We have the total score and as well we need the total number of customers. So from reading this you can understand we still want to group up the data by the country but this time we need two type of aggregations. We need the total number of customers and the total scores. So we have almost everything but what is missing is the second aggregation. Now what you can do you can go over here and add another aggregate function called the count. And what we want to count is the number of customers. So we can go and add the ID over here and call it total customers. So now of course SQL going to So now if you go and execute it, you will get as well the total customers by the country. And now as you can see SSQL has no problem with the ID and that's because you are aggregating the ID. So SQL know what to do with it and how to combine it. So that means you don't have to mention the ID in the country because you are aggregating it. So that's all with that we have solved as well the task. All right. Right. So with this you have learned how to group up your data using the group eye. Next we're going to talk about another technique on how to filter your data but this time using the having clause. So let's go. All right. So what is exactly having? You can use it in order to filter your data but after the aggregation. So that means we can use the having only after using the group I. So let's see the syntax of that. So again like the previous example we are finding the total score by country. So we have our select from group I and now you say you know what I would like to filter the end results and in order to do that we use the having after the group I and now like the wear clause you have to specify a condition. So we have the following condition where we want to see in the results only the countries if their total score is higher than 800. So this going to be our condition. So now you might noticing something with the group by we are using the country the column where we are grouping the data by its value but with the having we are using the aggregated column the sum of the score. So this is how the syntax works and now let's see how is going to execute it. So as usual SQL start with the from we are getting our data and then the second step is going to go and aggregate the data by the country. So it's like before going to group the rows with the same value of the country. So we're going to have one row for each country and this is what going to happen if you use group I and with that we have now aggregated values right and after the group IQL going to go and execute the having. So having it is like a filter. Now we have a nice condition the total sale must be higher than 800 and SQL going to go and check the new results after the aggregation. So in Germany we have the total sales of 850. So it meets the condition and it going to stay in the results. The same thing for USA it is higher as well than 900s but for UK it is not meeting the condition 750 it is not higher than 800 and SQL going to go and filter out this row so that means after applying the having we will get only two countries because they have values that is fulfilling the condition and that's it is what can happen if you are using having it is simply filtering the data but now you might be confused you say you know what we have used the wear clouds to filter the data so why we have in SQL another cloud how to filter my data. Can't we just use the where? Well, in SQL there are like different ways on how to filter your data based on the scenario. So now let's go and add both of the filters in my query. We are already using the having after the group I and now let's go and add the wear. Usually the wear comes between the from and the group I so directly after the from. And here we are saying the score must be higher than 400. So now we are filtering based on the scores twice, right? Once we are saying the score higher than 400 and by having we are saying the sum of score must be higher than 800. So what is the big difference? It is when the filter is happening. If you want to filter the data before the aggregation you want to filter the original data then you can go and use the wear clause. But if you want to filter the data after the aggregations after the group by then you can go and use having. So it's really all about when the filter is happening. So let's see how is still going to execute this. So as usual first the from going to be executed to get the data. Then after that the second step the wear going to be executed. This is our first filter. So SQL going to filter the data using where before doing any aggregations and based on our condition the first customer will be filtered out because score is less than 400 and the same thing for the last customer. Now after the applying the wear clouds we will get only three rows only three customers. And now next SQL going to go and execute the group by. So it's still going to go and group the data by the country. So now we have fewer data to be combined. So the values will not be summarized because we have only one row for each country. Now after the data is aggregated by the group by then SQL going to activate the second filter having. So the next step is going to execute the having and here SQL going to filter the new results based on the total scores and still going to check one by one. So, USA is meeting the condition. UK going to be filtered out because it is not higher than 800. And this time Germany as well will be filtered out because this time it is not fulfilling the condition. In the previous example without the wear, we had more scores for Germany. That's why it passed the test. But this time since we filtered a lot of customers using the wear, Germany will not have enough scores pass the second filter. So with that in the output we will get only one row and that's because we are filtering a lot of data. So it is very simple where going to be executed before the group by before the aggregations having going to be executed after the group by after the aggregations. So now let's go back to scale in order to practice. Okay. So now we have very interesting task find the average score for each country considering only customers with a score not equal to zero. So it sounds like condition and return only those countries with an average score greater than 430. So this is again another condition. So I know there is a lot of things that's going on. Let's do it step by step. Usually I start by doing a very simple select statement with the columns and data that I need. So let's start with a simple select. So what do we need over here? We need a score. We need a country. Again we need a score country. So all what we need is two columns. Now I'm going to go and select the ID just to see the customer ID. Then let's go and get the country score from our table customers. So let's go and query that. So now as you can see I start with the basics. Query the data and then build up on top of it the second step. Now what do we have in the task? We have to find the average score for each country. That means we have to do some aggregations. And here we have two conditions. The first condition says we need only the customers with a score not equal to zero. And the second one we need only the countries with an average score greater than 430. Now you have to decide for each condition whether you're going to use the where or having. Now for the first one we want to filter based on the scores. So that means we want to filter before the aggregations. It's not saying the average score. It's saying the score itself. So that means we can use for this a wear condition. Now about the second one it says countries with an average score greater than 430. That means we want to filter the data after aggregating the score. So that means for this condition we have to use the having. Now what I would like to do is to implement the first condition. It's very simple. We're going to say where after the from the score is not equal to zero. So let's go and execute it. And with that we don't have any customers where the scores is not equal to zero. So that we have solved this part. But now for the second condition first we have to do the aggregations. So we're going to start with the average score. We're going to go over here and say average and we're going to call it average score. Now we don't want to see only the average score. We want to see the average score for each country. So that means we have to aggregate by the country and for that we use the group I group by comes always after the wear clause. So group by and which column? It's going to be the country. So country. Now there is like an issue here. You cannot execute it like this. We have to go and get rid of the ID. We don't need it at all. So let's go and execute it. So with that we have the average score for each country and we have solved the first part. So that means the first and the second part they are completed. Now we're going to talk about the last part. The average score must be higher than 430. And for that we're going to use the having and having comes after the group by. Now we need to specify the condition. It must be the aggregated column. So we're going to take the average score from here and put it after the having and it should be greater than 430. So that's it. With that we have the last part as well. Let's go and execute it now. And with that my friends we have filtered the data after the aggregation. So this is how I decide between the where and having. It is very simple. All right. So with that you have learned how to filter the aggregated data using the having. And now next we're going to go back to the top where we can use there the keyword distinct exactly after the select. So let's go now and learn about the distinct. Okay. So what is exactly distinct? If you use it in SQL, it's going to go and remove duplicates in your data. Duplicates are like repeated values in your data and it's going to make sure that each value appears only once in the results. So it sounds very simple and as well the syntax is easy. So as usual we start always with a select but directly after the select we use the keyword distinct. So there is nothing between them and then the normal stuff we specify the columns and then the from in order to get the data from table. Let's say that I would like to get a list of unique values of the country. So the first thing that SQL going to do of course is to get the data from the database using the from. And now the second step is the select. So SQL going to execute it and going to select only one column the country. All other columns going to be excluded and removed from the results. And now SQL going to go to the third step. It's going to go and apply the distincts on the country values. So it acts like a filter where it going to make sure each value happens only once. So it's going to start with the first value Germany. Now it's going to look to the results. Do we have Germany? Well, we don't have anything yet. So that's why it's going to include it in the results. Then the next value is going to be USA. The same thing. We don't have USA in the results. So it's going to go and include it. And this happens as well for the UK. We don't have UK in the final results. That's why it's going to go as well included. Now comes Germany again. Now it's going to say wait, we have it already. So it will not go and add it again in the output because it must appear only once. So we will not have Germany twice. And as well for the last value the USA we have it already in the results that's why it will not appear again and with that we have removed the duplicates or the repetition inside our data. So each value is unique. Now let's go back to SQL. Okay that task is very simple. It says return unique list of all countries. So let's go and do that. It's going to be funny. So select and now let's get the column country from our table customers like this. Now you can see we have a list of all countries but the task says we need a unique list. So that means I cannot have here repetitions inside it. And with that we're going to use the very nice distinct. So if you do it like this let's go and execute. You will see there will be no duplicates in your results and all the values in the result going to be unique. So with that we have solved the task. It's it's very simple. Now there is like one thing about the distinct that I see a lot of people using it a lot in cases that it's not really necessary. So for example, let's go and get the ID. Now if you go and execute it, you can see here we have a list of all ids and there are no duplicates. But now if I go and remove the distinct and executed, we will get the same results because the ids are usually unique. So it really makes no sense to go and say distinct because as you can see the database has to go and make sure each value happens only once. So there's like extra work for the SQL and it is usually an expensive operation. So if your data is already unique, don't go and apply distincts. Only if you see repetitions and duplicates and you don't want to see that only in this scenario, go and apply the distinct. Don't go blindly for each query applying distinct just in case there is duplicates. This is usually bad practices. Okay. So that's all for distinct. Okay my friends. So with that you have learned how to remove the duplicates using the distinct. In the next step we're going to talk about another keyword that you can use together with the select. You can use top in order to limit your data. So now let's go and understand what this means. Okay. So what is exactly top or in other databases we call it limit. So it is again some kind of filtering in SQL. If you use it, it's going to go and restrict the number of rows returned in the results. So you have a control on how many rows you want to see in the results. The syntax is very simple as well. Directly after the selects you're going to use the keyword top and then you specify the number of rows you want to see in the results. So for example three and then only after that you specify the columns that you want and then from which table. Now let's see how going to execute it. So as usual the from going to be executed we will get our data and then the second step is going to go and select the columns. In this case all the columns going to stay and then after that it's going to execute that top. So how it works? It's very simple. For each row in database, we have a row number. It has nothing to do with your data with the ids. For example, here like in the current result, we have row number 1 2 3 4 5. Those numbers are not your actual data. It is something technical from the database. So it is not equal to the ids. For example, the ids is actually your content your data. So here we are not filtering based on the data based on the row numbers. So since here we have defined three SQL going to count. Okay. row number one 2 three and that's it. So it's going to make a cut and all the rows after number three they will be excluded from the results and you will get only the three rows at the results. So now as you can see this type of filtering is not based on a condition or something it's just based on the row numbers. So whatever results you have in your data it will go and make a cut at specific row. So let's go to scale and practice that. Okay. So now we have a very simple task. It says retrieve only three customers. So let's go and do that. We're going to go and select star from our table customers and execute it. Now as you can see in the output we have five customers. But the task says we want only three. And there is no specifications at all about any condition. So I don't have to go and make a work clause where we write a condition based on our data. We just want three customers. So we can do that very simply by just adding top exactly after the select and then specify the number of rows you want to see from the output. So select top three and then the star. Let's go and execute it. And with that we are getting three customers. That's it. It's very simple. All right. Now moving on to another task. It says retrieve the top three customers with the highest scores. Now of course this is like a mix between ordering the data and filtering the data. Right? So we usually sort the data by the scores from the highest to the lowest. But now it's like we are doing both together. So let's do it again step by step. I will just back to the select star from customers. Now what we can do we can go and sort the data by the score from the highest to the lowest using the order by so order by score and then descending. So let's go and execute it. And now you can see the first customer is with the highest score and then the second highest and so on. Now I think you already got it in order to get the top three customers with the highest scores. What you have to do is to just go over here and say top three and execute it. And with that you have now a really nice analyzis on your data. It's like a reports where we are finding the top customers with the highest score. So this is really amazing and very easy. So as you can see mixing the top with the sorting the data you can make top end analyzes or bottom end analyzers. So let's have this task retrieve the lowest two customers based on the score. So now we want to get the lowest scores in our table. And in order to do that is very simple. What we're going to do we're going to flip that. So we're going to sort our data based on the scores ascending from the lowest to the highest. And since we want only the lowest two customers, we're going to replace the three with a two and execute it. And with that, we're going to get at the lowest two customers. It is Peter and Maria. They have the lowest scores. Again, it's very easy. Okay, this is fun. Let's go to the next one. Get the two most recent orders. Well, this time we are speaking about another table. Let's go and select everything from the table orders like this. So now, as you can see, we have here four orders and we want the two most recent orders. So most recent means we have to deal with the order dates and we can build that by sorting the data by the order dates. So order by order dates and since we are saying the most recent orders so from the highest date to the lowest that means descending right let's go and execute it and as you can see based on our data and now we can look to our result this is the last order in our business based on the order age and this one is one of the earliest orders. So with that we have sorted the data and since we want the two most recent orders we go over here and say we go exactly after the select and say top two and execute and with that we have now the last two orders in our business. So as you can see combining the top with the order by you can do amazing analyszis. All right so this is how you limit your data using top and with that you have learned the basics everything that you can learn and with that you have learned all the clauses the sections that you can use in any query in SQL. Now next what we're going to do we're going to put everything together in one query in order to learn how SQL going to go and deal with all those clauses and how SQL going to go and execute it. So let's go and do that. Okay. So now I'm going to show you the coding order of a query compared to the execution order that happens in the database. So the coding order of a query starts always with a select and then exactly after that you can put a distinct and then after the distinct you can put a top. So this is the order of all those keywords and then you can go and select like few columns and after you specify the columns separated with a comma you tell SQL from which table your data come from using the from clause. Now after that if you want to filter the data before the aggregation you can use the where clause and this always comes directly after the from. And if you want to group the data then you have to do it after the wear clause using the group by and after the group buys comes the having if you want to filter the data. And the last thing that you can specify in query it is always the order by. So this is the order of all those components of the query. And if you don't follow this order you will get an error from the database. Now if you look to this query there are a lot of things that's going to filter your data. So let's check them one by one. The first thing that you can do is to filter the columns. If you don't want to see all the columns, you want to see only specific columns, you use the select and of course you must use it. So the columns that you specify will be shown in the results. So it's like filtering the columns. Now there is another type of filter where you filter out the duplicates if you want to see unique results and that's using the distinct. So this is another type of filter. Moving on, we can filter the result based on the row numbers. So we can limit the result using the top. But this type of filter doesn't need any conditions. It's purely based on the row number in the results. Now moving on, if you want to filter your data based on conditions based on your data, you can filter the rows before the aggregation using the wear clause. And the last type of filtering, you can filter your rows after the aggregation using the having. So as you can see, we have like five different types and how to filter the results in SQL. So now let's see the execution order. As we learned the first thing that's going to happen is that SQL going to execute the from clause. So SQL going to go and find your data in the database where all the next steps going to be paste on this data. Now the next step that is going to do is that it's going to go and filter the data using the wear clause. This has to be happen before anything else. So before any aggregations and so on we have to make scope of the data. So once SQL apply it maybe some of the rows going to be removed and once the data is filtered the third step SQL going to execute the group I so going to take the results and start combining the similar values in one row and start aggregating the data based on the aggregate function that you have specified. So now after the group by after aggregating the data what is going to do now it's going to go and apply the second type of filter the having. So based on the condition the SQL going to go and start removing few aggregated data away and keep the rest. Now moving on to the step number five. Finally it's going to go and execute the select distinct. So SQL going to go and start selecting the columns that we need to see in the results and remove the other stuff. And once the columns are selected SQL going to go and execute the order by. So SQL going to start sorting the data based on the column that you have specified and the mechanism as well. So the data will be sorted differently. And my friends the last step that going to happen in your query will be always the top statements. So based on the final final results SQL going to go and execute the top. So here we are saying top two that means we want to keep only the first two rows without any conditions. So SQL going to count okay row number one two and after that it's going to make cuts and remove anything after that. So this is the last filter that's going to happen and as well the last step. So now if you sit back and look at this the coding order is completely different than the execution order in the coding we have first to specify the select actually the select going to be executed just almost at the end. So at the step number five and once you understand how SQL execute your query you can understand how to build correct queries. So now the first thing that we have learned that we can go and have like one query right something like this select star from customers. Now this is one query and in the output we have one results but did you know that in SQL we can have like multiple queries and multiple results in one go. So we can do everything together like for example let's say I'm selecting as well the data from orders. So that means we have two queries and now if you go and execute what can happens you will get two result grids. The first result grid is for the first query and the second one is for the second query. So with that you can do multiple queries in the same window and with that the results can be splitted into multiple window depend how many queries you have and usually in SQL you might find that by the end of each query there is a semicolon like this. So at the end of the first query we have semicolon and for the second query we have as well at the end another semicolon. For the SQL server it is not a must but for other databases if you have multiple queries in one execution you must separate them with a semicolon and with that the database can understand okay this is the end of the first query and this is the end of the second query. So you have like separations between queries. Okay. Now moving on to another cool thing in SQL. Now what if we don't want to query the data inside our tables, we would like to show a static value from us from the one that is writing the query. And this is very practical. If you are like practicing and you want to check something using a value from you, not from the tables. So how we can do that? It is very simple. We're going to write select and then now after that instead of having a column name you can go and add any value like 1 2 3. So it is just a number and we do not specify after that any table. So we leave it like this. Select 1 2 3 and we don't need to use the from close. So now if you go and execute it you will get 1 2 3. So this is a static value. And of course you can go and rename the column like static number. So execute it again. So with that we have a static value. And you can go and add anything like string as well. So let's say hello as static for example string. So let's go and execute. Now we have two queries. The second one you can see our static value. Hello. So in queries we can add values from us. Not only selecting data from the queries but of course you can go and mix stuff. So we can have like in one query data from the database and static data from us. So let me show you what I mean. Let's go over here and say select and let's go and get for example the ID the first name from the table customers like this. So with that we can see we are getting data from the database. But now I can go and add something from me new customer and we can call it customer type. So now what is going on here? Two columns from the database and one column from us. It is the static one. So if you go and execute it, you can see for the ID and the first name those data comes from the database. But for each record we are always getting the same static value new customer, new customer and so on. So this piece of information comes from the query. It is not stored inside the database and those two informations come from the stored data inside the database. So this is really cool thing. You can add few informations from you and you can get the data from the database. This is the static values. Okay. One more cool thing that I want to show you that if you have a query like this you are selecting from table and filtering the data and now you would like not to execute the whole thing. You would like to execute only a part of this query. So now sometimes as you are writing a query, you don't want to execute the whole thing. You want to execute only part of the query. Like for example, I would like to see all the customers again in this query without this filter. So instead of removing it and then query and then again adding it, what you can do, you can highlight what you want without now the filter and execute. So without the database going to execute exactly what you highlighted. And now as you can see I'm getting all the customers without the filter. And if you don't highlight anything and execute, what's going to happen? It's still going to execute the whole thing inside the editor. And this is really nice if you want to query another table quickly in the same editor. Like we want to select everything from the orders just quickly. So you can highlight only this query and execute. And with that SQL is ignoring everything else and only executing what I'm highlighting. And this is really nice. It gives us like speed and dynamic. And you're going to find me doing that a lot in the course. So this is really nice. Okay. My friends. So with that we have learned the basics about SQL query. the basic components of the select statements and with that you can talk to our database in order to get data. Now in the next chapter we're going to learn how to define the structure of our database. So we're going to learn the data definition language DDL. So let's go. Okay. So usually if you have like an empty database what you want to do is to go and define the structure of your data. So one of the first things that we usually do is we go and create a new tables. So here we have a command called create and if you use it you can create a new object inside the database like for example a table. So once you execute it you're going to get brand new table and usually the table going to be empty without any data. So it is very simple. This is what the create command does. And now let's go to SQL in order to create a new table. So my friends we have the following task. Create a new table called persons with columns ID person name birth date and phone. Okay. So this time we will not start by select we will start with the command create table. So we are telling SQL to create a table and after that we have to define the name of the table. So in this task we have to call it persons. Now we have to go and open two parenthesis like this and in between we have to define the columns. So what do we need? First we need an ID. So this is the first column name. And next we have to define which data type for this column. It's going to be an int. So it is a number does not contain any characters. And now next we can define some constraints and we cannot have a person without an ID. So it should not be in null. So not null. This is the first column. So we have defined the name of the column, the data type and the constraint. Okay. So let's go to the second column and here we're going to have a comma and the next one name going to be person name. So this is the column name and the person name we can have. And now the data type for this column it going to be a varchar because the person name contains characters. So vchar. And now we have to define the length. So I'm going to go with 50 characters. And now I would say this is a must. So each person should has a name. So we're going to say not null as well. So that we have the name, the type and the constraint. Now let's move to the third column. It's going to be birth date. Now which type of informations we have inside the birth date? So it's going to be a date, not a number, not characters. So we're going to go with the data date. And now about the constraint well depends. I would say in our application it is an optional because this is very personal information and maybe some persons will not provide their birth dates. So this is an optional and I will not say it is not null. So nulls are allowed. Now let's move on to the next one. It's going to be the phone. So now what is the data type of a phone? Well we have some types numbers we have characters special characters. So we could have anything. So that's why I'm going to go with the farchar. And here you can specify the length that you think it's okay. I'm going to go with 15. Now of course depend on the system that you are building. I would say the phones are very important in order to validate whether this is a real person. So we're going to say not null. So we are not allowing nulls in this field. Perfect. So with that we have covered all the columns that are required. We have defined the data types and as well the constraints. Now the last thing in each database table we should has a primary key in order to make sure this table has an integrity and maybe as well connectable to other tables. So now what we're going to do, we're going to go and add the primary key constraint, comma, for the last column. And then we're going to say constraint. Now we have to give a primary key name. This is only going to be visible for the database. So I'm going to call it PK for primary key. And here persons and then after that we're going to say primary key. And between two parentheses, we're going to go and pick which one is the primary key. And of course, it's going to be the ID. So we're going to go over here and say ID. So again, we are saying there is a new constraint. This is the name of it. It's only internal for the database. And then we are saying this one is a primary key on the field ID. So that's it with that. We have defined a primary key for our table. Let's go and execute it. So as you can see it is successful. Let's go and check our database for our new table. So if you don't see it already, you have to right click on the database and then go and refresh. So let's go to tables and now we have a brand new table called persons. So with that we have created our new table. Now of course for the DDL commands you will not get results or data. All what you're getting is a message from the database and the message says here the command completed successfully and then we have a date when this is completed. So that means the DDL command will never return data. It is changing the structure of your database. It's not about retrieving any data and so on. So this command did change something in our database and in this scenario it created a new table and that's why we call this data definition language DDL because we are defining the database. Now of course if you go and say select star from our new table persons. So let's go highlight it and then execute it. You will see we are getting of course the columns. So the ID, the person name, birth date, the phone but we don't have any rows that means our table is empty. Now what is very important to that you go and save those informations in an SQL script because maybe later you have to redefine this table but let's say that you have created different queries and you have lost the script and now I would like to see again the create statements for this table well there is trick for that if you go to the left side you see the persons right here right click on it and then you have here script table as and now we have here different options that you can run on the table and the first one says create two Then let's go to new query editor. So now what happened? The database did read the metadata informations about the person and created your DDL query with many extra stuff that we haven't done. But this is the template that the database uses. So now we can see a lot of stuff. But what is interesting is this create table. So we can see create table the schema DBU the default one then the persons and then we have our columns the data type and as well the constraints. So with that you got back your DDL statements and many other stuff about the table which is now not interesting. But now what I really need is to see the create statements about this table. So this is how you can get back your DL command. But of course what I recommend you is always put your code inside a get repository and always keep it up to date. So that always you can check your work and extend it. Okay. So now what else you can do with the structure of your database? If you have already a table, what you can do, you can go and edit and change the definition of the table. So for example, let's say I would like to add a new column. In order to do that, we can use the command alter. Alter means you want to edit the definition of your table and you want to change it like adding new column or maybe changing the data type and anything in the definition of the table. So the alter command, you can use it in order to change the definition of your table. And now let's go back to scale and try to change something. All right. Now the task says add a new column called email to the person's table. So it is very simple what you can do. We can use the alter table command. So we are not creating new table. We want to edit already existing table. So which table we want to modify it's going to be the persons. So we are telling SQL we want to change something in the table persons. And of course we have to tell SQL what we want to change. Are we removing a column? Are we adding column? In this scenario we want to add new column. So let's go and add the email information. So this is the column name and as you are creating a table you have to define column name the data type and the constraint. So now for the emails we're going to have like characters, numbers, special characters. So we're going to go with the varchar and about the length it's going to be let's say 50 and I'm going to say each person has to has an email. So it's going to be not null. So with that we are adding completely a new column. So that's it. Let's go and execute it. Now again this is not a query. This is a DDL command and in the output we will not get data. We will get a message whether everything went correctly. So it says command completed successfully and the time when this is completed. Now we can go and do a simple query just to have a check to the table. So and now you can see we have our columns and at the end we have a new column called emails. This is very important. If you are adding new column it's going to be always at the end of the table. But now you might say you know what I would like to have the email like something in the middle maybe after the person name. Well, in order to do that, you have completely to delete and drop the table and create it from the scratch using create command which is might be bad if you have data inside the table. So if you are fine by adding your new column at the end, you can use the alter table. But if you say I would like it in the middle, then sadly you have to go and drop everything and start from the scratch. Okay. So now let's have another task and it says remove the column phone from the person's table. So now we're going to do exactly the opposite. We're going to go remove it completely with its data from the table. So we're going to still saying alter table persons. We are saying we want to edit the definition of the table persons. And now instead of adding we will be dropping a column. And then after that we have to specify as well the column name. It's going to be the phone. But we don't have to mention again the data type and the constraint. And that's because the database already knows those informations. So we need those informations if we are creating something new. That's why we can get rid of that. We just need the column name and the database is going to do the rest. So let's go and do that. Now you can see successful. And now let's go and check our table. And now as you can see we have the ID, person name, birth date, email, and we don't have the column phone. Be careful. If you are deleting column, you will be losing as well all the data inside this column. So as you can see, this is very simple. This is how we can edit the definition of our table by adding and removing columns. Okay, now moving on to the last one in this group of commands. So now so far what we have done, we have created something new in the database. We have changed the definition of something inside our database. And now the last one, you can go and drop something from the database. Let's say we have another table and we don't need it anymore. So we can go and use the drop command in order to remove the table completely from the database. And this means as well removing everything the table and the data inside it. So now let's go to SQL and let's drop something from our database. Okay. So now our task says delete the table persons from the database. This is the simplest form of command in SQL but yet the most risky one. So what we need? We have to delete and drop the whole table persons. We don't need it anymore. We're going to say drop table and then all what we have to do is to give the name of the table persons. So three words. You don't have to specify anything. Just destroy the table persons. Let's go and execute it. It is successful. So as you can see it is very simple. Now on the left side to your database go refresh and go to the tables and you will not see the table persons. So the drop command it is very simple but yet very risky. So if you compare now create table with a drop table you can see destroying things is way easier than building it. Those are the commands create alter drop. those commands we use in order to define the structure of our database the DDL commands that was very simple all right so that's all about the data definition language DDL and with that you have learned how to define new stuff in your database now moving on to the next one we're going to learn about the data manipulation language and here we're going to learn how to manipulate our data inside the database let's go all right so now what we're going to do we're going to go and modify and manipulate your data inside the database. So now sometimes what happens you have a table inside your database and the table is empty. You don't have any rows any data inside the table. Now in order to add your data to the table what you can do you can use the command insert. So insert going to go and add new rows to your table and of course not always the table must be empty to add your data. You can add new rows to already existing data and SQL going to go and append it at the end of the table. Now my friends in order to insert new data to the target table there are two methods. The first and the classical way in order to insert new data we can use the insert command and manually specifying the values that should be inserted to the table. So you're going to start specifying in the script the values and then they're going to be inserted as a new rows to the target table. So in this process you are manually inserting new values to the table using like an SQL scripts. So now we're going to focus on this scenario on how to insert data. All right. Now let's check quickly the syntax of the insert command. It start with the keyword insert into and after that we have to specify the table name. So where we want to insert and then we make a list of all columns that we want to insert. And then we specify list of columns where we're going to insert values into them. And after that we say values. And finally we're going to go now and specify the data that should be inserted to the table. and we make it as well as a list like we have done for the columns. Now in the insert statements specifying those columns it is totally optional. So if you don't specify the columns of the table then SQL going to expect you to insert values into each column because sometimes of course we don't want to insert value for each column. You can skip few columns of course but if you want to insert a value for each column either you go and specify them as a list or you can skip it. Now for the insert statements there is very important rule. The number of columns and values must match. So if you specify here three columns then you must insert as well exactly three values. So this must be matching. And one last thing about the syntax you can insert multiple values in one go. So for each row you can specify a list of values that must be inserted. So that's all about the syntax. Let's go back to SQL in order to practice insert command. Okay. So now let's go and insert a new customers. So it's very simple. It start with insert into. So we are saying we want to insert data into. So we have to go and specify the table name customers. Now after that we have to specify list of columns where we want to insert data into it. And what we can do we can go and check which columns do we have inside our table. So we can see we have ID, first name, country, score. And we can go and make a list of that. So we can say ID, first name, country and score. So we just have a list of all columns inside our table customers. Now what we need? We need the values. So which data should be inserted. So we can go and open two parenthesis. And now we have to specify an ID. We know the last customer was five. So we're going to go with the customer six. Now we have to give the name of the customer. Let's go for Anna. And then a country. Let's go for USA. And this customer has no scores. So what we can do? We can say null. So we don't know the score of this customer. nulls means nothing we don't know. So with that you can go and insert one row. But now let's say that I would like to go and insert like a second row one more customer. What we can do we can separate this with a comma and then we can go and repeat the whole thing again. So the ID is seven. The next one let's call this customer Sam and we don't know the country of this customer. So we're going to say it's null. But the score we know it already. It is 100. So as you can see we are adding a value for each of those columns. And if you don't know the answer then make it null. if the database allows it to be null. Some columns they are not allowed to be null like the primary key. So if you go and say over here null the database will not allow it. Well actually we can go and test it. Let's execute. And you can see you cannot insert the value null into the column ID. So this is not allowed. Going to have a seven. But for the other columns it is allowed. You can go and check the definition of the table. Now we go and execute. Now the output of the modifications command is going to always indicate what happens to the data. So it says two rows affected. Affected might be inserted, updated, deleted. So you're going to get a general statement from the database. But you are getting how many record is affected. So we got two because we have inserted two records. So now as you can see it's not like the query. We are not getting any data in the output. We are just getting a message. So this is a big difference between querying the data using the selects and modifying the data using inserts. We are doing now direct modifications to the data inside our database. Of course, if you want to see the data in the customers, what we can do, we can go and query the data, right? So, let's go and do that. Select star from customers. I would like to see the whole table. So, market and execute it. Now, you can see we have seven customers. So, we just manipulated our data. We have here Anna and Sam. This is how you can insert data to the database. Now, there's like few rules you have to be careful as you are inserting new data to your tables. You have to pay attention that the order of the columns that you have defined. insert is matching the values that you are inserting over here. Let's have an example. I'm going to go and remove this over here and let's say that we are inserting a new one number eight and now in the first name instead of the name of the customers we have inserted the country like USA and in the country we have inserted the name is just mistake and we are all human right? So let's have a name like this max. Now if you go and execute it the database can accept it because it is really hard for the database to understand that you have made here an error. Both of them are var and the database doesn't care about the content of the data as long as you are following the rules of the data type. So now if you go and select the data from the customers you can see now we have a customer called USA from the country max. So the SQL going to do it blindly like you insert the data as long as you are following the data type rules and the constraints. So for example, if you made this error over here and you say the id is max and let's say the first name is let's say nine and you execute it here the database is smart enough to say you know what there is something wrong the ID should not be strange so the database going to reject your inserts be careful of the order of your columns now let's go and query again our table now if you are in the insert commands defining all the columns exactly like the table so as you can see we have here complete match ID first name country score we have all the columns and as well the correct order there is like lazy way you can go and remove the whole thing over here and with that the database can understand okay we are inserting values to all of the columns so going to understand you are inserting something to each columns in the correct direction so let's go and do that correctly nine and here let's say we have from Germany so if you go and execute it it will be working even though we didn't define the columns and that's because the values that we are inserting as exactly the same number of columns of the table and following as well the rules. Now moving on to the next one, you can go and add only two columns in the definition. If you know already always the country and the score is null. We know only two informations, the ID and the name. Then you don't have always to go and say null null null and so on. We can go and skip that. Okay. So now let me show you what I mean. We're going to go after the table name and we're going to define only two columns, the ID and the first name. So that means we are telling SQL we want to insert only two columns. And now you have to be careful. If you define here two columns then the values should be as well two columns. So we're going to remove the country and the score. And we can go and add only two informations. So 10. And we can go and add here for example Sara. So if you go and execute it, it will be working. And now what is skill is doing with the other two columns. It's going to be nulls. So let's go and select again from our table. You can see here Sara has null in the country and as well in the score because we didn't define those informations. But be careful, you cannot here skip a column that is not allowed to be null. So you have always have in your list all the columns that are not null. So for example, I cannot go and insert only the first name. I will get an error because the database can try to insert a null in the ID and this is not allowed. So you can skip only nullable columns. All right, my friends. So that was the first method on how to insert data to your target table as you saw by typing manually the values inside an insert command using values. And now let's move to another methods. We're going to insert data but this time not manually. We're going to insert data using another table. So imagine we have the following scenario. We have an already existing table with data and this going to be the source table, the source of your data and we have another table. This table is empty and we want to insert a new data to this target table. Now what we can do, we can take the data from the source table and insert it into the target table without manually writing the script for the values. So we are moving the data from one table to another. Now in order to do that we need to do two steps. The first step we have to write an SQL query using select from and so on in order to select the data that we need from the source table. And once you do that you will get a results. So this is like you are doing a normal query. You right select and you will get an answer with the results. And now what we can do in the next step we can take this results and use an insert command in order to insert this results into the target table. And with that we have moved the data from the source table to the target table. So first write the query on the source table. And the second step use an insert to move this results to the target table. So let's go back to the scale in order to do that. So now we have the following task and it says insert data from the table customers into the table persons. So that means the source table is the customers and the target table is persons. Now how I usually do it that I keep my eye on the target table to understand the structure of this table and I start writing the query from the source table. If you go to the left side, we can see okay, we have here an ID. We have here person name, birth date and phone. And you can see only the birth date except nulls and the rest we have always to provide informations. So with that I have now understanding about the table persons. Now next I'm going to go and start writing the query from the source. So we start like this. Select star from our table customers just to have an overview of our table. Now the next step we're going to go and design a perfect result from this query that is matching the target table. So in the output we need ID and we have it from the customer from the original table. We're going to go and select ID. Okay. So now next we need a person name and here we have from the original table something called first name. So this is a perfect match. So we're going to go and select this table as a second column. So we have covered the first two. Then the third one is going to be the birth date. Well, my friends, we don't have birth dates, but the database can accept it as a null. So, I'm going to go and write a null because I don't have such information from the source table. And now the next one going to be the phone as well. We don't have phone informations. But we cannot have it as a null because it says here not null. So, what we're going to do, we're going to go and add a static value, a default value. So, we're going to have two single quotes and in between we're going to say unknown. Since it is var, it can accept this word. So, now let's go and just query. So we have the ID, we have the first name, the birth date is empty, and the phones is unknown. Now you might say, but the column name is not matching with the column name of the persons. Well, the database does not care about that. As long as the result of the data is matching the table, it can go and insert it. So the database will never compare the column names together. But if you like and go and add here like the aliases exactly like the target table it will not hurt but it has no effect on the results. All right. Okay. So now we have like query select and we have a results but this is not an insert. So how we going to insert the result of this into the table persons. Well for that we need the insert into command. So insert into and now we have to specify the target table going to be the persons. And of course you can go and list all the column names but if you have like exact match you can skip it but for me I would like always to add it just to make sure that we don't have any issue. So the ID, person name, birth date and the phone. So that's it. Let's go and execute. So it is working now. We can see 10 rows affected. Well that means 10 rows are inserted from the table customers into the target persons. And now what we can do we can go and query the table persons just to check that everything is working perfectly. Select star from persons and let's go and execute. And with that you can see our 10 persons that we have added from the customers. So with that we have moved the data from one table and inserted into another table. And as you can see it was very simple. First you have to write a query from the source table in order to collect the data that you need. and then you go and insert it into the target table. So this is really nice and easy and this is another way on how to insert data into your database. Okay, so with that we have learned how to insert data to our tables. Now let's say that I don't have something new. I don't have any rows to be added to my table but I have an update. I would like to go and change the content of the already existing rows. So what you can do? We can use the command updates in order to change the content of already existing rows. So again my friends insert going to go and insert completely new rows but update going to go and change the data of already existing row. Now let's have a look quickly to the syntax of the updates. It start with the keyword updates and then we have to specify the table name and after that we're going to use sit in order to specify what are the new values for the columns. So you have to write down for each column that you want to update a new value and you separate the columns of course using a comma. Now after that we have to specify as well a wear condition. So it's like the queries you say where and then you write a condition and if you don't do that and you don't use the wear clause what going to happen you will be end up updating all the rows inside your table. So that's why we need always the wear clause. All right. So that's all about the syntax. Let's go back to SQL in order to update our data. Okay. So let's have the following task and it says change the score of customer 6 to zero. So that means we have to go and modify the data of the customer ID equal to six. So now first I would like to go and have a look to our data. So select star from customers and now the task is targeting this customer over here and we would like to replace the null to zero. Now how we can go and update this information inside the table? We can use the update command. So what we going to do? We're going to start writing update and after that we have to specify the table name. So what we are updating? We are updating the customers and then we're going to tell the database to set the value of the score to a zero. So we would like to update and change the value from null to a zero. And now here comes something very risky. Don't execute this query yet. If you do that, what's going to happen? The database going to go to the table customers and replace all those values of all customers to zero. So it's going to go and update the whole table and this is of course very risky. That's why in the update command we have to give a wear condition a filter in order to target only specific row or the rows that you want really to modify. In this case we want to change only one row. So what we have to do is to go and specify the work condition like we have done in the select query. Nothing new, right? So we're going to say where the customer ID is equal to six. And with that SQL will not go and update everything. First it's going to filter the data and then updates. And now before I execute just to make sure I go and check which data going to be affected. So it's very simple you go and select star from table customers and then I go and take the exact where and put it in my query and then I select the whole thing and execute. And now if this query gives me the data that should be modified then I'm doing the update command correctly. And in this case we are targeting only one customer. This is the customer number six. And with that I feel really confident with my update. So what we can do since I'm going to use this later I'm going to put the whole thing in a comment and if I execute now only the update going to be executed. So let's go and do that. Now very important to check the message you can see one row is affected which is really good because if I see here 10 rows is affected that means everything is updated. Now let's go and check the data. I'm going to go and remove the wear here and check the whole table. Now you can see we still have the old scores only Anna has now score zero instead of null. So this is how I usually update the data. You have to do it very carefully. Now let's move to another task. It's going to say change the score of the customer number 10 to zero and update the country to UK. So now this time we are targeting the user number 10. As you can see she doesn't have the country and score. And the task wants us to change the score to a zero and the country to UK. So now how we going to do it? We're going to use the exact same command but with different condition. So the ID this times is equal to 10 and the score is to zero. But now we have to change as well the country. Now if you want to do multiple updates, you're going to have here a comma after the score and the new line and let's say country equal and then we're going to add UK. So select the whole thing and let's go and execute. So again it is affecting only one row. This is really good. And if you go and check the table search for Sara, you can see in one update we have updated two columns the country and as well the score. So with that we have solved the task. It's very simple. Now moving on to the second task. It says update all customers with a null score by setting their score to a zero. So this time we are not speaking about one specific customer. We are talking about updating the data for a subset of customers. So now imagine you have like hundreds of customers and you are making one update command for each customer. It's going to be really wasting of time. Now instead of that we can specify a condition that targets multiple customers and we're going to do the update for those customers in one go. So now let's see how we're going to do it. We are talking only about replacing the nulls with a zero. So we don't need the country. So set score equal to zero. But now we will not be specific for the ids. Now we have to make a new condition. It's going to say like this where score is null. Now of course in the course we have a full dedicated chapter about the nulls and here all what we are doing is we are searching for scores that is equal to null. But we cannot write an equal we have to write it like this is null. Of course before we update anything we have to go and test it in a query. So select star from customers where score is null. Let's go and execute. Now as you can see we have two customers where the score is null. So that means this condition is targeting a subset of customers and we're going to do now the updates for multiple rows for this subset. So that means we can run this query. Let's go and execute it. Now you can see two rows are affected. So that means multiple rows got affected got updated. So now if you go and query our table customers you can see we don't have any nulls inside the scores and we have replaced all the nulls with a zero. And of course you can do the same thing. you can go and make an update command in order to replace all the nulls in the country to maybe something unknown or any default value that you want. So this is how you can update multiple rows in one go. All right my friends. So with that we have learned how to insert new rows to our tables and as well how to update the content of already existing row. Now the last thing or command that we can do to the data inside the table that we can go and remove rows from our table and we can do that using the command delete. So if you use delete SQL going to go and start removing already existing rows inside your table. All right. Now for the syntax of the delete it's going to be very simple. We're going to say delete from and then we're going to write the table name. And here comes something very important. We have to add a wear condition. And it's like the update. If you don't do that, if you don't include where condition, what going to happen? You will end up deleting all the rows inside the table. So the syntax is very simple. Let's go back to scale in order to delete some data. Okay. So now we have the following task. Delete all customers with an ID greater than five. So now we have to go and delete all the customers that we recently added. So how we going to do it? It's very simple. We're going to say delete from. So that means I want to delete something from a table. And we have to specify the table name. It's going to be the customers. So the syntax is very simple. Now my friends, this is more risky than updates because if you execute it like this, don't do that yet. Wait, what's going to happen? All the data of the customers going to be deleted. So you will get an empty table and we will not do that. So now we're going to do exactly like the update command. We're going to specify the work clause. So it says the ID should be greater than five. So that means ID higher than five. So with that we are defining a subset of the data that should be deleted, not everything. And if we check in the updates, we have here to do a double check before deleting anything. So again what we do, we select star from table customers and we're going to go and copy the work condition in order to test what going to be deleted. So it's going to be all the customers that is higher than five. And with that I'm making sure that my delete command is correct which is from what I see here is correct. So those five customers should be deleted. So now let's go and delete those customers. And now very important to read the message. It says five rows affected. So that means five customers got deleted. And this is better than 10 of course. So let's go and check what customers left. So we have 1 2 3 4 5. Those are the original customers. And everything else got deleted. And with that we have solved the task. And this is how we can delete data from tables. Be very careful. Always test before doing the delete command. Okay. So now we have the following task. And it says delete all data from table persons. So that means we have to go and drop everything from the table persons. But we don't want to delete the table. We just want to delete the data inside the table now. So now what we're going to do, we're going to write delete from. And now we have to specify the table persons. And if you execute it, what's going to happen? SQL going to go and drop all the data in the persons. But in SQL, we have more interesting command. If you want to delete everything from the table persons, we have that truncate. Truncate. It is exactly like delete from persons. It's going to go and make the whole table empty. But why I like to use truncate because it is way faster than deletes. If you have large tables, the delete command going to be really slow because with the delete there is like a lot of things happening behind the scenes. There is like logs and protocols. But if you are using trunk, the database going to skip all those extra stuff and it's going to be very fast. So if you want to delete all the data from table, you can do it like this if it's like small table. But what I usually do, I go and write truncate and then table. we're going to get the same effect and with that I'm saying reset everything make the table empty. So let's go and execute it and now with that you will not get the number of deleted rows and that's why it's truncate it's way faster. It is not protocoling anything it's not logging anything it just go and delete all the data without any extra steps. So this is how we can delete all the data from a table but the table still exists. Okay my friends, so with that you have learned the basics on how to manipulate your data inside the database the data manipulation language DML and with that I can tell you we have covered the basics of SQL. So with that we have covered the beginner level. Now in the next chapters we will be in the intermediate level and the first thing that you're going to learn in the intermediate level you will learn how to filter your data and we're going to cover many operators that you can use inside the workclass. So let's go. All right. So now let's have an overview about all different operators in SQL. So the first group of operators we have the comparison operators. They are the easiest one where all what we have to do is to compare two values and we have like six different variants and how to do that. Now to the next one we have the logical operators. We use it in order to combine multiple operators. And moving on to the next one we have the range operator. Here we have only one, the between. We're going to use it in order to check whether a value falls within a specific range. Now moving on to the next one, we have the membership operator. And here we have two things. We have the in operator or not in. Here all what you have to do is to check whether a value is in a list or not. And the last category that we have is the search operator. And here as well we have only one operator that like we use it in order to search for a specific thing in a text. So my friends, we're going to go through all those operators one by one. Okay. So now let's go and deep dive into the first category the comparison operators and we're going to cover all those stuff. So what is exactly comparison operator? Okay. So what is exactly comparison operators? It is very simple. We want to compare two things and there is a lot of things that we can compare in SQL. But the formula for that going to be always like this. So we have the first expression and then operator and then we have another expression and this going to form something called condition. So here we have a lot of variance. We can compare one column to another column. So for example, you can go and compare the first name with the last name. So both of the expressions are columns here. Another scenario, you want to compare a column with a value, a static value. Like for example, you say the first name must be equal to a value like John. So now we are comparing a column with a value. It's not anymore two columns. Now we have another scenario where we want to apply a function to a column and then compare the results to maybe a value. So for example, we apply the upper function to the first name and then this must be equal to a value like John with all the letters in the uppercase. And one more thing that you can compare you can write an expression in one of the sides like for example you can say if we multiply price with the quantity it must be equal to 1,000 for example. So here we have an expression. We have multiple columns included in one sides and the output of this expression must be equal to 1,000. And now the last one is going to be a little bit more advanced and we're going to cover that of course in other chapter. We can include a whole query the complete query to one of the sides and we call this a subquery. So in one of the sides you're going to write a whole query select from where whatever you want and you go and compare the result of this query to for example a value or a column. So as you can see in a scale we can compare a lot of things together. Either comparing the columns together or a column with a value or we use a function or an expression or even a whole query. So this is how we build conditions in SQL. Okay my friends. So let's see how the conditions works in SQL. So we have our data the name the country the score and let's say that we have built a condition where it says the country must be equal to the USA. So this is very simple comparison operator and this is the condition that we are using inside the work clause. So once you apply this filter to your data what going to happen? SQL going to go row by row evaluating whether it is meeting the condition. If it's not fulfilling the condition then SQL going to remove it from the results. But if it is fulfilling the condition it's going to keep it. So now we are comparing the values of column together with a static value the USA. So we're going to compare whatever value we get from the country together with the USA. So now let's see how is going to apply this filter to our data for the first customer Maria. Now you can see the value inside the country is Germany. So Isql now going to go and compare Germany to USA since it is not equal. Then is going to understand okay Maria is not fulfilling the condition. So it is false and is going to go and remove this customer from the results. So she is not fulfilling the condition. Moving on to the next one to Joan. Now S is going to take the value inside the country the USA it is equal to USA. So that means John is fulfilling the condition and Isl going to be happy about it. So it is true and this means is going to keep Joan in the final results. Now moving on to George the value is UK not equal to USA. He is not fulfilling the condition. Is going to go and remove him from the final result. Same thing for Martin. Germany is not equal to USA. Is going to remove this customer as well. And to the last one bit better you can see the value is USA. So USA equal USA. The condition is fulfilled. SQL is happy about it and going to leave the customer in the output. So now if you go and apply this condition using the comparison operator to your data only two customers going to be left in the output. This is exactly how the conditions and the comparison operators works in SQL. Okay. So now let's start with the first operator. It's very simple. We have the equal. It's going to checks if the two values are equal. That's very simple. Let's have an example. Okay. So now we have this task. It says retrieve all customers from Germany. So this is very basic. We're going to go and select and we're going to select all the columns since we don't have any specifications from the table customers. And if you go and execute it, you will get all the customers. But we don't need that only the customers that comes from Germany. So we have to go and apply a condition using the wear clause country equal to the value Germany. So make sure you are writing it exactly like in the database otherwise it will not work. So let's go and execute and with that we are getting only the customers from Germany. So it is very simple and this is why we use the equal operator. Okay. So now moving on to the next one again very simple. If you want to check if two values are not equal we can use the not equal operator. So let's have an example. Okay. So now we let's have the opposite task. It says retrieve all customers who are not from Germany. So this is very simple. We are saying here who are not they are not equal to Germany. So we can use the not equal operator in order to get these customers. So with that as you can see after executing we are getting all the customers country is not equal to Germany and there's like another way on how to do the not equal doing it like this we'll get the same results. All right my friends moving on to the next one. We can check if a value is greater than another value. So we use the greater operator. Let's have an example. Okay. So now the next task it says retrieve all customers with a score greater than 500. Now we want to filter the data based on the score. So we're going to say where score and now the task says greater than 500. We're going to use the operator greater than 500. It's very simple. So with that we will get only the customers where the score is higher than 500. So for example Maria it's not fulfilling the condition. The same thing for the Peter and as well for Martin it must be greater than 500. So if you go executed you will get only those two customers because they are greater than 500. Okay, moving on to the next one. This time we're going to check if a value is greater than or equal to another value. So it is like mix between the greater than and the equal. If one of them is fulfilled then the value going to meet the condition. So let's have an example for that. Now, if the task says retrieve all customers with a score of 500 or more, this time we're going to go and include the customers where their score is equal as well to 500 or higher. So, we're going to have a similar condition based on the score and the 500's value, but this time we're going to say greater or equal to 500. So, if you go now and execute it, this time we're going to see the customer Martin with the score of 500. So, in this scenario, we're going to use greater or equal. All right. Right. So now let's keep moving. The next one is as well very simple. We're going to check this time if a value is less than another value. So we're going to use the less operator. Let's have an example. Now moving on to another simple task. Retrieve all customers with a score less than 500. So this time we want all the customers with a lower score. And we're going to use exactly the opposite. It's going to be the score is less than 500. And again here it is not equal, right? So if you go and execute, you will get all the customers with a low scores. he will not get to Martin because Martin is equal to 500. So with that we have solved the task. We have all the customers with the score less than 500. Okay my friends, now moving on to the last one. I think you already got it. So we're going to check whether a value is less than or equal to another value. So you can go and combine the less operator together with the equal and if one of them is fulfilled then the value going to meet the condition. So let's have an example for that. This time we are retrieving all customers with a score of 500 or less. So the query going to be very similar but we are saying it is less or equal to 500. So we are including the value in our condition. And with that as you can see we still have our two customers where we have the score less than 500 but we have now as well Martin with a score of 500. Okay my friends. So with that we have covered the first group the comparison operators. Now we're going to move on to the next group. We're going to speak about the logical operators and here we have three and or not. So let's start with the first one. What is exactly and operator. Okay. So now what is the definition of the and it says all conditions must be true. So all the conditions that you have in the wear clause must be true in order to keep the row in the results. So let's understand what this means. things going to get more complicated where you can have not only one condition but you might have multiple conditions in your query. So here we're going to add a second condition where we're going to say not only the country must be equal to USA but also the score must be higher than 500. So now you have two conditions and you have to put them in the wear clause. Now you have to combine those conditions using the logical operator and here we have two options two operators the and operator and the or operator. In this scenario, if you say and then SQL is very restrictive. Both of the conditions must be true in order to keep the row in the results. So now let's see how this going to work. Now for the first row and for the first condition you can see the country is Germany and it is not fulfilling the first condition. So this going to be false. And as well if you check the second condition for the first row you can see the score is 350. So that means this customer is as well not fulfilling even the second condition. So both of the conditions is false and it's going to go I remove this customer from the results. Now to the next one John you can see John is fulfilling the first condition because the country is equal to USA and as well fulfilling the second condition. His score is 900 and this is higher than 500. So now SQL going to be very happy about it because both of them is true and this is the only way in order to keep the row in the output because we are using the operator and so John going to stay in the output. Now moving on to George. He is not fulfilling the first condition. But now the second condition is fulfilled. His score is 750 and this is higher than 500. So now it's like 50/50 right. In one side it's false but the other side is true. But this is not enough for the ant operator. Both of them should be true in order to keep the result in the output. That's why SQL going to remove this row. Now moving on to Martin. He is not fulfilling both of the conditions. So SQL going to go I remove it from the results. And now for the last one. Peter is fulfilling the first condition. the country is equal to USA but the second condition is sadly not fulfilled so we have the score zero not higher than 500 again we have the same scenario it's 50/50 and this is not enough for the ant operator that's why SQL going to go I remove it so as you can see if you use an and operator a lot of rows going to be removed if one of the condition is not met so the ant operator is very restrictive both of the conditions must be fulfilled to keep the row in the results so this is exactly how the and operator works. Okay. So now we have the following task. Retrieve all customers who are from USA and have a score greater than 500. So here we are like combining multiple conditions and let's go and do it step by step. So the first thing that we have to go and select the data from the correct table. So select star from customers and with that we are getting all the customers from the table. Now the first condition we need the customers that come from USA. So we need only those two customers and in order to do that as we learned we can go and use the wear clause and the condition going to be country equal to USA. So if you go and execute we will get those two customers. Nothing is new. We have used the compression operator equal. But we are not done yet. We have another condition from those two customers. We need only the customers where their score is higher than 500. So now by looking to those two customers you can see we see that the bitter here does not have a score higher than 500 and we don't want to see that in the results. So now what we have to do we have to go and write a condition for this one over here. So this is based this time on the scores not on the country. So the score should be greater than 500. Now as you can see we have the first condition for the first one here and the second condition for the second requirement. Now the question how to connect those two conditions. So here we have two options and or and to be honest this is very simple the task says it customer should fulfill both of the conditions should be from USA and as well at the same time greater than 500. So it is very simple real and so with that we have connected both of those conditions and if you go and query it you will get only one customer that is fulfilling our conditions. So from all customers we have only one customer that's fulfilled this condition that comes from USA and at the same time the score of this customer is higher than 500. So this is how we use the ant operator in order to connect two conditions. Okay my friends. So that's all for the ant operator. Let's speak now about the or operator. All right. Now the or operator it says at least one condition must be true. So it is less restrictive than the and it is enough to have one condition true in order to keep the row in the results. Let's understand exactly what this means. Okay. So now we have the same scenario. We have two conditions and in SQL you have to connect them either using the and operator or the or operator. In this scenario we're going to talk about the or operator. And as we said at least one of the conditions must be fulfilled in order to leave the record in the results. So let's see what's going to happen here. Now the first customer Maria she is not fulfilling the first condition and as well the second condition. So both of them is false and this is the only scenario where SQL going to remove the record from the results because it is not fulfilling the minimum at least one of them should be true. Both of them is false then SQL going to go and remove this row. Now moving on to the next one to John. John is from USA and has higher score than 500. Both of the conditions is green. So both of them is true and this is more than enough to keep the row in the output. That's why we will see John in the outputs. Now moving on to the third one, George. George is not fulfilling the first condition because UK is not equal to USA. But John this time is fulfilling the second condition. So we have here true and since we have at least one true, this is good enough to keep the record in the output. So you will see George in the results. Now moving on to Martin. He is not fulfilling the first condition as well not fulfilling the second condition. Both of them is false and this is not enough to keep the result in the output. So that's why it's still going to go and remove it. Now moving on to the last one. Peter he is fulfilling the first condition but not the second condition but still everything is fine because he is fulfilling at least one condition. So we have the minimum and it's still going to leave it in the output. So as you can see the or operator is not restrictive like the and operator. It's enough to have one true in order to keep the data in the output. And this is exactly how the or operator works. Now let's see the second task. Retrieve all customers who are either from USA or have a score greater than 500. So it is a very similar task. We have two conditions. So we need the customers that are either from USA. So it is based on this country equal to USA. And the second condition is the score is greater than 500. But this time we are very relaxed. either this condition is fulfilled or the second one. So instead of having and we will be using the operator or. So it is enough to fulfill one of those conditions. And if you go and execute now as you can see we are getting more results because it is easier to fulfill the conditions. So we can see those three customers either fulfilling the first condition or the second one. All right my friends. So that's all for the or operator and we're going to move to the last one in this group the not. So what do we mean with the not operator? Okay. So now what is this operator not? It is a reverse operator. It's going to go and exclude the matching values. So what this exactly means? Let's have a very simple example. All right. So now the net operator is not like the or and the ands. This operator will not go and combine two conditions. So you can use it with only one condition. And let's say that our current condition is like this. The country must be equal to USA. So this is like a comparison operator. And if you apply it to your data, as we learned, it's going to leave only two customers, John and Peter, because they fulfill the conditions and all other customers will be removed because they don't fulfill the condition. So nothing crazy so far. But now if you go and apply the not operator to the condition, what going to happen? You're going to reverse the whole truth. So you are saying if this condition is fulfilled, it must be removed from the final results. So it is switching everything. We want to see the customers that is not fulfilling the condition. So now let's see what can happen if you apply the not operator together with the condition. We can see that the first customer is not fulfilling the condition which is great thing. This is exactly what we want. We want the customer that is not fulfilling the condition. That's why going to be happy about it and SQL going to make it true and leave it in the output. So Maria is fulfilling the whole thing. She is not meeting the condition. So SQL going to leave it at the output. Now for the next one. So this customer is fulfilling the condition and that is not a good thing. So SQL going to go and this time remove John from the results because he is fulfilling the condition. And moving on to George. So George is not fulfilling the condition which is amazing. So that's why SQL going to keep this time George in the output. The same thing for Martin. Martin is not fulfilling the condition. So Isl going to keep the customer and better he is fulfilling the condition. So SQL going to go and remove this customer from the output. So as you can see we have reversed everything right. The not operator going to make the true false and the false true. Okay. So this is how it works. Now let's go back to SQL in order to practice. Okay. The next task it says retrieve all customers with a score not less than 500. So this sounds really funny. As usual we're going to go and select star from customers. And now we have to filter the data based on this condition. So the score is not less than 500. Well, you can go and say well the score is higher, greater or equal to 500, right? And with that it is not less than 500. So if you go and execute it, we just solve the task, right? We get all the customers that are not less than 500. Or you can go and use the not operator to make things more funnier. So you go over here and say it is not and then you switch it. So you make like this. So the score is less than 500. But as we use here not then we twisted everything. So we are saying the score is not less than 500. And if you execute it you will get the exact same results. Convert the truth. If you remove it and execute you will get everything that is less than 500. But if you put the nut you will convert the whole logic. So if you go and execute you are not getting the scores that are less than 500. So this is really nice. This is how you use the nut operator. Okay my friends. So with that we have covered everything about the logical operators. Now we're going to move to the third group. We're going to talk about the range operator. And here we have only one the between. So what is exactly between operator? Okay. So what is between? It's going to go and check if a value falls within a specific range. So you have a range and you are checking whether your value is in the range or outside the range. So let's understand exactly what this means. Okay. So now in order to build a range you need two things. You need the lower boundary for the range and you need as well the upper boundary. Once you have two boundaries then you have a range and everything between those two boundaries going to be true and everything outside those boundaries going to be false. So now for example let's say that we have the lower boundary 100 and the upper boundary 500. And there is one thing that you have to understand about the between the boundaries are inclusive. So that means if a value is exactly 100 or exactly 500 then it's going to considered as a true. So it is considered to be inside the range. Now if you apply this filter to our data where we say the score must be between 100 and 500 going to go and do the following. So for the first customer Maria is going to go and check whether her score is inside the boundaries. So as you can see 300 is between 100 and 500. So she is in the green area and that's why Isque going to be happy about it and leave the customer in the outputs. Now moving on to John. John has 900. As you can see 900 is greater than 500. So this value is going to be outside the boundaries on the right side and this means the score of John is not in the range. That's why he is not fulfilling the condition and SQL going to go and remove this customer from the results. Now moving on to George 750. The same thing outside the range. SQL will not accept it and remove this customer from the final results. Now moving on to Martin his score is 500 and this is exactly at the boundary. So if it's like 5001 it's going to be outside. So since between is inclusive then SQL going to accept it and Martin considered to be in the range and fulfilling the condition. So SQL going to keep him in the final result. Now here are speaking about better he has zero score and this is less than 100. So in the left side not in the range. So not fulfilling the condition and SQL going to go and remove him. This is exactly how between works in SQL. It's very simple. Okay. So now we have the following task and it says retrieve all customers whose score falls in range between 100 and 500. So let's start as usual by selecting all data from customers and execute it. Now the task says everything. We need all customers in a range. So we have a lower value and a higher value. So in order to do that as usual we're going to use the where and then we're going to specify the column that we want to filter on. So it's going to be the score and since we have like two boundaries we can go and use the function between and we start with the first boundary the lowest boundary. So it is the 100 and 500 the high boundary the upper boundary. So between 100 and 500. So now let's go and execute it. And with that we get only those two customers because they are between this window. Now there is another way in how to solve this task by not using between. We can go and use the comparison operator together with a logical operator and. So let me show you how we can do that. I'm going to go and copy the whole thing. And now we're going to write two conditions. So first the score should be higher or equal to 100 because the boundaries is inclusive and the other one the score is less or equal to 500. So this is the upper boundary. So with that we have the two conditions and we can go and connect them using the and operator. So it's like very similar to the between we have an and between the upper and the lower boundaries but we are using the comparison operators. So it is higher or equal to 100 and lower or equal to 500. If you go and run this query you will get exactly same results. Now if you ask me which method is my favorite I'm going to go with this method and I will skip the between because each time to be honest for me I forget about the between whether the boundaries are inclusive or exclusive. But if I read the script I am going to see exactly that those boundaries are inclusive because we have here the equals. So I really prefer using the compressor operator together with the and then using between. So it's up to you if you memorize it then go with the between. But for me I'm going to go with the compression operators. Okay my friends. So that's all about the between and the range operator. Now let's move to another group. We have the membership operator. So here we have like two. We have the in and the not in. So let's understand what this exactly means. Okay. So what is in operator? It's going to go and check if a value exist in a list. So you have a list of values and you are checking whether your value is a member of your list. So let's have very simple example in order to understand what this means. Okay. So now how this works exactly what you have to do is to go and make a list of values. So let's say that I have a list and there I have specified two values Germany and USA. So those two are the members of this list. Now if you use the n operator it's going to go and check the value of countries whether it is in the list or not. So let's do it one by one. For the first customer Maria her country is Germany and Germany is member of the list. So it's going to be happy and going to leave Maria in the final results. Now moving on to John. John comes from USA. USA is member of the list. So he is fulfilling as well the condition and you're going to see John in the final results. Now we come to George. George comes from UK and UK is not member of our list. And SQL going to go and remove this customer from the final results not fulfilling the condition. Now for the last two, Martin and Peter, their country is a member of the list and SQL going to go and leave those customers in the final results. So as you can see it's very simple. Or what you have to do is to define the members of a list and use the n operator and if the value is a member of this list it's going to be true otherwise it's going to be false. Now of course the other operator going to be exactly the opposite where we say not in the list. So we are searching for values that are not in this list. So as we are using not it's going to go and reverse completely the truth. And if you apply this you will get in the result only one customer. you will get George and the result because the country is UK and UK is not a member of the list. So if you use not together with the in operator you will get exactly the opposite effect. So this is how the in and the not in operator works in SQL. Let's go back to scale in order to practice that. Okay. So now we have this task and it says retrieve all customers from either Germany or USA. Okay. So let's try to solve this task. This going to be a little bit tricky. So select star from customers as usual and execute it. So now we need in the results only customer that comes either from Germany or USA. So that means this customer over here should be excluded from the result because he come from UK. So how we going to write it? It's going to be like this maybe. So the first one going to be the country is equal to Germany or the country is equal to USA right something like this. So if you go and execute it, you will get in the output only the customers that are either from Germany or USA. And with that we have solved the task, right? Well, there is another way in order to solve this task which is more clear and shorter using the n operator. So now how we going to do it? Let's go and get the whole thing in another query. And now instead of having equals and ors and so on, we're going to use the in operator and then we're going to have like two parentheses and then inside it we're going to have a list of values. So it's going to be the Germany and then the second value going to be USA like this. So we are saying country should be in this list Germany or USA and if it is like one of those values then the condition is fulfilled. So now if you go and execute this one over here you will get the exact same results. So my friends, if you notice that you are repeating yourself in the wear condition and you are just changing the value of the condition, it is based on the same column and you are connecting them using the or then there is something wrong and always think on this scenario to use the in operator because this can be really ugly once you have a lot of values. So imagine in our database we have a lot of countries and your query going to be like something like this. So you are keep repeating country equal or country equal and so on. Instead of that you're going to have a really nice list of countries in one go. So this is as you can see here it is easier to extend and as well has better performance. So as you can see we are repeating the same thing but we are just changing the value and we are connecting all those conditions using the or in this scenario go and use the in operator. All right my friends. So that's all for the membership operators. Now we're going to speak about the last one the search operator. And here we have only one the like. And each time we're going to say like, I'm going to remind you to like this course. So let's go. Okay. So now what is like operator? You can use it in order to search for a pattern in your text. So if you have like a text or characters and you are searching for a specific pattern inside the text. So let's have an example in order to understand exactly what this means. Okay. So now if you don't have yet cafe, go grab one because you have to focus for this one. Now what we have to do is to define a pattern in is scale. In order to build a pattern we have like two special characters. If you use a percentage you are saying anything. So I'm going to accept anything. So it could be no characters at all or only one character or many characters. So I'm saying anything. Now if you use an underscore you are expecting to have exactly one thing like one character or one number. So it is exactly one. I know this sounds complicated but with an example you can understand this. And I can tell you the percentage is way more famous than the underscore. I rarely really use the underscore. So now let's say that I build the pattern like this. I say the first character must be M and then percentage. So here I'm saying in my text the first character must be an M and after the first character I really don't care. It could be any character, any number whatever. So this is the pattern and now let's have few values in order to say whether it's true or false. So now if you have the value Mariam. So now you can see the first character is an M which is perfect. This is exactly our pattern. The first character must be an M. And then after the M we got like four characters. So whatever it is totally fine. We can say Maria is fulfilling our pattern. And this is exactly what we are searching for. This value is fulfilling the condition. Okay. Now moving on to the next value we have m a. So here again the first character is an M which is perfect. And after that we have only one character a. Well we have say percentage. So it could be anything one character multiple characters a number or whatever. So that's why this value can match our pattern and we will see it in the outputs. Now moving on to the next value we have only one m which is as well totally fine because we are saying the first character must be an M and then followed with anything. Now moving on to the last scenario we have Emma. Now this is a problematic because the first character is an E and in our pattern we say it must start with M. So we don't have that in this word. The first character is an E. That's why this value is not fulfilling our pattern and SQL going to remove this value from the final results. So this is exactly what going to happen if you have this pattern and those values. Now let's have another scenario where you say you know what it could start with anything but for me it is very important the last two characters it must be an I and N. So we could start with anything but the last two must be an I and N. So let's take this value Martin going to go and check immediately the last two characters. So you can see we have an I and N and the first part marks it is fine. It could be anything. So this value is fulfilling the condition because the last two characters is an I and N. Now moving on to the next one we have vin. So v i n the last two characters is as well exactly what we are searching for. It is fulfilling the condition and we have before it like only v. So we say anything with a percentage. Right? Now one more we have in. So it is as well fulfilling the condition because before it we don't have anything. So en is fulfilling as well the condition. The percentage is always saying anything. Now moving on to the last scenario we have Jasmine. They are not the last two characters. The last two characters is an N and E and this is not matching our pattern and this why this value is not fulfilling our pattern and you will not see it in the results. So with that you can understand how we can search for something in a text using the like operator. Let's keep going. Now let's say that I have a percentage at the start and percentage at the end and in between I have only one character an R. If you define it like this you are saying if there is an R anywhere it is good enough whether it's beginning or at the end or in between then the condition is fulfilled. So if you have Maria you can see we have an R in the middle. So in the left side we have two characters on the right side we have two characters doesn't matter the main thing we have an R somewhere. So this going to be fulfilling the condition. Now moving on to better we have an R at the end and that is totally fine cuz we say at the right side it could be anything. So we have an R somewhere that's why it's going to fulfill the condition. Now we have another case where we say Ryan we have an R at the start. So we don't have anything before and we have after that like three characters which is totally fine. So we don't really care about the position of the R. It is totally acceptable to have an R anywhere. And if you have only an R that is as well good enough. You don't have anything before. you don't have anything after and that's okay. But if you have a word like Alice, we don't have any R inside it. So that's why this is the only case where you say we don't have here an R and it's going to remove this value from the results. And this way of searching of something is very famous. You don't care about the words before this word and after the word, right? So if you are searching for any word, you're going to say percentage before and percentage after. Now I know that we want to practice with the underscore. So let's say that I have two underscores and then the character B and then a percentage. So here what I'm saying there should be something in the first position. There should be as well something in the second position. Then the third position should be the character B must be exactly at this position and after that it could be anything. So we really don't care. I know this is a little bit complicated. Let's have an example. So we have the value alert. Now we can see the first position we have something the A. Then the second position we have as well something the L. So so far we are good at the pattern and then the third position we have B. So we have complete match and the rest the ERT whatever. So with that Albert is matching our pattern. Moving on to the next one rope. You can see the first character we have something which is good. We have the R. Then the second character we have an O. So it's not empty. We have something. And then the third one we have exactly B. And after that we don't have anything which is fine. So again this value going to fulfill the condition. So moving on to the next one. So it start with an A. So we have something in the first position. The second position we have as well something the B. But now the third character it is a problem. It is not P. We have an E. So that's why it is not following our pattern. And is going to go and remove it. Now moving on to last example we have an A and an N. So in the first position we have something. The second one as well. But the third one we don't have anything. We don't have a B. So that's why it's going to be removed. So my friends I know that was a lot. This is exactly how you build a pattern for the like operator using the percentage and the underscore. But the percentage is more famous. So this is exactly how it works. Let's go back to scale in order to have some examples. All right, let's start with this task. Find all customers whose first name starts with a capital M. So let's go and start searching for those informations. We're going to start as usual. Select star from customers. And now we have to go and build the filter logic. So we're going to say where. Now we are searching something in the first name. So we're going to say first name. So that means it is very important to start with an M and then the rest it doesn't matter. So we're going to use the like operator in order to search. And we're going to have our single quotes and we're going to start with the M. And it doesn't matter what comes after that. So for us it is very important that the first character is an M. Let's go and execute it. And with that we got our two customers Maria and Martin. And both of them starts with an M. So with that we have solved the task. It is very simple. Now we have the following task. Find all customers whose first name ends with an N. So let's go first and select all the customers here. And we need all those customers where they are having an N at the end. So we have John and as well Martin. So how we going to do it? The same thing where first name like since we are searching but here we're going to change the expression. So it must ends with an N as a last character. So before that it doesn't matter whether it is the first character. So it could be anything but the last character of the word should be an N. So that's it. Let's go and execute. And with that we got John and Martin because the last character is an N. It is very simple, right? It is all about where we're going to place this percentage. Okay. So now we have the next task. Find all customers whose first name contains an R. So here we don't have like specifications whether it is at the start or at the end. Somewhere there should be an R. So if you go and execute first without any wear condition you can see here for example Maria we have in the middle somewhere an R George George as well Martin and Peter at the end. So we have a lot of names with an R. So how we can search for that? We're going to stick with the where first name like and here our character going to be an R and we're going to put before it and after it a percentage. So it doesn't matter what is before it or after it somewhere there should be an R. So let's go and execute it. And with that we got all our customers where somewhere we have an R. As you can see it is very simple. If you put it before and after then you are open for more results. And this is usually used a lot in order to search for a value inside your database. All right. Now we're going to move to a funny one. It kind of says find all customers whose first name has an R in the third position for some reason. I don't know why. So let's go and execute our customers here without any filter. So it is for us very important to find the customers where in the third position we have an R like here for example Maria the third character is an R which is okay but with Peter over here it is not the third character so it is not fulfilling the condition. So how we going to write that? It going to say like this where the first name like but we have to write it now from the start. So the first position going to be an underscore the second position going to be as well an underscore and now in the third position going to have an R. So with that we make sure the third position and an R and before it we have two positions and now afterward it doesn't matter what comes after that it could be nothing or characters. So if you go and execute it like this we will get Maria and Martin and we will not get Peter because the R is not in the third position. So now if you don't do it correctly with the underscores let's go and remove one of them and execute. You will get nothing because we don't have any first name where the second position is an R. So you have to be very careful with this. All right my friends. So this is how you search inside your values. And with that we have covered all different groups of operators that you can use inside a wear clause. So with that you have learned how to filter your data using multiple operators that you can use inside the wear clause. So you can filter anything now in SQL. Now we will move to very interesting topic. You will learn how to combine your data from multiple tables. And here we have two main methods. The first one is SQL joins and the second set operators. And they are really big topics. So we're going to first focus on the SQL joins. And here we have a lot of things to cover. So now we are talking about the core of SQL. So let's go. All right. So now we have two tables, table A and table B. And the big question here is how to combine those two tables. What do we want exactly? Do you want to combine the rows or the columns? And now if you say I would like to combine the columns then we are talking about joining tables. So we're going to use joins in SQL. So now let's say that we are joining the table A with the table B and we start from the table A. So SQL going to take the columns and the rows of the table A and SQL going to call it the left table because we started from there and then we join it with the table B and SQL going to call the second table as the right table. And here what's going to happen? and SQL going to take the columns and the rows from the right table and put it side by side with the columns and rows of the table A. So we are like combining the columns we are putting them side by side. And now if you say you know what I don't want to do that I would like to combine the rows both of the tables having the same columns. I just want to stack them. So we are now talking about another methods. It is called the set operators. So here there is like no left and right. So since we started with the table A, the SQL going to take the columns and the rows of the table A and put it in the results. And then it's going to go to the second table, table B and it's going to take only the rows and put it below the rows of the the table A. So we are putting the rows beneath each others. We are doing like appending. So that means as we are using the set operators, we are combining the rows. Our table going to be longer but with the joins we are combining the columns side by side and we are getting wider table. But now for each methods there are different types. So now for example in order to do the joints we have four very famous types. We can do an inner join, full join, left join, right join. But of course there are more than that but those are the basics. And for the set methods we have as well types. We have the union, union all except and intersect. And for each methods there are like different rules. In order to join the tables we have to define the key columns between the two tables. Don't worry we're going to learn about that later. This is the requirement in order to join tables and the requirement of combining tables using the set operators the tables in your query should has the exact same number of columns but here you don't need any like key in order to combine the tables. So guys if you look at this in order to combine two tables first you have to decide do I want to combine the columns or the rows. So first you have to decide in the methods and after that you have different types on how exactly you're going to go and combine the data and of course there are rules that you have to follow. Now, of course, we're going to go and cover everything in the course, but now in this section, we're going to learn how we're going to combine the tables using the SQL joins. So, we're going to go and dive into this word. All right. So, now what is exactly SQL joins? Now, let's say that we have two tables. On the left table, we have the customer name. So, we have four customers. And on the right table, we have the country informations about the customer. And now we would like to query both of those informations the names and the countries. Now in order to query those two tables in one query first we have to connect them. And in order to connect those two tables we need a key a column that exist on the left and on the right sides. And by looking to this the common column here is the ID of the customer. Now once we connect those ids together we will be able to query those tables together and SQL going to start matching those ids. So for the ID number one, we will get the name Maria and the country Germany. And the ID2 is connecting John to USA. And now you can see the ID3 is not connectable. So we cannot connect it to the right side. But for the ID4, we can use it in order to connect Martin to Germany. So this is exactly what happens if you join two tables. You connect those two tables using a common column, a key like the ID. And once we have matching value, we can connect the two rows together. So this is what we mean with SQL joins. Now you might ask why do we need actually joins? Well, the first and very important reason is to recombine your data. So now usually in databases the data about something like the customers could be spreaded into multiple tables. Like we could have table called customers, another one where we have the customer addresses and a third table where you can find the orders of the customers and maybe another one where you can find the reviews of the customers. So as you can see the data of the customers is spreaded into like four tables. Now how about I would like to see all the data about the customers in one results. So I would like to see the complete big picture about our customers. What we can do, we can go and connect those four tables using the SQL joins. And once we do that in one query, I will be able to combine all those tables in one big results. And this is the most important reason why we use SQL joins in order to combine all the data about specific topic in order to see the big picture. Now, another reason why we use SQL joins is to do data enrichment. It is where I want to get an extra data and extra information. So let's say that you are querying the table customers and this is your main table the master table. So you are able to see all the data that you need but sometimes what happens you would like to get an extra information from another table like for example the zip codes of the countries. So you would like the help of another table we call it a reference table or sometimes lookup table where there is like one extra information that you would like to add it to your master table to the primary source of your data. So now what we can do we can join those two tables in order to enhance our table. So we are getting one extra relevant informations for the customers and this process we call it data enrichments. I'm getting an extra data for my main table. So this is another reason why we use joins. All right. So now so far we have used joins in order to get the data from two tables. But now there is another use case for the SQL joins. We use it in order to check the existence of your data in another table or maybe as well the not existence. So let's say that I have a table called customers and I'm working with this table and doing queries. But now I would like to check something. I would like to check whether our customers did order something. Now in order to check that I need the help of another table for example the table orders. So that means I'm using the table orders only for my check. So I don't want to get any extra data from the orders in my final results. I'm just using the table orders and we call in this table a lookup. So now what we can do we can connect those two tables together. And now based on the existence of the customers inside the second table the orders either the customer going to stay in the final results or going to be removed. So that means I'm filtering the data based on the join. And of course I can check as well the net existence. I would like to see in the final results all the customers that didn't order anything. So it is the same scenario. So my friends, those are the main three reasons why you use SQL joins. First, if you want to combine the data from multiple tables in one big picture. So I use join in order to get the data from different tables. The second use case, you are working with one table but you would like to get an extra information from another table. So you are doing it like something called data enrichments. And in the third scenario, we don't want to combine the data. We want just to join it with another table in order to do a check to check the existence of your records in another table. So this is why we need joins in SQL. Now there is like a lot of different possibilities on how to join tables, how to join the data. Now in order to make it easy to understand, we're going to visuals as like two circles. So we have the table A and a table B. The table A is on the left side. We call it the left table. And the table B going to be on the right side and we call it the right table. The side of the tables is very important. Now if you combine those two circles, you will get three different possibilities. The circles going to overlap. And here exactly where we can have the matching data between the two tables. So the data is available on the left and on the right. Or another possibility you want to get all the data from one of the tables. So you can get all the rows from one circle. And the third possibility you want to get only the unmatching data from one table. So if something exists in one table but not in the other table then we call it unmatching data. So those are the three scenarios that you have to ask yourself once you are combining tables and this can generate a lot of join types. So here we have like basic SQL joins those are the classical one and here depends on the scenario whether you want only matching all or all the rows from either left or right and we have advanced SQL joins where we focus on the unmatching data. Now we're going to go and cover all those types one by one. So we're going to start first with the basics and the first option that you have is to get all the data without joining tables. So let's see what this means. So what do we mean with no join? Well, we want to returns the data from two tables without combining them. So actually this is not a joint type because we are not combining anything. We just want to query the data from two tables. So that means from the table A we want to see all the rows everything and from the table B we want to see everything as well all the rows. So that means we want to see two results and there is no need to combine them. So let's see the syntax of that. So all what you have to do is very simple. Select star from table A and then semicolon and then start another query. Select star from table B. So that's it. And of course since we are not combining the data there will be no join in the syntax. So that's it. Let's go to SQL in order to do that. Okay. So now we have the following task. It says retrieve all data from customers and orders in two different results. So that sounds that we don't have to go and combine the tables together. And all what we can do is the following. We can go and select the data from the first table like this and then we make another query for the second table the orders and we don't have to go and combine them in one big query. We just use a very simple select statements in order to retrieve the data. So if you go and execute it since you have two separate queries you will get two results and with that in one result you will get all the customers and in the other result you will get all the orders and the data is not combined at all. So this is how you query two tables without combining them. So with that we are getting all the data without joining the tables. Now we're going to start talking about the first type of join the inner join where we start combining the data from two tables. So let's go. Okay. So now what is exactly an inner join? So this type going to return only the matching rows from both tables. So that means we will see in the output only matching rows. So now what do we need from the left table? We want only the matching data. So we will not get the whole circle of A. We will get only where we have an overlapping with the table B. So we want to see the data from A only if it exists in the table B. And now what do we need from the table B? Exactly the same thing only the matching data. So that means I don't want to see all the data from B. I want to see only the data in B that has a match from the table A from the left side. And with that you will get only the matching data from both tables. Now let's see how we can write that in SQL. So it is a usual query and always we start with a select. So we select for example all the columns from and here we specify the table name. So it's going to be a. So so far nothing new. But now we want to add as well the table B in the same query. In order to do that we use the keyword join and then we say table B the name of the table. And since we have like different types of joins in SQL, you can specify the type of the join before the keyword join. And if you don't specify anything, the default type is inner join. But my friends, the best practices is always mention the type. I don't like to skip the defaults because in projects maybe not everyone is aware of the defaults. So don't skip that. Always specify the type. So now what we're going to do, we're going to put the keyword inner before the join. And with that SQL going to know how to deal with the rows between two tables. But still we are not done there. We have to tell SQL how to combine the tables. And with that we use the keyword on. And after that you specify the join condition. And as we learned in order to join two tables we have to find out a common column in order to match the data. Right? And usually in scale they are the keys or ids. So the condition can be like this. the key from the table A must be equal to the key from the table B. So this is the join condition and using this join SQL can go and start matching the data from the left table and the right table. And there is one thing that is very important while you are joining the tables you have to understand about the order of the tables in your query. Now in the inner join the order of the tables doesn't really matter. So whether you start from A or you start from B it doesn't matter because you will get the same results. Both of the tables has the same priority and it doesn't matter where we start whether we say from A join B or we say from B join A we will get the exact same results. So in the inner join you don't have to worry about the order of the tables. So that's all about the inner join. Now let's go back to scale in order to practice. Okay. So now we have the following task and it says all customers along with their orders but only for customers who have placed an order. So my friends that means we need the data from the customers and from the orders from two tables and we have to put everything in one results. That means we have to join two tables. Now let's go and do it step by step. So we're going to go and say select star from customers and then we have to go and join it with the orders. We're going to say join orders. Now you have to go and specify the join type. Is it inner, left, full and so on. Well that's depend on the task. It says we want all customers but only for customers who have placed an order. So there is like condition right here. We don't want to see everything from the customer. We just want to see only the matching data only if the customers has an order in the orders table. And for that we can go and use the inner join. Of course if you can leave it like this you will get the same effects but I'm going to go and specify it like this inner join just to make it clear. We are speaking about the inner join. And after that we have to go and specify the join condition. So we have to go and find a common column between the customers and the orders. So how I usually do it I go and explore both of the tables. So I'm going to go and select everything from customers and as well everything from the orders. So let's go and execute. Now we're going to start searching where do we have a common column between those two tables. So we have the from the first table first name, country score and you don't find any of those informations in the second table. The only one is the ID. So the ID of the customer and the ID of the customer you can find it in the orders the second column here. So this is the common column between those two tables. And usually in databases we create ids exactly for this in order to connect tables. So it's really rarely that we're going to use like a country or score or first name in order to join tables. We usually use the ids. So let's go back to our query and use those two columns. So it's going to be the ID from the customers equal to the customer ID. So that's it. With that we have the condition we have decided on the type and we can go and execute it. Now you can see we are getting only three customers. Right? If you don't apply the inner join we can see that we have five customers. So that means actually we have two customers without any orders any matching data from the other table. And as well you can see very nicely we have now not only the columns from the customers but as well all the columns from the orders side by side. So with that we have combined the data and as well with that we have solved the task but we will not leave our query like this because it is not really good practices. What we have to do is to go and select only the columns that really make sense in our query because in many cases in your tables you will have a lot of columns that is not needed like for example if you check here you see we have the customer ID here and as well the customer ID over here. So it's like repetition and it's enough to see it only once. So what you have to do is to go and pick few columns that we want. For example, I'm going to start with the ID maybe the first name and that's all from the first table. Let's go and get the order ID and I don't want the customer ID again. So from the second table I'll get add the sales. So let's go and execute it. And with that you can see very nicely the customer's name and their orders with the sales. And now comes something very important. Sometimes if you have two tables you might have columns that having the same names. Like imagine the order ID in the table orders it's called ID. So that means we have the same name in both tables and this kind of makes SQL very confused. And here you will get an error tells you I really don't know what do you mean with the ID. Is it from the table customers or from the orders? So we have to tell SQL exactly from which table did this column come from. So in SQL in order to do that what we do before the column name you write again the table name the customers and then you make a dot and now we are telling SQL this column the ID it comes from the table customers and SQL will not be confused about it and it's going to go and get the ID from the customers. And for the second id you can go over here and as well before it you say orders do id so that knows okay this ID come from the orders and the other one comes from the customers and it is always good practice especially if you are joining tables to always assign for each column a table because after a while if you open your query and you see okay the sales does the sales come from the customers or the orders and if you have a long list of columns it's going to be really confusing so that's why we consider it best practices if you always assign for each column the table name especially if you are doing joins. So it's going to be like this. But of course if you have like only one table it's clear that all the columns in the select comes from this table. But since here we are dealing with multiple tables it is good to show it like this. And of course here we don't have the ID. We have the order ID and the same thing for the join condition. So the ID from here comes from the customers and the customer ID come from the orders. So now it is clear for everyone which column come from which table. But now you might say you know what each time I have to write the customers this is very long name and sometimes in real projects you're going to see tables that has really long name and it's going to be really annoying to add it each time before each column right so instead of that we can go and assign aliases for the tables but only for the columns so usually we go over here and say as and maybe you can go and use only one character like the first character C. And now instead of saying customers you can go over here and say C. The same thing for the second column and as well over here. And you can use now the C in everywhere in your query. The same thing for the orders. You can go over here and say has O. And now instead of orders you say O on here. And now it is very easily to see those two columns comes from the C that means the customers and those two columns comes from the O the orders. Those are the best practices as you are joining tables together in SQL. And of course with that we have solved the task. And about the order of the tables, it doesn't matter where do you start. So for example, if you take the orders here and put it in the join and get the orders in the from. So I just switch the tables and execute it, you will get the exact same results. So if you are doing inner join between two tables, don't worry about the order of the tables. Okay. So now let's go and instant exactly how executed the inner join. Okay. So now again here we have our query. Then we have the two tables customers and orders. And here we have the ID where we are joining the data. So this is the ID from the table customers and this is the customer ID that we have in the orders. Now let's see how SQL can execute this. So we are saying I would like to see the ID and the first name. So we will get the ID, the first name from the table customers and we would like to get the order ID and as well the sales from the table orders. So our result going to focus on those four columns. Now the data should be joined between those two tables using the inner join and SQL going to start from the left table from the customers because we say from customers. So it's going to start matching the ID from the left table with the right table. So it's going to say okay is there a match from the first record from the first order? Well yes it is the same ID and then SQL going to say okay that condition is fulfilled and we are allowed to see the data. So the data will be presented in the output. So we're going to have the ID Maria and the order ID from Maria and the sales of this order. So there is a match. Then SQL going to go to the second record. Well, we don't have a match. The third we don't have match. And so on for the last one. So we have only one match for this ID. Then SQL going to go again to the customers and pick the second one and start matching again with the first order. Do we have a match? Well, no. Then it's going to go to the second. Well, now we have a match. So SQL going to be happy. the condition is fulfilled and we will see the results. So we're going to see the first name and as well the order information for this customer in the output. It's going to keep searching. So we don't have a match as well here. So that's it. Now for the third customer as well from the start there match no to the second to the third and here we have a match. So it's going to go and show this informations since there is a match. So the customer three George with the order from this customer order ID and the sales as well in the output. Now it's going to go and keep continuing the search. Well, we don't have any match. Then it's still going to go to the fourth customer and start matching. Do we have here an ID? Do we have here a match? Well, no. Then the second, third, and fourth. We don't have any order for this ID. There is no match at all. And since we are saying inner join then SQL will not allow to show the data of this customer in the results. There is no match and SQL going to totally ignore this customer. Then we're going to go to the last one and start as well matching this ID with the orders. Well, there is no match as well. SQL going to go and exclude this user from the results. So this is exactly how the inner join works. it start from the left side and start matching the data on the right side and only if there is match the result going to be presented in the output and this is exactly why we are getting this results and how the inner join works. So now if you look again to the reasons why we are joining tables we can say we can use the inner join in order to recombine the multiple tables into one big picture. So the first use case and as well we can use the inner join in order to filter the data. So since we are saying only the matching data that means we are filtering the data we are checking the existence of the records in another table. So you can use inner join either to combine data from multiple tables or you can use it as well only for filtering purposes only to check the existence of your rows. So this is usually the two use cases of inner. All right. So that's all about the first type the inner join. Next we're going to talk about the left join. So we're going to focus on the left side. So let's go. Okay. So now what is exactly left join? This type going to returns all the rows from the left table and only the matching from the right table. So now if you look again to our two circles A and B. What do we need from the left table? We want to see everything all the rows all the data. So that means we will get a full circle. And now from the right table we want to get only the matching data. So that means we don't want to see everything from the table B. We want to see only the records that has match to the table A. So that means my friends the left table has here more priority. This is the primary source of your data. The main source we cannot miss anything. This is very important. We want to see all the data. But from the table B, it is a secondary source of data and we are joining it only to get an additional data. So I don't want everything. I want only the data that has matched to the lift table. So this is what we mean with a lift join. Now if you look to the syntax it's going to be very similar to the inner join. So we start from the left table the A. Then we say left join the right table B and then the same condition using keys. So here we just switch the type. Instead of inner we have now left. But now here with the syntax we need to be very careful. The order of the tables now is very important. You have to start from the correct table. So you have to mention the left table exactly in the from clause and then you join it with the right table. So in the join you have to specify the right table. If you don't do it like this then you will not get all the data from a and you will not get the results that you are expecting. So this is what we mean with the left join. Let's go back to scale in order to practice. All right. So now we have the following task. It says get all customers along with their orders including those without orders. So again here we need the data from two tables the customers and orders and we want everything in one result. So that means we have to go and join the data. And now the task says includes those without orders. So that means I want to see everything the matching data and the unmatching data from the table customers. And by looking to our query this is not working because we are not getting everything right. We are getting only the customers that has match in the table orders. And this is not of course fulfilling the task. So now if you read the task you can understand the main table here is the customers. We are not speaking about to see all the orders and not missing any order and the orders here is only for additional informations. So now in order to not lose any data for the customers we make sure we start from the table customers. So that means now the customers on the left side and now after that instead of inner join this is not good thing for this task. We're going to say left join and with that we guarantee we will get all the data from the customers. Now we say left join orders and of course the condition going to stay like this. This is how we are connecting the two tables. So actually that's it. Let's go and execute it. And now by looking to the result you can see that we have now five customers even the customers that didn't place any orders. So you can see Martin and Peter they don't have any order ID. So that means they didn't order anything. And as you can see is showing us nulls when there is no match. So with that we have solved the task. Now my friends one more thing as I told you the order of the tables is very important because the customer is now the left table because you start from it and the second table the orders is the right table. Now if you go and switch them like this. So we start from the orders and then join it with the customers and you go execute it you will not get all the customers and of course the task is now not solved. So as you can see you are getting now completely different result if you go and switch the tables. So be careful where you start and how you join the tables in order to get the effects that you want. All right. So now I'm going to put everything back like before. Now let's go and understand how is exactly executed this query. Okay. So now again we have the data from customers and orders and this time we are doing the lift join. So now let's see how is going to do it. So going to say okay we need the ID and the first name and we will get that as well in the results and from the right table we need only those two informations the order ID and the sales in the output. So those are the columns that we need. So now SQL in the left join going to do it a little bit differently. It's going to start as well from the lift table from the customers. But this time going to go and immediately put the result in the output without like trying to match anything and to check whether the data exist or not because it doesn't matter not doing any validation whether the customer exist in the orders. Since it's lift join is still going to show all the data from the lift table. So there will be like no check. But now as a next step in order to get the order ID and the sales SQL will start searching. So SQL going to go over here and start searching where do we have a customer with this ID? Well, it's going to be the first order. We're going to get the order ID and as well the sales informations and we will see that in the output. So that's it for the first one. Now it's going to go to the second row and the same thing going to happen immediately. The SQL going to go and put the result in the output without checking anything. And then in order to get the order data, it will start searching for this ID. So we have it here in the second row. We have the order ID and the sales. And it's still going to put those results to the output. So the search for the third one immediately going to put everything in the output. And then start searching for orders with this ID. We have it over here. So this order belongs to the user ID number three. So far we are getting the same result as the inner joint. But we are not done yet. Now exactly count the difference this guy going to go and get Martin and put it immediately in the output and start searching for an order with this ID. So do we have any order with the ID number four? Well, we don't have anything this time. SQL of course will not go and exclude the ID number four. It's going to leave it. But in SQL if there is no match, we still have to have something in the output. So SQL going to go and say the output going to be null like this. We don't know it is unknown. And the same thing for the sales. So in the lift join if there is no match you will see nulls. The same thing for the next customer for better. So SQL will go and put the result immediately in the output and then start searching the orders. So do we have anything for the ID number five? We don't have anything. That's why SQL going to go and present nulls as well in the output. And that's why you saw nulls in the output because those customers don't have any orders. So this is exactly the effect of the lift join. you will get everything from the lift table and only the matching stuff on the right side and if there is something not matching you will get nulls. So that's it is this is how scale execute the left join okay so now back to this use cases of joins if I think about lift join I can use it in order to recombine data in order to build this big picture and as well in the second use case where we use it in order to get an extra information from another table. So we have a main table and secondary table. So we use it for both use cases and as well in the third use case only with a twist that we're going to learn later. So that's all about the left join. Now we have another type that is exactly the opposite of the lift join. We have the right join. So now let's understand what this means. Okay. So now what is exactly right join? This is the total opposite of the left join. So this tag going to returns all the rows from the right table and only the matching from the left table. So here the main table the main focus is the right table. So SQL going to get you all the rows everything from the table B the right table but from the left side we will get only the matching data. So that means in the left sides you will get only the data that has a match on the right side and with that the right table going to be the primary the main source of your data. So it is very important table but the lift table is not that important. You are just joining it in order to get additional data. So again about the syntax it's not that crazy. All what you have to do is to change the join type. So instead of left you say right join and again here the order of the tables is very important because the side here makes a difference. So we start from the left table A and then right join it to the table B. So it sounds very similar to the left join. We are just switching things. Now let's go back to scale. in order to practice. Okay my friends, so now we have the following task and it says get all customers along with their orders including orders without matching customers. So again we have the customers and the orders and we are doing the join but here the condition is different. We want to see all the orders even if they don't have a matching customer. So that means I would like to see everything from the table orders and the customers table here is only like supporting and helping. So the main table that we are focusing on is in the orders. We want to see everything and from the customers only the matching and if you are looking currently to the results you can see we are seeing only three orders right but in the original table if you go back over here you can see that we have four orders. So we are currently using this query not seeing all the orders. So now how we going to solve it? If you start from the table customers you can say you know what instead of left join we're going to say right join. And with that you're going to guarantee you will get everything from the table orders. But now the left table the customers is not that important and you will see the data of the customers only if there is a match. So doing the right join like this guaranteed to see everything whether there is match or no match. Now if you go and execute it you can see on the right side the order ID and the sales and we can see now all the orders and on the left side the ID and the first name. We are seeing only the customers if they did order something. And for the orders without a known customer, we are getting nulls. So with us, you have solved the task using the right join. So now my friends, you have to go and solve this task to get the exact same results. But you are allowed to use only the left join. So you are not allowed to use the right join. So now go pause the video, solve the task and meet you [Music] soon. Now my friends, in SQL there is always alternatives on how to solve a task. So now if you want to get all the data from B and only the matching from A, you can do it like we have done using the right join. But if you go and switch the sides and you make the table B as a left table and the table A as a right table, you can do that of course in SQL. But you have to switch the join type. So instead of right, we have to use left now since the B table now on the left side and as well you have to switch the order. So you start from the B table and then you say left join the A table. and of course the same join condition. And if you do that, you will get the exact same result as the left query. So if you just switch the tables and as well switch the join type, you can get the same results. And to be honest, my friends, I don't like the right join. It's just in the last 10 years, I always tend to start from a table and then use a left join. And from my point of view, the left join is way more famous than the right join. And I think I never used a query where I'm using a right join. So my advice for you always try to skip the right join and stick with the left join just get the order of the tables in the query correct and you will get the same results. So with that you know an alternative for the right join. Now all what you have to do is to go and switch the right to left. Uh this is not enough because if I go and execute it. So now all what I have to do is to go and switch the tables like this. So we start from the table orders because I want to see everything from the orders and then lift join it with the customers. And of course we don't have to change anything here. It doesn't matter the order because we have an equal operator here. What is very important here is where you start from which table and what is the table that you are joining with. So if you go and execute it, you will get the exact same results. So now I'm seeing all the orders. I'm not missing anything and only the matching customers. And I prefer this way solving this task instead of using the right join. All right. So that's all about the right join. Next we're going to combine everything. We're going to talk about the full join. So let's go. Okay. So now what is exactly a full join? If you use it, SQL returns everything all the rows from both tables. So now if you check again our circles from the left table, we want to get everything all the rows. So you will get the whole circle and as well from the right table you want to get everything all the rows the whole circle. So that you want to get everything the matching the unmatching all the data from left and right. Now let's check the syntax. It's going to be very simple. The joint type here going to be a full join. And the full join it is very similar to the inner join. You remember the order of the tables is not important at all. So there is here no main table and secondary table. Both of the tables are important and it doesn't matter in your query where you start. You can start from A full join B or you can start from B then full join A. you will get the exact same results. It sounds simple. Let's go to SQL and practice the full join. All right. So now we have the following task and it says get all customers and all orders even if there is no match. So now again we need the data from customers and orders. But now of course which type we're going to use? It says even if there is no match but it didn't say no match from orders or customers. So you can understand from this task we are not focusing only on the orders or the customers. Both of them are equally important and we need all the data. So that means we need all the data from left, all the data from right and we can go and use the full join. So now we have this query over here. We are starting from customers and then joining to orders. But now instead of having left, we're going to say full join. So now let's go and just execute it. Now if you are looking to the left side, you can see we are getting all the customers, right? So we have our five customers and if you are looking to the right, you can see all our orders. So with that we have everything from left and everything from right and the matching data is just side by side in the results and if there is no match we are getting nulls. So actually with that we have solved the task and again it doesn't matter how you start. You can start from the orders and then join it to the customers and you will get the exact same results. So you are getting exactly the same data. Now let's go and understand exactly how is executed the full join. Okay again we have the data of the customers and the orders and our full join. So now we're still going to identify those columns that we want to see in the results. So the ID and the first name, the order ID and the sales informations to the output. Now it's still going to start from the left table since it is started with the customers. It's still going to take simply everything from the left table and present it in the output. Since it is full join, we want to see all the data from the left side. And now start searching for matches from the right table. So let's start with the first customer. And as usual, we will get the order from the customer number one. And the same thing for the second customer, we have as well here match. So we will get as well. It's like that lift join. And for the third one, we have as well a match. And we're going to have it like this. And since we don't have orders for those two customers, we will get as well nulls in the outputs. So scale going to mark it with null. The same thing over here. And as well for the last customer. So we will get nulls for those two customers. And now of course SQL will not stop here otherwise we will get a left join effect. Now SQL going to start looking at the right side to find any order that is not in the output. So SQL going to see okay the first order is in the output. The second one is as well in the output. The third but the fourth one is not in the results. So SQL going to take this result and put it in the output. So this order has no match at all from the left side. And with that if you are looking to the right side you can see SQL going to be happy because we have all the orders from the right table. And of course SQL will not leave it like this. Instead of that SQL going to show nulls on the left side. So there is no ID and there is no first name. So this is exactly why we got this results. And this is how SQL executed the full join. Okay. Okay. So now if you are looking to the use cases I can say you can use the full join in order as well to recombine the data from multiple tables if you don't want to miss anything from all four tables all data the matching and unmatching data but I don't use it usually for data enrichment for the second use case and where we can use the full join is in the last use case as well but with a little twist that we're going to learn later. So this is mainly where we can use the full join. All right. So with that we have covered the basic types of joins inner, left, right and full join. Those are the classical joins on how to combine two tables. Now we're going to start talking about the advanced SQL joins. And now we're going to cover the first part the lift anti- join. So let's see what this means. Okay. So now what is exactly a lift anti- join? Now in this mechanism we want to return rows from the left side the left table that has no match in the right table. So now by looking to our two circles from the left table we want to see only the unmatching rows. So only rows that exist in table A but it don't exist in the table B. So if there is like matching data we don't want to see it. And now from the right table we don't want anything. We don't want any data. So that means the only source of your data going to be the left table. And from the right table we don't need any data. We are just joining the tables to do a check to filter the data. So now for the syntax this can be interesting. We don't have a special type called left anti- join. At least in the SQL server we still can create this effect. Since we are saying left we can use the type left join and then as usual the join condition with the keys. But now if you leave it like this you will get the effect of the lift join. And we don't want that because with the lift join you will get the complete circle from the lift table. But now in order to remove the matching data this overlapping in the middle what we can do we can use a filter and in order to filter the data we use the wear clause. So now in order to get rid of the matching data we can take the key from the right table and we say the key must be null. So if the key is null so that means there is no match on the right side. And if you do it like this you will get the effect of the left anti-join only the data in the left that has no match on the right. So now let's go in scale and create this effect. Okay. So now we have the following task and it says get all customers who haven't placed any order. So now by looking to this query clearly we are focusing on the table customers but we want to see the customers that didn't order anything. So they are in our database but the customers are inactive. Now there are like different ways on how to solve this task but we're going to solve it using the joins. Now let's go and start by just writing a very simple query where we are selecting everything from the table customers. Now you can see this is our five customers. And now I want to check which of those customers didn't order anything yet. Now since we are talking about the orders, we can go and join it with the table orders. So we're going to say lift join the table orders as all and then we're going to go and connect the tables using the ids with the customer ID. So now if you go and execute it now we are still seeing all the customers because we are using the lift join and now we can see the orders informations of each customer and you can see immediately those two customers didn't order anything because we are seeing here nulls right so they are empty there is no orders now we can use this information in order to filter the data I just want to see Martin and Peter so what you can do we can go and say where and all what you have to do is to take the key that we are using in order to join in the tables this is this one over here and say this must be null so is null so if you see it like this that means you want to see the data if the customer ID is null so let's go and execute it perfect now you are getting the customers who haven't order anything and this is exactly the effect that we wanted the left anti-join we are getting the data from the left side where there are no match on the right side so you have always to do it in two steps first join the data as you normally do using the classical joins the lift join and then the second step you go and use a filter using the wear clause if you do it like this you can check for not existence and with that we are getting the effect of the left anti-join so that's it okay so now if you are looking to this picture I think you already know where we use the lift anti- join we're going to use it only in the last use case where we are checking the existence so if you use the lift join together with the where you can check for the notexistence of your data in another table so This is exactly for this scenario. All right. So that's all about the left anti- join. Now we're going to speak about the exact opposite of that. We will cover the right anti- join. So it's going to be very similar but we are just switching sides. So let's go. Okay. So now what is exactly the right anti- join? Well, it is the opposite of the left anti- join. So we want to return the rows from the right table that has no match in the left table. So again if you are looking to our two circles. Now what is important is the right table. We want to see only the unmatching rows from the right table. So only the rows that exist in B but not in A. And from the left table we don't need anything. So no data is needed and that means the only source of data comes from the right table and you are using the left table as a filter as a lookup just in order to check the existence. So now the syntax of that going to be very similar to the left anti- join. So we don't have a special type called right anti-join. We have to use the classical one the right join. But if you do that you will get everything from the right table. And now in order to get rid of the matching data in the middle we use a filter. We use the wear clause where we say we are interested only on the unmatching data. So we take the key from the left table and we say the key from left is null. And if you do that you will get rid of any matching data. Is null means there is no match. And again here the same thing the order of the tables is very important since here we are talking about sides and you have to do it correctly. Okay. So now the task says get all orders without matching customers. So now it is exactly the opposite. We want to see all the orders that don't have a valid customer. So this is really bad scenario. You have in your business orders without a valid customers. So let's see how we can discover that using SQL joins. Now as you can see we are focusing completely on the orders. It's not the customers anymore. And we want to see only the orders where there is no match with the customers. So now again here we have two steps. The first step we're going to go and do the normal join. So using either the left or the right join. Now by looking to this query you can leave it like this where you can start from the customers. But if you want to fully focus on the orders you have to switch this from left to right. And with that you will get all the orders and only the matching customers. And let's go and remove this workloads from here first. So I'm just adding comments. And with that SQL going to totally ignore this line of code. So let's go and execute it. Now you can see we are getting all the orders right and data from customers only if there is a match. And now of course this is not the task. We don't want to see all the orders. We want to see only the orders where we don't have a match from the customers. So if you look to this those three orders they are okay. They are totally fine. We are finding customers for them. So they have valid customers. But this order here is really bad. So there is no valid customer for this order and now our task to show only this type of orders in the result. Now what we have to do we have to use the workclass in order to get exactly the effects. So this time we're going to say if the ID of the customer here. So here we're going to say the ID of the customer from the table customers must be null. So we're going to remove this here and take the key join from the customer and we are saying this ID must be null. So let's go and execute it. Perfect. With us we have solved the task and we are getting the effect of the right anti- join and we are getting now those orders that don't have any customers. So we have solved the task. Now my friends you have to go and solve this task without using the right join but still you have to get the same effects. You want to get exactly those orders without customers. So pause the video and go solve the task. [Music] Now again as you know me I don't like the right joins. We can create the same effects if you switch the sides of the table. So if you say the B table now on the left side and the A on the right side then we will get the same effect if you go and switch the type of join from right to left and you go just switch the tables. So you start from the B table since it's on the left side and then join it with the A. And we still say of course in our work condition where the data from A is null. So there is no match. So if you do this you will get the exact same results like the lift query by using the lift join and just switching the tables. So you will get the same results and with that you know that in scale we have always alternatives. I hope that you are done. So it's very simple what you're going to do. We're going to go and switch the joins and since the orders is the main table we're going to start first from the table orders. So we are putting it on the left side and then the right table going to be the customers. And of course the condition going to stay as it is. We want to see the orders where there is no customer. So we don't have to switch anything here or in the join key. So let's go and execute it. With that you are getting the same exact results. Since we are using here the star, it's always starts from the left table and show the data from the right table. But still the result is valid. We are getting this type of orders without matching customers. And I prefer this way. All right. So now with that we have the left, the right and now of course what is next? We will get the full. So let's speak about now the full anti-join in SQL. Let's go. Okay. So now what is exactly a full anti- join? Well, this time we don't have sides. We want to return only the rows that don't match in either tables. So what this means? If you are looking to the left circle, we want only the unmatching rows. So we don't want the whole circle. We want only the data that exist in A but it don't exist in B on the right table. Sounds like the left ant join but since we are saying full then you have to do the same thing on the right side as well. So on the right table we want only the unmatching rows. So we want to see in the result the data that is in B but don't have a match from A. So it's exactly the opposite. And if you look to this then that means we want to see only the unmatching data and this is exactly the opposite effect of the inner join. In the inner join we were interested only on the matching data only when there is like overlapping. But now with the full anti-join it is exactly the opposite. We don't want to see the matching data. We want to see everything else the unmatching data. So how we going to write this query? Again here we don't have a special type called full anti-join. We will use the help of the classical full join. So the basic one. So you start from a full join b and then the same key. But now what is interesting is about the where condition. Now we have like two conditions right? So now in order to get all data from A that has no match in B, you have to make a filter where you say the key from the B table must be null. And now since we want the exact same thing from the right table, we want all the data in B that has no match in A. You have to say as well the key from the A table must be null. So now we have here like two conditions. And in SQL if you have like two conditions in the work clause, you have here two options either use and operator or the over operator. So now the one that we're going to use here is the or operator. So either the key from right is empty or the key from left is empty. If you do it like this, you will get the effect of the full anti- join. And of course since here both sides are equal then the order of the tables as well here is not that important. So you can say from A full join B or from B full join A. It doesn't matter. So now let's go back to scale in order to create this effect. Okay. Instead we have the following task and it says find customers without orders and orders without customers. So if you are looking to this this means we want to see only the unmatching data from customers and as well from orders. There is no main table and secondary table. Both of them are equally important. So now since we are talking about the unmatching data and the anti-join we have to do it in two steps. The first step we're going to do the classical join and then we focus on the wear clause. So let me remove the wear clause to make it as a comment. Now since we want the data from left and right, we're going to go and use the full join. So let's go and execute it. Now you can see we are getting the effect of the full join. We are getting all the orders and as well all the customers. But now we are interested only on the strange cases where they are like orders without customers like this one here and as well customers without orders. So that means the first three rows they are not really interesting for us because it is boring. We have here matching data and this is totally fine but we are not focusing on that now. We are focusing only if there is like missing data from left or from right. As you notice I'm saying or and this is very important because we're going to use the or operator. So now let's focus on getting this scenario over here. We want to get an order without a customer. So that means the customer ID must be null. And we have it already here. So we are saying where the ID of the customer is null. So if I go and execute it, I will get only one records only this one over here. But as well I want to get the opposite scenario. So in this scenario, the customer ID must be null. So we're going to say or the customer ID in the orders is null or we can do it like side by side like this. Either the right side is null or the left side is null. So if you go and execute it, you will get the effect of the full anti-join. And with that we are finding the customers without orders and orders without customers. I think this is really fun and as well really easy. So this is how we do the full anti- join. All right. So now if you are looking to the use cases we use the full anti- join again exactly for the last use case in order to check the existence. So if you combine the full with the where you can check the existence or the notexistence of your data in another table. So this is exactly the scenario for that. Okay, my friends, now we have a bonus section where I'm going to challenge you to solve the following task without using an inner join. So, it says, "Get all customers along with their orders, but only for customers who have placed an order, but without using an inner join." So, pause the video now and go and solve this [Music] task. Okay, so now let's see how we're going to solve this. We want the customers, the orders, blah blah blah. But we want only the customers who have placed an order. Previously, we have used the inner join in order to solve this task. But this time, we are not allowed to use it. So, let's go and solve it. This is how I'm going to do it. Select star from table customers. Can't give it the alias. So, now I'm getting all the customers, but I am interested only the customers who have placed an order. So, as we know before there's like two customers didn't order anything, and we don't want to see them in the final results. Now how we will get that? Well, we can use the help of the table orders in order to check the existence of our customers there. And of course, I'm not allowed to use the inner join. So I'm going to go and use a left join with a table orders and then combine them as usual. Nothing new with the customer ID. So now let's go and execute it. As you can see, we are doing it step by step. You don't have to rush everything in one go. So you start simple, check the results and decide on the next step. So now by looking at these results I want to get those three customers because they have ordered something and we are seeing data about their orders and I don't want to get in the result the last two. So again we still can use the customer ID from the right table in order to decide which data going to stay in the result and which data should be filtered. We're going to go and use the wear clause and then the key from the orders and this time we're going to say is not null. I know we didn't learn yet about the not and the logical operators but using the not null it means there should be data inside the column it must not be null if you do it like this and execute you will get the exact effect as the inner join. So as you can see as you are joining the tables using the left join you can control what you want to see using the wear clouds using the filter and this is how you can solve this task without using an inner join. Okay, so with that we have covered all those three scenarios in order to find the unmatching data. Left, right, full and joints. Now we can speak about one crazy join. We call it the cross join. This one is totally different from all other types that we have learned. So let's understand exactly what is the cross join. Let's go. So now what is exactly a cross join? Now in some scenarios we want to combine every row from the left, every row from the right. So that means I want to see all the possible combinations from both tables. So we are doing something called like cartesian join. So now if you look at our two circles, we want everything from A and as well everything from B. So that means I want to see everything from A combined with everything with B. So in this example, we have two rows in A and three rows in B. If you do a cross join, you will get six possible combinations by just multiplying the number of rows between A and B. So be careful using the cross join. If you use it, you will get like crazy number of rows in the results and you're going to make the database really busy finding out the result for you. So now about the syntax, it's going to be the easiest. So you start as usual from one of those tables, the A for example, and then you say cross join B. So now my friends, if you look at this, you can see it's not like the previous joins that we have done. We have always before talked about unmatching rows, matching rows and so on. But here we don't care at all about whether the data is matching or not. I just want to see all the possible combinations everything. So since we don't care about matching the two tables, we don't have to specify any condition. So there is no need to use the keyword on because we don't need any condition. So that's it. You just say cross join B and the magic can happen. So this is a cross join. Let's go to SQL to try that. Okay. So now we have the following task. It says generate all possible combinations of customers and orders. So that means we want everything with everything using the cross join and this going to be very simple. So we're going to start with select star from whatever table. So you can start from the customers and then you say cross join orders. That's it. Very simple. Let's go and execute it. So now as you know we have five customers and four orders. And if you multiply them you will get in the results 20 rows. So now we are getting everything with everything. even if the data is not matching at all. So you can see for example the orders here. So this is one order that belongs only to one customer the customer ID one. So it is an order from actually Maria but still we are seeing this same order with the other customers since we want to combine everything with everything. So there are no rules. The same thing for the next set. So this is the second order actually belongs to John but we are seeing this order with all customers. So that's it. This is how the cross join works. And now you might ask me why we have this. It makes no sense, right? Well, my friends, I rarely use it. But sometimes if I want to generate like test data or maybe if you have like for example table called colors and table called products and you would like to see all the combinations between the products and the colors. So in some scenarios it makes really sense to see all your products together with all the colors without any matching conditions or whatever. So there are like few scenarios for the cross join if you are like doing simulations or testing. So this is how we do the cross join. Okay. So that's all about the cross join. And with that we have covered the four advanced types of joins. Now if you look at this you might ask okay how I'm going to choose between all those types. So you might ask me okay bar how you do it? Well I'm going to show you now my decision tree that I usually follow in order to choose the correct type. So now if I'm combining two tables and I want to see in the results only the matching data between two tables then I go and use the inner join. We don't have any other type for that. So that's simple but now if I want to see everything all the data I don't want to miss anything after joining two tables then I take different path and here I ask myself is there like one side more important than the other am I interested in all data from one table from one side like here we have like a main table or a master table then I go and use the lift join but if I want to see all the data from all tables in my query everything so there is no one table more important than other then I go with the full join So this is another path and now the third path if I'm interested to see only the unmatching data. So I'm doing some kind of checkups and so on. And here again the same thing do I want to see the unmatching data from only one side. There is like one table that is important then I go and use the lift anti- join. So I want to see the unmatching data from one table and I'm using the other table only for the check. But in my query if both of the tables are important there is no main table and secondary table both are important then I go and use the full anti- join. So actually that's it. This is the decision tree that I follow usually as I'm writing a query. And you might ask me how about the right join. Well as you know me I don't have it at all in my decision tree. So I don't use it at all. Now by looking to this I can tell you if I check most of the queries that I write very often I use the left join. So I can tell you this is my favorite way on how to join tables. So let me show you exactly why. Usually I write queries in order to do data analyzes. So in data analytics you have always like starting points. You have like a topic that you are analyzing like the customer. So you have always like a master table. So I always start with the main table of my analysis. So in my query I start from this table from table A the main table. And then what happens? The data is not enough in this table. I need some extra data that comes from another table like the table B. So the table B is only here like an additional data to the master table. So I go and use the lift join in order to connect the table B and then I find another interesting information in another table in table C. So same things happens. I go and join the tables using the lift join and so on. So I keep connecting multiple tables to this main table in the middle. And my query going to look like this. always doing lift joins with multiple tables. Now, of course, you might say, "Yeah, but sometimes you would like to see only the matching data and so on. So, it makes sense only to use the inner join." Well, in order to do that, I can control everything that I want to see in the final results using the wear clause. So, in the wear clause, I define exactly what I want to see in the final result. So, with that, I get like more flexibility on whether I want to see the matching, unmatching data and so on like we done in the lift and join, right? So as I'm analyzing data I tend very frequently having this setup where I start from the main table and I lift join all other tables and with the word conditions I control the final results. So this is how I connect multiple tables together. So now if I want to visual this in like circles it's going to look like this. We have the circle A. So this is the master table the starting point. I want to see all the data from table A and I live join it then with another table B and from table B I want to see only the matching data. So it's like the lift join. Now what going to happen? I'm going to go and add another table. So another circle the circle C. And from the circle C, we want to see only the matching data. And of course you can keep adding circles to this. But it's going to be always the same thing. And in your circle going to has only the matching data. So now as we learned we can use joins in order to combine multiple tables to get a complete big picture about topic like the customers. I would like to see everything about the customers in the final results. So either you're going to do it like me where you start from the main table and then go and lift join all other tables or maybe you say you know what there is no main table about the customer's data all the tables are equally important then you can go and join all those tables using the inner join if you are interested only on the match data so what can happen if you have again those circles from the A you need only the matching data from B you need as well only matching data and as well from the third circle so you are interested only on the overlapping between all all three tables. So you will get only this section where you have overlapping between all three tables. So this is of course another way on how to join multiple tables. Okay. So now my friends let's go back to scale in order to practice how to join multiple tables. Okay. So now let's have a task. This going to be a little bit challenging. We will be doing multi- joins using the sales DB. Retrieve a list of all orders along with the related customer product and employee details. And for each order display the following. We want to see the order ID, the customer name, the product name, sales price, salesperson name. So there is a lot of things that is going on. And the first thing that you're going to notice it does now we are using different database. We will be not using the my database, we're going to go and use the sales DB. So this is the first thing that we have to do. So instead of using my database, so we say use sales DB and then execute it. We are now connected to the sales DB. So this is the first thing. So now if you are reading this task there are a lot of tables that are involved. We need the orders, we need the customers, products and employees. So there are like four tables needed in this task and we need different stuff from each table. So now how I think about it well it is mainly focusing on the table orders right? So we need all the orders we cannot miss any order here. So this sounds for me this is the main table and then it says along with that we need other informations. So that means the other tables are not that important like the orders. So this gives me feeling about what is the main table and this going to be my starting points. So let's start from that from the table orders. So select star from and here you have to pay attention that this database has always a schema. It's called if you look to the left side sales dot the table name. So we have to write that now in our query. So we're going to write it over here sales dot and then the table name orders. Let's go and execute it. Now I know this is the first time that you are querying this table. We have a lot of informations here and as well we have a lot of ids. Those ids going to help us of course on joining our data with the other tables. So what do we need from here? We need the order ID. So we have it over here. We're going to get the order ID. This time the naming convention is different. We don't have like underscores and comm. We have different type of namings. So be careful with that. So what else do we need? We need the sales. So if you go to the right side over here, we have column gold sales and we're going to go and include it to the results. Now all the other informations are actually not needed, but I need those ids in order to join it with the other tables. So now what I'm going to do, I'm going to go and give it an alias and all. So now I'm going to go and assign it for each column. This comes from the orders and as well the same thing for the sales. So that's it for now. And if I go and execute it, I will get the orders and the sales. All right, so that's all for the first table. Let's go now and see what do we need. We need the customer's name. Well, actually we don't have this piece of information in the orders. So all what you have to do is to go and explore in the other tables in order to find this column. So how I usually do I go and explore the tables like this. So I write a symbol select from each tables. So the customers. So now I go and repeat this for each table inside the database. So we have the customers, employees, we have an orders, the orders archive and as well the products. So now I start exploring the table. So if I go to the customers over here, we can see we have here five customers and we can see the names of the customers. So we see the first name and the last name and this is exactly what I need for my query. Now of course we have to go and connect this table with the orders. So we need a common column. Usually it's going to be the ID. So here we have the customer ID and if you go and query the orders you can find here as well the customer ID. Now if you are working in big projects you're going to have a lot of tables and exploring each one of them going to be really hard. So now of course if you have like in the project hundreds of tables it's going to be really hard to explore each table. So instead of that a good project a good database usually has an entity relationship model er model like the one that we have for the course. And here you can find easily the tables that you have inside your database and as well the relationship between them and this is very important especially if you want to join tables. So now by just looking quickly to this diagram I can understand okay there is an ID called customer ID inside the table orders and it is like a foreign key to the primary key the customer ID. So that means if I want to connect the orders with the customers I have to use that customer ID. So as you can see this is really nice documentations and I can quickly understand how to join the tables. So now back to our query. Now what I'm going to do I'm going to say lift join. So with that I guarantee all the orders going to be presented in the output and I will see always 10 orders. So now let's join it with the table customers sales dot customers and let's give it an alias like this. And now we're going to build the joining condition. So it's going to be the customer ID from the table orders equal to the customer ID from the table customers. So that SQL understand how to match the two tables. And now the two tables are connected and I can get the informations now from the customers. So see let's go and get the first name and as well the last name. So now let's go and execute it. So now as you can see we have customers for each order which is really nice. So with that we got the customer name and the order ID. Now the next one we need the product name. So either you're going to go here and start exploring. I think it is inside the table products. And here you can see we have the product. This is the name of the products. And if you check our ER diagram, you can see we can connect the table orders with the products using the product ID. So we have the product ID in the left and as well in the right. And now we can go and build this join as well over here. So again I go with a lift join. I don't want to lose anything from the table orders sales products and we give it an alias P. Now the condition for that here you have to be very focused. You want to get the product from the orders. So you say O dot product id equal to the product ID from the table products. So as you can see in the joins we are always joining with the table orders. Right? We are not trying to join for example the customers with the products. Always we are joining with the main table. So with that we have connected the third table and we can get the information that we need. So we need the products as I'm going to go and rename it products name. So let's go and execute it. And with that my friends I'm getting now the product informations from the table products. So we have the sales as well and we need the price. So if you go to the products you can see we have as well price information. I forgot about it. So let's go and get it as well from the same table. price. So let's go and execute it. And with that we have as well the prices. Now the last column it says we want to get the saleserson name. So the name of the employee right now if you go and explore as well we have here employees table and execute it. You can see we have here the name and the last name of the employees and we have an ID. So now we need this ID as well in the orders. So you can see we have the product ID, the customer ID. We already used those two. But we have here one more extra ID called the salesperson ID. Of course, it is not called employee ID. So here you might be a little bit skeptical about it. That's why we have to go and check again our ER diagram. And as you can see the employee ID from the employees, it is connected to the salesperson ID. So that I have better feeling about it and I understand. Okay, I can connect the orders with the employees using the salesperson ID. So let's go and do that. I'm going to say lift join. So as you can see I'm just doing left joins sales dot employees as e and the condition again very important always the first table is included in the join condition and here we're going to say the sales person ID is equal to the employee ID. So with that we have connected as well the employees and we will get as well the first name and the last name. So perfect that's it. Let's go and execute it. And as you can see guys, now we are getting the name of the salesperson. Now here comes an issue. As you are joining multiple tables and you are getting columns from different tables, what can happen? You might encounter this scenario where you have the same names in multiple tables. So now as you can see we have the first name last name from the employees and as well we have the first name last name from the customers and it's going to be really hard from the result to understand what are we talking about? Is it the customers? Is it the employee? That's why in this scenario if you have the same names we have to go and start giving aliases. So for the first one we're going to say customer first name and as well for the last name we're going to say customer last name. Same thing for the employee. So let's say employee first name or we can call it the saleserson whatever employee last name. So if you go and execute it now it's going to be more clear. Here we are talking about the name of the customer and here we are talking about the name of the employee. And again one more thing if you are not using aliases it's going to be an issue. So for example if you go over here and you don't use the table name before the column. So if I go and remove it and execute it you will see I'm getting an error. Now SQL can't understand what are you talking about. Is it the first name of the customer or from the employees because you are not specific about it. So you have to tell SQL to which table belong this column. It's very important to use a table name or the alias before the column name. Especially if you have the same column. So now we will not get an error. And with that we have solved the task. You have really to pay attention about the join keys. The condition you have to do it correctly cuz as you can see now we have a lot of tables and a lot of columns and sometimes happens an issue where you specify the wrong columns or the joins and the result can makes at all no sense. So always double check are you using the correct keys in order to join the tables. So with that you have solved the task and this is exactly how I join tables. I have always a starting point from an important table and everything else going to be left joined and in my results if I want to remove any scenario then I go and use the wear clause. So this is how I join multiple tables. Okay my friends. So with that you have learned now everything about how to join the tables in SQL and this is very important to understand. Now moving on to the second method on how to combine your data from multiple tables. We have the set operators. So we're going to go and cover how to combine the rows from multiple tables. So let's go. All right, my friends. So now as we learned before, in order to combine two tables we have two methods. If you want to combine the columns, we use the joins. And we have learned all those different types on how to combine data using join. So we have covered this section. But now if we want to combine the rows of two tables, we can use the set operators. And here we have four different types. We have union, union all, except and intersects. So now we're going to go and deep dive into this word on how to combine the rows of tables using the set operators. And now of course in this course we're going to cover everything. So let's go. All right. So now let's have a look to the syntax of the set operators. Okay. So now let's see that we have the following query. we are selecting the data from the customers. So this is our first query or our first select statements and we have another one which is very similar where we are selecting the informations from the employees and this is our second select statement. So now what we can do we can put between those two queries a set operators like for example the union. We can use of course any other set operators like the union all intersects except and so on. So as you can see the syntax is very simple. We have two different queries and we just put between them the set operator. So this is how the syntax of the set operators looks like. All right friends. So now we're going to talk about the rules of the set operators. And we're going to start with the rule number one the SQL clauses. In each individual select statements or query. We can use almost all the SQL clauses like where join group by having. But there is only one exception with the order by. Order by you can use it only once and only at the end of the entire query. So that means we cannot use order by in each select statements or in each query. We can use it only once and only at the ends of the entire query. All right. So about the syntax again here we have our two select statements and in between them we have the set operators. So now in each query we can go and use multiple stuff like the join where group by having. So we can make each query complex as we want. So everything is allowed but not the order by the order by must be always placed at the end of the entire query. So if you want to sort the result by the first name, you have to use the order by exactly at the end. So we are not allowed to use order by in each query. Okay. Moving on to the rule number two. The number of columns. The number of columns in each query must be the same. Okay. Okay. So now in order to understand this rule, let's have this very simple example. We're going to go and select the first name and the last name from the table sales customers. So this is our first query, our first select statements and let's say that I have another one and we want to select the first name last name but this time from another table, the employees. So with that we have our two queries and I would like now to go and combine them into one result. So we're going to go and use the set operator union. Let's go and execute it. So now as you can see in the result we will get the first name and last name from two tables the customers and employees. And it is working because we are fulfilling the rule where it says the number of columns must be the same in both queries. So how many columns do we have in the first query? We have two right and as well in the second query we have two columns. So that's why everything is working. So now let's go and break the rule by adding another column to the first query. So let's say that I would like to have the customer ID as well in the first query and with that as you can see in the first query we have three columns but in the second we have only two. So let's go and execute it. Now as you can see in the result we will get an error where it says if you are using union intersect and all those set operators you must have an equal number of columns between queries. So this is the rule you have to have the same number of columns in order to repair it. So I'm going to do I'm just going to remove the customer ID. Okay. So here again we have two columns and the second one as well two columns and everything going to be working. Okay. Moving on to the rule number three. The data types of columns in each query must match must be compatible in matching. In order to check that what we're going to do we're going to go to the object explorer to the left side. Let's go and browse the customers and the columns. And as you can see we have here the first name and last name with the same data type. We have the vchar. And if you go to the employees, you can see as well the first name, last name having varchar. So the first column is varchchar from the first query and as well for the employees and as well the last name from the customers having the same data type as the last name from employees. So the data type is matching. Now let's go and break this rule. Instead of having the first name, I would like to go and use the customer ID. So now let's check the customer ID on the left side. It is an int, an integer. But the first name is invarchar. So here we have a mismatch between data types. Let's go and try to execute it. So now we are getting an error where it says SQL is trying to convert the value Frank to an integer. So what this means the first query is always controlling everything the names and as well the data types. So here we have an integer and now scale is trying as well to convert the first name values to an integer and of course it will not work because we have here characters inside and it cannot convert characters to an integer. So we have a mismatch between data types between the customer ID and the first name and that's why we will get an error. The second column we don't have an issue because it is varchar in the first table and as well for the second table. So now in order to repair it either select a first name in the first query or we can go over here and say employee ID and with that if I execute it we will not get any errors because the employee ID is as well an integer and we have a match in the data types. So as you can see it's not enough to have the same number of columns. You have to have as well matching data types between those two queries. Okay, let's move to the next rule. Rule number four, the order of columns. The order of columns in each query must be as well the same. Okay, so let's understand what this means. Now we have here again the same example where we are selecting the ID and last name from customers and we are combining it using union with the employee ID and last name from the employees. And as you can see everything is working because we have the same number of columns and we have a matching data types. So now let's go and break it. What I'm going to do I'm just going to switch between those two columns. So first I'm selecting the last name and then the customer ID. So again I have the same number of columns and the ID is integer matching the ID of the employee and the last name having the same data type. So let's go and execute it. So here again SQL going to throw an error and says SQL is trying to convert the value go back to an integer. So it's like character to integer. It will not work. So what happened here? I have here the same informations. I have an ID and last name and ID and last name. Well, SQL doesn't work like this. SQL going to go and map the first column from the first query with the first column with the second query. So it's going to go and map last name to employee ID. And since they have different data types, SQL going to throw an error. So SQL doesn't understand or don't know how to map let's say the ID with the ID and since they have different data types SQL going to go and throw an error. So as you can see here we have the same informations between customers and employees but they don't have the same order. So SQL cannot go and map the informations because of the names of the columns. It's going to go and simply just mapping the columns like this. The first column from the first query with the first column from the second query. So as you can see in this rule you must have the same order of the columns. First the ID and then the last name and with that it's going to work again. All right moving on to the rule number five. The column aliases column names that we see in the output in the result is defined and determined by the column names of the first query the first select statements. So that means the first query is responsible of naming the columns in the output. Okay. So let's understand what this rule means. Again we have the same example. The customer ID, last name from customers, union, employee ID, last name from employees. So if you check closely the output, you can see that in the output we have the customer ID and not the employee ID. Even though we have the ids from the employee ID, but as you can see the first query is controlling the naming of the output. So since the first column called the customer ID, you will see it in the output as a customer ID. So the naming of the like the next queries will be totally ignored. So that's why if you want to give aliases to the output, you're going to go and do it only for the first query. So for example, I go over here and say instead of having customer ID, I would like to call it as an ID. So now if I go and execute it, as you can see in the output, we will get an ID. So I don't have to go and in each query give this alias. So I don't have to go over here and say yeah you are as well the ID because it's enough to define it from the first query. So there's no need to give the same names in the next queries. Let's take another example where we would like to have an alias for the last name. So I would like to have it like this last name and let's go and do it in the second query. So last name let's go and execute it. So now as you can see in the output, we still have last name and there's no underscore because this is totally ignored from SQL. This is not the first query. The first query says you are last name without underscore. So again if you want to do that we go over here. Let me just get it and put it in the first query. Let's go and execute it. So my friends, the first query is very important in order to give the names for the output. So if you want to do aliases and to rename stuff, do it only on the first query. And as well the first query controls the data types. All right. Now to the last rule matching the correct informations. If in your query you fulfill all other rules and you don't have an error in the SQL that doesn't mean that your result is accurate and correct. You are the only one that is responsible of mapping the informations between queries correctly because SQL doesn't understand the content and the informations of your tables of your queries. And if you don't match the informations correctly between the queries, you will get inaccurate and wrong results in the output. Okay. So now back to our example. Let's say I would like to get the first name and as well the last name from the customers and the same informations from the employees. Let's go and execute it. Now as you can see it's very nice where we are getting the first name, last name from both tables in one result and we are fulfilling all the requirements in SQL. Same numbers, same data types and so on. Now let's go and make incorrect results. So what I'm going to do, I'm just going to swap the first name and last name in the second query. So first last name and then the first name. So let's go and execute it. So now as you can see we will get results because we are fulfilling all other rules because we have the same number of columns and as well we have matching data types. So the first one is character the first name and the last name is as well character. So SQL will just present the result as you define it. But the result is completely wrong because now we have if you check the first column here the first name. So here we can see last names inside the first names. For example, Brown and Baker those are last names but we can see them inside the first name. And the same thing in the last name. We now we can see first names inside it. Mary, Carol, they are all first names. So as you can see the result has really bad data quality. We are now mixing stuff and it doesn't makes any sense. But SQL will not know that because SQL doesn't know the information the content of your data. It's just mapping the data types. So first name is varchchar the last name as well vchar. Everything is fine and you will get the results. So my friends you are responsible of having the same informations mapped between the two queries and not having an error from a skill doesn't mean that we have now correct results. So pay attention to the informations that you are mapping between the two queries. All right. So those are the rules of the set operators. So the first one is that the order by can only be used once at the end of the entire query and all queries must have the same number of columns, the matching data types, the same order of columns and the first query always control the names and the aliases of the result set and as well the data type. And the last rule is that make sure that you are mapping the correct informations to each others between queries. So those are the rules of the set operators. Okay. So what is union? Union going to go and return all distinct unique rows from both queries. So that means it's going to go and combine everything and all the rows going to be presented at the output. So since it says all distinct unique rows that means union going to go and remove all duplicates from the combined result set. So union going to make sure that each row going to appear only once. All right. So now let's have this very simple example. We have two sets of data. We have the customers where we have five customers with the first names and as well we have another set called employees and we have as well the first names of the employees and we have five employees. And now if you take a look to the first names you can see that we have the same persons as a customers and as well as employees. We have given and marry in both sets of data. So now how is k going to execute union it's going to go and return everyone from customers and everyone from the employees. But now since we have given and married twice in the output we're going to have them only once. So this is how the union works. It going to go and return everyone from two sets but without duplicates. All right. So now we have the following task and it says combine the data from employees and customers into one table. So that means in one table we want to combine all informations from employees and customers. So which informations do we need? This is the first question that I usually ask myself. So in order to do that first we have to explore the data. So select star from sales customers and then semicolon. Then I'm going to write another query select star from sales and employees and semicolon. So now why I'm using two different semicolons because I'm telling SQL we have now two separate queries. They have nothing to do with each others. And if you go and execute it like this. And now in the output you can see we got two result grids. The first result grid is for the first query and the second one for the second query. So they have nothing to do with each others. I just want to explore those two tables in order to understand how I'm going to map those informations. So now if we check those two tables you can see that both of them has ids. So we can map those informations right. Both of them has as well first name last name. So that means I can go and map the first name and last name together. Now in the customers we have country but we don't have this informations in the employee. So we have to go and ignore it. And we have as well here score where we don't have a score for the employees. That means I can go and map three informations between the customers and employees. Now of course we can go and think do we need really the ids because it doesn't make really any sense to have the ids in the tables. It's not anymore unique because we have here the custom ID one and employee one. So I think we can go and ignore it. So the only really two informations that is useful to map is the first name and last name. So now let's go and add those two informations. So we need the first name, last name and the same informations as well from the employees. But now we want everything to be in one query. That's why I'm going to go and remove the semicolons. And now we have to go and use set operators between those two queries. And now in order to combine the data we have two options either union or union all in this example it doesn't mention anything about duplicates and so on. I would like to go with the union in order to remove the duplicates if there is any. So that's it. Let's go and execute it. Now as you can see in the output we have only one result because we have only one big query. And now we have the first names and last names from the customers and employees. And now one more thing about the order of the queries. It doesn't matter whether we start with the employees or with the customers. we will get the exact same results but pay attention to the naming of the columns. Always the first query controls the names but since now they have the same naming so it should not be a problem. So if I go and switch those two tables and start it again we will get exact same results. So now let's understand how scale did combine the data using the union. Okay. So now we have here the results from the first query and the second query employees and customers and we are combining the data using union. The first step in SQL is that it's going to go and take the columns from the first query which is from the employees. So it's going to take the first name last name as a column name to the results. And now the next that is going to go and start combining the rows between those two tables. So first going to go and take the rows from employees and as well going to check whether there is duplicates in the data. So as you can see we don't have here any duplicates. So we're going to have the five employees. And now the next step is going to start adding rows from the second query from the customers very carefully without generating any duplicates. We don't have it in the output. That's why it's still going to go and add it to the result. Append it. And then the next customer we have Kevin Brown. As you can see, we have it already in the results. That's why will not go and add it to the result. Otherwise, it's going to go and generate duplicates. So it's still going to ignore this customer. The same thing for Mary. We have Mary as well in the results. So it's going to skip it. And then we're going to go to the mark. As you can see, we don't have mark in the results. That's why SQL going to go and take this customer and put it in the output. And then the last one, we have Anna. We don't have Anna in the results. That's why SQL can go and as well add it to the results. And now with this, SQL did combine the rows between those two tables. And we have here eight persons. So as you can see, SQL is combining the data, but very carefully not generating any duplicates. All right. So that's it. This is how the union operator works. Okay. So now union all union union all going to go and return all rows from both queries. So it's very similar to union. It going to go and combine all the rows and everything going to be presented in the combined result set. But the big difference to the union all will not remove any duplicates. It is the only set operators that doesn't remove duplicates and it going to show all the rows as it is. So if you have a row 10 times from the query, you will find it as well in the output 10 times. Now you might ask me when to use union and when to use union all. I'm going to say that there is one big difference between them is that union all has way better performance and it's faster than the union. And that's because union all doesn't perform additional steps like removing duplicates. So my friends that means if you know already that in my queries there is no duplicates. I know my tables. I know my queries. There's no duplicates. Don't use union and always use union all because you will get better performance. Another scenario for the union all is that I would like to see the duplicate. I'm doing data quality checks and I would like to see whether there is duplicate after I combine multiple queries. So in this situation I go and use as well the union all. Now we have again the same example. We have the customers and employees and we have as well the same persons Kevin and Mary as customers and as well as employees. So now if you want to combine the data using union all it going to return all rows including duplicates. So that means SQL going to go and execute union all like this it going to return everything from customers and everything from employees and Kevin and Mary going to be presented twice in the output. So as you can see union all is returning all the rows as it is from the two result sets and if there's duplicates in the sets we will get as well duplicate in the output. So Kevin going to be existing twice in the output and marry as well twice. So this is how the union all works. All right. So now we have very similar SQL task and it says combine the data from employees and customers into one table including duplicates. So it's exactly like the last task but this time in the task we are saying include duplicates. So we cannot go and use union. We have now to go and use union all. We will have the exact same query. So we are selecting the employees first last name and as well customers first last name. And now instead of using union, we're going to go and use union all. So all what we have to do is that to go over here and say union all. So now pay attention to this. As you can see in the union previously, we got eight records or eight persons from the output. So now let's go and execute it and check the results. Now as you can see we got now 10 persons instead of eight. And that's because we have five customers and five employees and we have duplicates inside the data. We have two duplicates. Now if you check we have here Mary and as well over here we have Mary and same goes for given we have given over here and as well here. So we have duplicates inside the data and SQL just combine the two tables. Okay. So now we're going to understand how SQL execute union all in order to combine data. All right. Again we have the two results from queries. We have the employees and customers and SQL going to do the same steps. First going to go and get the column names from the first query and put it in the output. It's still going to go and take all the employees and put it in the output without checking anything. So that means if there is duplicates in the data, it's going to be presented as well in the output. It's very simple. Now it's going to go to the second step and as well take all the customers and append it into the output like this. So that's it. It's very fast. It's going to go and just combine all the rows from the employees and all the rows from the customers. And with that, we're going to get that 10 persons. And as you can see, we have duplicates in the data. So we have marry twice and given as well twice. And that's why union all is the fastest. It doesn't have any extra steps or checks. Just taking all rows from all queries and put it in the output. All right. So as you can see it's very simple, right? So that's all for the union all. Okay. So what is except sometime we call it minus in other databases but in SQL server we call it except. So it's going to go and return a distinct rows from the first query that are not found in the second query. So from this definition we can understand that the order of the queries can affect the final result. There is a first query and a second query. So it is the only set operator where you have to pay attention to the order of the queries. And as well it's like the others. It's going to go I remove the duplicates from the result set. All right. Again we have this very simple example. We have two sets, five customers, five employees and there is the same persons as a customer and as employees Kevin and Mary. So now we're going to go and combine those two sets using the excepts or sometime we call it minus. So it says it's going to return unique rows in the first table that are not in the second table. So what going to happen? What is the first table? Let's say the customers on the left side. So here we have five persons. Joseph, Mark, Anna, Kevin and Mary. So now the rule is we need the customers that are not employees. So it's safe for Joseph, Mark and Anna because they are not existing in the second set. That's why SQL going to return those three values. But now for the two customers given and marry here there is an issue. Given and marry they are members of the second set. The second table the employees. That's why SQL going to go and exclude them from the output because they are not fulfilling the rule. So in the output we will get only three customers and all the values from employees and the common values between customers and employees will be excluded from the output. So this is how the except works. All right. So let's have a very simple skill task and it says find the employees who are not customers at the same time. Okay. So let's see how we're going to solve that. We're going to stay with the same queries as usual. We have the employees and the customers but instead of having union all we're going to use the set operator except. So now since we are using except we have to make sure that the order of the queries are correct. So the first query is the employees which is correct because we have to find the employees who are not customers at the same time. So we are focusing on the employees. The first table is correct and the second table is customers. If the task says find the customers who are not employees at the same time then we have to go and switch it. We have first to query the customers. So now everything is correct. Let's go and execute it. And now in the output we see three employees who are not customers at the same time. So we have Carol, Frank and Michael. But as we know we have five employees Kevin and Mary. They are not here in the result because they are customers as well. So now let me show you what can happen if I just switch those informations. So we start with customers and then with employees. Let's go and execute it. As you can see, we're going to get completely different results. Now we are getting customers informations. And now in the output, we got three customers who are not employees at the same time. This is not what we want from this task. So if you do it like this, it's going to be incorrect. So pay always attention here to the order of that query. So now let's go and correct it. So we're going to have first employees and then customers. Let's execute it. And now let's go and understand how SQL execute the except operator. All right. So again we have the results from the two queries or from two tables and now we are doing except between them. So let's see how is going to execute it. It's going to take as usual first the names from the first query from the employees and put it in the output. And now SQL going to present data only from the first query in the output. And it going to go and use the customers only as a check. So SQL will not put any data or rows from the customers. It will just use the second query as a lookup in order to check the data. So, it's going to start with the first employee, Frankly. Do we have Frankly in the customers? Well, no, we don't have it. That's why it's going to accept it and put it in the output. And then in the next step, it's still going to go to the second employee and check. As you can see, we have it already in the customers. So, SQL going to go and ignore it. It's not allowed to be in the output. The same thing for Mary. We have it as well in the customers. That's why it will not be presented in the output. So Michael, we don't have a Michael in customers. That's why it can be presented in the output. And as well for Carol, the same thing. We don't have Carol as a customer and we're going to have it in the output. So as you can see, we will get data only from the first table and the second table only going to be used in order to check the informations from it. So we don't have in the output any customers, it's only employees. So now let's check quickly what going to happen if we switch the tables. So now we have the customers as the first table. SQL going to take the columns from the first table and it's going to start presenting the customers informations in the output and going to go and use the employees only as a lookup. So do we have Joseph? We don't have it in the employee. And then Kevin and Mary we have it already in the employees and Mark and Anna are not part of the employees that's why can go and present the results in the output like this. So now as you can see SQL is focusing on the table customers and we are getting data from the customers not from the employees. Employees is only as a check. So with that we understand the order of the queries is very important for the exceptions. We will get different results if we have different order. All right. So that's all for the except operator. Okay. So what is intersect? Intersect going to go and return only row that are common in both queries. It's something very similar to the inner join and as well here it's going to go and remove duplicates. So there will be no duplicates in the output. All right. Again we have this very simple example where we have five customers and five employees and now we're going to combine them using the intersect. So what intersect does it going to go and return common rows between two tables. So how SQL going to execute it? It's very simple. SQL going to go and search for the common values. So what are the common values? It's given and marry and SQL going to return only those two values given and marry and all others going to be excluded from the results. It's very simple, right? It's going to go and return only the common values and this is how the intersect works in SQL. Okay, let's have this simple task and it says find the employees who are also customers. So we're going to have the same queries employees and customers but instead of having except we're going to go and use intersect. Since we are finding the common informations between the employees and customers it's very simple and straightforward. Let's go and execute it. And with that we're going to get the Kevin and Mary. This is the two persons that are at the same time employees and customers. And of course here we don't have to pay attention to the order of the queries. It's going to be the same if we say find the customers who are also employees. So if you go and just switch for example the customers with employees you will see that we will get the exact same results. So it doesn't matter which query is first again pay attention to the first query that define the names. So now let's understand how is scale execute intersects behind the scenes. Okay again our two tables and now we are doing intersects. So as usual SQL going to go and take the columns from the first query and now we're going to go and find the common data between those two results. So it's going to do it row by row. So we have the employee Frank. Do we have it as a customer? No. So it will not be in the output. Given brown, we have it in the employees and as well as a customer over here. So that's why we will get it in the output. The same thing for Mary. So we have Mary as employee and as well as customer. So we're going to have it in the output. Michael and Carol, they are not customers. They are only employees. That's why we will not get it in the output. The same thing goes for the customers. Joseph, we don't have Mark. We don't have Anna because they are not employees. So with that we're going to get only the common informations between the two tables or two queries and it doesn't matter whether we start with customer or with employees we will get at the end the same information. All right so that's all it's very simple right this is how the intersect works in SQL. All right friends, so now we come to the part where I'm going to show you how I usually use the set operators in my projects for data analyszis or for data engineering. So here are the most important use cases for the set operators. All right, the first use case is combining similar tables before doing data analyzes. In some scenarios, we want to generate a report and we end up writing similar queries on top of similar tables and we go at the end and join all the results from the queries in order to present the final report. And now instead of doing that what we can do first we can go and combine all the similar informations into one table and then we can do on top of it a query a data analyzes in order to generate a report and we can do that using the union or union all. Let's have few examples. So let's say that we have four tables employees, customers, suppliers and students. So as you can see all of them are sharing the same informations. They hold data about persons. So now let's say that you are generating a report that requires all the individuals in the organization in the database. So what you're going to end up doing is writing SQL query for the employees, another one for customers and as well for the suppliers and the students. And then you're going to go and merge all the results from those queries into the final report. Now the issue with this setup is that you are having a lot of queries, a lot of similar queries. So you have it here four times. And now what might happen is that you go and change the logic of the first two queries and you forget later to do it for the other two and you will get really inconsistent data in the reports. So instead of that what we can do we can go and use the set operators in order to combine first all those tables in one big table. So what we're going to do we're going to go and use a union in order to combine those four tables into the table persons. So we're going to have it like this. So we will get all the rows from the employees and put it in the persons all the rows from the customers from the suppliers and as well from the students and put everything in one big table that holds all the informations about the individuals that we have inside our database. And now the next step after we combine the data now we write an SQL query in order to analyze this new big table and the result going to be presented in the reports. And now of course the advantage here is that we have only one SQL query for the data analyzers on top of this table instead of having it four times. And now if you go and change the logic of the SQL query, it going to be applied automatically on all the data that we have in the database. And we have done already this example where we have combined the data between the employees and customers. Another scenario where we have to combine data before doing any reporting. That's sometimes the database developers tend to divide a table one big table into multiple small tables in order to optimize the performance. For example, here splitting the orders by the year. We have orders 2022 2023. Now again here if you want to generate a report in order to analyze the orders over the years over the time either you're going to go and make a query for each of those tables or you're going to go first combining all those tables into one table called orders. So what we're going to do we're going to use a union between all those tables in order to generate one central table called the orders. So all the rows from the first table and all rows from the next table. next one and the last one. So, we're going to put everything in one big table and once we have the orders, we're going to go and write analytical skill query on top of the orders in order to generate the report. So, as you can see, it's very important step in order to prepare the data before doing data analyszis. Okay. So now let's have the following SQL task and it says the orders are stored in separate tables. We have the orders and orders archive. Now combine all orders data into one report without duplicates. Okay. So by looking to the task we have to combine two tables orders and orders archive. So either union or union all. But since the task says without duplicates that means we have to go with the union. But now before we combine any data we have first to understand the content of the orders and the orders archive in order to map the columns correctly. So first we have to go and explore the two tables. So let's start with selecting the data from orders everything semicolon and as well from the second table sales orders archive and as well semicolon. So let's go and execute it. So now in the output we get two results because we have two separate queries. The first result is for the orders and the second one is for the orders archive. Let me just make it a little bit bigger. And now as you can see we have almost identical tables. So as you can see we have the order ID, product ID, customer ID. So everything looks like identical and of course we can go and check that using the object explorer on the left side. So we have here the orders and those are the columns. And if you go to the orders archive, you can see that we have the exact same columns. So that means we can go and map all columns from orders with the all columns of orders archive. So let's go and do that. So I'm just going to remove all semicolons and then we're going to go and use the union. So now we have everything in one query. Let's go and execute it. Now we will get in the output one single results, one single table with all informations from orders and orders archive. So we have all orders now in one table and everything currently is matching. So with that we have solved the task. We have one result with all orders. We don't have any duplicates since we are using union and we have combined the data. But now we have one issue with that. This solution, this query is quick and dirty and actually it's not following the best practices. So now the best practices here is to list clearly all the columns in each query without using star. All right. So now let's go and do that. Now we need a list of all columns from the table orders and the table orders archive. And since we have a lot of columns, what we're going to do, we go to object explorer, right click on the table name, and then let's go select the top thousand rows. So let's click on that. And now we're going to get a very simple select statements where we have all the column names from the table orders. This is what I usually do if I need all the columns in the my select statements. So let's go and copy it and go back to our query. Then let's go replace the first star with those columns. And we're going to do the same thing as well for the orders archive since they have the same names. So let's go and do that as well. So let me just make this smaller in order to see the query. So now we have a select for the table orders with all columns and as well a select with all columns for the table orders archive. So let's go and execute it. And of course now we're going to go and get the same results. Now you might ask why we are doing this. Why didn't we stick with the star? It's quick. It's simple. Well for the following reason. So now currently the status is that everything is matching. We have 100% identical tables. But what happened with the time is that we do development in our solution and we might go and change the schema of the table orders. So we might rename stuff, we might add new columns or maybe switch the columns. So this means the table order with the time will not be anymore identical with the archive. And this is of course a problem if you are mapping the data blindly using the star. So now let me show you what I mean. Let's say that in this table we are developing the orders and we just switch those two columns in the schema for some reason. So now we have the product ID first and then the order ID. So let's go and execute it. Now if you are using star you will not notice this informations. But if you are using script you're going to see immediately that here we have first the order ID and then product ID. And here we have the opposite. So it's more clear listing the columns than using the star. And now as you can see in the output you can see that we have a problem that here we have order ids and then suddenly we have something like the product ID. So we're going to have incorrect data which leads to incorrect analyzes. So here the best practices to not use the star and to clearly list all the columns. Now one more technique that I usually use once I'm combining data is that I add the source of the data inside the query. So what I mean with that now you can see that we have here two orders with the order ID one they are not duplicates they are completely different informations and that's because they come from different tables. So what I usually do I go and add the source of each record it's really nice information for the analytics for the users to understand where these records come from. So how we going to do that? We're going to have for example on the first column the following word let's say orders and we're going to call it let's say that's source table and we're going to do the same thing as well in the second query. Right? So the source table here is not the orders it's the orders archive. So I'm just adding a static columns to my query in order to see the source of the table. So now we have here two different values. And let's go and execute it. And now you see we have created a new column called source table where it has only two values. We have the orders and the orders archive. Let's go and sort the data by the order ID. So order by order ID. So let's go and execute it. And now you can see it very clearly. The first order order ID one comes from the table orders and the second one comes from the orders archive. So this is really nice information that you can add to your data once you are combining multiple tables. So that's all about this use case on how to combine data between different tables. All right. Now we have another use case for the set operators. It's more for data engineers. We can use the except in order to find the delta between two batches of data. For example, data engineers build data pipelines in order to load daily new data from the source systems to a data warehouse or a data lake. Now, in those data pipelines, we have to build a logic in order to identify what are the new data that is generated from the source system in order to insert it in the data warehouse. One way to do it is to use the set operator except in order to compare the current data with the previous load. Let's have a very simple example. So in the day number one we have two customers one and two. So what going to happen in this day we're going to go and load those two customers into the data warehouse. So in the data warehouse we will get as well one and two. So this is for the first day nothing is crazy. We just load the data as it is. Now for the second day we will get the new data from the source system and it's going to look like this. So now if you check the second day you can see that we have again the customer number one we have already loaded to the data warehouse. So we have it as the previous day but we have a new customer ID number three. So now in order to load only the new data we don't need to load again the customer number one. What we can do? We can do an accept between the day number two with the previous load with the day number one. So now if we simply do an accept between those two sets we're going to go and identify the new data that is existing in the source system which is only the record number three. So now what going to happen if we do except between day two and day one we will get one record the new record that we're going to go and insert it inside our data warehouse. So as you can see this set operator except is very powerful in order to compare two sets and not only for data analysis we can use it as you can see for data engineering in order to identify what is the new data that is generated from the sources in order to insert it inside our data warehouse. Okay, one more use case for the set operators that I personally use a lot in my project is that if you are doing data migrations, you can use the accept in order to check the data quality and more specifically we can use it in order to check the data completeness. Okay, so we have the following scenario where we are doing data migrations between two databases. So let's say that we would like to move this table from database A to database B. So we're going to go and load the table to the new database. And now what is very important after you move the data is that to check whether all the records did move from database A to database B we are not missing anything even one record. So we want to do data completeness test and there are many methods on how to do this test. One of them is to use that set operator except. So how we going to do it? We're going to do an except between the table from database A and the table from database B in order to find any record that is still in database A which is not migrated to the database B. And of course the best result is that we will not get anything. The result should be empty. If we get an empty that means all the rows from database A exists in the database B. And now of course we are not done yet. We want to do the comparison but the way around. We want to find any new rows that is in database B that we don't find in database A. Those two tables must be identical. So now what we're going to do, we're going to do an except but the first table going to be from the database B. And then we're going to compare it with the database A. And we have the same expectation. The output should be as well empty. And now after doing the except twice for both sides and we are getting empty in the results. That means those two tables are identical and we are not missing anything. So this is another amazing use case for the set operators in order to improve the quality of your data migrations and in order to do data completeness test. Okay. So now let's have a quick summary about the set operators. So the set operator is going to go and combine the rows of multiple queries, multiple tables into one single result. And we have four different types of the asset operators. The first one is the union where it's going to go and combine all the rows but without including any duplicates. The second one we have the union all it's very similar. And the third one we have the except it's going to show all the rows from the first query that cannot be found in the second query. And the fourth one we have the intersect where it's going to show the common rows between two queries. And of course we have SQL rules in order to use the set operators. Both of the queries should have the same number of columns, the same data types and the order of columns. And the last rule, don't forget that the first query controls the aliases, the name of the columns and the data types of the entire result. And we have found amazing use cases for the set operators. Like for example, using union and union all in order to combine similar informations into one big table. Or we can go and use the amazing except operator in order to compare two different results in order to find the differences between them. And I usually use it in order to do data quality checks to test the data completeness. And another use case as a data engineer you can go and implement the except in your logic in your data pipelines in order to identify what are the new data that must be inserted in your system. Okay my friends. So with that we have learned all the set operators that we have inside SQL. And with that you have learned how to combine your data from multiple tables using SQL. So we are done with this chapter. Now we're going to go to the right side. So now we're going to start talking about the functions in SQL. And here we have two big families. The first one is the row level or the single value functions. And the second one we have the aggregate analytical functions. So let's start with the first one the rowle functions. And here we can group them into multiple categories. And we will start now with the string functions. But first let's understand what is exactly functions and why do we need them in SQL. So let's go. Okay. So what is exactly function and why we need it. Now again we have our data inside the table. Now there is like a lot of stuff that you can do with your data. So sometimes you have to change the values of your data like doing data manipulation or you want to do some aggregations and analyzes. So maybe you want to analyze your data and find insights and maybe build reports and sometimes you might find bad data inside your tables and you want to clean that up. So you want to do data cleansing and sometimes you have to do data transformations and data manipulation on our data in order to solve some SQL tasks and in SQL in order to solve those tasks we have functions. So again what is exactly a function? It is a built-in code block that accepts an input value. Then the function going to go and process this value and it going to return a result an output value. So you give an input value do some transformations and give an output. And we can group the functions into two big categories. The first one we call it single row functions. So you give the function only one value and at the return you will get as well one value. So the input for the function going to be only one single value like maria and the output of the function going to be as well single row value. So one value in one value out. And now the other category of functions we call it multirow functions. So for example if you have the function sum this function accept multiple rows multiple values like it gets 30 10 20 40 the function is then going to go and summarize all those rows and return in the output only one value. The summarization of all those values going to be 100. So the input is multiple rows and the output is one single value. So those are the two main categories of functions in scale. Now my friends you have to understand something about the functions that you can go and nest functions together. So you can use multiple functions together in order to manipulate one value. And this technique is not only in SQL in any programming language. So let's have this example. We have the function left. It's going to go and extract like few characters. Let's say two characters. So the input for this function let's say it's Maria. This value going to enter the function. The function is going to go and extract the first two characters. And in the output we will get only two characters m a. So this is one function. We have an input and output. Now you might say you know what we have multiple steps on this value. So the first step we want to extract the first two characters using the lift function. But we have a second step. So we want to transform this output into a lowercase characters. So we have another function lower and the input for this second function will be the output of the first function. So ma it is at the same time output and input for another function. So the lower function going to take this value and convert it into lowerase character. So it's like inside the factory the materials going to be processed into multiple stations and the output of one station going to be the input for the next station. And this is exactly what we can do with the functions. So now how we going to build that? The first step is to start with the first function. So this is simple one function. Now for the next step what you're going to do on the left side you're going to write lower and put the whole thing in parenthesis. So now the whole thing the first function going to be inside another function and with that you have nested one function in another and of course if you need a third function like for example the length what you're going to do you're going to put the whole thing again between two parentheses. So now that means the output of the lift going to go to the lower and the output of the lower going to go to the length. So it is very simple and the order of the execution for this will start always in the inner function. So the lift function going to be executed first and then the outside function the lower and the last function that's going to be executed is the length. This is how the nested functions works in SQL or in any programming language. Now my friends in SQL we have a lot of functions that's why we have to group them as well into subcategories. Like if you are talking about the single row functions, we have functions for the string values and as well for the numeric, the date and time and as well functions in order to handle the nulls. And if you are talking about the multirow functions, here we have basically two groups. The first one is the simple aggregate functions. Those are the basics in order to aggregate your data. And we have another advanced one. We call it the window functions or sometime we call it analytical functions. So now if I'm looking to those two groups and now my friends it is very important to understand those functions because using them you can do whatever you want with your data and if I'm looking to those two groups the single row functions those stuff here they are functions in order to manipulate and prepare the data for the second group. So if you are thinking about data engineers and data analysts the data engineers going to go and prepare the data in SQL using the single row functions. So you're going to use them in order to clean up, transform, manipulate your data in order to prepare it for the analyzes. And if you are data analyst, you will be mostly using the aggregate functions in almost every task. So I really see it like this. The single row functions for data engineers and multirow functions for data analysts. And my friends, what we're going to do in this course, we're going to visit each of those subgroups one by one, exploring the functions, understanding how they work and when we're going to use them. So let's start with the first group, the string functions. And here we're going to learn how to manipulate the string values. So let's go. Okay. So now since we have a lot of string functions, I'm going to go and divide them into categories based on the purpose. So for example, we have a group of functions that's going to go and manipulate the string values. So we have concatenation, upper, lower, replace, and so on. And another group where we have only one function. It is where we can do calculations on the string values. And the last group, it is all about how to extract something from a string value. And here we have three functions left, right, substring. So now let's go and start with the first group about the data manipulation. And the first function we have here concat. All right. So what is exactly concat or concatenation? It's going to go and combine multiple string values into one value. So if you have multiple things you can put everything in one value. So let's have a very simple example. Okay. So now let's say that you have one value called Michael. So here you have a first name and you have totally separated value for the last name another column where you have a value like Scott. And now you say you know what it makes no sense to have the first name separated from the last name. I would like to go and combine them in one value. So you can go and use the concat in order to combine those two values or multiple values into one single value like Michael Scott. I think that pretty much sums it up. So it is nicer to see the full name in one value instead of having like two columns for that. So that's it. This is why we need the concatenations. Now let's go back to scale in order to try that out. Okay. So now we have the following task. Show a list of customers first names together with their country in one column. So that means we have to make a list of customers and we have to combine two columns in one. So let's start writing the query. Select. We need the first name, the country from the table customers. So first let's go and execute this. Now as you can see we have list of customers but the issue here the first name and the countries those two informations are in different columns but the task says they should be in one column. So now in order to combine those two things we have to use the concatenate function. So concat. So I'm going to start with the first argument. It's going to be the first name and then the country like this. And we're going to give it a name. Let's call it like this name country. Now let's go ahead and execute it. Now in the output you can see we have a new column. It's called name country and we have both of the informations in one column. So we have Maria, Germany, join USA. But it doesn't really look good because there's like no spacing between them. Now we can go and make some separation between them by just adding one more thing in between like for example maybe a space. So now we are concatenating the first name together with a space this over here and then the country. So let's go and execute it. Now as you can see we have nice separations between the first name and the country. And of course you can go and add different separations like maybe my notes or underscore and you will get the same effect. So with that we have a list of customers where we have the first name together with the country in one column. As you can see it's very simple. This is how you combine two columns in one. It is really nice and easy transformation. Okay. So that's all about the concatenation in scale. Next we're going to talk about two functions. The upper and the lower. Okay. So what is upper function? It's going to go and converts all the characters of a string to an uppercase. It's going to make everything capitalized. And the lower function is exactly the opposite. It's going to go and convert everything to a lower case. So let's have very simple example for those two functions. Okay. So now we have like three values with different cases. The first one where you have only the first character capitalized and the rest is lowered and then the same value but everything is lowered and a third one where you have everything with an uppercase. Now if you go and apply the function upper to those three values what going to happen for the first value going to go and turn it into an uppercase. So everything going to be capitalized not only the first character. And now for the second value going to turn it as well to completely capitalized. So all the characters going to change. And for the last value it is already capitalized. So in the output you will get the same value. So actually nothing going to happen for that. So this is simply the uppercase. Now let's see what can happen if you use the lower case. For the first value only the first character going to be changed and then you will have everything in lower case. The second value it is already a lowerase value. So if you apply lower case nothing going to happen. You will get the same value. But for the last one everything here is capitalized and if you apply lower case all the characters going to convert to a lower case. So my friends this is very simple. Let's go back to your skill in order to practice that. Okay. So we have the following task and it says transform the customer's first name to lowerase. So now as you can see the first names here the first character is a capital the rest is lowerase. So now in this task we have to convert the whole thing into lower case. So let's go and do that. It's very simple. We're going to say lower first name and let's go and call it low name. So that's it. Let's go and execute it. Now if you go and compare the lower name with the first name, you can see all the characters now in the lower case. So that's it for the task. We have transformed the first name to lower case. All right. The next task is exactly the opposite. Transform the customer's first name to uppercase. So let's go and have a new column. We're going to say upper then the first name as app name. So that's it. It's very simple. Let's go and execute. Now you can see in the output we have a new column called up name and inside it we have the first name but now all the characters in upper case. So this is how you convert the case to lower or to upper in SQL. Okay. So that's all about the upper and the lower. Next we're going to talk about very interesting function. It is the trim. So the trim function going to go and remove the leading and trailing spaces in your string values. So it's going to go and get rid of the empty spaces at the start and at the end of a string value. Let's have very simple example. Okay. So now we're going to have different scenarios. The first one you can have like a value join where you don't have any spaces and this is the normal case. But sometimes you might have it like this where at the start you have a leading space. You have an empty space or sometimes we call it white space. In another scenario the space might be at the end of the word. So here we call it trailing space and in another scenario you might have both of them. This is really bad. where at the start you have the leading space and at the end you have the trailing space. And of course you might not have only one space, you might have multiple spaces depend on how long did the user press the space, right? So of course my friends spaces are really evil and this makes no sense to have it in your data. Now what you have to do is to do data cleansing. We have to clean up this miss and you have the best function in order to clean up the data. You have the trim. So if you apply trim for the first value, nothing going to happen because everything is clean and we don't have any spaces. Now if you apply it for the second case where you have a leading space if you do that SQL going to go and remove this space. The same thing for the trailing space. So if you have space at the end the trim function going to find it and clean that up. And if you have it at the start and at the end then it's as well no problem. It's going to go and clean that up. And as well the trim function can go and clean multiple spaces. So if you have like five spaces 10 spaces at the end or at the start the trim function going to go and clean that up. So this is how the trim works. And now let's go back to our scale in order to find out whether we have any spaces. Okay. So now we have a very tricky and interesting task. It says find the customers whose first name contains leading or trailing spaces. So now by looking to those values we have to find any spaces inside the customer's name. Now by just looking to this results you will not find any white spaces because it's really hard to see especially if it is like trailing spaces. Now we have to write query order to detect any spaces in the names. So how we can do that? Okay. So now think about it a little bit and I can give you a hint. You can use the function trim in order to remove any white spaces and you have to use it inside a wear clause. So what we're going to do we're going to say where. So now we have to build a condition to detect any spaces. So if you are saying if the first name is not equal to itself first name after applying a trim. So after trimming the first name if it is not equal to the first name so that means there was spaces. So again what is going on here? Let's go for Maria. If Maria has no nulls if you trim this value nothing going to happen. The value going to stay exactly like before because there is no white spaces. But if in Maria there is any space inside it. Trimming the value will not be equal to the first name if it contains any spaces. So if the column is not equal to the same column after trimming it that means there is spaces. So let's go and execute it. And now we can see in the output we have one customer John where we have this situation. Now if you don't believe me or you don't follow me here we can have another easier check. So let's go and comment this out and let's have a look to our first names. Now we can go and calculate the length of the first name like we have done before. So length name and let's go and execute it. Now if you can see here Maria we have five characters but John we have here four characters but the length is five and that's because we have somewhere space and the space going to count as a character. So here there is like something wrong right and you can check the others as well everything is matching but only John we have here an issue and now in order to see this more clearly we're going to use two functions the trim and the length. So first let's go and trim the first name. And after trimming the values, I'm going to calculate the length. So we are nesting together the trim and the length. And I'm going to call it length. Trim name. So let's go and execute it. Now we can see the length before trimming any value. And we can see the length after trimming the values. So you can see over here that join before trimming is five and after trimming is four. So we have here an issue. Now we can make things more clear where we can go and subtract the length of the first name with the length of the first name. But first we trim the values. So here we can call it maybe a flag or something. So let's go and execute it. Now by looking to the flag it is really easy to now to see if we have a zero then everything is fine. We don't have any white spaces. But if we have higher than zero like here one then this is an indicator that we have a white space. Either you do it like this where the first name is not equal the first name after trimming or you use more complicated solution where you say where and I'm going to remove this from here the length of the first name is not equal to the length after trimming so not equal so if you go and execute it you will get exactly again join so this is how we detect any empty spaces inside our data using the trim function or maybe as well using the length but I really prefer the first solution it is way easier using one function. All right, so that's all about how to remove the empty spaces using the trim. Next, we're going to talk about very important function called replace. Now the replace function going to go and replace a specific character. So that means we have something old and we want to replace it with something new. Let's have a very simple example to understand it. All right. So now imagine we have a phone number where the data is splitted by a dash. Now let's say that I don't like to have the dash in my data. I would like to have slash like any other special character. Now in order to replace the dash, we can use the function replace. So we have to specify for SQL two things. The old value the dash with a new value the slash. So if you do that in the output it's going to go and remove all those dashes between the numbers and the replacement going to be the dash between them. So it's very simple, right? All what you are doing is replacing an old value with a new value and that's why we call it replace. But we can use this function as well in order to remove something not only we replace and you can do that by not specifying anything in the new value like just the single quotes and with that it's going to be nothing a blank. So now what's going to happen is still going to go and replace the dash with a blank and that means I'm just removing the dashes from the output. So if you do it you will remove the dash and you will get only numbers. So if the replacement going to be a blank then that means this function will be replacing any value that you specify. So this is exactly how it works and this is why we use the replace function in SQL. Now let's go back in order to practice. So let's do the same example. This time we're going to go and select from a static value. So we're going to get 1 2 3 4 5 6 7 8 9 0. So if you go and execute it, you can see we are getting the phone number. Now let's go and remove the dashes from this value. So let's have a new line and we start with replace. The first thing that you have to specify for SQL the value itself. So let's go and get the value. This is the first argument. The second argument going to be the old value. So the old value going to be the dash. And now the third argument will be the replacement. And since we want to remove it, we don't want to replace it with anything. We will have just single quotes and nothing between them. So there's no space between those single quotes. Now we can go and rename stuff like this is the phone. And this is a clean phone. Let's go and execute it. Now, as you can see in the output of the function, we don't have any dashes between the numbers. And you can go and test stuff. Like for example, I can go and add a slash and execute it. You will see slashes between them. So you can go and try multiple stuff. So this is one nice use case for the replace function. Now there is another use case for the replace function is that sometimes in my data file names going to be stored like for example, let's say reports.t txt and now let's say that I would like to change the file format from .txt to CSV. Now how we're going to do that we're going to go with a new line say replace and then the first argument going to be the value. So let's take our value from here and now what is the old value it's going to be the txt and I want to replace it with another format with another extension. So it's going to be the CSV. So we're going to say this is the new file name and this is the old file name. So let's go and execute it. And now as you can see in the output SQL did replace the txt with SCSV. This is as well where I use the replace function in my projects. So my friends the replace function is really fun and those are two nice use cases for the replace. All right. So that's all about the replace function in SQL and with that we have covered the whole datamations. Now in the next group we're going to talk about the calculations. And here we have only one function the length. Now the length function it's very simple. It's going to go and count how many characters you have in one value. So you are calculating the length of a value. Let's have very simple example to understand it. Okay. So now let's say that we have the value Maria. If you apply the length function for that what's going to happen? It's going to go and start counting how many characters we have inside this value. So the m is 1. a 2 3 4 5 in the output you will get the number five. So five is the length or the total number of characters in this value. Now let's say that you have a number like 350. If you go and apply the length function still is going to go and count how many digits do we have. The three is 1 5 2 3. So the total length for that going to be three. So you can apply it even for numbers and not only that you can go and apply it on a date value. So let's say that you have the following date 2026 1st 23. So SQL going to go and count each digit each character even the underscores not only the numbers underscore is as well a digit right? So the total length of this date it's going to be 10. So you can apply any data type to the links function and in the output you will get always a number. That's it. This is how you can count the number of characters in any value. Let's go back to scale in order to practice that. Okay. So now we have the task calculate the length of each customer's first name. So it is very simple. We're going to go and apply the function length len to the column first name and we're going to call it length name. So let's go and execute it. And with that as you can see we are getting in the output numbers and these numbers are the number of characters of each name of our customers. So this is how we calculate the length and that's it for this group. Now moving on to the next one. It's going to be very interesting. Now we're going to talk about how to extract something from a string value. And here we're going to cover now two functions the left and the right. Now the lift function going to go and extract specific number of characters from the start of a string value. So if you want to get few characters at the beginning of a value, you can use the lift. But now the right function is exactly the opposite. It's going to go and extract specific number of characters from the end of string value. So if you want few characters from the end of your value, you can use right. Now in order to apply the left or the right function, you have to give SQL two things. The value where you want to extract a part from it and the number of characters, how many characters you want to extract and this is the same for the left and the right. Now let's say that we have again this value Mariam. And now if the task says I would like to extract the first two characters and since we are talking about the starting position, we're going to use the lift function. And since it says two characters, we're going to go with the two. So it's going to start counting M is 1, A is two and after that it's going to stop and make a cut and it's going to go and return the two characters M A. So we are counting from the left side going to the right side. Right now if your task says extract the last two characters here we are talking about the end position of your value and for that we're going to use the right function since we are approaching from the right side and since we want only two characters the number of characters going to be two. So this time going to start counting from the right side moving to the left side. So A is one, I is two and that's it. Then SQL going to stop and extract only those two characters. I A. So if you want to extract data at the starting position, you use the left. But if you want to extract characters from the end position of your value, then you use the right function. Now let's go back to scaler in order to practice. Okay. So now we have the following task. Retrieve the first two characters of each first name. So we just need the first two characters. Since we are coming from the left side, we can go and use the function left. So it's very simple. First name and we need only two characters. So two. So we're going to call it first to character. Let's go ahead and execute it. And now you can see in the output we have two characters MA. Now with John we have only G because we have a leading space. Well, you can leave it like this or you can transform it. And then George we have G and so on. So with that we are getting the first three characters. Now in order to fix it for John what we're going to do we're going to say trim first and then apply the lift. So with that we are getting rid of all white spaces and then we apply the lift. So with that everything looks perfect. So for John we have jo. So this is how we can get the first two characters of a column. Now let's move to the next one. The task says retrieve the last two characters of each first name. So this time we need the last two. So we are coming from the right side. So we're going to do it like this. We're going to say write first name and then as well too. So last two character let's go and execute it. And now as you can see in the output we have new column where we have the last two characters from the first name. So we have here I a er and for John as well working and that's because we don't have any trailing spaces but if you have any trailing spaces then go and use that trim function. All right so that's all for the left and right and now we're going to go to the last function. we have the substring. So the substring going to go and extract a part of a string at a specified position. So this time we don't want something from the beginning or the end. We want something like in the middle. So we want to specify the starting position and we want to extract few characters from there. So let's have very simple example to understand it. Now in order to use the substring you need three things. The first one is the value itself where you want to extract a specific part from it and then you have to specify the starting position where SQL going to start extracting the characters that you want and as well SQL needs the links how many characters we have to extract. So now let's say that we have the following task after the second character extract two characters. So from reading this you can see we specified the starting position this is the second character and the length going to be the two characters. So let's have this example. Well, if you have Maria, so now we have to specify the starting position. Now we are saying after the second character. So the first character m is one. Then a is two. After two, we got the position number three, right? So starting from R. So that means we have to specify for SQL three because the starting position going to be number three. This is after the two. Now we want only two characters. So we want the R and the I. If you give this to SQL Maria starting position three and the length two, SQL can go and extract the two characters the R I. And this is exactly what we want. We want two characters after the second position, the second character. So with that, we didn't extract something from the left or from the right. We extracted at specific position. And this is exactly why we need the substring. Now let's make it a little bit more difficult where we're going to say after the second character extract everything all the characters. So not only RA I I would like RA I A. So now nothing's changed about the starting position. It's going to stay at three. But now if you are looking to this value and you want to extract everything starting from R. That means you have to specify the length of three. But this is not really good because let's have another value in the same column. So we have Martin. So the starting position going to be as well R. And now the lengths going to be different. So we have here four characters. So now the length is not anymore three. It is four. But you have to specify something at the end for SQL. You can go for four. That's fine for Maria as well. But if you have a lot of values, it's going to be really hard to specify exactly the correct length. That's why instead of specifying a static number like three or four, we can use another function. So now my friends, if you use the length function, you will get the total number of characters, right? So for Maria, you will get five. For Martin, you will get six. And those numbers are okay to use in the length because they are more than what we need. And that's totally fine. So if you are saying okay for Maria start from the third position and cut for me five characters SQL going to find only three but you will not get an error. So you are extracting more than you need and you will always get all the characters after the starting position. So this is a little trick that we use in order to make the links dynamic where we cannot find one value that we can use in all scenarios. And now let's go back to SQL in order to practice the substring. Okay. So now we have the following task and it says retrieve a list of customers first names after removing the first character. So now don't ask me why but for some reason we don't want to see the first character of the first names. We want to remove it. So how we can do that? We cannot use the left or the right. We have to go with the substring because it is little bit more complicated. So substring and let's go and get and the first argument going to be the value. So it comes from the first name and then the second argument is the starting position. So where we want to start since it is saying I want all the characters after the first character. So that means we will be starting from the position number two. So for example Maria here the first character M position number one and we want to start our substring from the position number two. So that was so that was the easy part. Now the next one the question is how much characters we want to leave. So do we leave here like four characters like in Maria we have four characters but in John we have only three then the next one is four and so on. So if you go for example with four and let's call it sub name. So we make it static. What can happen? It's going to work for some scenarios like Maria. We have here Ara and for better we are getting it. But for Martin it is not working. We are not getting the last N because it has like five characters after the first one. And by just looking to the result as you can see we have here one issue with John and that's because the first character is an empty string. So this is really annoying. So that's why we use the trim first just to get rid of all those white spaces. And now you can see it's working fine. So we are not getting the J. We have everything after the first character. So now instead of having this static what we're going to do we're going to make it variable. So we're going to go and use the length of the first name. So with that we make sure we have enough length to extract. And this can work for any value inside the first name even if the name is like 20 characters. So let's go and execute. And now you can see for Martin it is now working. So we have here like five characters after the M. And here we have four characters after the M as well. And here we have three characters after the G. So it is working completely and it is full dynamic. So this is the trick by using the links together with the substring. And as you can see now we are using three functions in one go. We have the length, we have the trim and we have the substring. And this is what happens in scale. we use multiple functions together in order to solve like complex tasks. So this is how you can extract a substring from a string. All right. So that's all about the substring and with that we have covered a lot of very important string functions in SQL and now you have enough tools in order to manipulate the string values in your data. Okay my friends. So with that we have learned how to manipulate your string values inside SQL using the string functions. Now we will move to the second one. you will learn how to manipulate the numbers, the numeric values. So let's go. Okay. So now let's have this example 3.516. Now let's say that you want to apply the function round and you are using two decimal places. So what going to happen? It's going to go and keep only two digits after the decimal point. So five and one and the third digit after the decimal six. It will decide whether the number going to round up or stay as it is. And now since six is higher than five. So that means SQL going to go around the numbers up. So instead of having 51 we will get 52. And after that the third digit going to reset to zero. So in the out you will get 3.52. Now let's say that you have done round but only for one decimal place. Now it's still going to go and keep only one decimal place and that is the five. And the second digit this time going to decide whether we round up or not. And now since one is less than five, there is no need to round up and the five going to stay as it is. It will not turn to six. So there is no round up and the digits after the five going to reset to zero. So we're going to get 3.5. Now let's say that you say round zero. So that means I don't want to see any digits after the decimal point. So now SQL going to go and check the first digit after the decimal point, the five. This one going to decide whether the three going to turn to four or not. And now since we have five it is good enough to round the number because either five or above five going to round the numbers. So that's why it's going to be a round up and SQL going to return at the end four and all the digits after the decimal points going to be reset to zero. So this is exactly how the round function works in SQL. So now let's see how we can do that in SQL. Okay. So now let's go and practice about the number functions. So what we're going to do we're going to write SQL select but this time we will not select any data from the database. We going to practice using our static value like for example the value 3 dot 516. So let's go and execute it. So with that I have this decimal number. Now let's go and start practicing the round function. So now let's go and round this number 3.516 and this time we are rounding to decimals. So let's go and call it round two and let's go and execute it. So as you can see in the output we are rounding two decimal places and we have the two because as we learned the six going to go and round it up. Now let's go and do the same thing for one. So let's round one execute. And as you can see in the output we are rounding to one decimal. So we have the five and everything is zero. And we don't have six here because the one is lower than five and it will not round up the numbers. And let's and round by the zero. it is rounding it to an integer to the four and all the decimal digits are zero and we have four because we have five and five going to round up the number. So as you can see it is really nice and this is how we round numbers in SQL. Now there is another number function which is really cool called APS or the absolute what it going to do it's going to go and convert any negative number to a positive. So let me show you what I mean. Let's go and say we have like minus 10. So this is a negative number. But if I say APS, so the absolute of the minus 10, what I will get? I will get a positive number. So it's like giving us the absolute of any number or in other words, it is like converting the negative to a positive. And if the number is already positive, nothing going to happen. So if I say the absolute of the 10, I will get as well a 10. So this is really nice and cool function that is really important in order to transform numbers in many scenarios like if you have mistakes on your database like let's say minus sales makes no sense to have sales that is minus. So in order to correct the data we can use the APS in order to convert all the negative numbers to a positive. So this is really nice cool and easy function to learn. All right my friends. So that's all for the numeric functions. We have covered two very simple functions and now in the next topic we have a lot of functions about how to manipulate the date and time in SQL. So let's go. So what is a date? If you take a look at calendar and you pick any date, for example, August 20th, 2025, this date could represent an event like a birth date. Happy birthday. Happy birthday. or a project deadline at your work and mainly it has three components. The first part is a fourdigit number indicating the year. Then the next component it is the month. So normally we represent the month with a number between 1 and 12. And the last component is the day. This is a number between 1 and 31 depending on the month. Now in database we call this structure of those three components a date. So this is what we mean with dates in SQL. All right. All right. So now let's move to the next one. What is time? Time refers to a specific point within a day. Like for example, we have 18:00, 55 minutes, and 45 seconds. So this structure has as well three components. The first one we call it the hours. It is as well a number between 0 and 23 indicating the hour of the day. Then the next one, it is the minutes. This is a number between 0 and 59. Moving on to the last component, we have the second. This is again the same thing a number between 0 and 59. So now this structure with those three components we call it in databases and SQL a time. So this is what we mean with the time. Now to the last type if you go and combine both the date together with the time and you put them side by side you will get a new structure and a new name in the databases and we call it usually time stamp. This name is used in many databases like Oracle, Postgress and MySQL. But in the SQL server, we have another name for that. We call it date time. So again, it's very simple. The date time or time stamp has the date information together with the time information. So here in this example, we have six components from left to right and here we have like a hierarchy in this structure. So we start with the highest which is the year. Then we have the month, the day and then we continue to the hour, minutes and seconds. So those are the three different types about date and time informations in SQL. We have the date alone or the time alone or together in the date time. All right, let's explore now the data that we have inside our database searching for date and time informations. Now let's go to the table orders and if you go and expand it, you will find here two columns having the data type dates. So we have the order dates with the date and as well the shipping date with the data type dates. And if you check the last column, the creation date, this one is date time 2. So now let's go and query those informations in order to understand the structure. I'm just going to select the order ID, the order date, and the ship date and the creation time from sales orders and from is big. So let's go and execute it. Now if you go and check both order date and ship date, you can find that here we have only the structure or the informations about the date and we have nothing about the time. So again here we have a year, month and day and that's why they have the data type date. Now let's go and check the creation time. Not only we have the date information but as well we have the time information. So it start with the date information year, month, day and then we have hour, minute and seconds and then we have fractions of the seconds, milliseconds and so on. So this is how the date time or time stamp looks like in databases and this is how the date looks like. All right my friends now in SQL I can say that we have three different sources in order to query the dates. The first one is dates that are stored inside our database like we saw here in those columns like the order date, shipping date, creation time. All those are columns that holds this informations and they are stored inside our database. So this is the first source of dates that we can get inside our queries. Let me just remove those stuff and let's stick with the creation time. So let's just execute it. So those are date and time informations stored inside our database. The second type is a hard-coded date string that we can use inside our queries. Let me show you an example. So now if we go to a new line, I can go and define a date like this. So 2025 August 20th. So that in this string we have hardcoded a date that is static for all rows. Let me just call it hardcoded and let's go and execute it. Now we can see in the output we're going to get a static date for all rows. So this going to be the same for all rows inside our table. So this value is not stored inside our database. This value I just added to our query and hardcoded it. So sometimes in queries we define our dates that's going to be used maybe later in calculations and so on. Now the third source of getting dates inside our query is using the function get date. Get date is the first and the most important function that we use in SQL. It's going to go and return the current date and time at the moment of executing the query. So let's try that out. I'm going to go and get a new line. So get dates. It's very simple. It doesn't accept any values inside the function. So it's going to be empty. So let's call it today. All right. Let's go and execute it. And of course, we're going to get different results because the get date now is the date and the time that I'm recording this video. So currently it is July 18, 2024. And I'm recording this around 20 p.m. So as you can see, this going to be as well repeated for each row. We're going to get always the same value. So again, this depend on the execution of that query. So during the tutorial, you're going to learn a lot about the get date and we're going to use it in a lot of functions. So those are the three different sources of getting date information inside your query either from a column inside our database or hardcoded using a string. And the third one is using the get date in order to get the current date and time informations at the moment of the query execution. Nice. Now we have a clear understanding what is date and time in SQL. The next question is how to manipulate those informations using SQL functions. Okay. Now we have our date August 20th, 2025. One of the things that we can do with the date is we can go and extract different parts of the date. For example, we are interested only on the year. So we can go and extract only the year part. Or if you are interested in the month, you can go and extract the month and you will get August. And of course, we can go and extract the day and we will get the 20. So this is the first thing that we can do. We can extract the parts of the dates. Now another thing that we can do is we can go and change the date format. So instead of having like a small minus between those date parts, we can go and split them using slash. We can even start first with the month August then 20 the day and then the year but having only the short form of the year 25 or we can go and change the format where we say we don't need any special character we just leave it as a space. So as you can see we are changing and manipulating the format of the date. Another category or task we can go and do date calculations. So we can go and take our date and add to it for example 3 years or we can go and find the differences between two dates like we are doing a subtraction or let's say minus and we will get for example 30 days. So we can go and add stuff subtract stuff or find differences between two dates. It's like we are doing calculations on the date. Now to the last thing that we can do with this date is we can go and test this date or validate it whether it is a real date that SQL understands. So we can put it on the test and at the output we're going to get true or false or zero and one. So as you can see here we have different ways or let's say categories on how to manipulate our dates in SQL. Now we're going to go and group up the different date and time functions under four categories. The first category and the most important one we have the part extraction and here we have around seven different functions that we can use in order to do this task. Another category we have the format and casting. And here we have three different functions. Underneath this category we have the format, convert and cast. And then the third category we have the calculations of the dates. We have two functions date add and date diff. And the last category the validation. We have here only one function called is dates. So as you can see we have a lot of scale functions. We have 13 date and time functions that we're going to cover in this tutorial on how to manipulate the date and time informations in SQL. And this is how we can group them into four different categories. Let's start now with the biggest category. We have the part extraction. We're going to cover all those seven functions in details on how to extract parts. All right friends, now we're going to cover three very easy quick functions in SQL to extract the parts of the dates. So they are very simple. The day function going to return a day from a date and in the same way the month going to return the month from a date and guess what the year going to return a year from a date. Okay. So now in order to understand how they work we have a date like this one 2025 August 20th. Sometimes you are not interested in the whole date. You would like to get only a part from this date. So you go and use the function day in order to extract the two digit 20. Now in other scenario you might be interested in the month information. So you would like to get those two digits 08. So we can use the function month in order to extract the month information in order to get the August. So 08 and one more situation where you want to have only the year information. So you are interested in the four digits 2025. So you can go and use the function year in order to extract it. So in the output if you apply it you will get 2025. So it's very simple. This is how those three functions work. All right. Now let's check the syntax of those three functions. It's pretty easy. So we have it always like this. A keyword called day. This is the function name. And then it accept only one parameter. It is the date. The same things for the others. We have a function called month and it accept as well only one parameter the date and as well for the year the same thing. So the syntax is very straightforward. It accept only one value the date and we have the function name like the name of the part that we want to extract. All right. So now let's try out those functions. I will be working with the column creation time. So let's try for example extracting the year from the creation time using the year function. So it's going to be very simple. It's going to be year and then creation time like this. And let's call it year. That's it. Let's go and execute it. Now as you can see it's very simple. We have only one year 2025 from the creation time. So with that as you can see we got a new column where we have only the year informations inside it. And this information come from the creation date. So we have only 2025. Now let's go and do the same for the month. So we're going to have the same thing month creation time and let's call it month. So let's execute it. Now as you can see in the output we got as well the number of the month. So we have here January, February and March and those information as well are extracted from the creation time and the same thing using the day function. So let's go and use that. So creation time and we call it day. So now as you can see in the output we have the day part from the creation time. So here we have 1, 5, 10 and so on and all those informations come from the creation time. So as you can see those three functions are very simple and quick in order to extract parts from a date or date [Music] time. All right. So what is date part? Date part going to go and return specific part of the date as a number. All right. So now back to our example. We have learned how to extract the day, month and year. But of course now in a day we have more informations that we could extract. Not only those three we could extract for example the week right the quarter so all those informations are as well stored in this dates we cannot see it like as a value but inside the SQL you can extract the week and quarter but we don't have a function dedicated for those stuff because they are not commonly used like the year and month and day but still we can extract those information using the date parts for example we can say date part and we can specify the part as a week and with that SQL going to return for this example 34 and maybe in other situation you are interested in the quarter right so you can specify it like this date part quarter so we are interested in the part of quarter and in the output you will get three so this is exactly the power of the date part you can go and extract way more parts that is available in these dates and one more thing to notice about the date part year and day all of them are always generating the output an integer a number. So we have the for the quarter 3 for the week 34 the day 20 2025 and so on. So all of those informations are integer. So integer is the data type of the output of these functions. Okay. So let's have a look to the syntax of the data part. It start with the function name date parts and it accept two parameters. The first one is the part that we want to extract. So we want to define what do we want. We want the month, the day, the year and so on. And the second parameter is the date itself. So let's have an example. We can say date part and we would like to extract the month from the order dates. So the part is the month and the order date is the date that we want to extract from. So with that we are specifying the part as a month. Now in SQL there is another way on how to specify the parts. We can go and use like an abbreviation of the month. So if you specify instead of month instead of writing the whole thing you write mm you will get the same results. So it's like abbreviation and shortcut in order to write scripts. But I rarely see that in the implementations. I always tend to write it completely like this month because it's more like standards if you are switching between different databases. So as you can see it's very simple. You have to give SQL two things which part you want to extract and the date that you want to extract from. Okay. So now we're going to go and extract different parts from the creation time using the date part. Let's start for example by extracting the year again. So let's go and do that. date parts and then we have to specify which part we need. So we're going to write year like this and then the next one going to be the value. So it's going to be the creation time. So let's call it year and let's say date parts. Let's go and execute it. So now at the output you can see we got as well again the years that is extracted from the creation time. So it's going to be identical to the year function. So there is no differences between them. Both of them are integer and it holds the year informations. Now we can go and try different parts. For example, let's copy the whole thing and let's extract for example the month. So you can go over here and change it to month and let's rename it execute. So at the output you see we got as well the months is identical as well to the function month. And the same thing for the day. So we are just changing the parts and in the output we are getting the parts. So here we have as well the days it is identical to the day function. So so far we don't have something new from the date part because we have it already from the other functions. But now we're going to go and extract other parts that are not year month and day. So for example let's go and get the hours. So we have the date part and here as a part you say hour and let's call it here as well hour. Let's go and execute it. Now you can see in the output we have a new dedicated column that shows only the information from the hour. So we have here 12 23 and so on. And those informations comes from the time and the same thing you can define minutes and so on. But now let's go and get something interesting like the quarter. So let's go and duplicate it and instead of hour let's get quarter. So this information it's not displayed in the creation time but SQL can go and extract it. So let's call it quarter and let's go and execute it. Now as you can see in the output we have one new field called quarter and inside it everywhere we have a one because all those dates are in the range of the quarter one. So as you can see this is amazing of course for reporting and analyzes. Let's go and have something else like the week day. So we are over here quarter and let's call it week day and rename as well this to week day. So let's go and execute it. All right. So now let's go and get something else like for example the week. So I just duplicated over here instead of quarter let's write week. So I would like to get the week number. So let's go and execute it. So now in the output as you can see we got a dedicated field that show us the week number from the creation time. So we can see this dates come from the week number one. Those two come from week number two and so on. So that's it. As you can see guys all those informations that you are getting from the date part are numbers. And now we can extract way more informations than only the year, month and day. And even if those informations are not displayed directly in the field itself like the quarter, weeks and so [Music] on. All right. So now we have very similar function to the date part. We have the date name. So the only difference here is that it returns the name of the date parts. All right. So now back to our example. We have learned we can extract different types of parts from one date. But we learned as well that all of them are numbers. How about we would like to extract the name of the month. So instead of eight, I would like to get the name of the month like August. Or instead of the 20, I would like to get the day name like here in this example, it going to be Wednesday. So in order to get the name of the parts, we have to use the function date name. So for example, if you use the function date name using the part month, you will not get eight in the output. You will get the full name of the month August. So as you can see we are getting a string a full name and as well the same thing if you use date name for the week day you will not get 20 like the day function you will get the name of the day Wednesday and as well here the output is string so as you can see it's very simple we are using the date name in order to get the name of the parts and the data type of the output here is a string it is not an integer so as you can see here we have different types of functions that all of them are doing the same job we are extracting ing parts from one date. Okay. So now by checking the data name syntax, it's going to be identical to the date part. So we are just switching the function name. It needs from you to define the part and as well the dates. The only difference here is that we are getting different data type at the output. So here we are getting a string instead of integer. All right. So now let's check the date name. It is very similar to the date part. So we're going to have it like this. We're going to work as well with the creation time. So we're going to say date name and then after that we have to define the parts. So let's go for example with the month and our field is as usual the creation time and let's call it month date name like this. So that's it. Let's go and execute it. Now if you go to the output over here you can see we have the month but this time we don't have numbers. We have the full name of the month. So we have January, February, March instead of having 1 2 3. So this is the big difference between the date name and date part. Date part you get numbers. Date name you get the name of the part. So let's do the same thing for the day. We would like to get the name of the day. So I'm just duplicating it. But now in order to get the full name of the day, we cannot go with the day. We're going to go with the week day as a part. So that's it. I will call it week day. So let's execute it. Now as you can see in the output, we have here a new column called week day. And inside it we have the name of the day instead of a number. So here we have Wednesday, Sunday, Friday and so on. So the full name of the day go of course with the day. Let's go and try that out. So this is the day of the month and of course the day of the month has no name and SQL of course going to return the numbers again. So you can see 1 5 10 20 and so on. But still there is a difference between the day from the day name and the day from the date parts. In the date parts we are getting integers. So if you store this information in a new table it's going to be stored as an integer. But in the date that you are getting from the date name it is a number but still it can be stored as a string value. So the data type of those numbers is a string and the data types of the day from the date part is an integer. And the same thing can happen if you extract for example a year. So you don't have like a full text of the year. So let me just do it like this. So if we say a year, you will not get the name of the year. You're still getting the numbers, the digits, but the data type here is a string. So that's it. This is the difference between the date name and the date parts. For the month and weekday, you will get the full name. For the other stuff, you will get numbers but with the string data type. So the most important thing about the date name is to present easy to read and human readable informations to the users. So imagine you are building a report called sales by month and then you show to the user the muscles as numbers 1 2 3 until 12. This is of course okay but it is way more nicer if you present those informations as a full text. So you go with the date name in order to show instead of one you show January, February, March and the full name of the month. And this going to look way nicer in reporting for the users. So this is the core use case of the date name. So what is date trunk? Date trunk going to go and truncate the date to a specific part. So let's understand what this means. Okay. Now let's check the syntax of the date trunk. It's going to be exactly the same like date part and date name. So you have to define the part and the date that you want to extract apart from it. So the only thing that is different here we are giving different function name. So as you can see all those three functions like having the same structure you have to provide which part you want to extract like a month, day, week, hour, minutes and so on and the date or date and time that you want to extract a part from it and of course with the date trunk we are getting at the output date or date time. Okay. So now let's understand exactly how the date trunk works. We have the following date time and as we learned we have like a hierarchy where we start with the highest from the year then we move to the month, day, hours, minutes and seconds and by looking to this information it is very precise. We know exact second for this information right? So the level of details here is very high. We know the seconds of this event. So now the date going to allow us to change this level of details of this information by specifying the level of details. Let's take for example if we say the date trunk minutes. So we are saying we are interested only at the minutes level. We are not interesting with the seconds. So what can happen? Everything between the year and the minutes going to be kept. That means all those information will not be changed but only the seconds going to be reseted. We are not interested anymore with the seconds. This is very detailed for us. So it's going to go and reset the seconds to 0 0. So we are saying the minimum level is the minutes and we are not interested anything like before it the seconds let's say now we say you know what the minutes is very detailed I would like to be at the hours level so we specify for the date rank hour so here things changed we're going to keep the informations now between the year and the hours and anything after that going to be reseted so now minutes and seconds going to be in the range of the resets and SQL going to go and reset the 55 to 0 0 so now the level of details is little bit lower now we know only the informations until the hours and we are not interested about the minutes and the seconds and I think you already get it if you say date trunk day what's going to happen it's going to keep everything between year and day and the whole time going to be resets so the hours and seconds all those information is going to reset to 0 0 so now by looking to this we don't know anything about the time we know only informations about the dates and now we can go one more step and we say you know what I'm not interested about the days I'm doing analyszis on the month level so what is here kept is only two informations year and month and everything below that the day and the time going to be reseted but this time SQL will not reset the date to 0 0 because there is no date called 0 0 it start always with the first date so it's going to reset to 01 so the dates parts and the dates going to reset to 01 one and the dates parts in the time going to reset to 0 0. So now we are at the level of the month. Now you can go to the last step and you say you know what I'm interested only on the years and I'm doing only analyzes at this level at the highest level. So you can go and say date trunk year and now what's going to happen going to keep only the year and everything below that going to be reseted. So between month and the seconds everything going to resets. So here is scale going to reset as well the August 2011. So the only value that is kept is the year and everything else is reseted. So this is the 1st of January and the time is completely reseted. So now we are at the lowest level of details. We know only information about the year and we don't care about any other parts. So as you can see the date trunk here is not really extracting a part here. Date trunk is like resetting stuff. So we are navigating through the hierarchy of the date and time and we are controlling at which level we are doing the analyszis. So as you can see at the end it's not very complicated once you understand how it works and it is very useful in analyzis. So this is how the date trunk works in SQL. Okay, let's have a few examples about the date rank together with the creation time. So as you can see the creation time the level of it is the seconds. So we have seconds information with the creation time. Now I would like to move it to the minutes. So let's go and do this date trunk and we're going to say let's tr it at the minutes level for the creation time. So let's call it minute date trunk. So let's go and execute it. Now if you go and check the output over here and compare it to the creation time, you can see here we have zeros at the seconds. So as you can see we have the seconds completely resetted compared to the creation time. Now let's say that I'm not interested in the time information inside the creation time. I would like only to get the date. So in order to do that, we can use the date trunk where we reset to the level of the day. So let's go and duplicate it. I'm going to put it over here and instead of minutes, let's say we have a day and let's go and check the output. Now if you go and check the result over here you can see all the time informations are reseted to zeros and we have here only information about the date. So we have year month and day and everything else is reset it to zero. Now of course we can go to the maximum where we say I just need the year. So I don't need anything else. So let's try that out. We're going to take date trunk and say year and let's call it year. So let's go and execute it. Now if you check the output over here you can see that everything is reseted beside the year. So we have only the year information but everything else is reseted to the first of January and the time is as well is reseted. So as you can see the output of the date trunk is always as a date time and it help us as well to navigate through the hierarchy of the day time and we can truncate at the level that we want. All right. So now we're going to check why data trunk is amazing function for data analyszis. So let's have this example. We are saying select creation time and we want to count the number of orders based on the creation time from our table sales orders and we're going to use the group by in order to group the data by the creation time. So let's go and execute it. Now as you can see we're going to get one everywhere because the level of details the granularity or the creation time is very high and that's because here we have the seconds and since our data is small we will not get like two orders at the same seconds. Now in data analytics you would like quickly to aggregate the data at different granularity like for example at the month level. So you can do that very quickly using the date trunk and you say you know what let's say at the month and let's call it creation and we're going to have the same thing for the group pie. So let's go and execute it. So now as you can see at the output we have only three rows we don't have like 10 rows and that's because we have three months. So that means we just rolled up to the month level instead of the seconds. And we can see now in the month of January we have four orders, February as well four and March we have only two. So now we are talking about different level of details in the output and granularity. And now you might say let's go and aggregate the data at different level at the year level. So you can just change over here the year and execute it. And with that now we are at the highest level of aggregations. We are at the year level and since in our data we have only 2025. So we will get the total number of orders inside the table and that is 10. And this is really amazing in data analytics. You can go and quickly change the granularity and the level of aggregation or details by simply defining the level inside the dates. So this is why the date rank is amazing. It allow us to do analyszis and aggregations by zooming in and zooming out. Okay. So now we're going to talk about the last function in the part extraction category. We have the end of the month. As the name says, it's going to go and return the last day of a month. So let's see how end of month works. This is very simple. So let's take our date 20th August 2025. If you go now and apply this function to it, what's going to happen? It's going to go and change only the day information. So instead of 20, it's going to go to the last day of the month. So it's going to go and change the 20 to 31. The last day of the month, August in 2025. Let's take another example is the 1st of February 2025. If you apply the end of the month, it's going to go and change the day from the 1st to 28. The last day of month February. So as you can see, it's very simple. Let's take another example where it is already the last day of the month. So we have 31 of March. If you apply the end of the month here, what can happen? Nothing going to happen. You're going to get in return the same value. So this is how it works. And as you can see always the output of the end of the month going to be as well a date. So this is how end of month work. It is very simple. All right. Now quickly about the syntax of the end of the month. It's going to have the exact same syntax like the day, month, year. It accepts only one parameter. It is the date. So we have to pass here a date in order to find out the end of the month. So let's go and find the end of the month of our creation time. So end of the month like this. And let's have our creation time. So let's see the end of month. Let's go and execute it. And now in the output you can see we have a new column a date column. And inside it we have values about the end of the month. So for example here we have January, January, January and so on. So you will see always here the end of January and the same thing for February and March. So that's it. This is really nice function in case you need the end of the month of each date. Maybe you're creating a report or analyzes where you need this information. And now you might ask me how about to get the first day of the month. Is there like any function for it? Well, no. But there is a trick in order to get the first day of the month using another function that we just learned. Think about it. How to get the days as one everywhere. So we have to get here the 1st of January, the 1st of February, and the 1st of March. So how we can do that? Well, using the date trunk. So let me show you how we're going to do this. So date trunk and we're going to reset at the level of month. So we don't need the days it going to reset to the first. So our field is creation time and this going to be the start of month. So let's go and execute it. So now as you can see in the output we have the start of month and you can see we have everywhere here a one since we reset it at the level of month and this going to give us the first day of the month. And now you might say you know what here we have a lot of zeros how to get it exactly like the end of the month and that's because the date rank give us date and time always. So that means we have to change the data type and that we're going to learn later using the cast function but we can go and do it right now. So we can say cast and we want to change the whole thing to date. And now that we change the data type from date time to date and in the output as you can see we have only the date information. So now it's really amazing that you got two dates. The first one is the start of the month and the second is the end of the month. And those information might be helpful if you are generating reporting and you need the start and the end of the [Music] month. So now we come to the part where we ask the question why do we need those parts? Why do we need to extract the date parts from a date? So let's have the following use cases. The first use case of extracting the part is doing data aggregations and reporting. Sometimes we are building like reports based on our data and sometimes we have to aggregate our data by a specific time unit like for example we are building a reports in order to show the sales by year. So we have different years and we are aggregating the data based on the year or you want to drill down to more details where you want to aggregate the data by the quarter. So in this report we are showing the sales by quarter Q1 2 3 4 or you decide to go in more details where you show a report says sales by month and then you start aggregating your data by the month. So you have January, February, March and so on. So as you can see we can use those different parts in order to aggregate the data based on it and these different parts can offer us different analyzes with different details. So now we have the following task and it says how many orders were placed each year. So that means we have to group up our data by the year and we have to count the number of orders. Let's go and solve it. So let's go with the select. And now what do we need? We need the order date. This going to indicate when the order is placed. So and we have to go and count the star. So this going to be number of orders. and from our table sales orders and we have to group up by the order dates. So that's it. Let's go a and execute it. So now in the output we are getting the number of orders but by the order date. So we are still not there. We have to have it as a year. So we don't need the whole date information. We need only the year information. So that means we have to go and extract the part year. In order to do that we can do it like this. So we can go with the year and we have it as well in the group I. So that's it. Let's go and execute it. And with that as you can see we got the number of orders for each year. And since in our data we have only 2025 we will get only one row. So with that the task is solved. We are now aggregating the data on the level of the year. Now let's have another task which is the same but only different parts. How many orders were placed each month. So we have to go and change it to a month. It's very simple. We're going to use the function month and as well in the group by. So let's go and execute it. And now as you can see in the output we don't have one row. Now we have three rows. And that's because we have three months inside our data. And for each month we will get the total number of orders. So for the January we have four, February we have four and March we have two orders. Now you might say you know what I don't want the months as a numbers. I would like to have the full name of the month. So in order to do that we're going to go and use the function date name. So let's go and use date name and then we have to specify the date part. It's going to be the month and the value going to be the order date and we have to have the same thing as well in the group I. So let's go and execute it. Now you can see in the output we are getting the full name of the month which is easier to read. So this is one of the use cases why we need to extract parts from a date in order to aggregate the data on a specific level. So now let's have the following task and it says show all orders that were placed during the month of February. So that means we don't need all the orders. We need only a subset of the orders based on the order dates. Now let's go and check the data. So select star first from sales orders and let's go and execute it. So now with that we have our 10 orders. Now if you check the order date over here you can see that we have orders in January, February and March. Now we are interested only on the orders that were placed in February. So only these subsets. So that means we have now to filter the data based on the month information. So what we're going to do, we're going to have a wear clause. And now we don't need the whole order date. We need only the part month. So we're going to go with the month and order date and this going to be equal to two. Since the output going to be in number. So let's go and execute it. Now as you can see SQL did filter the data and in the output we have only the orders were placed in the month of February. So this is as well very common use case. Why do we need the parts? We use it in order to filter the data based on specific part of the dates. So as you can see it's very quick and easy. And here my recommendation is that if you are filtering the data always use the numbers. So always use a date function that gives you a number because it's always faster to search for integers instead of searching for a character or for string. So don't use the date name function in order to search or filter for the data. It's better to use the date part or month, year and day. Since you can work with numbers and numbers are always faster to retrieve data and to filter your informations. Okay. So now we have a lot of functions and I would like now to do a quick recap about the data type of their results. So as we learned we have functions like day, month, year, date bar and the output of all those functions going to be integer. It's going to be a number. Now we have another function the date time. If you use it the output of this function going to be a string because here we are extracting the name of the date part. And if you go and use the date trunk you will get in the output always date time two. So you are getting both the date and time. And the last function that we learned end of month if you use it in the results you will get the data type date. So this is really important to understand the data type of the output so that you don't get any unexpected results. All right. So now you might say you know what those are a lot of functions and like I'm saying they are doing the same stuff. We are extracting the parts of the dates. So now you might ask me how do you decide on when to use which function? This is how I usually do it. First I ask myself which part I want to extract. If I want to extract a date or a month then I ask the question do I need it as an integer as a number? If it's yes then I go and use the day function or the month function because they are quick and I will get exactly what I need. But now if I need the full name of the month or the day then I go with the function date name. Now moving back if I'm interested on the part year. So here we don't have a year name or something. I'm going to go immediately with the function year. But now let's say that I don't need the day, month or year. I'm interested in other parts like the week, the quarter and so on. Only for this scenario, I go with the function date part. So this is my decision process. This is how I decide when to use which SQL function in order to extract the parts of the dates. All right. All right. So now I have prepared for you here a list of all parts that we can use inside those three functions date part date name and date trunk. And you can see in this table the different outputs using those different three functions. So for example if you go and use the month with the date part you will get eight but for the date name you will get August and for the date trunk you will get truncated date time at the level of the month where you reset the days and times. So this is a full list of all examples you can go and check it. And one more thing that I have prepared for you in order to practice with all those different parts. I have made one big query with all different parts. So if you go and download the queries of this chapter, you will find the following files and let's go now and open all date parts. So we're going to go inside it and here we have a long query. So what we're going to do, we're going to select everything and copy it and let's go back to our scale and paste it. So let me just zoom out and then let's go and execute the whole thing. So now in my code I have just done a union for each possible part. For example for the year we have date part date name and date trunk and I'm using currently the get date. So we are manipulating this one and then the output can be presented over here. So you can see it like this. So if you use the part here for the date name you will get 2024. The same thing for the date name and this is for the date rank. And with that you have all possible parts that you can use in SQL in one query. So with that you can learn what are the outputs for different parts. All right. So with that we have learned all those functions on how to extract the parts of dates. All right. Moving to the second category. We're going to learn how to do formatting and casting for the date informations in SQL using three functions. So now before we deep dive to the formatting and casting I would like you to understand what is date format. So back to our example we have here the date and time informations and we understood there is components year month day and so on. Now if you check the date time there is combination of numbers and characters. For example the 2025 is a number but between the month and the year there is like a minus between them and this is a character. So now this is a very specific format and in SQL we can have a code for this format. So for example let's start with the year we have here four digits and we can represent it with 4 Y. So Y Y and we call those characters as format specifiers. So this is how we represent the year. Then between the year and the month there is like this small minus and then the month is two digits and we're going to represent it with two big M. So m M then between the month and the day there is a minus. So we have as well minus and then the day going to represented with two digits d and then we have like a space between the date and time and then we start with the date. So it start with the hour big h and big h because here we have the system of 24 and then we have double points small m small m. So as you can see here the formats are case sensitive. So there is a big difference between small m and a big m. So a small m indicates for a minute and big m indicates for a month. So as you can see here the case format is case sensitive. So two small m means minutes but two capital m means month. Then double point and small 2s. So now the whole code is called the date format. So this is the date format representation of this value. Now in the world there are different representations on how to represent a date. So for example in SQL we have the international standard ISO6801 and the date format is like we have learned first it start with the year. So four digit for the years minus two digit for the month minus two digit for the day. So year month day but in the USA we have different standards. So first it start with the month. So we have mm and then after that it is followed with the day. So we have then the day and after that at the end we have the year. So this is the sentence format that is used in USA and in Europe we have different representations of the day. So it start first with the small. So it starts with the day then the month and then the year. So this is exactly the opposite of the international standards. So as you can see we don't have one standard. We have different ways on how we represent dates. But in SQL the SQL server is following the format of the international standards. So SQL server start always with the year then month then day. So all dates that are used in our SQL database can be following this format. Okay. So after we understood what is date format, now let's talk about formatting and casting. So what is formatting? Is changing the format of value from one to another. So we are changing how the data looks like. So for example, we have our date. So it's following the international standards start with year, month, then day. Now we can go and change the format using the function format where we can go and define a different date format like it start with the month and then we have like slash instead of minus and then the day/ year. So in the outer we're going to get it like this and even the years is only two digits not four. So here we are providing for SQL the format that we would like to see the data with or you can go with other format where you have three big M and then four digits for the year and between them is just a space. So in the output you will get abbreviation of the month name and then space and the year. So this is one way on how to format data. But in the scale there is another function that help us to format data and that is convert. So here we provide not the format itself we provide style number. So for example the style number six. So it can show it like this day space and after that we have the abbreviation name of the month and then two digits of the year. Or if you use another style the 112 then you will get the year, month, day without any separation between them. And of course not only the date and time we can style we can style as well numbers and here we can use the function format in order to change the format of the number. So here if you're using the format of numeric values then the values will be separated with comma or if you use c for the currency then you will get the dollar sign or if you go and use p then you will get the percentage and at the end you have the percentage character. So as you can see we can as well change the format of the numbers but only the dates. So this is what we mean by formatting we are just changing how the value looks like. Now in the other hand the casting the casting can go and change the data type from one to another. So for example if we have the value 1 2 3 as a string we can go and convert it from the data type string to an integer. So in the output we will get as well 1 2 3 but as a number or we can go and change the data type from dates to a string. So in the output it is not anymore dates it is a string value or the way around we can change the data type from a string to a date. So as you can see we can change the data type from one to another and we can use that using two functions. The first one is and the most famous one is cast function or in SQL server we can use as well the convert function in order to change the data type. So this is what we mean with casting changing the data type from one to another. All right. So let's start with the first function the format. So what is format? As the name suggest it formats a date or time value. So it's like we are changing how the date and time looks. Okay. So let's check the syntax of the format and here it accepts two parameters and the third one is optional. So the first one we have to provide a value. It could be a date or a number. And the second one we have to provide the format. So here we are specifying the new look the new format for this value. Now the third one it is optional one. It is the culture. Culture means show me the value whether it's date, time or number. Show me this value in the style of a specific country or region. So each country each region has different format. So here we can go and change it to specific region format. But as I said it is optional. Let's have an example. So here we are saying go and format the order dates using the following format. So dd day then slash then we have the month then slash then the year. So going to go and format this with this new format. And as you can see here we didn't specify any culture since it's optional. Let's see another option where we can say you know what I would like to have the order date formatted with this format but we would like to go and add the style of Japan. So we are specifying here the code or the style of Japan. And of course we can go and use the format not only for the date but as well for formatting the numbers. So here we are specifying the value. The format is D. And as well we have activated the culture option. We are using the style of France. So this is the syntax of the format. Using this option is not really common. So I rarely see this format or someone using it. So the first example is the most used one in the projects where we have the culture as default or we are not using the culture at all. And of course if you don't specify anything is going to go and use the default culture which is enus. So this is all about the syntax of the format. All right. So now let's have a few examples using the format. So we're going to go and format the creation time. So we're going to do it like this. Format. And what we are formatting? We are formatting the creation time and now you can go and define any specifier you want. For example, let's say DD like this. So let's go and check the outputs. So execute it. Now if you are using DD, you will get the day information. So we can see if you're using this specifier, we are getting two digits about the day. So and as well we are getting the leading zero. So we are getting the 01 05 and all those informations are the day information. Now let's go and try something else. adding one more D. So let's have it 3D and here as well. So let's go execute it. So now if you check the output, we are getting now the name of the day. It is not full. So we are getting like a short name of the day or abbreviated one. So this is sometime nice if you are creating like a calendar or something. Let's go and add one more D. So we're going to have 4 D. And let's go and check the result for this one. Now in the output we are getting the full name of the day. So it's really nice. Now we are getting full flexibility on how to format our day. Okay. So now let's keep playing. Let's get something else. I'm just going to go and duplicate everything and I will go with the month now. So this is 2 M, 3 M and 4 M. Let me do it like this. So let's go and execute it. Now as you can see we are getting the same stuff but for the month. So mm we will get the two digits and 3m we will get the abbreviated name of the month and for m we will get the full name of the month. So it's like we are extracting the date part from the format but of course we don't use it like this. We will go and write the whole format that we need for a date. So for example let's go and change this format to the USA format. So in order to do it so we're going to go over here. So let's say format again the creation time. And now we're going to write the format of USA. So it's going to be mm. Then after that then after the month we're going to have like minus then day and then after that we're going to get the year. So for time year and that's it. Let's call it USA format. So let's go and excuse it. And now you can see in the outut we got a new column where we see now the date information but as a USA standards. So it start with the month then the day and then afterward we got the year. And of course we can do the same thing in order to generate the standard format of Europe. So what we're going to do I'll just duplicate it. And now the format of that going to start with the day then the month and then the year. So now if you check the output you can see it start with day minus then we have the month then minus the year. So as you can see we are changing the format of the date from creation time to something new. All right. So now we have the following task and it says show creation time using the following format. Now we have a very weird format. So it start with the word day. Then after that we have the abbreviation of the day and then abbreviation of the month. This is the quarter informations. Then the year and after that we have the time and we're going to say whether it's PM or A.M. So it's little bit weird format that you don't see it everywhere but still we want to practice on how to construct such custom format. So let's do it step by step. I'm going to go over here and a new line. So the first one is like day. So we don't have any format for that. It's just like characters. So this one going to be static for all the format. So what we going to do? We're going to say with a string this is the day. So let's go and execute it. So with that we got a static value. Everywhere we have the word day. So that's it. And after that we have a space. So I'm going to go and include it after the day in the string. So we have a day then space and after that we need the abbreviation of the day name. So what we're going to do we're going to go first with the plus operator in order to concatenate the strings. So we need the format function for the creation time. And what do we need? We need the short name. So it's going to be three times the d. Let's go and execute it. Let me just say here custom formats. So now as you can see in the output we have here the day. Then afterward we have space and then the abbreviation of the name of the day. So it looks so far good. Now after that what do we need? We need space and then the abbreviation of the month. So we can go and add all those stuff together with the format here. So we don't have to create two formats. So space and the abbreviation of the month is 3 M. So let's go and test it. Great. So now as you can see we got the abbreviation of the month as well side by side. So we so far we have covered this part. Now we have to move to the second part. So we still need a space and then Q1. Well the Q going to be static. So we cannot go and extend this format. We have to start a new one. So what I'm going to do I'm just going to add a plus here and a new line. So what do we need? We need first a space between the month and the quarter. So let's go and add space and we need the Q as a static value like this. Let me just move it like this. And now after that we need this one like this right so now we need the quarter informations and we don't have format for that that's why we have to go and use the part extraction functions and the one that we're going to use since we are using string I will go with the date name so quarter and we are extracting from the creation time so let's go and test it so now in the output you can see we have everywhere a Q1 and that's because all of those dates are in Q1 all right so now we are so far halfway in our format Not. So now next what do we need? We need like a space and then the year information and then the time information. So now in order to go and get space we're going to do it very simply concatenate and we're going to have space. Now let's go to a new line and in order to get the year I will go with the format as well. So format and what do we have? We're going to have the creation time again. So how we going to format it now? What do we need? We need the year. So it's going to be four times the y and after that we have like space and then the time information. We still can't do that inside the format, right? So we're going to have space here. And then next what do we have? We have the hours. So it's going to be h the small h because here we are talking about the pm and am. It's not the 24hour system. And then after that what do we have? The points double points. Then the minutes going to be small 2 m. And then after that the seconds. So far this is exactly this part over here. And now what is missing a space and the PM the designator. So in order to do that we're going to have a space as well and then small 2 * tt. All right. So we are almost there. Let's go and execute it. Now you can see it is working. So we have the year then space the hours minutes and space and then we have the designator. So this is PM and this is A.M. which is correct. So that's it. We are done. This is how you can create those crazy formats in SQL using the help of format or maybe date name or maybe some static values like we just added here. So I think it's really fun formatting the dates in SQL. Now one use case for the format that I frequently use in my project is using it to format the date before doing aggregations. So it's like part extraction but here we have more customizations on how we represent the date at the reports. So we can show a report like sales by month where we display for example the date as abbreviation name of the month Jan and as well two digits for the year 25. So once we change the format like this and then do data aggregations we will have a nice report about the sales by month. So let's have a quick aggregations using the format. So, we're going to go and say select and now the order date and count the number of orders from our table sales orders and then group by. But now before we start using the order date, we have to go and format it. And then if you take the order date, let's go and execute it. So as you can see the level of details is very high and we have here 10 rows and for each day we have like one order. Now we learned we can go and use the date part in order to extract one part and then aggregate on it. So now instead of that we're going to go and use the format function. So let's go and change the format and it is the order dates. And our format going to be like this. So three big M and then two digits for the year. That's it. And let's call it order dates. And we need this as well for the order date over here for the group I and here a comma. So that's it. Let's go and execute it. So in the output as you can see over here we have three months and here we having the aggregation the number of orders for each month. So now it's like the date part but now we are customizing the format as we want. So we can use the format in order to change the granularity of the date in order to do that aggregations. Now I'm going to show you a real use case for the formatting in real projects. Now our data could be stored in different technologies like the data could be stored in CSV file or we can get our data using an API call or in very common scenario our data could be stored in database. So now what we usually do we go and extract the data from these different sources into one central storage. It could happen that you are getting different formats for the dates and of course this is a problem for analytics. You cannot present different formats for the dates. What we're going to do we're going to go and clean up the formats into one standard format. So that means we have to format the incoming data to new formats and once we have one standard format we can use it in analytics and reports. So this is very common use case in data preparation and in data cleanup by formatting different formats into one standard format. Now in SQL we have many different date and time specifiers and I said they are case sensitive and each one of them has a different meaning. So I prepared for you as well all possible specifiers that we can use with the formats. Not only that, if you go back to the queries that you can find in this chapter, you can find here date format. So all date formats. If you go inside it, you can go and copy the whole query and then go back to SQL then execute it. You can find here a live example because I'm manipulating now the get date. So you can find here a list of all possible date specifiers that you can use with the formats. So I would say go and practice with those different date formats in order to understand what is possible in SQL. So as we learned not only we can change the format of the date, we can change as well the format of the number using the function formats and those are the different possibility that you can use as a specifier for this format in order to change the format of the numbers and as well I have prepared all those different specifiers in one big query. So if you go inside it and copy it and then put it in SQL and execute it, you will find here all different possibilities that we have as a specifier to change the format of the numbers. All right. So what is convert? It's very simple. It's going to go and change the value to a different type and as well at the same time it helps formatting the value. Okay. So let's check the syntax of the convert and it looks like this. It start with the function converts and it accept two parameters the data type first since we can use this function in order to cast the data types. So you can use string integer dates and so on and then we have to specify the value. So which value should be casted. And the last parameter it is optional one where you define the style the format of the value. Let's have this very simple example. We are saying convert to the data type integer int and the value that should be converted is 1 2 3 as a string. So it's going to convert it to integer. We are saying convert to a vchart and the value that should be converted is the order date. So the order date should be a date. So we're going to convert it from date to v charts using the format or the style of 34. So here we are specifying a style a format for this value. And of course it is optional and if you are not using anything the default value that's going to be used is zero. So this is the syntax of the convert in SQL. All right. So now we're going to have few examples on how to work with the convert. So let's go and convert for example string to integer. So we're going to say for example convert. So what is the target data type? It's going to be the integer and the value. It's going to be like for example 1 2 3. So and let's call it like this string to integer and the function is convert. So now in the column name as you can see I'm using here brackets and that's because I'm using like empty spaces and so on and with that I will get more freedom on how to name things. So this is just the name. So this is no function or something. Let's go and excuse it. Now as you can see it's going to work. So we are converting from a string value to an integer and the output this 1 2 3 here is not string. This is the data type of integer. All right. So now let's have another example where we want to convert from string to date. So the target going to be the date and the value let's have this value as usual and we're going to go and call it string to date convert. Okay. So let's go and execute it. Now in the output we will get this information this string as a date. And with that we have converted the data type from string to dates. Now let's have another example where we want to convert the date time to a date. As you remember the creation time is a date time and we would like to have it as only date. So let's go and convert and we would like it to be as well date but this time it's going to be a column called creation time and let's give it the name. So we are converting date time to dates. But of course here we have to go and select. So from sales orders that's it. Let's go and execute it. Now, as you can see in the output, we got only date. I'm going to go and select the creation time in the query as well. So now, as you can see, the creation time was before a date time. So, we have the time information as well. But if you go and cast it using the convert and make it only date. So, SQL going to go and convert it to date and you're going to lose all the informations about the time. So, so far what we are doing here is just casting. So, we are changing the data type from one to another. But in the convert, we can do both. We can do casting and formatting. So let's see how we can do that. I will just get rid of those information at the start. So creation time. And now we're going to go and convert the date time of the creation time to a varchar to a string. And as well to give it the format of the USA standard format. So let's see how we can do that. We're going to start with convert. We are changing now to var. So this is the new data type and the value is the creation time. And now if I don't give it a style, it's going to stay with the standard format, but we would like to have the USA standards. So in order to do that, we're going to go and add the style of the format. So it's going to be 32. So that's it. Let's have a name like this. So USA standard and we are using the style of 32. Let's go with that. This is just a name again. So it's not a function. Let's go ahead and execute it. And now in the output we got a new field and the data type of this field is a varchar. So it's not a date or date time. And as you can see the date now is formatted using this style the 32 the US standard format. So it start with a month then a day and then a year. So now let's go and do the same thing in order to get the standard format in Europe. So I will just go and copy the whole thing. I will just change the style. So instead of 32 we're going to go with the 34. And I will just change the name as well. So, so we are just changing the style. Let's go ahead and execute it. Now, as you can see, we got the same thing. We have as well a v jar and the format now is different. So, we have here the day, then the month, and then the year. So, this is how you work with the convert function. You can use it in order to do only casting or not only that, you can do casting and as well formatting. So, you have both things in one function. And now if you're talking about which styles are available, we have many styles that you can use inside the convert. So I have prepared for you a list of all styles that you can use with the convert. So we have styles only for the dates and another styles only for the time and styles for only date time. Now in the download folders you can find here one file called all culture formats. And here you can find one query that I have prepared where you can find inside it the different cultures and the examples. So let's go and copy it and let's go back to scale paste it and let's see the results. So now if you check the output we got the first column is the cultures that is used. So we have a lot of cultures like around 17s and you can see how the numbers are formatted or the date is formatted based on this culture. So it's really fun. You can check here for example how the format in Japan or Korea or France and the German one. If you scroll down, you can find the Arabic, the Russian and so on. So you can see the format of each dates is changing based on the culture. So I would say have fun. Go and try those different cultures formats in order to format your numbers or dates. So what is the cast function? It going to go and convert a value to a different data type. So it turns one data type to another. All right. So now let's check the syntax of the cast. I really like this one. It is not typical like format or syntax in SQL. So it says the cast is the function and then inside it we need two things but it's not separated like with the comma as we learned before with all other functions but this time is separated with the keyword as. So it's like the natural English you are saying cast the value as a data type. So you are casting the value to a new data type. So let's have this very simple example we have here cast the value 1 2 3 as integer. So previously it is string and it going to be converted to integer. So as you can see it's very simple. Now in this example we are saying cast this value this string value as a dates. So converted from string to dates. So as you can see with the cast we don't have here any option of formatting or styling the values. So it's only dedicated for casting the value from one data type to another one. So this is the syntax of the cast. It is very straightforward and really nice function. Okay. So now let's have a few examples about the cast. So let's go and convert a value from a string to integer. So it's very simple. We're going to say cast. So now we need the value. So let's go with the 1 2 3. So we have here a string. And then we're going to say as and then we have to define the data type. So the data type going to be integer. So that's it. So let's give it the name like this string to integer. Let's go and execute it. Now as you can see we got the value but with the data type integer. From string to integer. Now let's do the way around. We cast from integer to string. So we're going to say cast 1 2 3 as var jar and we're going to give it a name int to string. So let's go and execute it. Now in the output we have 1 2 3 but this time it has the data type varchar. Now let's go and work with the date. So we're going to go and convert a value a string value to a date. So our value going to be the usual one and we want it from string to date. So we're going to have the data type as date. So let's give it a name string to date. Let's go and execute it. Now we're going to have this value with the data type date. So that's it. Now let's say that I would like to have this value but as date time. So I will just copy the whole thing and go to a new line and say date time two. So the name of this going to be string to date time. Let's go and execute it. Now in the output as you can see we are getting not only the date but as well we are getting the time information. But now since we didn't provide SQL with any time information SQL going to go and show it as zeros. Now let's do one more casting where we change the data type from date time to date. So now we need our creation time but we have to get it from the tables. So from sales orders let's go and execute it. So now in the output you can see the creation time is a date time. We have the time information but we are not interested about the time information. I would like to have this field as a date. So it's very simple what we're going to do. We're going to say cast. Now the value is creation time and then the keyword as and we need it as a date. So we're going to give it the name date time to date. So let's go and execute it. Now as you can see in the output we got the creation time but only with the date information. We don't have anything about the time. So we get it as a date instead of date time. So that's it. This is amazing function SQL and it's very simple and we can use it only for casting. So only to change the data type from one to another. And we cannot use this function in order to change the format. So if you are casting you will get always the standard format from SQL. So now let's go and compare our functions side by side. So we have our three functions. cast, convert and format and we can do two things either casting or formatting. So by the casting for the first function cast we can change any type to any other type. So there is no restriction at all. The same thing for the converts the same thing we can convert anything to anything. But for the format we can change only to a string. So any data type like a date or number to a string value because the main thing for the format is not changing the data type. Now if you are talking about changing the format of the values, you cannot use the cast function in order to change the format. So the cast function is only for casting. It makes sense. Now about the convert, we can use it in order to change the format of the date and time. But we cannot use it in order to change the number formats. And for that we have a dedicated function called format. So we can use it to change the format of the date and time and as well the numbers. So those are the main differences between those three functions. All right. So with those three functions we have learned how to do formatting and casting on date informations. Now moving on to the third group we have the date calculations and here we have two functions on how to do date calculations or mathematical operations on the dates. If okay so now we're going to start with the first function the date add. So what is date add? Date add can allow us to add or subtract a specific time interval to or from a date. So let's understand how the date add work. So here again we have our date August 20th 2025. So now in some scenarios we would like to add years to our dates. So for example let's say I would like to add three years to our date. So we can do that using the date ad. So if you do that in the output you will get 2028 August 20th only the date part is changed and where we have added three years but in other scenarios you would like to go and add months. So for example let's go and add two months to the August. So in the output you will get 2025 10 20 with that we have added two months and of course we can go and add days to our dates. So for example we're going to go and add five days to our date. So in the output we'll get the same year 2025 the same month August but only the day will be changed to 25. So we have added five days to the original dates. And of course we can go and subtract dates even though that the function called date add. So for example, we can go and subtract three years from our dates and we will get So if you do that, you will get 2022 August 20th or if you go and subtract two months from our dates. So it's going to stay the same year 2025. But this time instead of August, we will go back to June with the same date 20. And the same thing going to happen for the days if you go and subtract five days. So the same year 2025, the same month August, but only the days going to be instead of 20, it's going to be 15. So as you can see with the date ad you can manipulate the years, the month and the days by subtracting or adding new intervals. So this is how the date ad works. All right. So now let's check the syntax of the date ad. And here things little bit more complicated. We have to provide three informations. The first one is a part. What do you want to add? Do you want to add years or months or days and so on. Then the second one is interval. So it's like how many days? How many years? How many months? And then the last one is the date. This is the date that we're going to be manipulating by adding or subtracting intervals. Let's check the following example. We are saying here date add. So what is the part here is a year. That means we want to manipulate only the year parts. Then the interval here is two. So it is positive. We want to add two years. So it's going to go to each order and start adding two years for each date value. Now let's check another example. Here we are saying date add month. So here we want to manipulate the month part. But here we are saying minus4 that means we want to go and subtract four months from each value in the order date. So as you can see the value of the interval whether it's positive or negative. We are controlling here the function whether it is subtraction or addition. So let's have few examples about the date add using our field order dates. So for example let's go and add two years for each date. So we can do it like this date adds. So we are adding years that's why we're going to go with the part year and how many years we are adding we are adding two years. So this is our interval and our field our value is the order date. So now in the output as you can see we got a date but this date is always 2 years higher than the order date. So everywhere you have see 2027. Now let's go and add maybe three months for each date. Just going to go and copy it and say a month. Let's change the interval to three and we're going to call it three months later. So now if you check the output over here we have a new date but now the difference between it and the order date we have here always three months more than the order dates. So for example here we have January but in the new one we have April and for the next one we have February and in the new field we have May. So as you can see we are adding months over here. So as you can see we are adding monthses to our original filled order date. Now let's say that I would like to go and subtract 10 days. So let's go and do the same. So we're going to have the date add. Since we are talking about the days, it's going to be the day. We're going to subtract 10 days. So minus 10 for the order date. So let's call it 10 days before. Let's go and execute it. Now we got as well a new date. And this date has always 10 days before the order date. So for example, let's take the order number seven. In the order date we have 15, but in the new column we have five. So we have subtracted 10 days from the original filled order dates. So as you can see it's very simple to add or subtract days, year, months using the date add. All right. So what is date diff? diff stands for difference and date diff can going to can allow us to find the differences between two dates. All right. So let's understand how the date diff works in SQL. Now imagine we have two dates. We have the order date 2025 August 20th and the shipping date is the 1st of February in the next year 2026. Now we might ask the question how many years have passed between the order date and the shipping date. So in order to answer this question we can use the function date diff and we can define the part year. If you do it like this it's going to subtract those two dates and it going to return one. So the date difference between those two dates is exactly one year. But now if the question is how many months are between the order date and the shipping dates. So here again we can go and use the date diff between the order date and the shipping date but we use the part month. If you do it like this in the output you will get three months. And now of course if the question is how many days are between the order date and the shipping dates. So here we can use the function date diff where we specify the day inside it and in the output you will get 68. So this is how the date diff works. You go and subtract two different dates and you will get in the output a number how many years how many months how many days. So that's it. All right. Now to the syntax of the date diff. It accept here as well three parameters. So the first one is the parts as usual year, month, day. And then here we need two dates, not only one, we need two. So we need the starting dates and the ending dates. So that means here we have the youngest dates and the end date going to be the oldest dates. So for example, here we have date diff and we are saying find the differences in years between the order dates. This is the start date and the shipping dates. So which dates normally happen? First we have to order something. So we have the order date and once you order what can happen next is the shipping date. That's why the shipping date is as an end date. So we want to find the differences between them in years or of course if you want to find the differences between them in days we have to go and change the part from year to day. So as you can see the syntax is very simple and very logical right. All right let's have the following simple task and it says calculate the age of employees. So let's see how we can solve that. So we're going to go and select first all the informations from employees. So sales and employees. Okay, let's execute it. Now in the employees, we don't have any informations about the age, but we have the birthday. So we can go and transform this birthday to an age. And of course, how we calculate the age? We count how many years between this year and the birthday. So that means we have to go and use two functions the date diff and the get day in order to have the year of the current year. So that means we have to go and use the function date diff. So let's go and do that. I'm going to go first selecting only few informations. So employee ID and P date. So let's start with the date diff. So if we are talking about the age we are calculating how many years that's why we're going to say as a part going to be the year. So what is the starting date is the birth date of the person. So it's going to be the birth date. And now we need the end date. We don't have here anything about the end date. The end date going to be the current year. So in order to get the current year, we're going to go with the function get dates. And with that we are getting the current date information. And this is exactly what we want. So let's close it and let's go and call it an age. So it's very simple. We are counting how many years between the birth dates and the current dates. So let's go and execute it. So now we are getting the ages. As you can see the first person is 33, the second one is 52 and so on. And now you might getting different values than I'm getting now. And that's maybe you are doing the course now in 2025 or 2026 and the employees going to be older than now. Now we are 2024 and I'm getting those ages. So this is how we calculate the age using the help of two functions. The date diff and the get date. Okay. Okay, so now we have another task for the day diff and it says find the average shipping duration in days for each month. So here we have a lot of informations. Let's do it step by step. Let's first find out the shipping durations in days. So let's go and select few informations from our table. So select order ID. We have the order date, ship date and I think that's it. So from sales orders. So let's go ahead and execute it. So now we have our 10 orders. We have the order date and the shipping dates. Now we have to go and create a new field called shipping duration. So what is the shipping duration? It is the number of days between the order dates and the shipping dates. So how many days it took from the order placement until the day of the shipping. So that means we have two dates and we have to go and find the differences between them. We're going to go with the function date diff. So now since we are saying in days we have to go with the part day. So what is the start date? The start date is the order date. And what is the end date? It's going to be the shipping dates like this. So I'm going to call it day to ship like this. Let's go and execute it. So now by checking the result for example for the order one it is ordered at the 1st of January and it is shipped on 5th of January. So between those two dates we have around 4 days. So four is the shipping duration and if you go to the order number three the differences between the order date and the shipping date we have around 15 days. So with that we have solved this part shipping duration in days. But now the task says we have to find the average duration for each month. So that means we have to go and select for example the month of January and find the average duration. So we have to go and do a simple aggregation. We're going to go to the date if at the start and say average. And we're going to close it over here. And let's go and rename it average shipping. And now we have to aggregate by the month. So we don't need the whole order dates. We need the month of the order date. So like this. We don't need of course the order ID, but now we need to group up the data using this dimension, the month order dates. So that's it. Let's go and execute it. So now in the output you can see we have three months and for each month we have the average shipping durations in days. So for the first month it is around 7 days for February is as well 7 days and for March we have less duration 5 days. So with that we have solved the task. As you can see the date diff is very strong function in order to do data analytics using the dates information. All right. Right. So now we have the following task and it says find the number of days between each order and the previous order. So there's a lot of stuff going on over here. Let's do it step by step. Let's start by selecting the basic stuff. So select order ID, order date from the table sales orders. Let's go and execute it. So we have our 10 orders and we have the current order dates. So now we have to find the differences between two dates. order dates, the current one and the previous order dates. So in our data, we have the current order dates, but we don't have the previous order date for each order. And in order to calculate the previous one, do you remember about the window functions? We can go and use the lag in order to access a value from a previous records. So let's go and do that. The order date, I'm just going to call it current order dates. And let's go and find the previous order dates. So we're going to go with the lag of the order date because we are interested in the value of the order date. Now over we have to sort the data. So we're going to sort it by the order date as well. So this is going to help us always to access the previous value of the order date. So we're going to call it previous order date. Let's go and execute it and let's check the result. For the first order, we don't have anything previously. So that's why we are getting a null. For the second record, the current order date is the 5th of January and the previous one is the 1st of January. And this value comes from the previous record, the previous order. Great. Amazing. So with that we have now the two dates, the current date and the previous one. And now we can go very simply finding the number of days between those two dates. And we can do using the amazing function date diff. So we are interested on the days that's why it's going to be the day. So what is the starting day? If you check those two dates, you can see that the previous order date is the starting date. So we're going to take the whole thing, the whole window function and put it over here. So I just moved my picture. So here is the previous order dates. And now the end date, what's going to be? It's going to be the current order date which is our order date like this. So again, we are finding the number of days between the previous dates and the current dates. So that's it. Let's close it. So I'm just going to call it number of days. So let's go and execute it. Now of course we have here null. So we will get as well null in the output. And now you can check over here how many days between those two dates. We have exactly four days. And as well for the next one we have around 5 days, 10 days and so on. So we have solved the task. We have now the number of days between each order and the previous order. So this type of analyszis is very important in the business. We call it time gap analyzes and we have done it using the help of the window function and as well the date function date diff. So date div function is amazing function to do data analyzes. All right. So with those two functions we have learned how to do mathematical operations on date informations or we can call it date calculations. Now moving on to the easiest and the last group, we have the date validation. And here we have only one function, the is date. Okay. So what is is date? So the is date is very simple. It's going to check whether a value is a date. So it going to return one if the string value is a valid date or zero if it is not a valid date. Okay. So let's check quickly the syntax of the is date. It's very simple. The keyword is date is the function name and it accepts only one value. So for example you can pass a string like this and you can ask SQL is it a date. So is date and the value and of course for this example you will get true or one. So as you can see we are passing here a string value and we are validating whether it is good enough to be a date or as well you can go and specify a number like here 2025. So is this value a date and of course SQL going to accept it and say yeah this is a year so you will get as well a one. So you can pass as well a number or integer. So you are just checking the values whether they are suitable enough to be a date. So that's all about the syntax of the is dates. Okay. So now let's have few examples. For example, let's go and select and we're going to say is date and we will check a value. So let's say this value is a string 1 2 3. Let's go and call it date. Check one. Let's go and execute it. Now in the output it's going to say no, it is not a date. And that's why we are getting the value zero which is correct because 1 2 3 is not a date. Let's pick another value. The same thing is dates. And now the value going to be the following. So 2025 August 20. So let's call it date check 2. And let's go and execute it. Now in the output we will get one. That means the value that we have provided is a date. And that's why we have a one in the output because ESKL is saying this is a date. Now let's have another example. We're going to take the whole thing. So this is a check three and remove this from here. But I would like to go and change the format. So let's say that we start with the day then month and then the year. Let's go and check. Now in the output you can see it is zero because SQL does not understand the formats. So we are not following the standard format of the database and scale and that's why going to say no this is not a date. This is like a string value. So this means only if the value is following the status format SQL going to understand this is a date. Now let's go and check another thing for example let's say is date and let's have only the year. So 2025 and let's give it the name date check for let's go and execute it. Now in the output we will get one. So that means is considering this value as a date. So that means Iskll is smart enough to understand okay we have provided a year information and is going to accept it and say okay maybe this is the 1st of January of 2025. Now let's go and do the same thing but for the month let's see whether SQL going to accept it. So check five and we have the month of August. Let's go and check now going to say no I don't understand this value this is zero. So that mean this value is provided is not a date. So by checking those results as you can see SQL understand only the standard formats and it allow you as well to check whether a year is a date. So this is how the is date works in SQL. And now you might ask well when I'm going to do this when I'm going to check whether the value is a date or not. Let me give you this following scenario. Now imagine that we have the following date. So we have four values as a string. And now if you check the data you can see that we are following the standard format but only one value has an issue. So we have here data quality problem. So now what we want to do, we want to go and cast this string value to a date. We don't want this to stay as a string value. We would like to have it in the final result as a date. So what we usually do is that we go and have like subquery on top of those values. So like this. So now what we're going to do, we're going to go and say we would like to go and cast the order dates as date. We don't want it as a string. And we're going to call it order dates from these values. So let me just make it like this and let's go and execute it. Now SQL going to give you an error and say well I cannot convert everything to a date because you have maybe corrupt data and this is of course because of this row. So SQL is not able to convert this string to a date. But of course now the example is very simple. We know that but if you have a huge table it's going to be really hard to identify those issues. But now still I would like to go and convert those value here. I don't want to get an error. And now if there is like some values like here that is corrupt and so on this value could be null. So how we can force SQL to convert the data type from string to date and not give us this error. And for this we can go and use the help of the function is date. Let me show you how I usually do it. So let's go and say let's check whether the order date is a date. So let's have it like this. And now before we go and execute, I'm going to make this as a comment because if I execute it like this, we will get an error. And let's go and get the order date in our select. So let's go and execute it. Now as you can see in the output, we have our string value. So they are not yet a date. And we have the result of our check. So as you can see the first row, we are getting a zero. So it's saying this value is not a date. But for all other values, we are getting one. So they are passing the check and they are dates. So now what we're going to do we're going to go and build a logic where we're going to say go and cast the value from string to date only if the flag or the check is equal to one. So that means we can go and use the help of the case when statement. Let me show you how we can do that. So let's do it step by step. We're going to say case win. Now we need the check. So is dates the order date. So if the output of this check is equal to one then you are allowed to do the casting. So let's go and get the cast as a result of this condition and if it's not equal to one then it could stay as a null. So let's have it as a null if it didn't pass the test. So end and we can call it new order dates. So now let's go and execute it. Now as you can see we are not getting error from SQL. So now if you check the output for the invalid dates we are getting a null. So we are not getting an SQL error. And now only if these string values are a valid dates it's allowed to be casted. So that you can go and cast a string value to a date even though that you have bad data quality and this is very important step in order to prepare the data before doing analyszis and it help us as well to find data quality issues. So for example we can go over here and say you know what let's go and search for all issues. So we're going to go and take the is dates. So let's go and get the check and I'm going to say let me see all string values that are invalid that are failing the test. So let me execute it. And with that we are getting this record. And now imagine we have a lot of data. So it's now it's really easy to identify those issues by just using the S dates. So this is as well amazing way in order to identify data quality issues. Now of course you might say you know what I don't want to see here null. Maybe let's get a dummy value. Well it's very easy. We can go over here and say else. So and we can go and get for example very large value something like this that is easy to identify. So now with that instead of getting nulls inside your data you can get such a dummy value. So now you understand the use case of the is dates and why this function is amazing doing data cleanup. All right. So with that we have covered 13 different date and time functions in SQL. So we have learned how to extract the date parts using seven different functions and we have learned as well when to use which one. So they are amazing in order to do data aggregations and as well filtering. And then we have learned how to change the date format from one to another and as well how to change the data types. And then we learned how to do mathematical operations on our dates. So how we can add or subtract days, years, months from a date or the amazing function the date diff where we can go and find the differences in days or years between two days. And the last one we can go and validate whether the values that we have are dates or not. So as we learned date functions are amazing functions in order to do data analyzes and reporting. All right my friends. So with that we have learned a lot of very important SQL functions and how to manipulate the date and time values in your database using SQL. Now in the next section we're going to start talking about the null functions in order to handle the nulls inside your tables. So let's go. So what are the nulls? Imagine you are filling out a forum and there will be usually like fields that are required and another fields that are optional. So what usually happens? We leave those optional fields unanswered. So we don't provide any values and we leave it empty. And now once we are done filling out the form and we click on register, the data will be inserted into database tables. So now what can happen? The fields where you have provided answers and values can be filled inside the table while the unanswered fields will have no value and this is what we call in SQL a null. So in databases a null means nothing unknown. It is not equal to anything. So it is not equal to zero or empty string or blank space. A null is simply nothing. It tells us there is no value and it is missing. It's like saying I don't know what this value is. So this is what a null means in SQL. All right friends, so now we're going to do a deep dive into special SQL functions on how to handle the nulls inside our data. Now in some scenarios we have nulls inside our tables and we would like to go and remove it and replace it with a new value like for example 40. And in order to do that in scale we have two functions. The first one called is a null and the second one called coales. But now let's say that we have another scenario where we have a value inside our table like the 40 and we want to go and make it as a null. So now we are doing the exact opposite. We are replacing the value with a null and for that we have the SQL function null if. So as you can see with those two scenarios we are replacing stuff. So from null to value or from value to null. So they are really helpful in order to manipulate the data inside our databases. Now moving on to another scenario where we don't want to manipulate anything. We want just to check. So we don't want to replace or convert anything. We want just to check in our database whether we have a null value. And for that we have a function called is a null. But between the is and null there is like space. It is different than the first function. So if you apply is null you're going to get a boolean true or false. For this scenario you will get true. Or the second option you can go and check whether the value is not null. So we can use is not null and for this example you can get false. So in the output we are getting a boolean true or false. So those keywords are really amazing in order to check whether we have nulls inside our data. So this is the big picture of all functions that we have in SQL in order to handle the nulls. So now let's go and understand those functions one by one. So let's start with the first function is null. Is null going to go and replace a null with a specific value. Now the syntax of the isnull is very simple. We're going to use the keyword is a null and it accepts two arguments. First the value and then the second the replacement value. So let's have an example. We can go and use the is null for the column called shipping address. So we are checking the nulls inside it. And if SQL encounters any null, it going to go and replace it with the value unknown. So this going to be like a default value for the nulls. So the first value is a column and the second value is like static. Always going to be the unknown if we find any nulls. Now of course in other scenarios we don't want to have it always like the unknown. We would like to use another column to help the first one. So let's have this scenario. So now with this syntax we are checking the values of the shipping address and if we find any nulls it's going to get the replacement from the billing address. So here in this example we have two columns. We don't have here any static value. We will get the values of the billing address only if the shipping address is null. So we are replacing the nulls using the help of other column. And in the first scenario we are replacing the nulls with a static value the default value. So let's have a very simple example in order to learn how this works. So what we are doing we are checking whether the value is null. If it's yes then we're going to go and get the value from the replacement and if the value is not null then show the value itself. So we have the following example. We are going to check the values from the shipping address and if there is nulls then go replace it with the default value na. So let's see how going to go and execute this very simple example. We have two orders. The first order we are checking the submit address is the value of this address is null. Well, no. We have a value a. So that's why it's scale going to go and return the same value. So in the outputs we will get a. So if it's not null, it's going to return the same value. So now it's going to move to the second order and here we have the shipment address as a null. So what going to happen here? If the value is null, then we going to get the replacement value. So what is the replacement value is the NA. So that's why in the output we will not get a null we will get the N A. So if you check the result what happens? We're going to get the addresses from the shipping address but only if we have a null we will get like default value. It's very important to understand if you are using the default value in the output you will never get a null. All right. So let's have another example for the second scenario where we are not using a default value we are using a column. So we have a supportive column that's going to be checked. So in this scenario we are saying is null shipping address and billing address. So we have two columns and of course the logic going to be the same right. So we are checking only once. Let's see how SQL going to execute this example. We have this time three orders and we have addresses from the shipments and as well from billing. So now SQL is always focusing on the shipping address since it is the first column. So we are not checking the billing address at all. So it start with the first order. Is it null? Well, no, we have the value A. So, we will get it as well in the output and SQL will not get anything from the billing address. So, we will get a. So, that's it for the first order. Now, it's still going to go to the second order. And this time, we're going to have a null. So, now in the rule, we are saying if the shipping address is a null, go get the value from the billing address. So, this time we're going to go to the replacement, right? So we will get the value C in the output because the shipping address is the null. Now let's move to the third row. As you can see here we have again null. So SQL going to go and get the value from the billing address. But here in this scenario the billing address is as well null. That's why we will get the value null in the output. So as you can see having the replacements values from a column there is no guarantee that there will be always a value like here in the third order it is a null that's why we will get null as well in the output. So if you think you are using is null to replace all the nulls by having two columns you might end up as well having a null in the output if the replacement having nulls. So if you want to make sure you don't get any nulls in the output you have to go and use a static value. So this is how SQL execute the isnull. All right. So what is coales? Coal is going to go and return the first null value from a list. All right. So now the syntax of the coales is way better than the is null. Here it accepts like a list of many values. So here for example we have value 1 2 3 you can add four five as much as you want. So we are creating here a list of values to be checked. So for example, we still can use it like the isnull where we have the shipping address where we replace the null with a static value the unknown or as we learned we can go and use two columns shipping address and the billing address. So so far it's like the same use cases as the is null but now of course the kalis is not only limited to two we can go and use three. So we are saying go check the shipping address if it's null then go check the billing address. If it is as well null then use at the end the default value the static one the unknown. So as you can see we can use more than two values with the coalis. Okay. So now let's understand the cowless and how this works. Now the workflow is something similar to the isnull. So in this example we have two columns shipping address and the billing address. It's going to consider it as a list and it's going to start checking from left to right. So it's going to check the first value from the shipping address whether it's null. If no, it's not null then we're going to go and get the value one. So we will get the value from the shipping address. And if yes, it is null then it's going to go and get the value two. So we're going to get the value from the shipping address. Now we have the similar data. We have three orders. Let's see how going to execute it. So it's going to start with the first row and it's going to focus on the shipping address. So here the value is not null. So we have it as an A. So that's why we will get the value one. So we will get the value from the shipping address and nothing else going to be checked. Now moving on to the second row. This time the shipping address is null. So it's going to go and get the value from the second column and it's going to be the C. Right? So in the output we will get C. Now to the last example, we have it as a null and it's going to go and get the value from the second column and this time we're going to get as well a null like the is null function. So at the results we are getting exactly the same result as isnull. So for this scenario it doesn't matter whether you use isnull or kowalis. So now of course we are still not happy with that because I don't want to see any nulls in the output and I will still need to use the billing address instead of any static values. So I would like to have everything the values from the billing address and as well I would like to have at the end a default value so that I don't have any nulls in the output. So how we going to solve it? So now we can use the power of the account list where we can include multiple values in one function. So what we're going to do we're going to have the shipping address first then the billing address and at the end we're going to have the default value. So we have now a list of three values and of course our workflow going to be a little bit bigger. So again here it's going to start from the left to the right. So first it's going to go and check the value one. If it is null then it's going to go as well checking the value two. And if the value two is as well null, we will get the last value. It's going to be the value three. So now let's run the example again using the new kalis. So the first thing we're going to go and check the first value which is the shipping address for the record number one. So now as you can see the value is not null. So we have here an a. So what going to happen? We're going to get the value a as well in the output. So that means this one going to be activated and we will not check anything else. So that means in the output it's going to be like this. and the first value is returned and everything else will be ignored. So, SQL will not check anything. So, as you can see, we are returning the first null value. So, now let's move to the second order. Now, we're going to check again the first value. Is it null? Well, yes. As you can see, we have here a null. So, that means we're going to go and activate this path over here on the right side. So, now SQL will not go blindly putting anything from the billing address in the results. First SQL has to check it. So SQL going to check it whether it's null or not. SQL going to go and return it as well in the output. And we have activated this path. So SQL is returning the value two which is the value from the billing address. So now let's move to the third order. SQL first going to go and check the shipping address. Is it null? Well yes it is null. So that's why SQL going to go and start checking the second value. So this time SQL will not return the billing address value since it's null. It's going to go and return the third value. And what is the third value? It is our static value the NA. So in the output we're going to get the NA our default value. So with that as you can see in the output we will not get any nulls. We are using the default value and as well multiple columns. So if you check the output, it's always the first priority to check the values from the first column, the shipping address. If it's null, then the second priority going to be the billing address. If it's null, then the last priority, it's going to be the default value. So as you can see, SQL is checking the values from left to right and it stops immediately once it encounters the first not null value and return it in the results. So this is how the cow works. All right. So now let's have a quick summary about the differences between the kowalis and isnull. So as we learned the isnull is limited only to two values where the kowalis is amazing where you can have a list of multiple values which is a great advantage compared to the isnull. Now if you are talking about the performance the isnull is faster than the kawalis. So if you want to optimize the performance of your query then go with the isnull. Now there is another problem with the isnull is that we have different keywords for different databases. So for Microsoft SQL server we use the isnull as we learned but in Oracle they have different implementations they use the NVL and other database like MySQL you have if null and all those three functions are doing the same but we have different implementations for different databases but in the other hand the cowis it is available in all different databases. So here we have like an agreement or standards between the databases of using the kowalis. So here again this is a great advantage for the kowalis because if you are writing like scripts and someday you want to migrate from one database to another. If you are using the kowalis you don't have to change anything but if you are using the isnull then you have to go and adjust your queries and scripts with the correct functions. That's why I tend always to use the kalis and avoid using the isnull. Only if it's really necessary that I have really bad performance, I go and try the isnull. But I usually stick with the kowalis. So that is my advice for you. Go with the kowalis and stick with the standard. Now the use cases of the kowalis and the isnull are very similar and we mainly use them in order to handle the null before doing any SQL task. For example, we can use them in order to handle the null before doing data aggregations. So let's understand what this means. Imagine that we have three sales. We have 15, 25, and a null. Now if you go and use an aggregate functions like the average, what's going to happen? SQL going to calculate it like this. 15 + 25 divided by two and the average is going to be 20. So as you can see here SQL is including only the two values 15 and 25 and ignores totally the null value. So in the calculations the null will not be included because if SQL does that the output going to be as well null. So the nulls are totally ignored. Now the same thing can happen with the other aggregate functions like the sum count if you are counting the sales min and max. There is only one exception about the aggregate function count. If you are using it with the star, SQL here is considering not the values. SQL going to consider the rows. That's why SQL going to go and include all those rows and find the output going to be three. Now in some scenarios, if your business understand the null as zero, then you're going to have a problem with the result of your analyzes if you don't handle the nulls. So what we have to do? We have to handle the null before doing the aggregations. So we have to go and replace a null with zero using either the isnar or the kowalis. So once you do that the calculation going to be changed for the average. So it's going to be 15 + 25 + 0 divided by 3 and the output this time going to be 13.3. So with that you're going to get more accurate results for the business if they understand nulls as zero. All right. So now we have the following example. It says find the average scores for the customers. So let's go and solve it. So we're going to go and select the customer ID, the score from table customers. So let's go and execute it. So as you can see, we have four customers with score and the last one doesn't have any score. So we have it as a null. Let's go and calculate the average for the score and I would like to have the window function in order to see the details as well. So this is average scores. So let's go and execute it. Now of course what is going on here? The four values going to be added to each others and divided by four and the null is totally ignored. Now of course the question is what the business understand with the null. If it is zero then we have inaccurate results. So let's go and fix it. Now this time we're going to say okay we're going to have the average but instead of score we're going to handle the nulls first. So we have to replace any nulls with zero. We can go and use the kowalis or the isnull. So I will go with the cabalis like this and score if you find any null make it zero. So that's it and as well I will go with the window function. So average scores let's call it two. Now let's go and execute it. Now as you can see in the output we got 500 and this is different than the previous average and that's because we have replaced the null with zero. Let's just go and display it in order to understand it. So I will copy it and put it here. So let's call it score two and execute it. So now SQL is going to summarize all those values and divided by five and that's why we are getting the 500. So if our business understand the null as a zero this average going to be more accurate after we handle the null. As you can see in some scenarios we have to handle the nulls before doing any data aggregations. All right, moving on to the next use case for the kowalis and isnull. We can use them in order to handle the nulls before doing any mathematical operations. So let's understand what this means using the plus operator. So if you do plus operator between two numbers like 1 + 5, you are summarizing the values and you will get six. And if you do the plus operator between string values like a + b. So now what we are doing, we are doing data concatenations and the output going to be a b. So now if you go and replace the one with a value like zero. So 0 + 5 we will get five. Nothing fancy about that. And for the strings if you go and replace a value with an empty string. So there is zero characters between the two quotes plus the B. So in the output you will get only B. So it's fine and nothing is critical. But now we come to the problem. If you use a null if you replace the one with null in the output you will get a null. because you are saying okay five plus something that I don't know so SQL says okay you are summarizing now a value with a no value it is unknown so I don't as well know what going to be the answer that's why going to say it's going to be null just don't know what is the answer and the same thing can happen with anything else like the string so if you're saying null plus b and here going to say the same thing the null is unknown and the answer going to be as well unknown so my friends this is very critical in the analyzes and working with data. So this means we have to handle the nulls before doing any mathematical operations. And this is not only for the plus operator, it's as well for the other operators like minus and so on. All right. So now let's have the following task. And it says display the full name of the customers in a single field by merging their first and last names and add 10 bonus points for each customer's score. So let's go and solve it. We're going to select first the basic informations. Let's get the customer ID. What do we need? the first name, the last name and we need the scores. So that's it from sales customers. Let's go and execute it. Now the first task is that we have to generate a new field called full name where we have to go and merge or concatenate their first and last names. So let's go and do that. We need the first name plus and then let's have a space between the first and last name and then plus let's have the last name as full name. So let's go and execute it. Now if you check the result for the first customer it is working. So we have Joseph Goldenberg. The same thing for the second customer. But for the third customer we have here a problem. Customer doesn't have any last name but she has a first name. So we have here a Mary. So the full name here is completely null which is not correct. For this example we have at least to show the first name Mary even though that the last name is missing. So the result is not really accurate and that's because we are doing the plus operator between a null and marry. So that means we have to go and handle the nulls before doing any plus operator. So again here we can go with the cowless or the isnull. So let's go and create a new field using the cowless. So it's going to be the last name and now we have to define a new value. If it's null so we could have like something unknown or we could have like an empty string and we can do that using two quotes and between them there is nothing. So we are using an empty string. So let's go and check the results. Last name two. So let's go and execute it. Now we can see that the last name over here for marry it has an empty string and it is not anymore a null. So now SQL knows okay this is a string and there is no characters inside it. So with that SQL knows more informations and we can go and now concatenate those informations. So let's go and do that. We're going to take the whole thing and replace the last name with the kowalis. So let me just remove this last name over here and execute it. So now as you can see things looks better. Now we have in the full name for mari only the first name. And of course if you don't like it like this you would like to have another default value. You can go over here and say something like in a not available. So let's go and execute it. And with that you can see immediately uh there is here a missing last name. But it doesn't really look good. So I will just remove it and go with the empty string. We're going to go and execute it. So with that we have solved the first part of the task where we have the full names and we are not missing any informations from the first name and the last name. Now let's go to the second part of the task where we have to add 10 bonus points for each customer score. So we have to go and add a 10 for each score. So let's go and do it. I'm going to put it at the end. So score + 10 and let's give it the name score with bonus. So that's it. Let's go and execute it. So now in the output you can see it's very easy. We have added a 10 for each score. So we have increased the score points for each customer. But now for the last customer Anna you can see over here she doesn't have a value in the scores and that's why didn't go and added 10. So we will get as well a null. And of course this might not be fair that the last customer is not getting any point even though that we have increased for all others. So that means we have to go and handle the null by replacing the null to zero. And only after that we're going to add a plus to it. So let's go and do that. I'm going to add a kalis if it is null then go and make it zero. And afterward go and add a 10 points. So let's go and execute it. So now as you can see at the results everything now is fair where we have a 10 bonus points for each customers even if the customer doesn't have any values in the scores like here Anna she has like null but still she is getting a 10 points. So here again as you can see if you don't handle the nulls correctly before doing the mathematical operations you might get unexpected results. So be careful with the nulls and handle them correctly before adding anything. Okay, moving on to the next use case for the kowalis and is null. We can use them in order to handle the null before doing joins. This is little bit advanced use case but it's very important to understand it. So let's understand why this is important. Let's have for example two tables table A and table B. And in some scenarios we have to go and combine those two tables using the joins. And now in order to join two tables, we have to go and specify the keys between the table A and table B in order to join on it. So in this example, we have two keys in order to join the tables. Now here comes the special case. If those keys don't have any nulls inside it and all the data are filled, then your join going to work perfectly and you will get the expected results. And now you might have a special case where there are nulls inside the keys. So there are missing values and this is a big problem because in the output you will get unexpected results and some records will be totally missing. So in this scenario we have to handle the nulls inside the keys before doing the joins. Let's have a very simple example in order to understand this behavior. All right. So now let's have this very simple example where we have two tables and we want to combine them. So in the first table we have a year type orders and in the second table we have as well year type and we have sales. So now we would like to go and combine those two tables in order to have all informations in one result. Now we can go of course and use the inner join between the table one and table two and the keys for the joins here. As you can see we have the year in both of the tables and as well the type. So we're going to go and use both of those columns as a key for the join. So let's do it step by step how going to execute this. So we need the year type and the results. So it's going to go and take those two columns to the results and we need the orders and sales. So it's going to take as well the orders and the sales from the second table. So now let's start doing it row by row. So the first key going to be those two columns. So we have 2024 and the type A. So now it's going to start searching for those two informations in the second table. And as you can see we have here a match, right? So the first row is as well matching since it's inner join it going to present in the output only the matching rows from left and right. So in the outputs we're going to get the whole row from the table one and we will get the sales from the table two. All right. So that's all for the first row. Now let's move to the second row over here. So what are the values of the keys? We have 20 24 and null. So now if you check the matches on the right side you can see we have a match here right it is logical so it's as well 20 24 and null so everything is matching and we should get it in the result right SQL cannot go and use the equal operator in order to join tables so even though that is logically it makes sense to have it at the output but still SQL cannot go and compare the nulls that's why this is a problem for this combination SQL will not find any matching So we will not get any informations for the combination of 2024 and null. So for us of course in the business this is missing informations and as well inaccurate results. So we're going to miss this row and it's still going to go and jump to the third row. So here what are the values of the key. We have 20 25 and B. Now it's going to go and search it in the second table and it's still going to find a match over here. So in the outputs we're going to get those values. The the orders going to be 50, the sales 300. Now it's going to go to the last row and we have here again the same problem. We have here 2025 and null. And of course if you check the data you will say yes we have a matching over here but SQL would ignore it. So we have exactly the same situation and we will not find it at the results. So at the output we will get only two rows even though that those two tables are like identicals if you compare the keys. So with that we are losing data at the results and we are providing inaccurate results. So my friends if you have nulls inside your keys what can happen you will lose records at the output. So here it's very important to handle the nulls inside the keys before doing the joins. All right so now in order to fix it we're going to go and use either the kalis or the isnull in the join. So as you can see we are not using the type directly. We are handling it by replacing the null with an empty string. It doesn't matter which value you are using. The main thing is that you have a value and SQL can go and map it. So you could have it as empty string or a blank or any default value. But I usually go with the empty string since it's little bit faster than having any other characters. So now what going to happen is we're going to go everywhere and replace those nulls with an empty string. So now we don't have any nulls inside our keys and let's go and see what can happen. So we're going to start with the first row again. Here we have a matching from the right table and we're going to see the whole records in the outputs. So we will get as well the sales as 100. And now it's going to go to the second row over here. So this time we don't have a null. We have 2024 and an empty string. So now it's going to go and search for a match and it's going to find it over here. we have as well 2024 and an empty string. So now what can happen in the outputs we're going to get a 204 but here we will get a null. So we will not get an empty string we will get a null over here and that's because we are handling the null only on the join. So as you can see we have here the is null type on the join but we don't have it on the select. So in the select the type going to be like the original data and the original data was a null. We are just handling the null in the joints just in order to let SQL understand how to map and match the data. So in this example, I'm not changing the values in the select. So that's why we will get the original value. But the orders we will get it 40 and the sales going to be 20. Now moving on to the third row. I think you already get it. So let's going to find the match and the sales going to be 300. All right. Now we're going to move to the last one. And here we have the same scenario. So we have 2025 and an empty string. So it's not null anymore. And SQL going to go and search for all those informations and it's going to find it over here. So SQL going to take this fields over here in the type in null not an empty string because in the select we didn't handle it. So the order going to be 60 and the sales going to be 200. So as you can see now the result is complete. We successfully combined both of those tables in one big results using joins but as well using the help of the isnull function in order to have a complete results and not miss any value. So my friends be very careful check always the keys whether they have nulls or not and if you find nulls go immediately and handle it so you don't lose any records in the results and you get accurate analyzes. All right, moving on to the next use case for the isnull. We can use it in order to handle the nulls before sorting the data. So imagine we have the following sales 15 25 and null. Now if you go and sort the data by the sales ascending from the lowest to the highest what can happen? SQL going to show the nulls at the start and that is not because the null is the lowest value because null has no value. But SQL show it like this. it's going to place it at the start and then below it we're going to have the lowest value. So it is the 15 and at the end we're going to have the 25. Now if you are doing the exact opposite where we are sorting the data from the highest to the lowest using descending. So what going to happen is going to sort it like this. We're going to have 25 then 15 and the last thing that going to appear in the list going to be the null. So here SQL is showing the nulls at the end and that is again not because nulls are the lowest value it has no value but SQL do it like this show it at the end. So this is how SQL deals with the nulls if you are sorting the data. So in order to understand this use case let's have the following task. So the task says sort the customers from the lowest to the highest scores with nulls appearing last. All right. So let's solve it. This going to be very interesting one. So we need the customer informations. So let's go and select and we need the customer ID and the scores from sales customers and let's go and execute it. So we have a simple list of all customers and their scores. But now we have to go and sort the data from the lowest to the highest. So we're going to go and use the order by clause and we need the field score. And since it's lowest to the highest that means we need to have the ascending and in SQL it is a default. So we don't have to go and mention it. So let's go and execute it. So now as you can see in the results it start from the lowest to the highest and the first part of our task is solved. But now of course we have an issue right because we have a null and as we learned SQL going to put it at the first place on the list. But the task says with nulls appearing last. So we really don't want to see the nulls at the start. We don't worry about it. So we would like to have it at the end of the list. So that means we have to go and handle the nulls before sorting the data. And here we have two ways to do it. One way that is lazy and the other one is more professional. So let me show you first the lazy way. We're going to go and replace the null with a very big number. So for example, what we're going to do, we're going to go and use the kowalis and we're going to say okay score and then let's have a lot of number so that we have a really big score. I just want to select it in order to see the results. So as you can see it's a very big number here. So if you take this and replace the order by with the new score. So that's it. Let's go and execute it. So now if you check the results we have already solved the task. We have listed all the customers from the highest to the lowest and the nulls are at the end. So now the question why do we call this lazy or not professional and that's because we are defining a static value. And of course for this example it is working but we don't know later what's going to happen. Maybe things change where in this course you're going to get a higher value than this and then sorting the data will make no sense since the null going to be like in between values. So who knows your value might be a real value inside the data. Now let me show you the other way which is more professional in order to solve this task where we don't play with luck at all. So let's go and do that. Let me just move this little bit here. I'm going to go and create a new logic where we're going to say case when if the score is null then what's going to happen we want the value one otherwise the value going to be zero so end so we are just creating a flag with zero and one if the score is null then we're going to get the flag of one if we have a value for the score we will get zero so let's have it like this and I will just go and get rid of this kalis so let's go and execute it Now if you check our new nice flag you can see we have zeros everywhere where we have a value in the score but only once we have a null we will get the flag of one. So now once we got this what we're going to do we're going to go and sort our data based on this flag and the score even though the task is not mentioning anything about the flag but we are using it in order to force the nulls to be at the end of the result. Let me show you how we're going to do that. So let me just remove all this. So first we want to sort the data by our new flag in order to make sure that the nulls at the end. So we're going to have our flag and then afterward we sort the data by the score. So let's go and have the score. So again what we are doing first sort the data by the flag in order to push the nulls at the end. And now once all those values are equal to each others what's going to happen SQL going to go and sort the data by the score. So SQL going to use the scores in order to sort the data and both of them are ascending. Let's go and execute it. Now as you can see we're going to get exactly same results. The values from the lowest to the highest and the nulls are at the end. And as you can see with the order by we didn't use any static values or any big numbers. And of course we don't need the flag at the select. So we can go and remove it. So let's execute it. And with that we have solved the task. So as you can see we can use those nice functions like the cowis or the isnull in order to handle the nulls before sorting your data. So what is the function null if null if going to go and compare two values and it going to returns a null if they are equal otherwise if they are not equal it going to returns the first value. Okay. Okay. So now the syntax of the null if it accepts only two values value one and value two. So here again of course you can go and use a column with a static value like the unknown. So we are comparing the values between a column and a static value or you can go and compare two columns the shipping address and the billing address. So again here it accepts only two values. We cannot have it like the kalis where we have a list of multiple values. All right. So now let's understand exactly what do we mean with the null if. So the workflow going to be like this. SQL going to go and check two values the value one and the value two. And if they are equal then SQL going to go and return a null. But if the two values are not equal going to go and return the first value. So it is the one on the left side. So by checking the outcomes here we will never have a scenario where we're going to get the second value. That means the second value always used as a check. So we are checking against this value. So either we're going to get the value one or a null. Let's have this very simple example. We are saying null if price and we are checking whether it's equal to minus1. So we are saying if the price is equal to minus1 then go and replace it with a null because it is data quality issue that we have a price that is negative. It makes no sense for our business. And if it is minus1 then it means for us a null. We don't know the price of this product. So we will correct it using the null if. Let's check this very simple example. We have two orders. So SQL going to start with the first order and check the first value. So what is the first value? Is the price. So here we have a 90. SQL going to go and check is 90 equal to minus one. Well, no. That means it's going to go and execute this path. So that means in the output we will get the first value which is 90. So in the output we will get a 90. Now let's move to the second order. Here we have a minus one. So SQL going to check is minus one here equal to the minus one that we have in the null if well yes. So that means SQL going to go and execute this path where we were going to get the null value in the output and we're going to get it like this. So now if you compare the result from null if and the price you can see we don't have any more the minus one. And as you can see now we are doing exactly the opposite as kowalis and is null. We are replacing a real value with a null. Now moving on to the second example and this is very interesting one in the analytics where we can go and use two columns inside the null if. So in this example we are saying null if original price and discount price. So SQL have to go and compare the prices between those two columns and if they are equal it should return a null. And now you might say okay in this example why we are doing this? Well we can use it in order to highlight or flag special cases inside our data. And the special case here is if the original price is equal to the discount price and if those two prices are equals that means we have an issue in our program or something like went wrong as we are inserting data. So let's see what's going to happen for the first row we're going to go and compare the 150 from the original price with the discount price. So they are not equal right. So that means going to go and return the original price the 150 in the output. So let's move to the second order. Here we have the original price 250 and as well the discount price is 250. So they are equal and if they are equal then we will get a null in the output. So as you can see again here we are not getting any values from the discount. We are using it only for a check. So with that we have a quick flag like using the nulls as flag in order to identify where we have equal values. So this is how the null if works. All right friends, here we have a very nice use case for the null if and that is preventing the error of dividing by zero. Let's see what this means. Okay, let's have the following task and it says find the sales price for each order by dividing the sales by quantity. So let's go and solve it. This should be very easy. So we need the order ID. We need the sales and the quantity from sales orders. Let's go and execute it. So now we have 10 orders. Those are the sales and the quantity. So now it's very easy to calculate the price. It's going to be the sales divided by quantity and we're going to call it price. So let's go and execute it. Now as you can see we got an error says divide by zero error encountered. So that means somewhere we have a zero for the quantity and this is a problem. Let's go and check the data again. So I'm just going to comment the whole thing and let's go and execute it. So now by checking the result yes we got for the order ID 10 here we have quantity zero. So it will not work if you divide by zero of course. So how we can solve it? We can use the magic of the null if where we're going to go and replace the zero with a null. So getting a null is way better than getting an error. Right? So let's go and do that. I'm just going to remove the comments. And here we're going to say null if if the quantity equal to the zero value. So that's it. Let's go and execute it. Now as you can see it is working. And with that we are making sure that we are not dividing by zero. And that's because we replace it with a null. And if you divide anything by null you will get a null. So if you check the result over here the order 10 we got the price of null which is correct and for the all other values everything is working because we have values and we didn't replace it with a null that's why we have values for the price and this is very common use case for the null if we can use it in order to prevent dividing by zero. All right so what is is null? It's going to return true if the value is null. So it is checking the value if it's null it's going to return true otherwise it's going to returns a false. Now the exact opposite if you go use the is not null. So if you use these keywords it's going to returns a true if the value is not null otherwise if it is null it's going to go and return a false. Okay. So the syntax for that is very simple. It start with a value or expression and then after that we're going to have the keyword is space null and the is not is exactly the same. So we have a value then afterwards we have the is not null. So we have the not operator after that and the is not is exactly the same. So we have a value then we have the is not the not operator then the null. So it's very simple. Let's have an example. We are checking whether the values of the shipping address is null. So we can have it like this. Shipping address is null or we can check the opposite whether it's not null. So the shipping address is not null. It's very easy. Okay. So now let's understand how this works. we are checking the value. So if the value is null then return a true if it is not null then we return a false. So as you can see it never returns the value itself or any nulls. So we are getting a boolean of true and false. So we are creating like a boolean flag in order to assist us with the checks. So we have this very simple example price is null and we have those two rows. So we are checking whether the price is null in the first order it is not null right that's why we will get a false in the output and the second order the value is null so it is correct that's why we will get true now of course if we go and use the is not null is going to be exact opposite so is the price not null well yes it's not null that's why you will get a true over here so now for the second check it is null right so the output going to be false we will get the exact opposite. So that's it. It's very simple how the isnull and is not null works. All right. One very obvious use case for is null and is not null is by searching for missing informations or searching for nulls. And maybe after that we can go and clean up our data by removing the nulls from our data set. Let's have the following task and it says identify the customers who has no scores. All right, let's go and solve it. This is very simple. So let's start by selecting star from sales customers. So we need everything. Let's go and execute it. Now as you can see we have our five customers. But the task says we have to have all the customers who have no score. So that means the result should return only the last record since the score of Anna is null. So let's go and have a wear clause. So where and now what do we need? We need the score. Then we don't use the equal, we use is null like this. So that's it. Let's go and execute it. And with that, as you can see, it's very simple. We have filtered our data and now we can see all the customers where the score is null. This is a very basic check to understand whether our data contains nulls. All right, moving on to the next task and it says show a list of all customers who have scores. So back to our example, this time we're going to do exactly the opposite. We want a list of all customers where we have a value in the scores. So what we're going to do, we're going to say where score is not null. So if you go and execute it, you can see we're going to get a clean list where all the customers have score. And with that, we get rid of all nulls inside the score field. And maybe this is helpful in order to do further analyzes. All right friends, now we come to very interesting use case for the isnull and that is by introducing a new type of joints between tables that's going to help us to find the unmatching rows between two tables. Let's have a quick recap about the joints in SQL in order to understand the new types. So basically we have two sets or let's say two tables the left and the right. And if you go and use an inner join what we are doing here we are finding only the matching rows between the left table and the right table. So at the result we will get only the matching rows. Now we have another type of joints called lift outer join. And if you use this type at the result you will get all the rows from the left table and as well only the matching rows from the right table. Now we have another type which is exactly the opposite the right join. And here we're going to get all the rows from the right table and only the matching informations from the left table. And now to the last type that we learned. We have the full join where we will get all the rows from the left and as well all the rows from the right. So we will not be missing anything. So those are the four basic joints that we have learned in SQL. But in SQL we have as well other types that are more advanced. But we don't have in SQL any keywords for that. So the first one called lift anti-join. So what we are saying here we need all the rows from the left table but this time without the matching rows. So all the informations that are matching with the right table we don't want to see it at the results. And as I said we don't have here an extra keyword for this type of join. But in order to get this effect we're going to go and combine the left join together with the isnull. And with that we're going to get all the data from the left side but without anything that is matching the right side. And this we call it left anti- join. And we have another advanced type for the joints called the right anti- join. This is exactly the opposite. So we are saying all the rows from the right table without having any matching rows from the left table. So all the informations on the right side that is not matching the left side. So again here we don't have a keyword for that. We're going to go and work with the right join plus and is null. So with that, as you can see, we have two new types of joins added to our four basic joins. Now this might be confusing. Let's have the following task in order to understand it. Show a list of all details for customers who have not placed any orders. All right. So let's see how we can create the effect of the left anti-join. So let's do it step by step. We need here two tables. We need the customers and as well the orders. So since we are focusing on the customers, the lift table going to be the customers. So let's go and do that. We're going to go and say select star from sales customers. This is our first table. So we are using the alias of C. So let's go and execute it. Now as you can see we got the list of all customers. So that we have all the details for our customers. But now we have to go and join it with the orders. So in order to do that let's have a new line. left join sales sales orders and let's have the lso and now we have to go and define the key for the join so on it's going to be the customer ID equal the customer ID in the order table so now if you go and execute it now what we're going to do we're going to go and show the order ID from the table orders so order ID just to see whether we have a match or not so let's have it like this and execute it Now let's go and check the results. As you can see those four columns comes from the table customers and only the last column come from the orders. So now what is interesting is to check the order ID whether we have nulls or not. So as you can see for the customer one we have everything matching. For the customer two as well we have orders the three as well for only the last one customer ID 5 we have here a null. So that means SQL was not able to find any order for this customer. So again what this means we have only one customer Anna where she doesn't have any order but all other customers they did have an order and that's because we have values from the right table. So once we have values that means we have matching but since here we have a null that means we don't have any matching. So now since the left anti- joint says we would like to have all the data from the left table without having any matching from the right table. So that means for this example we would like to get only this customer Anna. And this is exactly as well fulfilling our task. The task says list all details for customers who have not placed any order. All data from customers where we don't have matching from the orders. Now I think you already got it how to get this effect. We're going to go and filter the data like the following. So we're going to have the wear clause and now we need the column from the right table from the orders. So we're going to go with the customer ID comes from the orders. So we're going to say oh customer ID is null. And of course you can go with the order ID as well. You're going to get the same effect. But I would like always to use the key that we are using with the join. So let's go and execute it. And now as you can see we got the effect of the left anti join and with that as you can see we got the customer that we are aiming for. So here we have the data from the left side that is not matching the right side. So the customers who have not placed an order and with that we have solved the task. So as you can see we have implemented the left and join by combining the left join together with the is null. So this is the power of playing with the nulls in SQL. Now my friends, there is something that is really confuses a lot of developers or anyone that is working with data in databases and SQL and that is the differences between nulls, empty string and blank spaces. So the nulls as we learned we are saying I don't know what the value is it is unknown. But now in the other hand the empty string you are saying I know the value it is nothing. So the empty string is a string value which has a zero characters. This is totally different than the nulls. The nulls we don't know anything about it. So now sometimes maybe happens to you as you are filling a forum and you come to one field you go and by mistake hit a space bar and with that you are entering space into the field and you just jump to the next field without entering any other values. So we have now like a space character inside the field. This is really evil in databases because once the user enter a blank space, it's going to go and store it as a value inside the database and it's going to take storage. So it could be one space or many spaces depends on how long you press the space bar. So the blank space is a string but the size is not zero like the empty string. We're going to have a size of how many spaces you have entered. So here it's not like the null. We know the value it is string and the character of that going to be space. Okay. So let's see those three scenarios inside scale. Now I have like a dummy data using the city statements. Don't worry about it. I'm going to teach you all those stuff in the next tutorials. So now we have here like four rows. The first one with a value a. The next one with null. The third one with empty string. So as you can see there is nothing between those two quotes. And the last one we have a space between those two quotes. Now let's go and query this temporal table. So select star from orders and execute. So now by looking to the values of the categories you can find all the scenarios now. So now the first scenario is the easiest one where we have a normal value. We have here an a. But the other three rows we don't have normal values. We have like empty stuff. So the first one going to be the null. So we don't have a value. This is the special marker from SQL. It says null. So there is no value. And the other two they are really confusing. As you can see it's really hard by just looking to the data or to the results whether it is an empty string or a blank space. And this confus a lot of developers or anyone working with data seeing those results. It's really hard to detect the data quality issues by just looking at the results. So now in this scenario what I do I go and calculate the length of each value inside my column. So let's go and do that. Now we're going to go in the SQL server. We're going to go and use the function data length and our field going to be the category. So let's call it category length. So let's go execute it. And now let's check the result. The first one since we have only one character, the length of that is going to be one which is correct. And now to the next row we have the category null. We don't know the value and as well we don't know the length of the value, right? So that's why we will get a null. So now by moving to the next one as you can see those two looking really exactly the same. But now with the help of the length or the data length function we can see that the third row or the third category value has the length of zero. That means it is an empty string and we don't have any characters over here that is hidden. So with that we are sure this is an empty string. But now let's move to the last one. Here it is very tricky and evil. we have a hidden space inside this value and we can understand that by the length of this field. So as you can see we have here a one that means we have here one hidden space inside this value and it is not empty string. So that means I have here only one space let's go and give it another space and calculate the length. So as you can see we have two spaces and that's why the length is two. So don't count on your eyes in order to understand the spaces. go and calculate the length in order to be very precise. So now let's go and compare the three scenarios side by side. So let's start with the first one about the representations in the table. The null we're going to see it as a null inside the table. The empty string going to be like two quotes and nothing between them. And the blank space it's as well two quotes and between them one or many spaces. And now if you are talking about the meaning the null means unknown. We don't know the value. The empty string it is known but it is nothing it is empty value. And the third one blank spaces it is as well known and the spaces are the value. And now if you are talking about the data types since the null is no value. So we don't have a data type for this and it is like a special marker in the SQL. And now the empty string has a data type. It is a string and the size of this string going to be zero since we have zero characters inside the empty string. Moving on to the blank spaces, it is a string since a space is a character and it's going to be the size of one or many. And now if we are talking about the storage, the null is the best. They don't consume or occupy a lot of storage. While the empty string and the blank spaces, they occupy here storage and memory and they waste the space. So if you are worried about the storage, the best option here is a null. Now talking about the performance, you will get the best performance if you are using nulls. Now the empty string is as well fast but it is not that fast like the nulls. Now the worst option here is the blank spaces it is slow. So again if the speed is important for you you have to have those scenarios as a null. So now if you are talking about the comparison and you are searching for those values if you want to search for the null you have to go and use is null. But in the other hand if you want to search for the empty string and the blank spaces you have to go and use the operator equal. So that's all those are the main differences between the null empty string and blank spaces. Now you might ask you know what why do I have to understand the differences between all those stuff the nulls empty strings and the blanks everything's like empty so why do I care well in new projects I'm going to promise you that you will be working with sources and data that has bad data quality and you might encounter all those three scenarios in your data and now if you don't do any data preparations like cleaning up the data handling those three scenarios and bringing standards to your data and you jump immediately to the analyzes without doing all those stuff, you will end up providing inaccurate results in your reports and analyzes which leads to wrong decisions. So preparing your data before doing any analyszis by cleaning up the data, handling those three scenarios and as well bringing standards is very important step before doing any analyszis. So this is how we're going to do it together with the stakeholders and the users of your reports and analyzes. You have to define a clear data policies. It's like rules and you have to commit yourself during the implementations by following those rules. And here we have three different options. The first one you can go and define the data policies like this. Only use nulls and empty string but avoid using blank spaces. In my project I cannot imagine that there is a scenario where we need blank spaces. They are just evil. Just go get rid of them. All right. Right. So with this policy, we have to go and get rid of all blank spaces inside our data. And in order to do that, we have a wonderful function in SQL called trim. The trim function in SQL going to go and remove the spaces from a string from the left side and as well from the right side. So all the leading spaces and the trailing spaces going to be removed. So now if we go and apply the trim function on that category, what's going to happen? All the blank spaces going to be removed and it going to be turned into empty string. So let's go and do that. It's very simple. So we're going to use the trim function and we're going to apply it on the category. Let's go and call it policy one. So let's go and execute it. So now by just comparing the policy one with the category. You see like it's identical but it's not. Now in order to have a better feeling about this we can go and test it using the data length. Now let's go again and use the data length function. So we're going to use it for the whole results and as well I'm going to go and use it for the category in order to just compare it. So without the trim so like this. Let's go and execute it. Now if you go and check the result as you can see here again we have the length of two because here we have two spaces but with the policy one we have zero. So those two values after applying the trim function they have the length of zero and with that we don't have blank spaces. So that means now we are sure after applying the trim we have either a null or empty string. So let me just get rid of all those informations. Now I am sure both of them are empty string. So as you can see it's very simple using only one SQL function you are cleaning up the data and bringing standards. All right moving on to the option two. You can define your data policies like this. Only use nulls and avoid both empty strings and as well blank spaces. So that means in our business we don't have anything meaningful for the empty string and the blank spaces. We can go and use only the nulls. Okay. So now let's go and implement this rule. We have to go and convert a value to a null. So the value going to be empty string to a null. And as we learned we can go and use the null if function in order to get nulls instead of values. So let's go and apply this policy. But now here we have two values the empty string and spaces. Now instead of having two rules for that I'm going to convert first the blank spaces to an empty string like we have done here. So I'm going to take the result of this function first as a first step and afterwards we're going to go and use the null if. So we're going to say null if for the result of the trim if if you find any empty strings convert it to null. So that's it policy 2. So as you can see in the result we have converted those empty spaces and planks to a null. So with that we are getting three nulls and of course we're going to get the value a. And now if you compare those three columns side by side you're going to see the bully C2 is really easier to understand compared to the previous ones. Right? So now if you compare the policy two now to the policy one, you can see it's easier to understand and it's easier as well to handle. So again it's very easy to do data cleanup with only two functions we have now like standards inside our data. And now moving on to the last option, we can define our data policy like this. Use only a default value unknown and avoid using anything else like nulls, empty strings and blank spaces. So that means in the analyzes and reports we want to see the value unknown and we have to handle all those three informations and convert them to unknown. So now in order to implement the policy three we have to go and convert a null with a value a default value and here we have two options either use the is null or we can go and use the kalis and I will go with the kowalis so kowalis and I'm going to use directly the category. So if you find any null replace it with the default value unknown and let's call it policy 3. So let's go and execute it. So now if you check the result over here you see that we got it only once correct. So we replaced the null with the unknown but we still have like empty spaces and blanks and that's because we rushed using the qualis and we skipped the other steps. So as you can see preparing the data you have to do it slowly step by step. So first we have to go and convert everything to a null like the policy 2. And after that the last step we're going to go and use the default value. So that means instead of using the category we have to go and get the result of the policy 2. So let's go and copy it and replace the category with those two steps and let's go and execute it. So now as you can see we have the default value for all those three scenarios. First we have to trim the data in order to remove all the blank spaces. The second step, we're going to go and replace all the empty strings with a null. And with that, we're going to get a null for all those three scenarios. And finally, we're going to go and replace the nulls with a default value, the unknown. So, that's it for the three policies. And this is the different ways in order to clean up the data and bring standards before doing analyszis. And now you might ask me, okay, which one should I use in my project? Like if I want to suggest something for the users, which one should I use? Well, it really depends on the business, but I tried always to avoid this one, the policy one, because it's always confusing and I have always explained for the users. So now we are left with the two and three. Well, I use both of them in different scenarios. I normally go with the policy 2 because it takes less storage and as well the performance of your queries afterward going to be really good. So if I'm doing data preparations in my ETL before inserting it inside a table, I go with the policy too. But in other hand, if I'm doing a preparation step before showing it in a report like in Tableau or PowerBI. So if it is like one of the last steps before showing the data to the users, I go with the policy 3 because if you present a null inside a report, it's going to be really hard to read. So having like a word like unknown, it's easier to understand. Okay, we have here missing data. So again if the data preparations is exactly before I present the data to the users I go with the policy 3 where I use default values but if I'm using a data preparations before inserting it in the database I go with the policy 2 because it's going to optimize the storage and it's going to be really bad if you go with the policy 3 because it's really bad to store the whole world each time there is no value like the unknown. it's gonna take a lot of space and as well you're going to get bad performance as you are building the queries. That's why I tend to store the data using nulls. If you present it to the users go and show it as a default value. So as you can see it's very important to understand the differences between the nulls empty strings and blanks and how to prepare the data by cleaning up the data and bringing standards and policies before doing any analyszis. So with this we have cleared up the confusion between those scenarios and if you encounter it in your projects you know how to deal with it. All right. So now let's have quick summary about the nulls. Nulls are special markers in SQL in order to say there is no value. It is missing. It is unknown. So nulls are not equal to zero or empty string or any blank spaces. And using nulls inside our database is going to save some storage and as well provide a strong performance in your queries. And in scale we have different functions in order to handle the nulls. So now if you want to replace a null with the value we can go either with the function kowalis or is null or if you want to do the opposite where you want to replace a value with null you can go use the function null if or in other cases we want only to check whether there is nulls or not we can use the is null or is not null. And we have learned as well that we have to treat the nulls especially before doing any tasks. So that means we have to handle the nulls before doing for example data aggregations like average, sum, max, min and so on. And we have to handle the nulls as well before doing any mathematical operations like using the plus operator to concatenate two strings. And in some scenarios as we learned we have to handle the nulls as well before doing joins. And in other cases we have as well to handle the nulls before sorting the data. And we have learned as well by combining the joins and the isnull we introduce new types of joins like as we learned the left anti- join and the right anti-join where we exclude the matching rows using the isnull and we can use the null functions in order to provide standards and data policies in our data like using the nulls or using a default values like the unknown. All right my friends. So with that you have learned how to handle the nulls inside your data and now we're going to move to a very special topic called the case statements. This is very important tool in order to do data transformations. So let's go case statements. It can allow you to build a conditional logic in your SQL query by evaluating a list of conditions one by one and return a value when the first condition is met. So now let's understand the syntax of the case statements and what this means. Okay. So now let's see the syntax step by step. It start with the keyword case. This case indicates now we are starting a logic a conditional logic in SQL. It's like programming languages as you start with the if else. So the if is the keyword of a logic and the whole logic as well ends with another keyword called end. So once SQL sees the end. So this is the end of the conditional logic. So the case is the start and the end is the end. So now what we're going to have in between is the conditional logics right. So the conditional logic start with the keyword when. Now we are telling SQL we have a condition to be evaluated and then we're going to go and specify that conditional logic. So now we have to tell SQL what can happen if this condition is fulfilled. So now we have to use another keyword called then. So now we are telling SQL show this results if the condition is true. So as you can see it's very simple. It's like the natural language, right? It's like in English when the condition one is met then show the results. It's very logic, right? And now of course we can go and add a second condition inside our case statements. So we're going to have the same setup. When condition two if this is true then show the result number two. We specify the keyword when then we have a second condition. And if this condition is true then we tell SQL to show another results. And of course it's very important to understand in the syntax of that SQL going to go and process the conditions from the top to the bottom. So the first most important condition should be at the start. So SQL going to first check this condition. If it fails and it's not true then it going to go and jump to the second condition. So the order of the conditions is very important in your logic. And now of course we can go and add multiple conditions depend on the logic using the keyword when. And now once we are done defining all the conditions we can go and specify an else keyword. The else can introduce the default value and it is optional. You can go and skip it. So the value of the else or the default going to be used only if all the condition failed. So that means all our conditions are not true and nothing is fulfilled then SQL going to go and use the value from the else. So it is the default value that's going to be used if all conditions are false. So those are the keywords that you must use inside each case statement. So we have case when then and only the else is an optional. So you can go and use it or skip it. So this is the main structure and the syntax of each case statement. Now let's have a very simple example in order to understand how SQL execute the case statements behind the scenes. All right, let's have this very simple example where we have only one condition. So as you can see in the syntax, it starts with case and end and then we have only one condition and we are evaluating here the sales. So the condition says if the sales is higher than 50 then show as a result the value of high. So it's very simple only one condition and on the right side we have here a flowchart in order to understand how the logic is executed. And now what we're going to do, we're going to go and evaluate those four sales through this logic and see what the output going to be with the case statement. So let's do it one by one. Let's start with the first sales. It is 60. So here we're going to go and check is 60 higher than 50. Well, yes. That means this sales is meeting this condition and we will get true and we're going to get in the output the value of high. So here we're going to get the value high in the output. So that means the first sales is fulfilling the requirement the condition and SQL going to give us the value from this condition. All right. So now SQL going to go to the next value and we're going to start evaluating the 30. Now we're going to ask the same question the same condition is 30 higher than 50. Well no. So that means in the output for this condition we will get false. So we will take the path of the false. Now if you take the path of the false we will not get any value. Right? So that means the output going to be a null. So the output for the 30 is null. And that's because we didn't define in our logic anything about the default option. So we don't have here an else. And this is what going to happen. If you don't use else, you will get a null in the output for the case statement. So now let's move to the next one. It's going to be the same thing. So 15 is smaller than 50. So it's not fulfilling the condition. And as well we're going to get a null. And for the last one since it's null we will get as well a null since it will not fulfill the condition. So now after evaluating all those sales only the first sales is fulfilling that condition and that's why we have only one value the high. All right. So now let's keep moving and adding stuff to our case statements. Now we are adding a second condition. So it says after checking the sales whether it's higher than 50 and it fails check again the sales whether it's higher than 20. If yes then show the value of medium. So now in our workflow we are adding a second condition to be checked if the first one is false. So now let's go and evaluate our sales again and check the output the first one the 60. So as you can see the 60 is higher than 50. So we are fulfilling the first requirement that's why we will get the value of high. So it's same like before. So here we're going to get high in the output. And now here very important to understand one thing is that SQL didn't evaluate here in this scenario the second condition. So SQL didn't waste any time by checking the other condition. It skipped everything once it get a true from one condition. So this is exactly how SQL process the case win. It going to check each conditions from top to down and once it finds a true it's going to stop everything immediately and throw the value from this condition and it will not evaluate any other conditions. So now it's going to go and jump to the next value. We are at the value of 30. So let's evaluate the conditions. Is 30 higher than 50? Well, it's not. So it's false. So now what can happen is going to go and jump to the next condition and start evaluating the second one whether it's true or false. So now we're going to check here. Is 30 higher than 20? Well, yes. So it's going to be fulfilled and we will get the value of medium. So it's going to stop everything and show in the output for this value the medium. So we're going to get medium here. So in this scenario, we have evaluated both of the conditions that we have in the case statement. Now it's going to go to the third one. We have 15. Is 15 higher than 50? Will no. So we will get to false for the first condition. Then it's going to go and jump to the second condition and check it. Is 15 higher than 20? Will as well no. So now what going to happen? The false going to be activated over here. And we will not get any value as a return. So we will get the value of null in the output. And now for the last one we have null. We will get as well null because it will not fulfill any of those conditions and that's because we didn't define an else in the case statement. So if we define these conditions like this, we will get the category medium for the 30. And this is how SQL evaluate multiple conditions in the case statement. All right. Now we're going to go to the final form of our case statements and we're going to go and add an else. So we're going to have a default value. So we are seeing here if the sales is not higher than 50 or higher than 20 then show a default value as low. So that means any sales that is equal or smaller than 20 going to get the value of low. And now very interesting if you check the workflow over here you can see that we have now a value for each path. So for the first condition we're going to get high for the second one medium. And if nothing is fulfilled we're going to get always the value of low. So there is no way in this chart to get any nulls. Right? So let's go and evaluate again our values. I think you already get it. The 60 is fulfilling the first requirement and SQL going to stop everything immediately and just show the value of high. So on the right side over here nothing going to be evaluated because the first condition is true. So here in the output we're going to get the value of high. So nothing changed like the two previous examples. Now it's going to go to the next value. We have the 30. So we're going to evaluate the first one. It's going to be false. The next one it's higher than 20. It is true. And that's why is still going to show the value of medium. And this is as well. We had it in the previous example. So medium. So now scale going to move to the next one. And here things going to get interesting. So the value of 15. We're going to evaluate the first condition. Is it higher than 50? Well, no. Is it higher than 20? Well, no. So now we are in scenario where none of those conditions are true. So that's why SQL going to go and execute the else. So if you check our chart it's going to be false and we will get the value of low. So in the output we will not get this time a null because we have else we will get the value of low. The same thing now for the null. Null will not fulfill the first condition as well the second condition and that's why we will get as well the value from the else. So here in the output we will get as well the value of low. So now as you can see if you use an else inside the case statements you will make sure that there will be no nulls in the output. So that you have learned the different options that we have inside the case statements and how skill execute the case behind the scenes. All right friends so now we come to the part where I'm going to show you the most useful use cases of the case statements that I usually use in my projects. So let's start. The main purpose of the case statement is to do data transformations. And data transformations is very important process in each data project. And one very important task in data transformations is that we can generate new informations. We can go and create new columns based on the existing data that we have in the database using the case statements and this of course can help us deriving new informations for our analyzes without modifying the source database only for analytics. So my friends, the main purpose of the case statement is to do data transformations by creating and generating new columns. So now let's start with the first use case and the most important and famous one is we use case statement in order to categorize the data. This means we are going to group up the data into different categories based on certain conditions. And now you might ask why this use case is important. Well, classifying and grouping data is fundamental in data analysis and reporting because it makes the data easier to understand and as well to track. But what's more important, it going to help us aggregating the data based on the categories. All right. So now let's have the following task. And it says generate a report showing total sales for each of the following categories. category high if the sales is over 50. Category medium if the sales is between 20 and 50 and low if the sales is 20 or less and sort the categories from the highest sales to the lowest. Okay, so let's do it step by step. And now before we do any data aggregations, we have to go and create a new column called categories because we don't have it in the database. So now let's start with very simple select statements. So select what do we need? Let's take the order ID, the sales and that's it for now. So from sales orders let's go and execute it. And now we have our 10 orders and we have to go and now create a new column called categories. And we're going to do that using the case statements. So let's take a new line and we start with case and then again a new line in order to define the first condition using the when. So the first condition is the high where sales is over 50. So it's very simple. So when the sales is higher than 50, what can happen if this is true? We want to show the value high. So this is the first condition. And then let's move to the second one. If the sales is higher than 20, that means it's less than 50 and higher than 20, then we want to see the value medium. And now for the last category, the low, we don't have to go and create a condition for that because if those two fails, then that means that the sales either equal to 20 or less. So what we're going to do, we're going to just do a simple else and show the value low like this. Let me make this a little bit smaller. Now what is missing in our case is of course the end. Without it, you're going to get an error. So end and let's give it a name category. So we are ready. Let's go and execute it. So now let's check randomly stuff. So as you can see here we have the sales of 50 it is low which is correct and then we have here 60 it's above 50 and we have the category high and now if you check the order number six we have the order 50 it's medium because it is not higher than 50 it is between 50 and 20. So now as you can see we have now classified our orders using the category. Now the next step that we're going to go and aggregate the data. So how we going to do that? We will use a subquery. So let's do it like this. So we're going to go and select and of course we're going to group up the data by the category. So we're going to go and select that category and we need the total sales. That means we're going to go and use the function sum for the sales and we're going to call it total sales. So now we have to nest the queries together. So from this is our query like this and then we have to close it and group by. So we are grouping by the category. Okay. So with that we are now aggregating the sales by that category. It's very simple. Let's go and execute it. So now in the result we have only three categories. We don't have the 10 orders because now we are doing data aggregations. So now the granularity now on the level of category. So now we can see the total sales for the high is 2010. The low we have 65 and the medium we have 105. And of course we are not done yet because in the task it says sort the categories from the highest sales to the lowest. That means we have to go and use an order by statement at the end and we're going to sort the data by the sales from the highest to the lowest. That means descending. So that's it. Let's go and execute it. And now with that we have our reports. Now we are showing the total sales by the categories and the data is sorted from the highest to the lowest. So the highest category is high then medium and then the last one is low. So my friends as you can see with the help of the case win we have created the new informations from our data we have the category and then we have created insights or report based on this new informations where we have aggregated our data using this new information. So the use case of categorizing data using case statements is fundamental and very important in each data project. Okay. Okay. So now one more thing before we jump to the next use case is that there is one rule to follow if you are using case statements and that is the data types of the result must be matching. So what this means if we check again our example over here we can see that the result of each condition is a string. So as you can see we have here high, medium and low and all of those informations are following the same data type. So it is correct. So now if I go and break this rule for example after this then let's have the value two. So now we have a number and we have characters. So let's go and execute it. And now of course we're going to get an error because now SSQL is trying to convert the value low to an integer which is incorrect. So the data types of the output of the result must be matching and that's not only include the value after the then but also the value after the else because this value is as well part of the output. So let's have here again medium. And now let's go and change this to let's say one. So let's go and excuse it again. Isl going to throw an error because this is an integer in number and the others are string characters. So this is the rule of using the case statement. The data types after then and after else must be matching. And if you ask me whether there is restriction about where you can use the case statement in which clauses you can use it everywhere in select, in joins, from, where, group by, order by, everywhere. So there are no restrictions and we have only this one rule. Okay friends, another use case for the case statement. We can use it in order to map values. So we can use the case statement in order to transform the data from one form to another in order to make it more readable and more usable for analytics. One scenario of mapping values is that sometimes the database developers stores the data and values inside the database as codes and as flags. So for example, the status of the order could be stored as one and zero instead of having inactive and active. And this is one technique in order to optimize the performance of the database for the application because one and zero is way faster than storing the whole string. But in data analyzes, we usually generate a report to be read by human by persons. And now instead of showing the data as zero and one, it's going to be more nicer and readable if you show the data as active and inactive. So for these scenarios, we're going to go and use the case statement in order to translate those cryptical and technical values into readable terms. Otherwise, each one going to consume your report. Going to ask you what do you mean with the zero and one. Let's have the following task and it says retrieve employee details with gender displayed as full text. Okay. So now let's go and solve it. First we're going to go and explore few informations. So let's go and show the employee ID and let's take the first name, last name and we need the gender informations. So gender from sales employees. So that's it. Let's go and execute it. So now as you can see in the result we got our five employees and now the gender informations are stored as only one character F and M. And of course it's easy to understand that the F is female and M is male. but we would like to show it in the report as a full text. So, female and male instead of those abbreviations. So, in order to do that, we're going to go and use the case statement in order to do the mapping between the old value and the new value. So, let's go and create a new column using the case. So, we're going to have here two conditions because we have two values. Let's start with the first one. So, we're going to have a new line and when. So when the gender equals to f ladies first then female and now for the second value it's going to be exactly the same when gender equal to m then we're going to have male be careful for the case sensitivity of the values. So of course we will not end this without an else. So else then we can have the default value. We're going to have the default value not available. It's better than having nulls. So what we are missing is the end. So we're going to have an end over here and we're going to call you gender full text. So that's it. Let's go and execute it. Now if you check the results, we have now done the mapping between the old format of the value with the new format. So instead of m we have males and females. And of course we don't have here any nulls. That's why we don't have a not available in the data. But if you have huge data of course you can have somewhere a null and then you will get this default value. So this is how you can do mapping between values very easily using the case statements. Okay let's have another task for the mapping use case and the task says retrieve employee details with abbreviated country code. Sometimes as we are generating reports maybe using PowerBI or Tableau we don't have enough spaces in order to use the full name of values. So what do we need? We need abbreviations. we need short form of the values and we can go and use in SQL the case statement in order to map the full value to an abbreviated value. So it's like the previous example but the way around. All right. So now let's go and solve it. We're going to go and select few details like the customer ID. Let's take the first name, last name and what do we need? We need the country information from sales customers. So that's it. Let's go and execute it. And now as you can see we get our five customers and we have the country informations as a full name. Now of course for the report we need abbreviated values from this. So we're going to go and map those full names of the countries to a short form. But in real project you might get big tables where you have thousands and millions of records. So you cannot just check it like this. So how I usually do it I go and retrieve a distinct list of all values from one column. So I usually go and have a separate query for that. So we're going to have select distinct country from the table sales customers. It's just for me to see all the possible values inside the database. So now you see the second result over here. We have only two values Germany and USA. And then I can go and map the data correctly. So always if you are mapping data using the case win you have to understand all the possible values that you have inside the table. So let's go and generate this new informations. Let's start with case and then you line when country equal to the first value. It's going to be Germany. Make sure you write it exactly like in the database. The first character is capital and the rest is small. So what happened? We're going to have the abbreviation of Germany. It's going to be de. All right. So this is for the first value. And then let's move to the second one. It's going to be country equal to USA. It's already abbreviated but maybe we can get only two characters. So us like this. And now let's go and add an else. It's optional but in case that we have nulls in the data or we get a new value. So else it's not available. So na. So that's it. And never forget about the end. So end. And the name going to be country abbreviation. So that's it. Let me just get rid of the other query. So the mapping is correct. Let's go and execute it. And now if you check the results, we got a new column called country abbreviation. And as you can see now the mapping is working. Here we have Germany and we have here DE and for the USA we have US. So with that we have solved the task and we done the mapping correctly between the old value and the new value. All right friends, now there is special case for the syntax of the case statements if you are using it for mapping values. So now let's go and check it. So now let's say that we have a lot of different distinct values inside the country not only to values you have a lot of values and if you are mapping the values using the case when you're going to end up always writing the same thing country equal Germany country equal India country equal United States and so on. So we are always using the column country. So the conditions over here using always one column and it's always the operator is equal. So now only for this scenario we have another syntax for the case statements and it looks like this. We start with the keyword case but after that immediately we're going to use the column that we want to evaluate and here you can use only one column you cannot use multiple columns. So now we are telling SQL we are now evaluating one column the country and then for each condition we have the following stuff we say when Germany that means when country is equal to Germany then de so as you can see here we don't have here the whole condition we have only a possible value that we can see inside the country. So we are saying is the value country if it's true then show de the next one is it India then en United States US and so on. So we call this syntax a quick form of the case statements and on the left side we call it full form of the case statements and of course the restriction and limitation using the quick format is that you can use only one column and it's only for the equal operator. So that means only for these scenarios you can go and use the quick format. If things get a little bit complicated where you have to mix and make complex logic, you cannot use the quick formats. So I would say if you are sure that the logic will not get complicated and you can stay always with the same column, you can go with the quick format. But I would recommend always to go with the full format because for one simple reason if you add one small logic you have to go and rewrite the whole case statements back to the full format in order to add any small logic. But of course there is nothing wrong using the quick form in order to do the case statements if the logic can stay static and you are sure we are using only one column and we are just doing mapping. There is no any extra logic. Okay. So now let's try this quick format for the case statement for the previous example. So I will just go and copy everything to a new column. So I'm just going to rename it to two. And now how we going to do it? So it's going to be case but this time we're going to write country and then inside the wind we will have only the values. So no need for the condition. So it's going to be like this. Let me scroll up. So that's it. As you can see it's smaller and quicker than writing the whole condition each time. So now let's go and execute this. And as you can see in the result we're going to get identical values. So now you know one more trick in the case statement. All right, moving on to the next use case for the case statements. We can use it in order to handle nulls. Handling nulls means replace a null with a value. And as we learned before with the window aggregate functions, sometimes nulls leads to incorrect calculations and results which leads to wrong decision-m. We're going to have later a dedicated chapter on how to handle nulls in SQL. But now we're going to learn how to handle nulls using case statements. So now let's have the following task and it says find the average scores of customers and treat nulls as zero and additionally provide details such as customer ID and the last name. Okay. So now let's solve it step by step and again we have here details and as well we have to do aggregations that means we have to go and use the window functions and we don't have to forget that we have to treat the null so we have to handle it. So now let's go and start with very simple uh select. So select customer ID we need the last name and as well we need the scores. So from sales customers let's go and execute it. So as usual we have our five customers and the scores. And here we have a null. Now we're going to go and write the window function but without handling the nulls just in order to see the differences. So we need the average function for what for the scores. Do we have to now partition the data? Well no. So we're going to leave it as empty. We need the average score of all customers. So that's it. Let's go and give it a name and then execute it. I think I have here mistake. So it is a score not scores. So and now as you can see we have the average of 625. And as you learned before SQL going to go and summarize all those four values and divide it by four. But our business understand the nulls as zero not as missing information. So we have to go and handle the null. Let's go and create a new column for the scores. But this time we're going to go and use the case statements. It's going to be very simple. So we're going to say when the score is null. So in SQL we don't write equal null, we say is null. So with that we are replacing the nulls with zero. Right? So now otherwise what can happen? So if it's not null so we need the score as it is. We should not manipulate anything. So the default value is the score itself if the score is not null. So now let's go and end it and let's call it score clean. So let's go and execute it. Now if you check the result over here, it's like almost identical as the score. So we don't have any new values for the scores but only the nulls now are zero and all other values they are not affected. So we didn't touch it. We didn't transform it at all. So this is what do we mean with handling nulls replacing nulls with another value. So now in order to finish the task we have to do the average for the score clean and not for the original score. So how we going to do it? Let's go and copy the whole case statements. I'm just going to do it in another column. So let's have an average and inside it we have the case statements like this. Let me just sort it like this. And now what is missing is the over and it's going to be empty. So average customer let's call it clean. So this is the logic. Let me just make everything smaller. So now as you can see it's exactly like the previous one but instead of using the original score now we are using the column that we have created. But of course we don't need the alias over here. So we have to remove it. So it start with case and end. So let's go and execute it. And now you can see in the output we got a new value for the average and it is more accurate for the business. So now we have 500. Previously we had 625. So as you can see you have to understand what the nulls means in your business and handle it correctly. Otherwise you will get wrong results. So that's it. We use case statements in order to handle the nulls inside our data. Conditional aggregations means we're going to go and apply an aggregate function in SQL like some average count but this time only on a subset of data that meet specific conditions. This technique is amazing in order to do deep dive analyzes or target analyzes on a specific subset of the data. So now let's have the following SQL task in order to understand this use case. The task says count how many times each customer has made an order with sales greater than 30. All right. So, as usual, we can do it step by step. So, what do we need? We need the orders. So, let's get the order ID and as well, let's get the customer ID like this and the sales from sales orders. Let's go and execute it. So now, what else I'm going to do with I'm going to go and order the data by customer ID. So, let's execute it again. Okay. So, now the task sounds easy, but it's a little bit tricky. We have to count the number of orders for each customer where the sales is higher than 30. Let's have an example. For example, this customer number one. So the total number of orders is three orders, right? But we have to count only the orders where the sales is higher than 30. And in this example, we have only one order where the sales is higher than 30. So it's only the order number four. So the count for the customer ID number one should be one. Now let's check another customer. For example, the two. And as you can see, we have three orders, but none of them have the sales higher than 30. So the count should be zero here. So how we going to do that? We have to go and flag each row whether it's higher than 30 or not. So if it's higher than 30, it gets the flag of one. If it's less than 30 or equal to 30, it's going to get zero. And then we're going to go and summarize all those flags in order to get the count. So let's do it step by step. Let's first create the flag. So we're going to go and use case and then our condition is very easy. We're going to say when. So what is the condition? Sales greater than 30. So sales is higher than 30. Then what can happen? We're going to flag it with the one because later we're going to go and summarize the one. And now else if it's not higher than 30, equal to 30 or less. So it's going to get zero. All right. So now let's go and end it. So let's say sales flag. Now let's go and execute it and check the results. All right. So now if you check the results we got now a very nice flag in order to see which orders has sales higher than 30. So now for example let's take the customer ID number one. As you can see only the order number four has sales higher than 30 and it's flagged with one and all others are zero. Now let's take the customer ID number three. And as you can see we have now two orders where the sales is higher than 30. And as you can see we have the one twice. And now we can use this flag in order to do the aggregation. So now if you go and summarize the flag for the customer id number three we will get two and this is the count of orders where the sales is higher than 30 right and let's take another example the customer ID number two we have everywhere zero and if we summarize those values we will get zero which is the count of orders where the sales is higher than 30 which is correct so now as you can see first we have built an extra column in order to help us doing the aggregation and now in the next step we're going to go and aggregate this column so let's go and do that we don't need all those informations the order ID we need the customer ID because it is the granularity for the aggregation and let's remove the order by and now let's go and group up the data by customer ID but of course we need the aggregate function so how we going to do it we're going to go and summarize the whole flag so and now of course we're going to go and rename this since now it is an aggregated column so we're going to call it total orders so now let's go and execute it. So now let's go and check the result. As you can see, now we have our four customers. And for the customer ID number one, we got only one order higher than 30. The second one has no orders higher than 30. The third we have two and one. And with that, we have solved the task. Now I would like to add one more thing to our query in order to see the normal aggregations, not the conditional aggregations. So usually we go and count for example the star in order to get the total orders. And let's rename the previous one to high sales. So let's go and execute it. So we are just now doing aggregations without any conditions. And now we can see how many orders did each customer. So we can see that the customer ID number one did order three times but only one order higher than 30. So this is a normal aggregation and this is a conditional aggregations using the case statements. All right friends. So now let's do a recap about the case statements. Case statement can go and evaluate a list of conditions one by one and return a value once the first condition is met. And if you are talking about the rules of using the case statements, we have only one where the data types of each condition after the then and else must be matching. And now if we talk about the use cases of the case statements, the main use case is to do data transformations and especially by creating new columns and deriving new informations. So as we saw there are amazing use cases for the case statements. For example, we can use it in order to categorize our data. As we learned, we can go and create a new groups of data then to be aggregated for our reports. And then we saw another use case is mapping values. We can use the case statement in order to help us mapping the cryptical technical values that is stored in databases to new values which is more readable and more friendly to be used. And the next use case that we have learned is handling the nulls. We can use the case statement in order to replace the nulls with value to make our aggregations more accurate. And the last use case that we have learned and I think the most used one in my project is doing conditional aggregations where we can aggregate a subset of data that meets specific conditions in order to do focus and target analyszis. Okay my friends. So with that we have covered all the topics and all the functions in order to transform single value in SQL the role level functions that was very important especially for data engineers. So we are done with this chapter. Now we are moving to very interesting chapter. Finally we're going to talk about data analytics in SQL and we will be covering now the aggregate and the analytical functions that we have in SQL. So first we're going to start with the basics. So we will learn simple functions on how to aggregate your data. So let's go. Hey my friends. So now we're going to talk about the aggregate functions in SQL. They are amazing if you are a data analyst or data scientist where we usually use them in order to uncover insights about our data. So the aggregate functions they accept multiple rows as an input and the output of the aggregate function usually is one single value. So now we're going to go and cover first the basic aggregate functions in SQL. So let's go. So now in our database we have four orders and we have the sales informations for each one of them. So now one question that comes in our mind what is the total number of orders in our business. So how many orders do we have? Now in order to do that we use the function count because what it does it's going to go and count the number of rows inside our table. So if you apply the count function on this data SQL going to go and start counting how many rows do we have. So the total number is four and in the output we will get four. So as you can see we don't really care about the content of the tables. Scale is just counting how many rows. So the number is not based on the sales or formations or the orders. So this is how the count function works. Now we have another question and we say I would like to find the total sales in our data in our business. So that means we have to go and summarize all those sales that we have in the order and for that we have the sum function. So if you go and apply the sum function, it's going to go and summarize all the sales and return at the end the total sales. In this example, it's going to be 80. So, as you can see, the aggregate function accept multiple rows, multiple values, and the output going to be one single value, the aggregated value. Now, moving on, I would like to understand what is the average sales in our business. So, it sounds simple. In order to do that, we're going to use the average function. So, if you apply it on the sales, it's going to go and summarize all those values and divide it by the number of values. So, you will get the average of 20. Now comes interesting question where you want to find what is the highest sales in my data. So for that we can use the function max. So once you apply it it's going to go and start searching for the highest value inside our table. So this time we are not really aggregating the data into something new. It's like searching for the highest value between multiple values. So in this example we will get the 35 as the highest sales. Now of course if you want to see the lowest sales inside your business you can use the min function. And if you go and apply it as well, the same thing is going to go and start searching for the lowest value in the sales. And in this example, it's going to be the 10. So as you can see guys, the aggregate functions is very simple but yet very powerful. So it is really useful for insights in order to understand how well your business is performing. So now let's go to SQL in order to try those functions. Okay. So now we're going to go and analyze the orders table inside our database by doing very simple aggregations. So let's start with the first task. It says find the total number of orders. So this time we are targeting the table orders. So let's just start with the select. So now we can see we have like four orders. And now we would like to have like one number. What we can do? We can go and say count star as total number of orders. So let's go and execute it. And with that we got one number. It is the four. This is the total number of orders. Now let's move to the second task. It says find the total sales of all orders. So this time we have to summarize all the sales values in one big value. So how to do it? We're going to use the function sum and this time we are targeting the sales and we're going to go and call it total sales. So let's go and execute it. And with that we have 80 as the total number of sales. So all the sales values are summarized in one big value. So as you can see now we are exploring the business right? We are understanding how many sales, how many orders. So this is really the basics of analytics in SQL. Now let's go to the second task. Let's find the average sales of all orders. So we're going to have average this time the sales as average sales. Again very simple. Let's go and execute it. Now the total sales is 80 but the average sales is 20. So all the values of the sales is summarized and then divided by the number of orders. So 80 divided by four. And with that SQL finding the 20 as an average. Now let's go and get interesting stuff. Let's go and find the highest sales of all orders. So what is the highest sales that happens in our business? In order to do that, we can use the function max sales as highest sales. Very nice. Let's go and execute. So the highest sales in the database is 35. And now I think you already know what is the next task. Find the lowest sales of all orders. So this is exactly the opposite. So we're going to go and use the min sales as lowest sales. So let's go and execute. The lowest sales in our business was 10. So my friends, as you can see, the aggregate functions are really amazing. And if you use it like this, you will get like the big numbers about our business. But now don't forget about the aggregate functions. If you combine it with a group by then you will be breaking those big numbers into something like you are aggregating by the customer ID maybe by a date by a country. So anything you specify with the group by it going to breaks those big numbers into smaller number based on the column that you are using. For example let's go with the customer ID over here and let's put it at the start as well. And now if you go and execute it. So now as you can see in the output all those numbers are not anymore like big numbers. We drill down to more details based on the column that we have specified. So now we have for each customer the total number of orders, the total number of sales, the average sales, the highest sales or the lowest sales. Of course the data is very small and those numbers can be more interesting if you have bigger data. So if you combine the aggregate functions together with the group by, you will break those big numbers into more details based on the column that you are grouping by. So now what you can do, you can go and apply those functions as well for the customers. There we have a score and you can go and find the average score, the highest score, the lowest score and then you can group up the data by the country for example. So pause the video and do some aggregations on the table customers. [Music] All right my friends. So with that you have learned the basics on how to aggregate your data using SQL. Now we're going to move to more advanced way on how to aggregate your data. We will start talking about the window functions the analytical functions. So first we're going to start talking about what is exactly window functions and we're going to cover the basics about this topic. So let's go. window functions or sometimes we call them analytical functions. They are very important functions in SQL. Everyone must know them especially if you are doing data analyszis. Each time I write SQL script in order to do data analytics, I end up using them. So as usual, we're going to go and now understand the concept behind them and then we're going to start practicing. So let's go. Okay guys, so now let's start with the first question. What are SQL window functions? They are functions that allows you to do calculations like aggregations but on top of subset of data without losing the level of details of the rows. So it is something very similar to the group pi but here we have special case you don't lose the level of details. So now in order to understand the definition let's have a very simple example. Okay. So now let's understand how SQL works with the group by clouds. Let's say that we have the very simple example. We have four orders. two orders for the cabs and two order for the gloves. And let's say that I would like to see the total sales for each products. So now if we decided to use the group by what SQL going to do going to take the first two orders for the caps and put it in one row. So in the output we're going to have only one row for the caps with the total sales of 40. And the same thing going to happen for the gloves. So we're going to take the two rows of the gloves from the input and in the output we're going to have only one row for the gloves. So that means the number of rows going to be depending on the number of products we have on our data. We have two products, we get two rows. So that means SQL is really like smashing or squeezing the results in the output. And this is exactly what the grouper does to our data. It aggregate the rows, aggregate the data into different level of details. So now on the left side we see four rows. On the right side we have two rows and with that we are losing some details in the results. But still we have solved the tasks. So now let's see what going to happen if you use window function in SQL. Okay. So now we have the same data and as with the same task we have to find the total sales for each product. Now if you use window function SQL going to do the following. It going to go and execute each rows individually from each others. So what going to happen it start with the first row the order ID one. In the output we're going to get as well the same stuff the order ID one the same row but we will get the total sales for the caps. So here the total sales is going to be 10 30 we will get 40. Then it's going to jump to the second row and it's going to process it as well. So in the output we will get the order ID two the product caps and as well we have the same aggregation since we are talking about the same product. So we will get 40. Then it's going to go to the third order and here we have the gloves. So in the output again we have the order ID 3 the product gloves and the total sales this time going to be 5 + 20 so we will get 25 then it goes to the last row to the order ID number four in the output we're going to get four gloves and as well 25. So now we can notice that if you use the window function you will not lose the level of details of your data. So we are doing something called rowle calculations. So if in input data we have four orders in the output we're going to get four orders and as well we will get our aggregations correctly. So now if you compare both of the methods side by side we can see that we are solving the same task. So we are finding the total sales for each products but with the group by we are smashing squeezing the results from four orders into two rows one row for each order. So that means with the group by the granularity is changing right in the input the order ID is controlling the level of details but in the output of the group by the product is controlling the level of details. So we have different granularity but in the other hand with the window functions we are still able to do aggregations but we are not losing the level of details. So the granularity of the input going to be the same like the output in the results. So this is exactly the main difference between the group eye and the window function. If you want just to do simple aggregations, then go with the group by. But if you care about the level of details and you need to add more details to your results, then you can go with the window function where you can do aggregations plus having more details. And now if you go and compare the functions between the window and the group by, we can find that both of them has exactly the same functions for the aggregations. So we have the count, sum, average, min, max. And here comes another difference between the window and the group by. The group I has only the aggregate functions. So that's it. But in the window functions, we have way more functions to use for analytics. So for example, we have the ranking functions. And we have here another group of functions for the value or we call it analytical functions. So that means in the SQL window, we have a lot of functions. We can cover a lot of analytical use cases and advanced complex stuff. But with the group by we, we have only the aggregate functions only for simple use cases. So this is another difference between the group by and the window. Group by use it if you have simple analyzes, simple aggregations, window functions, we're going to use it for more advanced data analyszis where we're going to cover a lot of use cases. All right guys, so now we're going to have few tasks in order to understand one thing. Why do we need scale window functions and why in some scenarios group is not enough and we have to use scale window functions. So let's go. All right. So let's start with very simple task. It's going to say find the total sales across all orders. So we need one value with the total sales. Let's see how we can do that. First make sure that you are using the database. So use sales database in case you have closed the clients so that we don't get any errors. So now we're going to start with the first thing. We're going to go and select the sales. You're going to find it in the table sales orders. So now let's just query the data. And as you can see we have 10 orders with 10 sales. We didn't aggregate anything yet. So we have the row data now. So now in order to solve the task, we're going to use the function sum. So sum of sales and we're going to give it new name total sales. We don't have to use any group by because we don't have to group up anything. So that's it. Let's go and execute that. And as you can see SQL going to return one value 380. This is the total sales that we have inside our data. And this is the highest level of aggregations. So with that we have solved the task. We have the total sales across all orders. We don't have to group up anything. Let's move to the next example. Let's say that in the next task, this time we want to find the total sales but for each products. So not for the all orders, for each products we want to find the total sales. So this time we don't need only one value. We need one value for each product. In order to do that now, we're going to go and use the group by function. And we're going to group up by the product ID. and group up need as well the dimension in the selection. So we can do it like this. So that's it. Let's go and execute the query. Now as you can see in the results we don't have one value. We don't have the highest aggregations. This time we are drilling down to the next level of details. So the level of details here is the product ID. We have one row for each product. So for the first product we have 140. The next one 105 and so on. So as you can see we are now splitting the data at the level of product ID and we went from 10 orders now in the results we have four orders and that's because we have four products. So the number of rows at the output going to be defined by the dimension the product ID and with we have solved the task we have the total sales for each product. All right guys so let's keep progressing in our examples. Now the next one going to be a little bit advanced where we have the same aggregation. Find the total sales for each product. Additionally, provide details such order ID and the order date. So, as you can see, we have already solved the first part. We are finding the total sales for each product. Now, we just have to add some additional informations like the order ID and the order date. So, let's go over here and just add it in our select. So, order ID and let's have the order date. So, let's go and execute that. Just going to make it a little bit bigger. So, let's go. But now as you can see SQL will not be happy going to throw an error and says the stuff that you are adding to your select are not included in the group by. So as you can see in the group buy we have only one dimension or one field called the product ID. But in our selection we have three dimensions the order ID, the order date and the product ID. So there is no matching between the select and group by and SQL will not allow it. And now you might say you know what let's add everything to the group by. So with that we're going to get our aggregation and as well we're going to get our details. So let's try that. I'm just going to zoom out a little bit and instead of having the product ID let's add everything. So the order ID, order date and the product ID. So now we have matching and SQL should not throw any error. Let's go and execute it. So now let's check whether we have solved the task. The task has two parts right. We have to do the aggregations and to provide details. So as you can see we have solved the second part. We have the details, order ID and order dates. But now the first part finding the total sales for each product is destroyed because if you check the results, we have the product ID 101. It has the total sales of 10. But in the third order, we have it as a 20 for the same product. So actually the data is not aggregated and that's because we are aggregating at different levels and we have included way more stuff that we don't need for the aggregations. We are aggregating at the order ID level. So as you can see now we are hitting the limits of group by. We cannot provide aggregations and as well provide additional informations from our data. You have to pick one. That's why we have to go to the second option where we can use the window functions. So let's do that. I'm just going to get rid of the group by parts and as well all the fields. Let's back to the root. So now we have the sum of sales and if execute this I'm going to get one value. So we are at the highest level of aggregations. So now we need to use the window function. I'm just going to remove the name. And now we're going to tell SQL this is a window functions using over after the aggregations or the functions tells SQL we are talking about window functions. So let's just execute it like this. And with that we got 10 rows and that's because we have 10 orders and for each row we have exactly the same value. So we have the total sales of all orders for each row. So as you can see SQL understands this is a window function and SQL should not like group all the data in one row. It should keep exactly the same rows or same number of rows like the input. So with that we have the window function but we have to split the data by the products. So now we're going to use the keyword partition by it's like the group by but another wording products ID the same dimension. So with that we have the total sales by products as a name. So let's go and execute this. So now as you can see in the output we still have the same number of rows. We have 10 orders. We have 10 rows but the result did change because now we are aggregating the data at the level of product ID. In order to understand the results we have to add more informations to our select. So now let's add the same dimension. It's going to be the product ID. I'm just going to add it at the front over here. So let's select. And as you can see now it makes more sense. We have those products and they have always the exact same uh sales. and as well for the next product and so on. And now here comes the magic of the window function. We can add more informations to our select statement without having any errors. So now we need additional informations like the order ID. So we can go over here and say order ID, order date, any type of column you can add it to your select and let's go and execute. So as you can see now we got the result even though that those three dimensions in the select are not part of the window aggregation. So with that we have solved the tasks. We have additional informations. We have the order ID, the order date and as well the first part of the task to find the total sales for each products. So each of those values are the total sales for each product. And with that we have solved the tasks and this is exactly why we need window functions. In real projects things get really complicated. You are doing different tasks in one query. So you are doing aggregations, you are doing some other stuff. So just focusing on the aggregations is not going to be enough. You have always to add additional informations to your query. So as you can see we use group eye to do simple analyszis but as things get complicated in the analytics we use the window functions in order to show the aggregations and as well add additional informations. So as you can see we use group eye to do simple analyszis but as things get complicated in the analytics we use the window functions in order to show the aggregations and as well add additional informations. All right everyone. So now we're going to go and deep dive into the syntax of the SQL window functions. We're going to cover everything each part of the syntax for you to understand how to use them. So let's go. All right. So let's start first by understanding the basic components or the basic parts of each window syntax. Mainly we have two parts. The first part going to be the window function. We have like sum, average and so on. The second main part is going to be the over close. And inside the overlo we have three different parts. The first one going to be the partition close. The second order close and the last one we have the frame close. And those are all components that you can use inside the window function. So two main parts window function and the offer close. And inside the over we have partition order and frame. Let's go more in details. So for example we have the following window function. So as you can see we have a lot of stuff going on here. We're going to understand them step by step component by component. Let's start from the left from the first one. So what do we have over here? We have a function window function. So what is a window function? Like here we have the average. It's like any other function in SQL. You can use it in order to do calculations on top of the window. So the first thing to do or to define in a window is to define the function of the window. And as we learned before, we have a long list of many window functions available in SQL. And we group them into three groups. The first one we have the aggregate functions. So we have the count, sum, average, max. All those functions we have them as well for the group by. So those are used for the aggregations. The second group of functions we have the ranking functions. So we have the row number, rank, entile and so on. So we can use those groups in order to give a rank for our data. The last group we call it value or sometimes analytics functions. So here we have very important functions like the lead, lag, first value and the last value in order to access a specific value and of course we're going to go and learn all of them one by one understanding the concepts some examples and as well for you to understand when to use them for that analyzers. All right so now let's keep moving understanding the other parts of the window syntax. Now inside the function average we have here a field name or column name called sales. This called a function expression. It's like a value, a parameter, argument that we can pass it to the function. And here we can use multiple different stuff. For example, depend of the function of course. So here it could be empty like here in the ranking. It doesn't allow to use an expression. So it should be always empty. Or we can use a column like in the example we use the sales. So we use the column name as an argument or an expression. For the average we are finding the average of sales or we could use a number. So here in the intile we are allowed only to use numbers or we could have multiple stuff. For example in the lead we can have sales then numbers and so on. So things get complicated. Don't worry about it. I'm going to explain that. So here we have multiple stuff or we can have a whole conditional logic. So for example here we have the case win so on inside the sum. So the whole thing over here holds an expression for the sum. So as you can see we can build here a complex logic and the output of this logic can be passed to the function sum. So that means as an expression for the function we can use different stuff of course depends whether the function allows it or not. All right. So now let's have a quick overview in order to understand which data types are allowed in the expressions for those functions. Let's see the aggregate functions. As you can see the count function accept any data type but the others like the sum, average, min, max, they allow only numerical data types. All right. So now let's move to the rank function. The expressions it's pretty easy. It should be empty. It doesn't allow any argument or anything inside those functions. So as you can see all of them are empty but only one that accept numerical values which is the end tile. You have to define a numeric value. And now moving on to the last type we have the value functions. they accept any data types inside the expressions. So as you can see each functions has its own specifications and you have to be careful which data type you are using in the expressions. Okay. So now let's keep moving to the next one. We have a very important part in the window syntax. So so far what do we have? We have a function. We have an expression. It's like usual stuff. We have done that before using the group by. Now we have to tell SQL that we are dealing with the window function. It's not a normal one. In order to do that we have to specify the keyword over. So the second main part in the syntax is the over clause and we use it in order to define a window and inside it we can define multiple stuff like the partition pie the order by the frame but all those stuff are optional. We can skip it and leave it empty. So the main task of the over it tells first SQL we are dealing with the window function here and as well you can use it in order to define a window of your data. So now we're going to go and cover everything inside the over clause and we're going to start with the first one the partition pi. All right. So now we're going to learn how to define a window inside the overlaus. The first part that we can define is the partition pi. So for example here we have partition pi category. We have to define the dimension. It's very similar to the group by another wording. So the first part is going to be the partition clause. What it going to do? It's going to divide the entire data sets into groups or you can call it windows partitions. So here we tell SQL how to divide our data. And here we have two options. Let me just show you. So if we don't use anything so we have it empty. You see over and partition by is not used. What going to happen? SQL going to use the entire data in order to do the calculations. So the whole data the entire data going to be counted as one window. So we are telling SQL don't divide anything leave it as it is. The second option that we have is to divide the data by partition pi. So we define the window like this partition pi products for example. So SQL going to go and divide the entire data into different windows. For example here two windows. And here this time the calculation the sum of sales will not apply on the entire data set. This time it going to be applied on the different windows individually. So we're going to find the sum of sales for window one separately from the total sales of window 2. All right. So now we have this very simple example. We have here three fields. The month, product, sales. They are really easy informations. And now we have the following SQL window function. So we have sum of sales and inside the overlo we are not using anything. So we are not using partition by. So how SQL going to define the window. Now SQL going to say okay I don't have to divide anything. The entire data set is one window. So SQL going to go over here and say the whole thing is one window. So there is no partitions, there is nothing. We have only one window. So the entire data going to be aggregated. So this is what happen if you don't use partition pi and you leave the overclos. The entire data is one window. All right. So now let's move to the next example. We don't want to have only one window. We would like to have multiple windows. So we have to divide the data by something. So in the over clause we're going to define the window like the following partition by month. So it's not empty. We are now dividing the data by the field month. So the values inside this column going to divide the data sets. So here we have two months January and February. So what's going to do? SQL going to go and divide the data into two sets. The first window going to be this one of January. So we have the first window going to make it smaller and the second window going to be the February. So it's going to be two windows inside our data and the calculation going to be happening on each window separately. So here as you can see we are using the month in order to divide our data sets into two windows. One window for January and another window for the February. So now let's have a quick overview about the options that we have with the partition p. The first option as we learned we can just skip it. So without partition by for example here total sales across all rows and here we don't find anything inside the SQL. The second option we can use one field one column for example partition by product. So we are using one dimension but we can go and mix stuff. We can use multiple columns or multiple dimensions in the partition by for example here partition by product and order status. So here with the partition by we can define a list of dimensions that could be used in order to divide our data. So in this example we are saying find the total sales for each combination of products and order status. So those are the different options on how to work with the partition pi. So now let's have this overview again for all functions. The partition pi for all those functions is optional. So if you don't use the partition pi in all those functions you will not get any errors. So now let's go back to SQL in order to start practicing with this clause. Okay. So now we have the following task. Find the total sales across all orders. And we have to provide additional informations like the order ID and the order date. So let's go and solve it step by step. First I would like to provide the details. So I'm going to select the order ID and the order dates from the table sales orders. And next we're going to work with the aggregations. So we need to find the total sales across all orders. Again since we have here details and aggregations we cannot use group by. We have to use the window function. So we're going to go use the function sum for sales. And now we have to tell SQL we are working with window functions. That's why we're going to use the over close. And now the next step we have to think about defining the window. So let's check the task. It says total sales across all orders. So that means we don't have to partition or divide the data sets into like chunks or partitions. We have to leave it as it is like the whole data going to be one window. And that's why we don't use partition pi inside the definition. We're going to leave it empty. Let's go now and give it a name. It's going to be the total sales. Let's go and execute this. And now at the results, as you can see, we have all the orders, all the details, and as well, we have the total sales across all orders. So with that, we have solved the tasks. We have the total sales and as well some details about the order. All right. So now let's move to the next task. It's going to be very similar. So it says find the total sales for each product. And we have to provide additional informations like the order ID and the order dates. So it's going to be very similar task but this time we have to divide the entire data into windows and that's going to be by the product. Since we are saying total sales for each product. So this time we have to go and divide the data. So we're going to define the window like this partition by and we can use the dimension product ID. Let's go and execute this. So now you can see in the total sales we don't have anymore the total sales of the whole data but they are divided but in order to understand the results let's go and include the product ID in the results. So product ID and execute. So now by looking to the results you can see that the data is divided into four windows. Let's see them. It's going to be by the product ID. So this dimension going to be controlling the partition. So the first window going to be the product ID 101. So we have the total sales for this product 140 and the next window going to be 102. The third one 104 and the last window it's going to be only one row the 105 and the total sales of 60. So with that we have solved the task. We have the total sales for each product and as well we have some details. Now I would like to show you the dynamic of the window function. We can add multiple aggregations on multiple levels. Let me show you what I mean. Let's say we stay with the same example but we're going to find the total sales across all orders and as well the total sales for each products. So what we can do we can do the window functions on different levels by for example here removing the whole definition. So here we have the total sales for the entire data for the first task and the next one going to be the total sales but divided by the product ID. Let's here rename it by products. Let's go and execute this. And now you know what I'm going to go and add the sales as well just to explain the flexibility of the window function. So let's go add the sales and execute it again. And now by looking to the results you can see we have the sales informations three time but with different granularities. The first sales the sales it sales without any aggregations. It is the highest level of details of the sales and we're going to have the sales for each order. The next one the total sales with the window function. Here we have the highest level of aggregation. So we have the total sales of all orders and the last one the total sales by product it's something like in the middle we are aggregating on a window and the window going to be the product ID. So as you can see we have different granularities of the aggregations and this is exactly the flexibility that we have with the window function. We can do all those stuff in one query. Okay. So now let's keep moving and adding stuff to our task. It's going to say find the total sales for each combination of the products and the order status. So this time we have to divide the data not only by the product as as well with another dimension the order status. So now let's see how we can do that. I'm going to just show the dimension order status and the results and we're going to add the following thing. So sum sales over since it's a window function and let's go now and define the window partition by. So we have again the product ID but not only this dimension as well the order status and let's go and call it sales by products and status. Let me just rename those stuff. Okay. So let's go and execute. All right. So now let's check the results. It is the last aggregation over here. And as you can see here the aggregation has different granularities as the previous one. And we have more details. This time we are splitting the data by two dimensions. So the first window going to be the product ID with the order status it's going to be only those two rows. So we have the order ID 101 and the order status delivered. So the total sales of this going to be 10 + 20 and we're going to have 30. The next window going to be the same product but with different status. So it's going to be the 101 shipped and we're going to go and summarize those two values and we're going to have 110. The next product and order status going to be the 102 and we have it only once. So 102 delivered it's only once. So it's going to be the same value. The next partition or window it's going to be two rows. 102 with the shipped is going to be those two things 60 + 15 we're going to get 75. So as you can see here the product ID and the order status they are controlling how many windows we're gonna get. So we get here around like six windows. With the product ID we got only four windows and without using anything inside the overlause we will get only one window. So this is how the partition by works. All right. So that was the first part of the window definition within the overclo. Let's move to the next part. We have the order by. For example, we can use order by order date. It's just a field. So the order close is very important in order to sort your data within a window. So the order by is very important as well for many functions. So by just checking the overview over here for the aggregate functions it is optional. So you could just leave it or add it. But for the rank function and as well for the value functions they are a must. So if you want to use those functions you must use the order clause because it makes no sense for example if you are ranking the data without sorting your data first. Okay guys. So now back to our very simple example and we have the following query. So the function this time going to be rank. So we have to rank the data and the definition of the window going to be partition by month. So that means we divide the data by the months. So we have it over here. And then the second part going to be order by sales descending. So we have to sort each window by descending order. That means we start with the highest value and we end up by the lowest value. So let's see how SQL going to go and execute this. So first partition by month. So it's going to divide the data into two partitions because we have two values by the month. So let's see how this going to look like. So one window for January and another window for February. All right. So now SQL going to go to the second part and execute order by sales descending. So what's going to happen? SQL going to go for each window separately and start sorting the data from the highest to the lowest without checking the other window. So in those three values, the highest one is this one. So it's going to be on top. Let me just sort it. This is going to be the lowest. You're going to be in the middle. So SQL going to sort this window separately from the next one. And then once it's done, it's going to go to the second one. So the highest value going to be this one. You are the lowest. Let me just do it like this. So SQL going to sort it like this. The highest one is 70. The next one is 40. And the last one is five. So with that SQL done with the definition of the window. So it's splitted by the month. And each window is sorted by the cells. The next step is going to go and rank those values. So it's really simple. In the output, it's going to rank the data like this. So the first one going to be this value. The next one going to be two and the third one going to be three. So as you can see, SQL is sorting only this window and it's going to go and repeat the same stuff for the second window. So each rank is separately from the others. So as you can see, it's very simple. This is how SQL executes partition by together with the order buy for the rank function. All right. So now let's have a quick task for the order by. It says rank each order based on their sales from the highest to the lowest. And we have to provide additional informations like order ID and order date. So let's see how we can write the query. So we have the basic stuff order ID, order date and sales. And now we're going to go and rank the data using window function. So we're going to use the function rank and then we're going to tell SQL this is a window function and inside it we have now to provide the definition of the window. So now by checking the task you can see that we don't have to divide the data. So we don't have to use partition by we have just to use rank and with rank we have to use the order by it is must. So we're going to use order by the field going to be the sales and from the highest to the lowest. So let's just call it rank sales and let's go and execute this. And as you can see our results going to be sorted from the highest to the lowest. So you can see the sales 90 at the top and the lowest going to be the 10. And as well we have a rank. So for the top rank it's going to be one and the lowest rank going to be 10. So as you can see we just quickly create a rank in SQL. It's very simple. The whole thing is one window since we are not using partition pi. And of course if you want to have ascending so from the lowest to the highest you can just remove it because optionally going to be ascending. So let's go and execute the query. So now we can see the orders are sorted the way around. So we start with the lowest and end up with the highest. And of course you're going to get the same results if you go over here and add ascending. So if you execute you see we got exactly the same results. So this is how you use the order by inside the window definition. Okay guys, so with that we have covered the second part of the window definition. Now we're going to go to the last part to the most advanced part of window and we have the following stuff. So we have rows unbounded proceeding. We call this frame close or window frame. So what we are doing over here that we are defining a subset of rows within each window that is relevant for the calculation. Totally understand if this is confusing at the start or complex. It was for me as well. So what we're going to do we're going to deep dive into the concept in order to understand how this works and we're going to do it step by step. So don't worry about it. All right. So now let's understand what is going on with the frame close from the basics. So now if you do aggregations and you don't use window function you're going to consider the entire data or rows inside the table. But what we can do we can go and divide the data using partition pi to a window. So for example here we have window one and window two. Now if you go and do aggregations all the rows in the window one going to be aggregated and then scale going to go to the window two and aggregate all the rows. What we can do in scale is that we can say you know what I don't want all rows inside the window. I want a subset of rows inside the window. So what we are doing over here is that we have those two windows but we specify a scope or we specify subset of data from each window to be involved in the aggregations. And of course not only aggregations we can do ranking other stuff. So I mean calculations. So here like we have a window inside a window. So we are defining a scope of rows. Not all rows should be involved in the calculation but only specific subset of data. And we can do that using the frame clause. So again the partition by you can use it in order to divide the entire data set into multiple windows. And now for the frame close. If you don't want to consider all the rows within each window in the calculation, you want to focus and specify only a subset of data within each window. Then you going to go and use the frame close. All right. So now let's go and understand the syntax of the frame close. Let's have the following example. We are saying the window function is the average of sales and then we define the window. So we have the first partition by categories, order by order dates and then we have the frame close. It's going to be the following rows between current row and unbounded preceding. This is the frame types and we have two types. We have the rows and groups. Then we have like between and range. So the first range going to be the frame boundary, the lower value. And here it accepts three types of keywords like the current row or number of preceding or the unbounded preceding. And then we have another frame boundary. It's going to be the higher values and it accepts the following stuff. We can use the current row in following or unbounded following. So as you can see we are defining like boundary or a range from low value to higher value. So now we have some rules. We cannot use the frame clause without order by. So order by must be exist in the definition in order to use frame clause. And the second rule it says lower boundary must be before the higher boundary. So always we start with the lower boundary and we end up having the higher boundary. You cannot switch that. Okay. So now we have a very simple example. We have the month and the sales and the following query. Sum of sales. This is the window function. And the definition of the window going to be order by month. We are not using partition by just in order to make our life easier. And the frame close going to be defined like this. Rows between current row and the two following. So now let's see how SQL going to execute this. The first definition order by month. As you can see the months are sorted already. So now SQL going to work with the frame definition current row and the two following. So SQL going to process this row by row. So it's going to start with the first row and it's going to be our current row as here in the SQL. So this is our current row and we say the range until two rows, two following rows. So it's going to be February and March. So that means the pointer going to be over here for the two following. So with this we have the frame boundaries and SQL have the following scope for the first row. So we have three rows and the summarization of those three rows going to be around 70. So we will get for the first row 70 because the scope is not all rows but only this subset of data. Okay. So with that scale is done with the first row it's going to jump to the second row. So the pointer going to be the current row at the February and the second two following going to be at April. So with that as you can see we are sliding down in the subset of data or in the window and with that we have a new scope a new subset and the summarization of all those values going to be 45. So that's it. I think you get it already. It's going to go to the next one. The pointer going to be on March and the two following going to be on June and it's going to slide like this. We have those three rows in the scope and the summarization of that going to be 105. So now things gets interesting for the next row. So the pointer for the current row going to be April but the two following going to be like after the end of the table or something like that. So as we slide down as you can see the scope now or the subset of the frame going to be only two rows and the output going to be 75. And finally if you go to the last row it's going to be the current row and we're going to have only one row for this subset because the two following is just outside of the table and we're going to get the same value as the summarization. So as you can see that's it. It's very simple right? So the frame we use it in order to scope which rows are involved in the calculations. So all you have to do is to define the boundaries of the frame, the lower and the upper boundary. Let's see what other options do we have with the frames. Okay. So here we have the same example but we redefine the boundaries of the frame like this. Rows between current row this is the first boundary and unbounded following. This means that we are targeting always the last record in the window or in the table. So unbounded following going to be always static and it's going to be in this example pointing to June. And now it's still going to go row by row and the current row going to be like the start January and then February. I'm just going to take this example the pointer is on February and the subsets or the frame going to be those four rows. So it's going to be February, March, April, June. So it's going to be four rows and the total aggregation of that going to be 115. So you can do it like this. And previously it was like flexible more flexible it was two following but this time we have unbounded following that means always the boundary going to be the last one. So as we are moving with the records over here the boundary is going to be smaller smaller and like this and the last one they going to be both in the same record. So the current record going to be as well the unbounded following. Okay let's see the next one. The definition of the window going to be the following rows between one proceeding and the current row. So here is the way around one proceeding is lower than the current row. So let's see how SQL going to execute this. Let's say that we are currently at March. So this is the current row and we are saying between one proceeding. So that means one row before the current row. So the frame going to be like this and we have only two rows. So the value going to be the summarization of those two rows and it's going to be 40. So that means we are always targeting the rows before the current row. Okay. So now let's keep going with the other options in order to understand everything about the frame. So we redefine like this rows between unbounded preceding and the current row. So unbounded preceding going to be the first row in the table or in the window. So it's going to be static like this. It's going to be the first one January. And let's say that we are at this current row in March. So the window or the subset going to look like this. Those three rows and the total of that going to be 60. So now as SQL is proceeding to the next one, it's going to fix the first boundary. So it's going to be always pointing to January and the subset going to be a little bit bigger until we reach the last one. And with that we're going to have the subsets the whole rows. So with that we get really great flexibility on how to define the subset and how the subset is shifting through the window. Okay, so now we are just having fun. So we are just playing around with the boundaries. We don't have always to use the current row. So we can use for example here in this definition row is between one proceeding and one following. So we don't include at all the current row in the boundaries. So let's say again our current row going to be in March. So one preceding going to be February and one following going to be April. So with that our frame going to be the three rows. And let me get it like this. And the aggregation of this going to be around 45. So with that as you can see the boundary is going to be one proceeding and one following. So it should not be always the current row. All right. So now I think you already get it. What going to be the last option? We're going to have everything. So the definition of the frame going to be rows between unbounded preceding and unbounded following. What we're going to have over here. The unbounded preceding going to be January and the unbounded following going to be June. And now the frame going to be everything all the rows. And it doesn't matter where are we with the current row, it's going to be always a fixed subsets. So it's going to be always everything. So if we are over here or February or March, we're going to be considering all rows and the total sales of that going to be 135. So we will get the exact same results for everything for all rows. So with that I think it's not that complicated, right? We just have to provide the boundaries and then the calculation going to be depending on the frame on the subset of data. Okay guys, so now let's go back to SQL and start practicing in order to understand how the frame work. So let's go and define a window like this. So sum of sales and the window definition like this. We going to divide the data by order status and let's say we're going to sort it by order date. And let's define a frame like this. rows between current row and two following. Let's give it a name total sales. So let's go and execute it. So now let's look to the data. You see that SQL going to divide our results into two sections, two windows delivered and shipped. And you can see that the data is sorted by the order date. So as you can see over here for example in this status delivered we can see that 1 of January 10 and so on. And then the third part we have defined a frame in each window. So for example, let's take the first one. So this is the current row. So we say the frame is between the current row and the two following orders. So that means the scope going to be like this. So 10 + 20 25 it's going to be 55. And now what is interesting as well to check here is the last record of each window. So now let's take this window over here and the last record going to be number seven. So this order and let's say this is the current record. So we set the frame between current record and the two following. But since it is the last record of this window, it will not go and consider the next two orders because those two orders are outside of the window and that's why we have here 30 and SQL doesn't go and summarize all those value. So we have it 30 and there is nothing after that. That's why we will get 30. So as you can see the frame going to be calculated within one window. So it will not consider anything outside of the window. So this is how the frame works within partitions. So now I would like to show you as well a few stuff about the frames. We can use shortcuts but we can use them only with the proceeding. So for example let's say I'm going to change the definition like this to proceedings and current row. So let's go and execute it and we will get those results. So now if you want to check the results quickly, let's take for example this order over here and we are always summarizing the values of the two previous orders. So that means those three orders going to be involved in the frame and the output going to be 55. So now there is a shortcut for SQL but only for the proceeding where we can remove the range. So we can go and remove everything and we can leave it like this rows to proceeding and if you go and execute it we will get exact results. So this is a quick way or a shortcut on how to define a window but it only works with the proceeding. So for example, if I go over here and say for example unbounded it's going to work. So we will get the results between the unbounded proceeding and the current row. But if you go over here and you say you know what let's have the unbounded following SQL going to say there's an error. And the same thing if you remove the unbounded let's say for example one following SQL will not like it. So you can use the shortcut only with the proceeding. And one last thing about the frames it does there is a default frame. So if you don't use any frame and you use order by what can happen SQL going to use a default frame. So if you check the result you will notice that for this window over here those values are not like the whole values of the sales. There is like frame there is hidden frame and the default frame in SQL going to be like this rows between unbounded preceding and current row. So this is the default frame if you use order by. So now if you go and just execute it you will see that we will get the exact results. So be careful once you use order by with the aggregate functions there will be a hidden frame or a default frame like this between the unbounded proceeding and the current row. So that means there are three ways in order to do this scenario framework between unbounded proceding and current row. Either write it like this or you can go and have a shortcut like this. Let me just execute it. So we'll get the same result or just remove it completely. We will get as well the same results. Now again the hidden frame or the default frame is only working with the order by. So if you go for example here and remove the order by let's see the results. The whole window will be aggregated. So again let me just select it. So you can see that SQL going to consider all the rows in the aggregations and we will get the total sales for the whole window. So there will be no frame defined only it's going to be present once you use order by. All right friends so with the frame closed we have now covered all the components on how to define a window inside an overclo and with that we have covered everything about the syntax of the window functions. Okay guys, so now we're going to go and understand the rules or let's say the limitations of window functions. So let's learn what you are not allowed to do while using window functions. Okay, so the first rule that you are allowed to use the window function only in the select close and as well in the order by clause. So here we have again the same example where we finding the total sales by the order status. So as you can see we used the window function in the select clause and we didn't get any error right. So now we can go and use it as well in the order by. So let's say order by and let's go and copy everything but not the name in the order by. So if I go and execute this there will be no errors and SQL going to allow it. And as you can see the result didn't change. So let's go and sort it for example descending. So I'm going to write here descending and let's execute. Now we have the total sales with the highest values then the lowest values. So having this rule that we can use it only in select and order by that means we cannot use window functions in order to filter data. So let me show you for example instead of order by let's have clause where the total sales let's say bigger than 100. So let's go and execute this. And as you can see XQL going to say no you are not allowed to do that. You can do that only for select and order by. We are not allowed to use it for filtering data using the wear clause and as well you are not allowed to use it in the group by. So if I go and do a group by and as well remove the condition over here. So if you execute it you're going to get the same error. You are not allowed to use the window function in the group by. So only with the order by or as well in the select clause. Okay. So now to the second rule. You cannot use window functions inside another window function. So that means you cannot go and nest window functions together. Let me show you what I mean with that. So let's remove the group pie. Now everything should be working. Let's take and copy the whole window function over here and let's just nest it. So instead of sales, we're going to have now window function inside another window function. So as you can see this is the inner window function and the rest the outside is the outside window function. So if I go and execute this you will see that scale going to tell us you cannot use the window function in the context of another window function. So we cannot do nesting using window functions. So as you can see this is another limitation for those functions. All right moving to the third rule or let's say an info the window function will be executed after filtering the data with the work clause. Let's have an example. So okay so now let's say that I would like to have the same informations. the total sales for each status but only for two products 101 and 102. So let's go and do that. We're going to use the wear clause and then we're going to say product ID in we're going to specify 101 and 102. So let's go and execute this. Now you can see we still have two partitions. So one for the delivered and one for the shipped but the total sales is reduced because we are only focusing on two products and we filtered the whole data sets. So how SQL works? First the workflow is going to be executed and then the window functions going to be calculated. So that means first filtering and then aggregations. Okay guys, now we're going to move to the last rule to the most interesting one and it says the following. You are allowed to use the window function together with the group by clause only if you use the same columns. So let me explain what do I mean but first some coffee. Let's have the following task and it says rank the customers based on their total sales. So now it sounds really easy but if you check it you have here two calculations. The first one you have to rank the customers and the second calculation is an aggregation. You have to find the total sales for each customers. Okay. So now I'm going to show you step by step how I usually solve those tasks. So for now let's check the total sales. It is an aggregation right? So we can use the sum function and this function is available in both group pi and as well in the window function. So for now I'm going to go with the group by and that's because the task is very simple. We don't have to show any other details. Right? So it's all about aggregations. So why not using the group by and now to the first part where we have to rank the customers. We cannot use the rank function with the group by right. Groupy uses only aggregations. So here we are forced to use the window function. So that means for the rank I'm going to use window function. For the total sales I'm going to use a group by. So now let's do it step by step. So first we have to find the total sales for each customer using group by. It's very simple. So I'm just going to remove all those stuff in our select statements. We need the customer ID and then we don't need a window function over here. And then after the from we're going to have a group by customer ID. So now I'm just grouping the customers and finding the sum of all sales. Let's go and execute this. So now as you can see in the results we have four customers and that's why we have four rows and as well we have the total sales. So let's say the half of the tasks is already solved. Right now what is missing that we need a rank. So let's go and build that. The second step we're going to use the rank function and we can define a window for that. So over and inside it will not partition the data at all because it's already like grouped up. So what we're going to do over order by the rank function always needs an order by don't worry about it we can talk about it later. So now we are ranking the data based on the total sales that means the sum of sales. So what we're going to do let's just go and copy this and put it after the order buy. And now we have to decide whether ascending or descending. It's going to be descending. So the highest sales first and then the lowest sales. So now as you can see we have now a rank customers and we have a window function now together with the group by. So now let's go and execute this and see whether SQL going to allow it. So let's run it and as you can see SQL runs it and we will get the rank for each customers. So the customer three has the highest total sale. Then the customer number one and the last one going to be customer number two with the lowest total sales. All right. So we solve the tasks. We have now ranked the customers based on their total sales. So as you can see SQL allows you to use window function together with the group by but only with one rule. Anything that you are using inside the window function should be part of the group I. So for example, we fulfilled the rule because we are using the sum of sales and the sum of sales is a part of the group I right. So now if I go I just break the rule by not using the sum just using the sales. So if I just remove the sum and use only the sales, SQL will not allow it because the sales is not part of the group I. So as you can see SQL is very strict with this. If you want to use everything in one query without using like subqueries and so on, you have to use the exact same columns. So for example, if I go over here instead of sales, I use the customer ID. So since the customer ID is a part of the group by, SQL can allows it. So be careful using window function together with the group by. As long as you are using the same columns, nothing going to go wrong and SQL going to allows it. Okay, so now I'm just going to go and fix this and let's run it. So now as you can see it's really easy if you follow those steps. First build the query using group by. So don't you think about the window function just build the group by and then the next step the last one you go and define and build the window function. So with that you can solve really nice analytical use cases with a simple one query without having you to build like some queries and so on. You can go and use group by together with the window functions. All right guys so those are the four rules for the SQL window functions. All right friends, so now let's have a quick recap about the SQL window functions. Let's start with the definition. It will go and perform calculations like aggregations on top of subset of data without losing the level of details. So that means we can do aggregations and at the same time we are not losing the details. Now, of course, there is a lot of similarity between the window function and the group by. But the main difference is that window functions are very powerful and dynamic compared to the group by. We have way more functions than the group by. Right? But now if you are doing data analyzes and you have an advanced use case, then you have to go and use window function. It's more suitable for complex and advanced data analyzes. But in the other hand if you have a simple question simple data analyzes then you can go and use the aggregate functions using the group by and of course you can go and use them in the same query in the same select you can go and mix the group by together with the window function with only one rule you have to use the same columns and of course the first step is to do the group by and then later you do the window function in the same query. And now to the next point about the window components we have two main components. The first one is the window function and the second part is the window definition using the over close. And inside the overlo we can define three things. If you want to divide the data to create windows you can use the partition by the second section we have the order by in order to sort your data. And the last part you can go and specify a subset of data like a frame within each window. Now let's move to the last part. We have rules for the SQL window functions. So the first thing is that if you have two window functions or multiple window functions, you cannot go and nest them together. You have to go and use multiple subqueries. The next point is that you can use the window function only in the select and the order by clause. So for example, you cannot use the window together with the wear clause in order to filter the data. Talking about filtering data, how SQL going to go and execute the window function? It's always after SQL filter the data. All right. So those are the basic stuff about the SQL window function. So with that we have learned the basics about the window functions in SQL. And next we're going to start talking about the functions. So the first group is the window aggregate functions. And here we're going to learn how to summarize our data for a specific group of rows. So let's go. Okay guys, let's say that in our data we have the following informations. We have the months and the sales. Now if you apply any aggregate functions in SQL what going to happen SQL going to go through all rows of the window or the entire data and start aggregating the data. So that means in the result in the output SQL going to give you one single aggregated value. SQL going to go and summarize all those values and in the output you're going to find for example here the total sales it's going to be 175 or you can use the average or count the data and so on. So the aggregate functions going to deliver at the end one aggregated value for a window or for the entire data. Okay. So now let's have a quick overview of the syntax of all aggregate functions. Most of them follow the same rule. So first as usual we have to define the function name. And in this example we have the average. Then to the next part we have to define inside it as well the expression. We cannot leave it empty. So here we are using the sales and the second rule for all functions beside the count. The data type of this field should be a number. And this of course makes sense, right? So we cannot find the average of the first name of customers or something like that. So we have to define a number. Then next we have to define the frame. So we have the partition pi and it is optional. So you could use it or leave it depends. And then the next one we have the order by. It is as well optional. It is not a must or required. So you could use it or leave it. That mean the whole definition of the window could be empty for the aggregate functions. Let's have a look to all functions. So we have the count, sum, average, min, max. And as you can see, only the count accepts all data types as an expression or arguments. All others require you to have a number as a data type. And for all functions, the partition by is optional. The same for order by and frame. So everything is optional over here. So now what we're going to do with that, we're going to go and deep dive into each of those functions in order to understand how they work, what are the use cases, and of course, we're going to practice in SQL. So we're going to start with the first one with the function count. Okay. So what is the count function? It's really simple. It's going to return the number of rows within each window. So it's going to help you to understand how many rows do you have within each subset of data. So now let's go and understand how SQL works with this function. All right guys, so now we have again this very simple example for the orders and we have the following informations. We have the products and sales and now we want to solve very simple task. How many orders do we have within each products? So in order to solve it, we can use the function count like the following. So we can say count and then we pass for it an argument or expression the star. So with that we are telling SQL go and count how many rows do we have in our table. But we have a window definition like this over partition by products. So now what SQL going to do? We're going to go and divide the data sets into two partitions. We're going to have one partition for the caps and another one for the gloves. So with that we have prepared our data into windows and we are ready to do aggregations. So how many rows do we have within each window? It's going to be three. So for this window it's going to be three rows and as well for the next window we have as well three rows. So we're going to have three three and three. It's very simple right guys? We are just finding the number of rows within each window. But now with the aggregate functions we have to be very careful with the null values for the count star. As you can see over here we are not specifying anything about the sales. So we are just saying find me the number of rows. So that means SQL will just count the nulls as one row. So that means if we are using the star as an argument for the function count the null will not affect anything. So whether we have nulls or nots we are just counting how many rows do we have inside our data. But in some scenarios, we should be ignoring the nulls in our account. For example, let's say that I would like to count how many sales do we have within each product. That means if we have nulls, it should not be counted. So now in order to achieve this task, what we're going to do, we're going to use instead of a star over here, we're going to have the filled sales. So now with this, we are telling SQL, don't just count blindly how many rows do we have within each window. You should be very careful with the values. Find how many cells do we have within each window. So now let's see what's going to happen. For the first window we have three cells. So we have three values. So the number of rows is correct. But for the next one, how many cells do we have? We have two. So we have this sale and then the 70. But the last one is null. So it will not be counted. It will be ignored. That's why we're going to get in the output the value two. We have two sales. So as you can see the result did change and we are now more sensitive to the null values. So be careful what you are specifying for the count. If you are using a column name like this it will ignore the nulls. But if you have a star it just going to go and find how many rows do we have within each partition. Okay. So now if you go and compare the results side by side you can see that if you specify a column within the count function it's going to be sensitive with the nulls. So it's going to ignore it and will not use it within the aggregations. That's why we have here only two rows. But if you go and use the star within the count function, what going to happen? SQL just going to go and count it. So we're going to find the number of rows that we have inside our table. And there is one more way in order to do the same thing here on the left side. You can use instead of star you can use one. So you might find it somewhere that people are using count one and then the same window function and we will get exactly the same result. So the nulls will be counted and will not be ignored. So now you might ask me which one should I use the one or the star? Well, I would say it doesn't matter right we are getting the same results and if you are thinking about the performance I hardly find any differences between them so you can go and try both of them and stick with the one that is giving you like more better performance. Now we have special case for the count function compared to all other aggregate functions it allows any data type. So that means we can use numbers we can use characters dates and so on. So that means we can go and specify something like the product for the count instead of sales. So we can go over here and say product and it's going to go and count how many rows do we have for the product. So it's going to be three over here. And since here we don't have any nulls, it's going to go and count it like this. So we have three rows and be careful here. We are not counting the unique rows. We are just counting the rows that we have inside our data. So this will not be counted as one and this as well would not be one. So we have three times the caps. That's why we have here three. Okay. Okay. So now we have this very simple example. Find the total number of orders. This is very simple task in order to find how many rows, how many records do we have inside the table orders. So let's go and solve it. So let's start by selecting just star from the table orders without anything like this. So as you can see we have 10 orders. It's very simple. It's very easy as well. But now let's say that you have thousands or millions of rows. You cannot do it like this by just checking the rows. What you're going to do? We're going to go and use the function count. So we can go over here and say counts star and then let's give it a name total orders. So let's go and execute it. So now as you can see we got only one record, one value. We don't see any other details. We got the 10 orders. So this is the total number of orders. This is very helpful in order to understand the content of your data. So this we call it overall analyzes or let's say having the big numbers about your business. For example, how many orders do we have? how many customers, products, employees and so on. So having those big numbers going to help us to track our business to understand how well we are doing with the orders and with the customers and so on. So this is the basics of reporting. Now let's go and extend our task by saying provide details such as the order ID and the order dates. So let's go and do that. So select order ID, order dates. And now of course we cannot do it like this. So let me just execute it. we will get an error because here we have different level of details in our select. So in order to solve this what we going to do we're going to use the over clause and with that we are telling SQL this is a window function. So now let's go and execute it. So with that you can see with that we have solved the task we have details we have the order ID order dates. So this is the highest level of details since we have the order ID and as well we have the highest level of aggregations. we have the total number of orders in the entire table orders. So now let's keep going and add more stuff to our task. Let's say that we want to find the total number of orders but for each customers. So that means this time we have to go and divide our data by the customers. So let's go and do that. We're going to use as well a window function. So count star over we have to divide the data using partition by and we're going to use the field customer ID. So let's call it orders by customers and I would like to see as well the customer informations in the query. That's why I'm going to go and add it. All right. So that's all. Let's go and execute it. Now as we learned before that SQL first going to go and divide the data. So that means we have four customers. We're going to get four windows. The first window going to be for the customer ID number one. And as you can see we have three rows. That's why we have here three orders. And the same thing for the customer two. We have three orders. customer three three orders but only the last customer the customer ID number four we have only one row and one order. So now if you go and look to the total orders and the orders by customers you can see now we are not doing the overall analyzes we are doing like comparison between different categories and of course in this example the category is the customers and with that we can understand as well the behavior of our customers. So you can see that we have three customers that has exactly the same amount of orders. So they are very similar but we have one extreme which is the customer ID number four. This customer has only one order. So this is the only customer that has different behavior than all other customers. So you see with very simple query we are able now to analyze our business and understand the behavior of our customers. So if you divide the data by partition by and using count you can go and now compare stuff together. All right. So now let's keep moving. Next we're going to understand the special cases that we have with the function count. So now we have this very simple task. It says find the total number of customers and additionally we have to provide all customers details. So I think it's very easy to solve. What we're going to do we're going to go and select star since we need all details from customers from sales customers. So let's just have a look. So we have five customers and the function is count star over and we don't have to divide the data since we have to find the total number of customers for the entire table and it's going to be total customers. So nothing new that's it we have five customers and now as we learned before if you are passing the star to the count function what you are telling to escale is that just go and count how many rows do we have inside the table customers. So SQL just going to go and start counting and going to say we have five customers, five rows. So it doesn't matter whether we have nulls inside our data like in the last name or the score. It's just going to count the number of rows. So now let's say that we have the following task. It's going to say find the total number of scores for customers. So what do we need with this task is to find out how many scores inside our data. So as you can see we have around four scores but the last customer doesn't have any score. So we have it as a null. So the result should be four. We cannot go now and use the star for it because we're going to get five. We have to go and count the scores. So let's see how we're going to do that. We're going to count as well. But this time the score and the definition of the window going to be empty. So total scores and let's go and execute this. So now we can see in the results we got four scores which is very correct because SQL did ignore the null and SQL now focusing only on one column. So focusing on those values the nulls will not be counted. This is really great in order to check the quality of your data. So let's say that you are not expecting any nulls inside your data. So instead of going manually through the whole records what you can do you can go and find the total number of customers like this and then you can go and count the total number of scores and you can see there is a difference. So by just checking the data I can say you know what we have one null without checking every record in our data. So with that we can check the quality of our data and understand very quickly how many nulls do we have in the field score and you can do the same stuff for example for the first name show it to you. So I'm just going to go and copy this and let's say first name or let's say country actually. So I will go with the country. So let's go with the country total countries. So let's go and execute this. So now if you check the result you can see we have five rows with the countries. So SQL going to go and focus on the countries and it will not find any nulls. So we have here complete data. We don't have any nulls because the total number of customers is equal to the total number of values within the country. And I can immediately find okay the data quality of the country is very good. All right. So now one more thing about the count function that we have learned before. We can use either star or one in order to count how many rows do we have. So let's just try it. I'm just going to go and duplicate it. And instead of having a star, let's have a one. Just going to give it a name here. It's going to be one and you are star. So let's go and execute it. So now if you check the output, we got exactly identical results. So there is no difference between those two queries. It's up to you. You can try it and check the performance. I usually go with the star instead of one. Okay. So now we're going to talk about a very important use case for the SQL window function count that I frequently use in my real projects. The data that we use for data analyzes has usually bad data quality. And if we don't find those data quality issues and we don't clean it before doing the analyzes, what going to happen? We're going to deliver bad results, bad analyzes which going to lead to bad decisions. And one very common data quality issue that you might encounter in your project or on your data is that having duplicates. Duplicates are really bad for doing data analyszis. So now in order to discover or let's say identify the duplicates in our data, we can go and use the SQL window function count. So now let's go and have some examples. Okay. So now the task says check whether the table orders contains any duplicate rows. So how we going to do that? By checking now the table orders over here. We can see that there are many orders. But how to find out the duplicates? Well, the first step is to understand what is the primary key of the table orders. So what we usually do we go and check the data model if there is one. So for example for this course we have the following data model and we can see that it is defined that the order ID is the primary key for the orders. The product ID is primary key for the products. So that means for our table the orders we have the order ID as the primary key and it should be unique. It should not contain any duplicates. So now let's go to our data and check the order ID. By just looking at the data you can see that we don't have any duplicates. Rightes all of them are unique. So we have 1 2 3 4 and so on. But of course in real projects you cannot do it like this. You have to go and build a query in order to find out whether the primary key is unique. But now you might say the primary keys are usually unique because we can define it in the DDL in the rules of building the table. Well that's true. If you have it like this then you don't have to find any duplicates. But usually in data analyzes we export a lot of files and a lot of data inside an extra database and we don't build such a rules. So now in order to check the quality of the primary keys that you get from the source we can use the count function. So let's go and build it. I'm just going to select the order ID first as a detail. And now we're going to do the following. So count and then star. And let's go and define the window. So it's going to be partition by and here the field going to be the primary key. So the order ID I'm checking now the quality of this field. This should not contain any duplicates. And now we're going to go and give it a name check primary key. So now my expectation is that the result of this should be at maximum one. That means we have one row for each primary key. And that means as well it is unique. So if we get anything more than one then it means we have duplicates. Let's go and run the query. And as you can see in the results we get for each primary key one. So that's great. That means we don't have any duplicates inside our data and the primary key is unique. So that means the table orders is clean and we don't have any duplicates inside it. Now let's check our database. We have here another table called orders archive. Let's go and check the table. So first I'm just going to go and select the data. So select from orders archive. So sales do orders archive. Let's check the results. And here we can see that we have exactly the same structure as the table orders. So now let's go and check whether the data quality is well clean. So now what we're going to do, we're going to use exactly the same query as before, but instead of using the table orders, we're going to take the orders archive. So that's it. Let's go and execute it. So now by checking the data, you can see that we don't have everywhere one. Sometimes we have two rows for the same primary key, which is really bad. So we have here for the order ID four we have two orders with the same order ID and as well for this order id six we have three orders that means those stuff are duplicates and they are against our data model. So now what else we can do is that to generate a list specifically for the data quality issue where we have duplicates. So anything that has one we are not interested in it. In order to do that we're going to use the subquery. So let's say select star from and then we're going to use the first query as a subquery and we're going to say in our filter where the check primary key is higher than one. So that means I need only the order ids where we have duplicates. So let's go and execute this. Now I have a list with the primary keys where we have duplicates. So we have the order ID 4 and as well the order ID six. So guys, as you can see, the window count function is wonderful in order to find data quality issues like the duplicates. All right guys, so those are the four most important use cases in the SQL window function count. So the first one we can use it in order to do overall analyzes or we can use it in order to do category analyzes like we have done the analyzes on the customer behavior or another use case we can use it in order to check the nulls inside our data. And the last use case we can use it in order to identify or discover the data quality issue duplicates in our data. So now let's go and check the next function. We have the [Music] sum. All right. So now let's understand what is the sum function. It's very simple. It's going to return the sum of all values within each window. So now let's go and understand how SQL works with this function. All right. So this is very easy and we are using the same simple example and now we would like to find the total sales for each products. So we can define like this sum of sales since we are finding the total sales and then we define the window like this over partition by products. So as we learned SQL going to go first and divide our data into two windows. So one window for the caps and another window for the gloves right. So now after SQL define the windows it's going to go and starts aggregating the data. So the sum of sales that means for the first window we have the three sales and it's going to go and just simply summarize all those values. So we are adding 20 + 10 + 5 and we will get the result 35. So in the outputs we will get everywhere 35. So that's it for the first window and as you can see SQL going to go aggregate the data within each window separately. So that means as we are aggregating the data for the caps will not check anything with the gloves. So they are completely separated. So now it's going to go for the next window. And here we have two values and a null. So again here the null will just be ignored. So what we going to have? We're going to have 30 + 70 and the total sales for that going to be 100. So as you can see it is very simple, right? So 100 100 and so guys that's it. It's really simple. We don't have here like a lot of special cases like the count function. It's only that it ignores the null in the calculation and as well the requirement here it allows only integers or let's say numbers. So we cannot go and say sum the products since the products are not numbers they are characters. So you can only use numbers for the sum function. Let's go now and have some tasks and some use cases in order to practice in SQL. find the total sales across all orders and as well find the total sales for each product and additionally we have to provide some details like the order ID and the order dates. So let's go and do that. Select order ID, order date and let's get as well the sales. And now we have to find the total sales across all orders. That means we're going to use the window function sum sales and the definition of the window going to be empty since we don't have to divide the data. So that's it. total sales and we have to select the table sales orders. So that's it. Let's go and execute it. So with that as you can see we got all the details that we need and as well the total sales the summarization of all those sales in one field. So with that we have our overall analyzes one big number for our reporting. We know how much sales we did made in the entire business. So now let's go for the next task. It says total sales for each product. I think you know already what we're going to do. So sum of sales and we're going to do it like this. Partition by product ID. So that's it. We're going to call it sales by products. And with that we are dividing the data by the product. So let's go and execute it. So as you can see we don't have the product information. So let's go and add the product ID in the query just in order to analyze the results. So we can see from the data that the winner is the product ID 101. So as you can see we have here the highest sales if you compare it with the other products and the lowest one going to be the products ID 105. So as you can see we can use the window function sum together with the partition by in order to compare stuff to do comparison between the products in order to understand the performance for example of the products. So it's really great analyzes for the performance. All right. Now we're going to move to very interesting use case for the aggregate functions not only for the sum but as well for the others. It is the comparison analyzes. Okay. Okay, so let's understand quickly what is the comparison use cases. So it's going to go and compare the current value. For example, let's say we are currently at the month of March and the sales is 30. So we're going to compare this value, the current sales with an aggregated value. For example, let's say the total sales using the sum function. So what happen if you compare the current value with the total sales? You are comparing here or doing analyszis called part to whole analyszis where it's going to help us to understand how important was the sales in this month compared to the total sales or we can go and compare it to the best months to the highest value. For example, the highest value is June and we can go and compare this month with the best months of the year or to the lowest month in the year or we can go and compare the sales of the current month with the average in order to understand are we above the typical sales or below the average. And this is very important analysis in order to study and understand the performance of the current data. All right, let's have an example in order to understand the use case. Find the percentage contribution of each product sales to the total sales. So let's go and solve it step by step. What we're going to do, we're going to go and let's select the order ID and as well let's take the product ID and the sales just like this from sales orders. So let's go and execute it. Okay. Okay. So now as you can see in the results we got the first part of the equation. We have the sales. So nothing like a crazy over here. Now we need the total sales over all data. So what we're going to do we're going to have the sum of sales and the definition going to be empty. So this is the total sales. Let's go and execute it. So now we have everything for the equation. We have the sales and as well the total sales and that is enough in order to find the percentage of the contribution. So the calculation for that is going to be very simple. We're going to divide the sales by the total sales. So it's really simple. Let's go and do that. It's going to be the sales divided by the total sales. So we're going to go and copy the whole window function over here. And then we're going to multiply it with 100. So that's it. Let's go and execute it. So now you notice that in the output we got zeros. This is because of the data type. So now if we go to our table over here on the left side you can see that the orders has the data type of integer. So if you divide integers you will not get a float or decimal number. You have to go and change the data type. So now what we're going to do we're going to go and change the data type for one of them. So it's enough for the sales over here. So we're going to use the following statement. So cast sales as floats. So that's it. I'm just converting the integer to floats. So that's it. Let me just give it a name. So it's going to be percentage of total. So that's it. Let's go and execute it. So now in the output, you can see we got now the percentage of the total or let's say percentage of contribution. So now what we're going to do with that, we're going to go and round those numbers because we have a lot of decimals. In order to do that, we're going to use the round function like this. Then we're going to have two decimals. And let's go and execute it. So now, as you can see, it is really easier to read because we have only two decimals. And we can find immediately that the order rate is the highest contributor to the total. So this is what we call part to whole analyszis where we find the percentage of total. It is very common analyzes in order to understand the performance of each order compared to the total. So this is an example how the window function is helping us here to compare the current value with an aggregated value. All right everyone. So that's all for the window function sum. Next we're going to talk about the average function. All right. So now let's understand what is an average function. As the name says, it's going to find the average of values within each window. So now let's go and understand how SQL works with the average. All right. So now back to our very simple example and the task says find the average sales for each product. So it's really easy. We're going to use the average then pass to it the column sales and we define the window like this partition by products. So the first thing that SQL going to go is to define the window. So it's going to divide our data into two partitions. One for the caps and one for the gloves. And now I hope that everyone knows how to calculate the average. So as you know that it's going to go and summarize all the values and divide it by the number of rows. So it's going to go and summarize 20 + 10 + 5 and divide it on three rows and the output going to be 11. So we're going to get it for each row. So as you can see SQL just ignored everything in the next window. We are focusing only on the caps. Now it's going to go to the second window and start doing the same aggregations. But here we have the special case of null. So the null is going to be ignored in the calculations and we're going to have it like this. It's going to say you know what 30 + 70 and we are just including two rows. So it's going to be divided by two and the average going to be 50. So we will get the result 50 for each row and we are completely ignoring the nulls. But now we might be in scenario where your users understand the business like this. If we find a null in the sales it means a zero. So there is no sales and it is actually a zero. But we store it in the database as a null. So that means the average that we have provided is not really correct. We have to divide by three. So that means first we have to handle the nulls before doing the aggregations before finding the average. Now we're going to have a whole chapter on how to handle nulls in SQL. What are the different functions? But for now we're going to go with the functions qualisk. Okay. So now what we're going to do, we will not use the sales as it is. First we're going to handle the nulls. So that means we're going to use the qualisk sales and replace it with zeros. So as you can see we are not using immediately the sales we are handling it first and then we're going to find the average. So SQL going to go over here and if it finds any null going to go and replace it with zero and that's going to have then an effect on our average over here. So it going to be 30 + 7 + 70 but now plus 0. And now we have three rows. So instead of dividing by two, it's going to go and divide it by three and the total result going to be like this 33. So that means we're going to have in the output 33 for each row and with that we are now fulfilling the expectation from the business. If you have a null it's going to be handled as zero and the result going to be more accurate. You see right it is very tricky. If you are doing data analyszis and aggregations be very careful with the nulls. understand them, understand what they mean for the business, handle them correctly in order to get correct results in your analysis. So now let's go back in order to practice SQL using some tasks and use cases. Okay, so let's start with the basics. We have the following task. Find the average sales across all orders and as well find the average sales for each product. And don't forget the details. So now let's go and solve it step by step. So select order ID, order date, and let's get the sales as well. And let's go and find the average sales. So it's going to be a window function. And we have the sales inside it. The usual stuff. The window going to be empty. So average sales, we're going to call it the table going to be sales orders. So that's it. Let's go and execute it. Oh, we have to select everything of course. So what SQL did in the output, it going to go and summarize all those values and then divide it by 10. So with that we have the average sales of 38. Very easy. So this is again what we call an overall analyzis. Let's move to the next one. Find the average sales for each products. So again we're going to go and build the window function like this. Average sales over and we're going to divide it by product ID. And we're going to call it average sales by products. And we're going to go and add the product ID in the query. So that's it. Let's go and execute. And we missed something here. So it is the partition by going to execute again. So with that we have the following data. So now SQL going to go and divide the data. So for example for this products we have those four orders. So what going to happen is still going to go and summarize the four values and then divide it by four. That's why we have here 35. The same thing for the next order. It's going to divide it by three. And the last one is just going to divide it by one. That's why we have 60. So as you can see the aggregation can done separately for each window and this is as well very nice way in order to compare the averages between the different products. Okay. So now let's have an example in order to learn how to deal with the nulls. Let's say that we have the following task. Find the average scores of customers and show as well additional informations like the customer ID and the last name. So let's go and solve this. We are now targeting the table customers. So let's just select it first. like this. And now let's go and include the customer ID and the last name. And let's have as well the score. But this time we're going to go and find the average score. So it's going to be the average score. And since we don't partition the data, we're going to leave the definition like this and it's going to be the average score. So that's it. Let's go and execute it. So now as you can see, we have the average score of 625. SQL is going to go and summarize the four values and divide it by four. But here we have a null. So now we have to understand the business or ask about it what the null means in the scores of the customers. Is it zero or is it something empty? If it's zero then the average that we have is wrong because it should be divided by five and not four. So let's say it's zero that means we have to go and handle the nulls. So what we're going to do now we're going to go and use the function kalis. So qualis and for the score and replace the null with zero. So you are the customer score. Let's go and execute this. So now as you can see if there is a value it's going to be exactly the same value but only if we have null it's going to be replaced with zero. So now let's go and correct the average. I'm just going to do it like this. So let's go and copy the whole thing. But now instead of using the score we're going to use the score that is handled with nulls. So I'm just going to go and replace it like this. So here without nulls. So let's go and execute it. So now as you can see we are getting more valid result at the output compared to the previous one. And this is only for the case if the null means zero. So guys as you see be very careful with the nulls especially if you are doing aggregations and handle it correctly before doing any aggregations like the average. All right. Moving on to the last use case. We have the comparison analyzes and the task says find all orders where the sales are higher than the average sales across all orders. So that means we have to go and compare the current sales with the aggregated value and this time the average of sales. So now let's go and do it step by step. So what we're going to do we're going to go and select of course the order ID. What do we need the let's take the product ID and we need the current sales. So it's going to be the sales as it is and that's it for now. So from sales orders. So that's it. Let's go and execute it. So now by checking the result, you can see that we got the first part of the equation, right? We have the sales for each order. Now we need the second part, the average sales across all orders. In order to do that, we're going to go and use the window function average sales and we're going to use over since across all orders that means it's going to be empty. So let's give it a name average sales. So let's go and execute it. So now in the output we got the average sales. So it's going to be 38. So now we need all the orders that are higher than the average. So as you can see for example the order one is not higher but the order for is higher than the average. So in order to filter the data we cannot use the window function in the wear close. Right? So what we're going to do sadly we're going to go and use the subquery. So it's going to be like this. select star from and then we're going to define the condition outside the subquery. So it's going to be where the sales is higher than the average sales. So that's it. Let's go and execute it. And now as you can see it's very simple. We got all the orders that are higher than the average. Right? So you can see all those sales are higher than the average. It would be nice if we can do all those stuff in the first query. But since we cannot do that, we need to use the subqueries in order to filter the data afterward. So that we can understand the importance of the comparison analyszis. For example, here we are finding or evaluating the data whether they are above the average or below the average. And this is very important in the business analyzes. All right, everyone. So that's all for the window function average. Next, we're going to talk about two very interesting functions, the min and max. All right guys, so what is min and max functions? They are very simple but yet very powerful functions for analytics. So the min simply is the function that can return the minimum or let's say the lowest value within a window where the max it's exactly the opposite. It's going to find the maximum value or the highest value within a window. So now let's go and understand how SQL works with these functions. All right. So now we have the same data and we have two tasks. First we have to find the lowest sales for each product. And the second one side by side we would like to find the highest sales for each product. So we're going to go and use the min max. And as you can see the syntax is very simple. Min the sales and then the partition going to be by the products. And here as well the same stuff but having the max. Okay. So now let's see how going to execute the first query. As usual first it's going to prepare the data. So it's going to split the data into two windows. One for the caps and another one for the gloves. And after that it's going to search for the lowest sales within each window separately. So for the first window we have the following values 20 10 and five. And of course the lowest value going to be the five. So that's why SQL going to find it over here. And everywhere for this window it's going to be the value five. So we have it as the lowest sales for the product caps. So now it's going to jump to the next window for the gloves and start searching the values. So as you can see we have 30 70 and null. Null will be ignored. So null will not be considered as the lowest value. So SQL going to find the lowest sales with the 30. So it's going to be actually the first row within this window and the value the output going to be 30 for each row. So that's it. It's very simple, right? Now let's move to the next one. We have the same stuff but using max. So the data is partitions and for the first partition what is the highest value? It's going to be the first row, right? The 20. So SQL going to find it and in the output we will get the highest sales 20 for this window and then it's going to go to the second window and search for the highest value. So here we have two values 30 and 70 and it's going to be the 70 right. So it's going to point it over here and in the output we will get everywhere 70. So guys it's really simple right now let's back to our scenario in the average where in our business we understand nulls as zero in the sales. So that means first we have to handle the nulls and replace it with zero and then we're going to go and search for the value. So what's going to happen? We're going to go and replace nulls with zero. For the max nothing going to change the highest value going to be 70 and we're going to get the same output. But for the min now we have new lowest value. So it's not anymore the 30. It's actually the zero. So SQL going to go over here and replace the 30 with nulls. So nulls is the lowest sales for the product gloves. So again guys, the nulls are very tricky and those functions are really sensitive with the nulls. Understand what the nulls means and handle it correctly so that you get correct results in the output. So that's it. Let's go back to SQL to have some tasks and use cases in order to practice SQL. All right everyone, let's start with the basic stuff. find the highest and lowest sales of all orders and as well find the highest and lowest sales for each product and we have to provide additional informations. So let's go and solve it. Select order ID order and let's take as well the product ID. Now let's find the highest sales of all orders. It going to be the max function for the sales and the window function going to be empty since of all orders. So you are the highest sales. Let's go for the lowest sales of all orders. It's going to be exactly the opposite. The main function for sales over then we have the lowest sales. So I'm just going to make it bigger capital. So let's select the table sales orders. So I think that's it. Let's have as well the sales actually. All right. So now let's go and execute it. So now this is very simple, right? This is the wholesales. What is the highest sales? We have the 90 of the order eight. So, as you can see, we have now the highest sales, the 90, and the lowest sales is the 10. The first order is the lowest. So, it's very easy. Now, we're going to go and repeat the same stuff for the product. So, we have go and partition the data by the product ID. So, what I'm going to do, I'm just going to go and copy paste stuff around. So, the first one going to be partition by the product ID. So, highest sales by product. And the next one going to be the same stuff. Copy paste by the product. So that's it. Let's go and execute it. So now again the data going to be partitioned and divided by the product. So for the first window what is the highest sales? It's going to be the 90 and the lowest sales is going to be the 10. So it's exactly like the overall rights now let's go to the second window over here. We can see that the lowest or the highest sales is the 60 the first one and the lowest this time is 15. And this is great in order to see that the SQL going to execute each of those functions for each window separately. So let's go to the last window. It's funny one. So the sales is 60 and we have only one row. So it's going to be the highest and as well the lowest sales. So with that as you can see we can define a range for each product and the range are different from each product to another one. For example, for this product 101 the range from 10 until 90. But for the second product we have it between 15 and 60. Okay guys, let's move to the next one which is one of my favorites in the window function where we filter the data using the minmax functions. Let's have the following task. It says show the employees who have the highest salaries. So this sounds very simple but we can use the help of window functions in order to solve it. So now we are working with the table employees. Let's just select the data. So select from sales employees. So that's it. Let's go and execute it. So now we have five employees and we have those different salaries. Let's go and find the highest salary. So max salary and let's use the window function over but we don't partition the data at all. So it's going to be like this highest salary. So let's go and execute it. And now by checking the results we got a new column called highest salary and inside it we have the 90k. So if you check those five salaries you can see that the highest is from the employee Michael. But still the task is not solved. We have to show only the employees who have the highest salaries. So we have somehow to filter the data and only show this employee. So in order to do that we have to use the subqueries since we cannot use the window function in the wear clause. So what we're going to do select star from and then our first query going to be the inner query. So we have the following condition. It's going to be the salary should be equal to the highest salary. So it's very simple. So with that we are comparing the salaries with the highest salaries. If there is a match the data going to be presented. So let's go and execute that. And that's it. As you can see we got the employee with the highest salary. But if there are like multiple employees with the same salary of 90k of course we're going to get it in the results. I think Michael going to need a new job. Right. This is the worst. So this is another use case for the window functions minmax. All right. So now we come to the use case of the comparison analyzers where we want to compare the current sales with the highest and the lowest value. So we have the following task. It says find the deviation of each sales from the minimum and the maximum sales amount. So now as you can see this is our sales. This is the highest and this is the lowest. So now we just have to go and subtract the data from each others in order to get the deviation. So it's very simple. Let's get the first deviation where we're going to go and subtract the sales with the lowest value. So it's going to be like this. So now what we are doing over here, we are subtracting the sales from the lowest sales of all records. So we're going to go and call you deviation from min. So let's go and execute it. So now we can see from those values how far is the current value from the extreme. The extreme here is the lowest value. So this is a really great way on to analyze the extremes in your data. So now as we are near to the extreme the value going to be low. So as you can see here we have a zero. This is the lowest because we have it exactly as the extreme. So actually this is our value. So the 10. Now the next one is little bit far away from the extreme which is 15. So we have it here as a five. So this is not far away from our extreme value. And then if you check this value over here we have it 80. So the distance is very far away from our extreme value the lowest sales. So this is really nice analyszis in order to analyze and evaluate the sales of your data. Now of course we can go and evaluate our data with an another extreme which is the highest sales. So in order to do that we're going to first say let's get the highest sorry this one the highest sales and subtract it from the sales. So you are the deviation from the max. So let's go and execute it. So now we can see in the output we're going to get exactly the opposite distances. So the order number one is the farthest from the extreme. So as you can see we have the value of 80 and the order eight is the identical one. So that's why we have the distance of zero. So now we can see as well very quickly which data points are the nearest to the extreme to the highest sales. So as you can see guys using the window function min and max it is very powerful in order to understand and evaluate your data points to the [Music] extremes. All right everyone so now we're going to focus on very important use case. One of the must know use cases for data aggregations is doing running total and rolling total. These two concepts are very important for data analyszis and doing reporting that you must know. The key use case for those two concept is to do tracking. For example, we can go and track the current total sales with the target sales in our business. And as well, it's great in order to do historical analyszis for the trends. Okay. So now the question is what is running a rolling total. They are basically very similar. They're going to go and aggregate a sequence of members and the aggregation going to get updated each time we add a new member to the sequence. A sequence could be like a time sequence. That's why we call this type an analyzes over time. So now we still have the question, what is the difference between the running and the rolling totals. The running total going to go and aggregate everything from the beginning until the current data point without dropping off any old data. Where on the other hand in the rolling total it going to go and focus on a specific time window like the last 30 days or the last two monthses and each time we add a new member or a new data point to the window we will be dropping off the oldest data point in the window and with this we're going to get the effect of rolling or let's say shifting window okay I totally understand if this might be complicated now let's go and have very simple example in order to understand this concept and as well how we can solve it using SQL all right guys so now We have very simple example. We have the months and sales and we have it twice because I want to show you side by side how SQL works with the running total and the rolling total. So now what is the task on the left side? We want to find the running total of sales for each month and on the right side we would like to find three month rolling total of the sales for each month. So they sound very similar but on the right side we have only fixed window. So now how we can solve this using SQL. On the left side we can use sum of sales. So we want to go and aggregate all the sales using the sum function. And the definition for the window going to be like this order by month and of course you can go and do anything like you can have here an average. And if you use an average with order by you will get the running average or the running max or the running count and so on. So that means always if you go and mix an aggregate function together with an order by you will generate an effect of running total. Now on the right side we can have the same stuff. So we can have an aggregate function together with order by. So sum of sales, order by month. So far we have everything like the left side, right? But now you might ask why is going to go and generate this effect the running total. We didn't here specify like crazy stuff, right? It's all about the definition of the frame close. So now do you remember if you use an order by and you don't specify a frame close you will get like hidden or let's say default frame close and it's going to look like this rows between unbounded preceding and current row. And what was the definition of the running total? It's going to go and aggregate all the data from the very first beginning well the unbounded proceeding until the current position the current row without dropping off any old members. So that means the definition of the running total going to be the exact definition of the default frame clause. That's why it's going to go and generate the effect of the running total. Now let's go to the right side the rolling total. Here again we have the same stuff right. We're going to go and aggregate the data using the sum function and we're going to go and sort the data order by month. So with that we are as well generating the effect of running total. So each time you use order by with aggregate function. So now in the running total we want always to specify a frame. So here in this example three months. So that means if we are getting a new month we don't want to include the latest months. We want always to be fixed window. Now in order to have this fixed window effect we have to go and redefine the frame close because if you leave it as a default like the running total the frame going to keep extending. You will see this effect in the example. So now we define it like this rows between two preceding and current row. So the total number of rows going to be included in each window going to be maximum of three months. So now I know you might saying bar what you are talking about you didn't get anything. It's total normal you will understand it only with an example. So in order to do this let's start with the left side. So first going to go and sort the data. So everything is sorted from the smallest month until the highest one. So from January until July everything is good. And now it's going to go and start working with the frame. So the frame says unbounded proceeding. So this going to be static. It's going to be always pointing to January. This is the unbounded proceeding. The first row in the data set. And now of course we are starting from top to bottom. The current row going to be pointing as well to January. So the frame going to look like this. It's going to be only one row and the total sale of this row going to be 20. So that's why we're going to have in the output 20. So now let's move to the right side. The current row going to be as well January. And what is the two proceeding? We don't have it yet. So it's going to be pointing maybe somewhere here before the table. So again, what is the frame? It's going to be as well one row. So in the output, we will get exactly the same result 20. So so far there is no differences between the running total and the rolling total. But let's keep going. Now we're going to go to the next row over here. So what can happen to our frame? It going to go and extend, right? So we're going to have now two months in this frame. And what is the total sales over here? It's going to be 30. So we added a new member. You can calculate it like this. Either go and calculate all the sales within the frame or you can go and say this is the previous aggregated value plus the new member. So the previous one was 20. The new member is 10. We will get 30. Both of them is correct. So now let's move to the right side. What's going to happen? We're going to be as well at February. The two preceding is still like pointing somewhere outside. And here the window going to go and extend like this. We have two months and the same aggregation going to happen. So we have 30. So so far nothing crazy right. Let's go to the next month March. The frame going to be extended. So we have now three months. And the aggregation going to be either here 60 or 30 + 30. We will get the running total of 60. And now on the right side what going to happen? We're going to point as well to March. And this time the two proceeding going to be pointing to January. And this is the first time we are getting the whole fixed frame. Right? So we have here three muscles in this frame. So what is the total of that? It's going to be 60. Okay. So now you say, okay, we're still getting the same results. There's no difference. I'm going to say wait for it. It's going to be the next one. So as we go to April, the effect here is that the frame going to get extended to four months because always we start from the first month until the current month without dropping any member outside. So what is the total of this? It's going to be 65. Sorry, like this. So now on the right side, what going to happen? We're going to go and add a new member. the April but we are at the maximum sides of the window we have only three and that's because the two preceding going to shift as well down over here so the boundary going to be from February until April and with that we are dropping off January and now you're going to see the effect it is sliding it is rolling or shifting from top to bottom and that's because the boundaries as well shifting so you can see now the effect of the rolling total the newest member going to be added the oldest member going to be But we are allowed only to have three muscles. So what is the total of this? It's going to be 45. So this times we are not aggregating this value the 60 together with the five. We are aggregating the values within the window. So now let's keep going. Now we are at June. What can happen on the left side? The frame going to get bigger. And with that we will get the result of 135. So the frame is getting really bigger. But on the right side it's going to has a fixed frame. So we are just sliding, shifting and rolling. So with that we are adding new member. Another member is leaving the oldest one. And the total over here going to be 105. And now we're going to go to the last row. We will have everything for the ring total. So the whole data set is going to be aggregated. So this is the maximum what we're going to get. It's going to be around 175. But on the right side it just going to keep shifting until we reach the last record. the window the frame going to be as well shifting like this. So the total of this going to be 105. Okay guys so you see it's very simple the running total it's always consider everything from the starting position until the current row without dropping any member. The rolling total it's always drop the oldest member in order to add something new and the window is keep shifting. So the running total is very great in order to do tracking like for example budget tracking or we track for example the current total sales with a target or something like that. So always we are considering the whole data sets but with the rolling total we always do here focused analyzes. We are always interested with the window of 3 months. So they might sound very similar but they have completely different scope for analyzes but both of them are doing aggregations over time. So they're going to help us to do analyzes over time like checking whether our business is growing over time or declining. So guys as you can see using very simple SQLs using the window functions we can do really great analysis on our data. So those stuff are really fundamental of data analyzes or doing reporting for our business. So window functions are really powerful for data analytics. Okay. Okay. So now we have the following task and it says calculate the moving average of sales for each products over the time. So now we have here something called moving average. It is very similar to the running total. In the running total we used count and sum and so on. But here we're going to go and use the function average and instead of calling it running average we call it moving average. So let's go and solve the task. Let's start always by selecting the usual stuff. So let's get the order ID. Let's get the product ID and I would say since it's over the time I will get the order date as well and the last one the sales from our table sales orders. So that's it. Let's go and execute it. So now we got our 10 orders with the products order date and sales. Let's start building our window function step by step. So which function do we need? We need the average. This is the easiest one. It says moving average. So that's it. We need the sales. So it's going to be the average of sales. Let's go and define the window. So now do we have to divide the data, partition the data? Well, yes. It says for each product that means we're going to go and use the partition by clause by the product ID. So now I would say that's it for the first step. So average by product. So let's go and execute it. So now if you check the result, you can see that we got our windows. So the first one for the product 101 and the total average of the sales going to be 35. So we have like aggregated one value for each window. The same thing for the next product and for the next and so on. So we don't have any progress over time or something like moving average all the time. Right? We don't have this effect. We have just one average for each window. So now in order to have the effect of the moving average, it's going to be like the running total. We have to use the aggregate function together with the order by. So I'm just going to make it in the new column. I'm just going to copy everything like here. And now what we going to do? Order by. And since it's over the time, we're going to go and use the order dates. Order dates. And we're going to have it ascending because it's overtime. Over time always like start with the earliest dates, end up with the latest dates. So from the lowest to the highest, we're going to leave it like this. So let's call it moving average. So now let's go and execute it. And we got here an extra comma because of the copy paste. So let's execute it again. All right. So now let's check the results. Let's take the first window over here. And you can see we have on the moving average like a progress. So it start with 10 15 14 35. So there is like moving average. We don't have one solid number for the average. We have different values. So now how SQL going to solve this? It's really simple. It's going to start row by row. So the first row what is the average of 10? It's going to be 10. Then moving on to the next one it's going to be 10 + 20 divided by 2 you will get 15. So now moving to the third one all those three values going to be summarized divided by three you will get 40. And now to the last row in the window it's going to be summarizing all those four values divided by four and you will get 35. And this is exactly the same value in the previous column. You have here the average byproducts. We don't have order by you got as well 35 exactly like this last row and that's because we have the same calculation. It is summarizing all those four values dividing it by four. But now it's interesting the next value. So as you can see the next value it comes from another window. So you see here we have 15 for the product 102 but the average going to be as well 15. So scale is not considering the old values from the other window. So SQL going to calculate each window separately. So it's again here this is the first value of this window 15 the average 15 then the same stuff right. So summarizing those values divided by two and so on. And this we call in data analyzes this last field over here we call it a moving average and you can implement it very simply using an average function together with the order by. All right, let's move to the next task and it says calculate the moving average of sales for each product over time including only the next order. So as you can see the first part we have already done it right. We have the moving average and divided by partition by for the products but here we have more specifications. It says including only the next order. That means we are talking about the current order and as well the next order. So here we have like a fixed frame or fixed window. So we don't need the whole average of the window. We need only maximum two orders in each calculation. So how we going to do that? We can have our custom frame close inside our window function. So that means we cannot leave it as a default. We have to specify it. So let's go and do that. I will just copy the old definition of the window because we have the exact stuff. So we have the average sales over partition by product ID order by date. So this is the first part. So now we would like to have this fixed window. So we're going to go now and define our frame close. I'm just going to zoom out a little bit. It's going to be rows between. So we have now the boundaries of the frame. It says including the next order. So we're going to go and use the following. So the first boundary going to be the current row. And since it's next order, so it's going to be one following. So that is our frame including only the next order. And we have it like this one following. Let's call it yeah rolling average. So that's it. Let's go and execute. So now let's go and check the result. You can see the moving average has completely different values as the rolling average. So let's go and understand why. We're going to do it row by row. Let's take the first row over here. So the sales here is 10 and the rolling average is 15. So why is that? Because in the calculation we are considering the next value. So 10 + 20 divided by 2 you will get 15. So that means the SQL defined the frame like this those two rows for this calculation for the first row. So now moving on to the second row. SQL going to include as well the third one right the next one. But since the window is only two orders it's going to go and drop the first row. So the next frame going to be like this. And as you can see it's going to be 20 + 19 divided by 2. You will get 55. So now you can see the effect of the rolling average. Right? So now for the next one going to be exact same. So we are at the third row. It's going to go and include the next one and we're going to get the same value because 19 + 20 divid by two you will get 55. Now interesting to the last row in the window over here. It will not go and consider the next value because it is outside of the window. So it's going to be 20 and it's going to stay as well 20. So that's it. All right guys. So with that we have learned about the moving average, the rolling average and those amazing concepts using the window function. All right. Now we're going to have a quick overview of the different use cases in the aggregate functions and how the definition of the window going to change the whole use case. So now the first use case is finding the overall total. And here if you don't define anything in the window if you leave it empty what going to happen you are doing here overall analyzes. So you're going to go and aggregate the whole data sets and then provide this aggregation for each row. So this is what happen if you leave it empty. You don't define anything. You are aggregating the whole data sets. Now moving to the next step, we can do analysis called total pair groups. So what we're going to do, we will add partition by to the definition of the window. So by adding for example here partition by products, what can happen? The data going to be splitted into two categories or two groups and the aggregation going to be done for each window separately. This is of course a great analysis in order to go and compare different products like here the caps and gloves. So this is helpful in order to compare categories. So you can do this analysis total pair groups if you use the partition by. Now if you go and use the order by you're going to land in the third use case. As we learned we will be doing running total. So as you can see here in the output we are building a cumulative value for the sales and this going to help us in order to do progress over time analyzes in order to understand the performance of our business. And now moving on to the last use case the final phase of the window function with the aggregation. Here you have the aggregate function together with the order by with customized fixed window. And of course we can use it in order to help us building progress over time in specific fixed window. And of course you can use those use cases you will get the same effect if you use the other functions not only the sum you can use average count max so all aggregate functions. So guys as you can see the window function in scale is very important in order to do data analytics by just like changing the part of the window you are generating a whole new use case for data analytics. All right friends so now let's do a quick recap about the window aggregate functions. So what they do they're going to go and aggregate a set of values and return a single aggregated value for each row. So it's very similar to the groupy but here we don't lose details. Now to the next point what are the rules for the syntax about the expressions they all expect a number in the expression. So you have to pass a number like sales or any integer but only for the count you can go and use any data type. And the things for the aggregate functions are very simple. Everything is optional inside the definition of the overclouds or the definition of the window. So you can go and use partition by order by frames or not or just leave everything empty. So everything is optional. So now as we learned we have a lot of use cases for the aggregate functions and they are really amazing for analytics. So the first one the simplest one you can do overall analyzes if you just leave the window function empty. So you will get one big number about your business. And the next use case we can do total bear groups analyzes. As you've learned, we can use partition by in order to compare categories with each others like comparing the products or customers and so on. Moving on to the next one, we can do partto-hole analyszis. We can go and compare the performance of each data point with the overall. So you can for example compare the sales to the total sales in the window or to the all data sets. And we have many comparison analyzes. We can go and compare the current value with the average or we can compare them to the extreme to the highest sales to the lowest sales and so on. And another use case, we can go and identify data quality issues in our data. So we can go for example and identify duplicates using the count function. Moving on to the next use case, we have the outlier detection. We can go and find out which data points are above the average and below the average and so on. Then the next one we have the running total. As we learned, it is great tool in order to track the progress or to monitor the performance of our business over the time. Or if you want to be more specific, you can go and use the rolling total in order to have like a specific window and only track this window like three months or something like that. And the last use case, we can go and calculate the moving average of our data. So it's really amazing how order by and aggregate functions can open for you a door for amazing or advanced analyzers. So guys, as you can see, we have a lot of use cases for the window aggregate functions in the world of data analytics. All right. Right. So with that we have covered the aggregate window functions and in the next step it's going to be very important. We will learn how to rank our data using window functions. So let's go. All right. So now let's say that we have the following data. We have products and their sales. If you want now to go and rank your products first you have to sort the data based on something like for example ranking the products based on their sales. So that means SQL first is going to go and start sorting your data from the highest to the lowest. So sorting the data is always the first thing SQL has to do before ranking anything. Now in order to rank our data we have two methods. The first method we call it the integer based ranking. So that means SQL going to go and assign for each row an integer a whole number based on the position of the row. So now by looking to the example the first row we have the product E with the sales 70 it's going to be rank number one then the next row the product B with 30 sales we will get the rank number two then the next one going to be three four and the last one going to be five. So that means SQL here is assigning an integer for each row based on their position in the sorted list. So this method we call it integer based ranking. Now let's go to the second method we have the percentagebased ranking. So in this methods going to go first and calculate the relative position of the row compared to all others and then assign a percentage for each row. So here in the output is going to start assigning percentages instead of integer and we're going to have a scale from 0 to one. So now if you go and compare both of the methods you can see that on the left side on the integer base ranking we have discrete distinct values. So it starts from 1 then 2 3 and end up in this example by five. So it really depends on how many rows do we have in the results. So it could be five, it could be 500, 5 million and so on. But in the right side we have always the same scale from one to zero. So between 0 and one we have infinite number of data points and this scale we call it a normalized scale or we call it continuous scale continuous values. So now the question is when to use which method. So for example for the percentage based ranking it is great to answer such questions find the top 20% products based on their sales. So this method is a great way in order to understand the contributions of data values to the overall total and we call this kind of analyszis a distribution analyszis where in the other hand in the integer based ranking we can answer questions like find the top three products. So with this question we are not interesting about the contributions of each product to the overall total. We are just interested in the position of the value within a list. So this is as well very commonly used analyzes and reporting. We call it top button in analyzers. So now let's group up our ranking functions based on those two methods. For the first group in the integer based ranking we have four functions. Row number rank d rank and inile. But in the other hand we have only two functions that generate percentage based ranking. We have the cumid list and as well the percentile. So now that was an introduction an overview of those methods and how we group up those ranking functions. Next we're going to go and learn about the syntax of the ranking functions. Most of them follow the same rules. So for example we start always with the function name. So we have here the rank. But as you can see we don't use any expressions. So they don't allow you to use any argument inside it. It must be empty. So this is the first rule using rank functions. Then about the definition of the window as usual the partition by it is an optional thing. You can use it or leave it. And now to the second part we have the order by it is as well required. So you must order the data or sort your data in order to do ranking. So you cannot leave it empty. So that means for the definition of the window at least we should have an order by for example here sales. So we cannot leave it empty. All right. So the two requirements you cannot use any expressions for those functions and as well you have to sort your data using order by. Okay. So now let's have an overview of all functions. So as you can see all those functions are ranking functions and almost all of them don't allow to use any expressions inside them. Beside this function here we have the end tile. it accepts a number inside it. So that means you cannot use it empty. You should use a number inside it. All others must be empty. So now for the partition by all of them are optional and for the order by all of them are required. So you must use order by and the frame clause they are not allowed to use in the ranking functions. So you cannot change the definition of the frame inside the window function. So now what we're going to do as usual, we're going to go and deep dive into all of those functions in order to understand when to use them and what are the use cases and as well practice in SQL. So we're going to start with the first one, the row number. All right. So what is a row number in SQL? The row number function going to go and assign for each row a unique number as a rank and it doesn't care at all about the ties. That means if you have two rows sharing the same value, they will not share the same rank. Okay. So now we have very simple example. We have a list of all sales and we have the following query. So it's going to start with the ranking function row number. It doesn't accept any argument inside it. And the definition of the window going to be like this order by sales disk. So that means we're going to go and sort the data descending from the highest to the lowest. So SQL going to go and do the following. The highest going to be the 100. The lowest going to be the 20. And here we have twice the 80. So now once SQL done sorting the data, what's going to happen? It's going to start assigning a rank. So the row number going to go and assign a unique number for each row. So that means it's going to start with the first one. The 100 going to be the rank number one. The next one going to be rank number two. The 80 going to be rank number three. And the 54. And then the last one going to be five. And now if you check the output you can see that all those numbers are unique. We don't have any repetitions. So 1 2 3 4 5 there's no repetitions. They are unique distinct value. And as well there are no skipping of ranking. So that means we have here 1 2 3 there is no jumping to 6 7 or something. They are clear sequence of distinct value and there are no gaps. But still there is something special in our data. We can see that in the sales we have the same value twice. So we have two rows with the same sales. As you can see in the row number they will get distinct values. So they will not share the same ranking. So that means row number does not handle the ties. If you have multiple rows sharing the same values they will not share the same rank. They going to have a distinct rank different ranks. So this is how the row number works in SQL. It generates unique ranks for each row. It does not handle the ties and as well it doesn't leave any gaps. So there is no skipping of ranking. So now let's go to SQL in order to have few examples and use cases. All right. So now we have the following task. It's very simple. Rank the orders based on their sales from the highest to the lowest. So now this is very easy. We're going to go and select first the data. So order ID, product ID. Let's take the sales as well and select the table. So it's going to be sales orders. Let's go and execute it. So with that we got all our orders. What we're going to do now is to assign for each row a rank. So that means we need a column here that contains the rank for each row. So in order to do that we're going to go and use the window function row number. It doesn't accept any argument inside it. So should be empty. And then we have to define the window. So as we learned in the ranking functions we cannot leave it empty. We have to sort the data using order by. So order by is a must. We don't have to use any partition by. So we're going to rank all the data that we have inside the table. So how to sort the data? It says it should be based on their sales from highest to lowest. That means we order by sales since from highest to lowest we have to use the descending. And now we're going to go and give it a name sales rank and let's say row since we are using the row number. So that's it. It's very simple. Let's go and execute it. So now let's have a look to the results. Before SQL did sort the data by the order ID since we didn't define anything. But since now we are order by sale descending SQL went and sorted the data by the sales from the highest to the lowest and start assigning a rank or let's say an integer unique integer for each row. So now the highest order going to be the order number eight. We have the sales of 90. This is the highest one. So as you can see we have 1 2 3 4 5 until 10. So now by checking the results you can see that the ranking here is unique. So there is no duplicates over here and as well there is no skipping or gaps. So we have everything between 1 and 10 even though that we have in our data a couple of sales that sharing the same value. So for example we have those two orders you can see both of them has the 60 at the sales but they don't share the same ranking. Right? So we have here as well the 9 and three they share the same value 20 but they don't share the same ranking. So with that we have solved the task. It's very simple. We have now a rank based on the sales from highest to the lowest. All right. So what is a rank function in SQL? The rank function going to go and assign for each row a number a rank and this time it going to go and handle the ties. So that means if in your data you have two rows having the same values they going to share the same ranking. One thing about the ranking function is that it's going to go and leave gaps in the ranking. So there is possibility of skipping ranks. In order to understand how the rank function works in SQL, we're going to have a very simple example. All right. So again with the same data but with different function. So our window looks like this. It start with the function rank doesn't accept any argument inside it. Then we have the window like this. Order by sales descending from the highest to the lowest. And our data is already sorted like that. So now how is scale going to go and assign the ranks. The first row going to be the highest rank. So the value 100 is going to be one. Then the second one going to be two. But now for the third one, as you can see, we have here two values that are the same. So we have a tie and this time SQL going to go and as well let them to share the same rank. So both of them going to be the rank two. So it's not like the row number where we have over here three. This time we have two because we have a tie. So having same values means they going to share the same rank. And now moving to the next value going to be tricky one because if you check over here you can see that the next rank should be like the three right? So we have one two and then the next value that generated in the rank should be three but going to say you know what this value position going to be number four. So as you can see 1 2 3 four. So actually the position number here is four and going to go and give it the rank of four. So with that SQL going to be leaving a gap in the ranking. You can see we are skipping the rank number three and this always happen once you have a tie where you are sharing the same ranking. So for the next one it's going to be easy. It's going to be the row number five. So now by looking to the output of the rank function you can see that we don't have a unique ranking. Here we have shared ranking in case of the ties. So it handles the ties but here we have gaps in the ranks. So we are skipping ranks. When I think about the rank function I think about the Olympics. If two athletes tie for the gold medal, the first place, there will be no silver medal for the second place, the next medal going to be given to the bronze to the third place. All right. So now let's go in SQL in order to practice the rank function. All right. Now we're going to go and solve the same task but using the rank function. So what we're going to do, we're going to stay with the same example over here and we're going to rank the order based on their sales from highest to lowest but this time using the rank function. So we use the rank and everything inside is going to be empty and then our window going to be exactly the same as before. So over order by sales and disk. So let's give it a name sales rank. Yeah, let's give it a rank. So that's it. As you can see the syntax is very simple and very similar to the row number. We just changed the function. So now let's go and execute this in order to check the results. So now let's go and check the results by looking to the new rank. If you go and compare it with the old rank, we can see that we are sharing some ranking, right? We have here the two twice. So the rank number two, we have it twice because we have over here the same value. So 60 60 we have it here two and two. But if you compare it to the row number, you can see that it is not sharing the same ranking. So this is one difference. And as well here the same thing. They have the same value. The sales is 20. So we have it twice the rank number seven. And here we have it as different values. And the next value as you can see we are skipping the rank. So there is gap there is no rank of eight. So you can see that this is the row number nine and that's why it get the nine. The same thing I believe over here. So now if you check those two ranks the next one should be three. But since it is in the row number four it's going to get the rank four. So by checking the results we can see that sharing the same ranks and as well we have gaps. So this is how the rank works. All right. So what is a dense rank? It is very similar to the ranking function. It's going to go and assign for each row a number rank and it as well handles the ties. So same values they going to share the same ranking but this time it doesn't leave any gaps like the rank function. So the d rank it will not leave any gaps. It will not skip any ranking. So in order to understand this we're going to have a very simple example. So let's go. All right. So again the same data but with different function. We have this time the rank function dense rank and the window going to be the same order by sales descending from the highest to the lowest. So now the data is as well sorted already. Let's see how SQL going to go and assign the ranks as usual. The first row going to be the rank number one the second as well but again here we have the same values. So we have same values and it's like the rank it's going to go and share the same rank. So both of them going to has the rank number two. And now you might say, well this is very similar to the rank function. So why do we have dense rank? I'm going to say wait for it. We're going to have the difference in the next value. So it's going to come over here. This value is exactly after the tie. In rank SQL went and took the position number. So the row number it was four, right? So 1 2 3 4. But this time with the dense rank SQL will not leave gaps in ranking. So there will be no skipping the next rank in the sequence going to be three. So that's why we're going to have the rank three for this value. So as you can see there is no gap. We have one, we have two and three. So we are not skipping, we are not leaving any gaps. And the last one going to be four. So this is exactly the difference between the dense rank and the rank. So now by checking the output of the dense rank, you can see that we don't have unique ranks. We have here shared ranks. As you can see, we have here repetition. So, it handles the ties and as well it doesn't leave any gaps. It doesn't skip anything in the ranking. Okay, so that's it. Now, let's go back to SQL to practice the dense rank. All right, so now we have the same task. Rank the orders based on their sales from highest to lowest. So, we're going to do the same stuff, but this time using the function dense rank. So, dense rank is going to be empty. And then we're going to define it like all others over order by sales disk. And then we're going to give it the name of sales rank dense. And that's it. So as you can see all of those functions having the exact syntax, right? So let's go and execute it. Okay. So now let's go and check the results. We got our newest rank using the dens. And by just checking the results, you can see that it handles the tie. We have two twice, right? So let's check the example over here. We have the sales 60 twice. That's why they are sharing the same ranking in the dense and as well in the normal rank. But now what is interesting is the value after the tie. So as you can see over here with the dense rank we have three. So we didn't skip any ranking. We don't have any gap 1 2 and then three. But with the rank it's just focus on the position number. So it is the row number four. That's why it's four. With that we have a gap. So as you can see now we don't have any gaps in the dense rank. So we have three four five. And now we have over here the same two values. So we have sales of 2020 and they share the six twice. So as you can see there is difference now between the dense and the rank. So here we have seven seven but here we are at the rank 66. So that's why we have differences between them because we skipped before in the rank number three. Now the other stuff you can see we have seven and eight. So now if you compare those three ranking you can see that they all start with the rank number one but they didn't all end with the same ranking. So the row number and the rank they really focus on the position number or the row number of the orders. So you can see over here it is the row number 10. That's why we have here 10 and 10. So the scale is from 1 to 10. And that is exactly the same for the row number from 1 to 10. But with the d over here we have it from 1 to 8 and that's because we shared the same ranking and with that we wasted let's say few ranks. So the scale is different from the two others. And that's because we have ties twice. This is one tie and as well we have over here one tie. That's why we are missing over here two ranks. So this is how the dense ranks works. And you can go and compare now all three togethers in order to understand how those ranks are working. All right. So now let's quickly compare the three functions side by side. Let's start with the first point about the uniqueness of the rank. And if you compare those three you can see that only the row number generates unique distinct rank. So this going to be unique rank and the two others we have duplicates or let's say shared ranks. Okay. So now the second point whether the function handles the ties and the only one that doesn't handle the ties is the row number. So this one doesn't handle the ties and the two others handles the ties since they offer the shared rank. And now we have the last point about leaving gaps or skipping ranking. So now if you check the row number and the dense rank you can see there will be no skipping. So there is no gaps for the row number and as well for the dense rank only for the rank function the middle one we are skipping ranks and we are leaving gaps. So that's it guys. This is the differences between those three functions. I tend usually to work with the row number more often than that to others. All right guys, so now I had a look to those three functions and I checked my projects real projects and I found out that there are many use cases for the function row number compared to the other functions dense rank and rank. So now what we're going to do I'm going to show you a few use cases for the rank number that I usually use in my real projects in order for you to understand how important is the row number function. So let's go to SQL. All right. So now let's start with the first use case and we have the task of find the top highest sales for each product. So this is very classic in reporting or data analyzes. We call this top end analyzes. So here the managers or decision makers they would like to have the best performers or the best success in our data. So for example the top highest five customers or the top five products or categories and so on. So this is very important analyzis in order to focus on the best products or on to the most important customers and so on and this is as I said very classic and very important in order to make decisions in the business. So now let's see how we can solve this. So we're going to start with the usual stuff. Let's first select the data. So select order ID. Let's take as well the product ID and the sales from sales orders. So let's go and execute this. And now as we know that for each product we have multiple orders and we have multiple sales but we are interested only in the highest sales for each product. So we have to go and create a rank. In order to do that we're going to use the row function row number and we have to define the window now. So do we need partition by check the query. So it says for each product that means we have to divide the data by the product ID. So let's go and use the partition by products ID. And now we must use the order by. So order by. And now how to sort the data by the sales, right? And it is from the highest to the lowest. So let's go sales. And we have here descending. So from highest to lowest. Let's go and give it a name. So you're going to be rank by products. So let's go and execute this. And now by looking to the result, you can see that SQL did divide the data by the product ID. So we have here like around four windows. The first one over here you can see that the rank starts from one end with four. So the highest rank can be the order number eight with the sales of 90 and then it goes to the four. Now as you can see that the second window we have a new ranking. So it resets the first going to be uh the order number 10 and the last one going to be order number two. So as you can see each window has its own ranking and as well the last one we have it only as one row. So now of course in the task we have to return the highest. So we are not interested in the others. We have to return this row this row as well and this one and this one. So as you can see we have to return everything that has the rank one. We are not interested in the rank 2 3 4 and so on. So we would like to have the highest. So now in order to filter the data what we're going to do we're going to go and use subqueries. So select star from and then we're going to have the following condition. So where and we're going to say rank by product equals to one. So we are interested only on the rank number one. So let's go and execute it. And with that since we have four products in our data, we're going to have only four rows and we have the highest sales. So as you can see we have only number one over here. And those sales are the highest for each product. And with that we have solved the tasks by finding the top end analyzers. Okay, moving on to the next use case. We have the following task and it says find the lowest two customers based on their total sales. So now we have the exact opposite use case. We call it button in analyzes. So now in this example in the business the decision makers want to optimize the costs want to cut costs and with that they have to analyze the lowest performers in the products or the lowest performance in the employees in order to cut costs. So now with this analysis the decision makers are not focusing on the best successful stuff. We are focusing on the lowest stuff the lowest performers. So now let's solve this tasks. So now if you check the question we have multiple stuff right we have the total sales and as well we have to find the lowest two customers. So we have ranking and as well aggregations remember we can do stuff together with the group I. So now let's do it step by step. First let's select the data right. So what do we need? Order ID customer ID and we need the sales from sales orders. So let's go and execute this. So now if you check the customers over here we have around four customers and they have multiple sales. Now we would like to have the total sales for each customers in order to find the lowest two. So let's start first with the aggregations. So what we going to do? We're going to go and aggregate the sales. So the sum of sales and let's call it total sales. And now in order to do the group by we have to have only the customer. So group by and we have the customer ID. So it is very simple group by statements. Let's go and execute this. So now by checking the result we can see that SQL did aggregate the data. We have four rows and that's because we have four customers and we have their total sales. So we have solved the first part of the task. We have the total sales for each customers. Now let's move to the second part. It says lowest two customers. That means we have to use the ranking functions in order to rank those customers. So we are not interested in all customers. We are interested only in the lowest two. So in order to do that now we're going to go and use the window function row number. So and then over. Now do we have to partition the data? Well no we don't have to do that. We have now to sort the data. So order by. So this time we're going to go and use the aggregations in the order by. So the sum of sales and we want to have it sorted from the lowest to the highest. So I'm just going to go and use the defaults. So it is ascending. Now let's call it rank customers. So that's it. Again here the rule is that if you are using a window function together with the group by function, you have to use only columns that is used in the group by. So this should be working. Let's go and execute it. So now as you can see in the results, we got an extra column for the rank. So now the lowest customer going to be the customer number two. The second one going to be four with the 90 total sales. And the highest customer with the sales is going to be the last one, the 125 customer number three. So now we have almost everything but the list should contain only the last two. So in order to do that to filter the data, we're going to go and use subquery. So select star from and then we have to define the condition where rank customers it should be smaller or equal to two. Right? So with that we will get the first two. So let's go and execute this. And with that we got the lowest two customers based on their total sales. So customer number ID you two and the four. So that's it. We have solved the task and now we have done button in analyzes. Okay let's keep moving to the next use case and we have the following task. It says assign unique ids to the rows of the table orders archive. So now guys we might be in situation where you have a table without any primary key and you would like to create an ID for each row. So in order to do that we can use the function row number in order to generate unique identifier ids for each row inside our table if we don't have one. And generating such ID for each row. It's very important to do stuff like importing data, exporting data, maybe joining tables as well using this ID or let's say optimizing the performance of query using the ID. So now let's see how we can generate that using row number. Okay. So now let's first select the table order archives in order to understand the content. So select star from sales orders archive. So let's go and execute. So now by checking the result you can see that we have 10 orders and we have repetitions in the order ID over here. So it is not really primary key. As you can see over here we have twice the ID four and here we have three times the ID6. So now what we're going to do we're going to go and generate unique identifier for each row. So in order to do that what we're going to do going to go over here and say row number and then we're going to define the window function. We don't partition the data at all but we have to sort the data by the order ID. So order by order ID or you can use something else as well using the order date or something doesn't matter. So let's add to it order data as well and let's call it unique ID. Let's go and execute this. Now by checking the data you can see that we have a new ID over here that comes from the row number and we have like a unique identifier. As you can see we have 10 rows and with that we have as well 10 different distinct unique ids. So with this as you can see we have solved the task and we have now a unique identifier an ID for the table orders archive. So now having this ID we can do many stuff like joining tables or doing something special and important called pagenating. Imagine we have like a huge table and we would like to retrieve the data. So now in order to not have all the data in one go we can go and divide the data by the primary ID or by unique identifier. For example, we can make a page from 1 until 100,000 and then the second page starts from 100K to 200ks. So now by dividing the data, we can maybe improve exporting or importing data or we can have faster retrieval for the users. We don't want to have the whole data in one go in one page. So it has a lot of benefits using pagionating and we can do that only if we have a nice ID like this. All right. Right. Today I'm going to show you the last use case for the function row number that I usually use in my real projects. So sometimes if you are doing data analyszis you're going to find out that there are data quality issues especially with the duplicates. So what I usually use I use the raw number in order to identify the duplicates. Not only that I can use it in order to delete the duplicates. So we can use it in order to do data cleansing. And this is essential task for each data engineer not only data analysts in order to prepare and clean up the data before doing data analyzes. So let's have the following task. Identify duplicate rows in the table orders archive and return a clean result without any duplicates. So not only we have to identify the duplicates, we have to return no duplicates in our results. So let's see how we can do this. Let's first select the data. So select star from sales orders archive. So let's go and execute. So now by looking to the data you can see that we have duplicates. We have an issue. So the order ID number four is twice in our database. It doesn't make sense, right? It should be only one. So which one is the correct one? If you check the data over here, you can see that this order is shipped and then delivered. So it looks like the last one is the correct one. So how we can do that? If you just scroll to the right, you can see that we have a creation time. And we usually use such a time stamp in order to identify what was the last valid like order. And here we can see immediately that this order time is higher than the previous one. Which means this is the more up to date, right? The more current. So what we're going to do, we're going to go and rank our data for each order ID and sort the data by the creation time in order to find the last inserted or created row for this order. So let's see how we can do that. What we going to do? We're going to go over here and say let's have a row number and then over and what we're going to do, we're going to partition by the primary key. So partition by order ID and as we said we have to order the data by this time stab at the end. So partition by or order by creation time and descending. So we want the highest then the lowest. So that's it. Let's call it Rn and execute the query. So now by checking the data if everything is clean and we don't have duplicates everything should be one because maximum for each primary key we should has one row. So but you can see over here we have here two and we have here three two. So that means this is indicator that we have duplicates inside our data. So now by checking one by one as you can see the order ID is only one. So we have the rank one the second one as well we have the rank one but here we have the issue. So as you can see we have now two ranks for the order ID four. So now which one is the correct in our logic? We say it is the last row that is inserted inside our data and this is rank number one. So if you scroll to the right side you can see that the creation time here is higher than the second one. So with that we have identified what we want. We want the last inserted row for each ID. And now let's check this over here. So here we have it three times. So it says the first one is the highest creation date. So if you go to the right side and now by comparing those time stamps you can see that this record the first one is the la latest one that is inserted inside our data. So as you can see this one is the one that we need the other two we don't need it because it is old informations. So now everything that doesn't has the rank number one is not valid. It's something old and it's actually bad data quality. So we want to remove it or not to select it. So now in order to have a clean data what we going to do we're going to go and select the following as sub select. So select star from the table and now we are interested only with the rank number one. We don't need anything else. So let's go and execute. And now if you check the results you can check the order ID over here. It is unique. We don't have any duplicates. Right? 1 2 3 4 5 6 7. There is no duplicates at all. And we have now only the latest inserted data inside the orders. and we don't have any duplicates or data quality issue. So now of course now we can go with this results in order to do for the analyzes and this is exactly what data engineers usually do clean up the data and prepare the data before doing any data analyzes. And of course if you want to communicate those data quality issues to the source of the data let's say you are not the owner of those informations. You can generate a list of all bad data quality issues and you can send it to the source system and tell them to clean it up from the sources. So now in order to select the bad data what we're going to do is we can just change here the condition and say if it is higher than one then you are like bad data. So let's go and execute this. And now with this we have in the results all records that shouldn't exist in the data in the first place. So we can go and export it and communicate it to the source and tell them check here you have something wrong in your system and those information should not be inserted in the data. So everyone it is very strong right? It is very powerful. I use it a lot in my projects. There are many use cases for the row number function in SQL. We can do it in order to find the top end analyzes, the bottom end analyzes, the best performance, worst performance and as well we can assign unique ids to do benating or we can use it in order to discover data quality issues to clean up our data. So it is amazing function in SQL and you're going to use it a lot. So that's it for the three functions ro number, rank and dense rank. Now we're going to talk about the inile. Okay. So what is inile? Intile in SQL is very simple. It's going to go and divide your rows, your data into specific number of almost equal groups or sometimes we call them packets. So now in order to understand this and how it scale works with this function, we're going to have a very simple example. So let's go. Okay, we have the following setup. We have four rows for sales and we would like to divide it into two groups or into two buckets. So in order to do that we can use the entile function. It has different syntax than the other ranking functions. So it starts with entile then we must define a number. So we cannot leave it empty like the other ranking. So here we have two buckets then over and here again we have to sort the data. So it is must order by sales descending from the highest to the lowest. So now as usual SQL going to go and sort the data. We have it already sorted in this example. Then it going to start assigning each of those rows into buckets. But SQL first has to calculate the bucket size. So how many rows we can like insert inside each bucket. So the calculation is very simple. It says the bucket size equals to the number of rows divided by the number of buckets. So what is the number of rows here? We have four rows, right? So we have four over here. Then the number of buckets we define it in the syntax of the query. So here we defined two buckets. We need two groups. So that means we are dividing four by two. And the size of the bucket going to be two. So now with this SQL is ready and going to start assigning each row to a bucket. So it's going to start on the top. The first one going to be in the bucket number one. Then go to the next one. It's going to say okay we still have enough space in the bucket. Right? So it's going to sign as well to one. But with this we reach the maximum number of rows within each bucket. So the next row going to be assigned to another bucket. So it's going to be two and the last one going to be as well too. So as you can see it's very simple. We have just assigned our sales based on the sorting of course into two buckets. These two sales belongs to the bucket number one and the other two belongs to the bucket number two. Very easy. So that was very straightforward because we are dividing even numbers and we got perfectly sized buckets. But now what going to happen if we have an odd number? So we have here five instead of four. So the bucket size going to be dividing five by two. We're going to get 2.5. And now of course SQL will not go and divide like two half for each bucket. Then we are splitting this into two packets. Of course this will not be working. We should has now a bucket with three and another bucket with two. So now the rule in SQL make it very clear. It says larger groups comes first then smaller. So that means if we have here an even number like this, the larger group going to be the first group. So that's going to look like this. It's going to like reset everything. So let's see what's going to happen. The first one going to be one. The second one has bill one. The third one going to be as well one. So it going to has a larger package than the second one. Then the rest going to be two. So as you can see the larger group comes first then the smaller. And this is how a scale going to work. if you have odd numbers. So you don't have here perfectly sized buckets. You have approximately or roughly equally sized buckets. So this is how the intel works. Now let's go back to scale in order to practice this function. Okay. So now let's have some fun working with this function. So we just going to select something like order ID sales from sales orders. So let's go and execute it. And with that we got our 10 rows. Now let's say that I would like to create only one bucket from the data. So entile and only one bucket over partition let's say not partition by let's take order by sales descending. So that's it. I'm going to call it one bucket. So let's go and execute it. As usual it's still going to go and sort the data and then calculate the bucket. It's going to be 10 rows divided by one. So the size of the bucket going to be 10. So that's why you're going to see everywhere here as one because all those rows going to fit into one bucket. So this is very simple. We have only one bucket. Let's go and now have two buckets. So I'm just going to copy and paste. And instead of one, we're going to have two and let's call it two buckets. So let's go and execute this. So now here again, what is the size of the buckets? It is 10 divided by two. So we will get perfectly grouped buckets. So the first bucket going to be five rows and the second one going to be the next five rows. So it is very perfect. Let's go to the next one. Let's have three buckets. So three. So let's go and execute. So now what going to happen is going to go and divide 10 by three in order to get the size of the bucket. And it's going to be 3.3. So it is decimal and we will not get perfectly sized buckets. So again the larger group comes first then the smaller. So as you can see we have to fit then in the first group four in order to get the others with three. So that's why the first bucket is going to be the biggest one. So four rows into the first bucket. Then the second three rows going to be in the bucket two. And as well the last one going to be bucket three. So as you can see the largest group is going to be the first bucket. So now let's keep playing with the data. Let's go and take now four. We would like to have four buckets. Now things going to get interesting. So now by checking the result it's going to be interesting. SQL going to divide 10 by four and we will get something like 2.5. So again we will not get perfectly sized groups. So SQL has to fit now 10 rows into four groups. So the first three rows going to be fit in the bucket number one and as well the second three rows like this going to be in the bucket number two. And then you can see over here we have two buckets with a size of two. And with that we can fit 10 into four groups. And again you can see the larger groups comes first like this one and then the second and the smallers comes later. Okay. So this is how the inter works in SQL. And now you might say you know what why do I need buckets in the first place. So what is the use case? There is two use cases for the intel function in my projects. In one hands if I am data analyst I'm going to use the intel function in order to segment my data. In the other hand, if I'm data engineer, I'm going to use the intel function in order to do ETL processing and as well to do load balancing. So now let's start with the first use case as a data analyst where you want to do segmentations with the entire function. Segmentations is very nice way in order to understand your data. So you can go and segment your data into different buckets or groups like for example doing segmentations for the customers. So you can go and group up your customers depend on their behavior like the total sales or the total number of orders. So with that you can make like for example VIB section and then the medium and then the low. So now in order to understand the segmentation use case let's have the following task. Okay. The task says segment all orders into three categories high medium and low sales. So in order to solve this let's do the basic stuff right. So select order ID. Let's take the sales from our table sales orders and let's go and execute it. So as usual we got our 10 sales. So now if you check the task it says we need three categories. So that means we need three buckets right and it says high, medium and low sales. So that means we are dividing by the sales. So let's go and do it step by step. So we're going to use inile since we need to segment the data. Three categories means three buckets. And then let's define the window over we don't have to divide the data by partition by we just need to sort it first by the sales. So it's going to be by sales and let's take discrete since we want to sort it from the highest to the lowest. So that's it. Let's say you are our buckets. So let's go and execute this. So now if you check the data you can see that they are segmented into three buckets. So the first bucket going to contain all orders with the high sales. Then the second one going to be all sales with the medium. And then the last one going to be all sales with the low sales. So as you can see we have already categorized our data into three groups. But now as you can see we have numbers and maybe the user is expecting to have those text high, medium, low. So that means what we're going to do now we're going to go and translate those numbers into text into words. And of course we cannot do that inside the window function. We're going to use data transformation using the case when statements. Don't worry about it. We're going to have complete dedicated section explaining the case when. So for now just follow me in order to see how this works. We're going to go and use subquery. So it's going to be select and let's take the star for everything and then let's have the following logic. Case when buckets equal to one then it is high the sales is high. So we are just mapping the numbers into text. So otherwise case when the brackets equal to two then we are targeting the medium medium and then the last group packets equal to three then those sales are low. So let's call it end it and let's call it sales segmentations. So that's it. Let me just make it a little bit smaller in order for you to see it. And all right so from and then we have our subquery like this. So as you can see we just mapped the numbers into text. We are just doing translations. So let's go and execute it. And now by checking the results we got our three categories for the users. So the first category going to be the high sales. The second one going to be the medium sales and the third one going to be the low sales. So guys you see Intel is very powerful in order to segment our data. So now you can go and segment stuff like the customers by their total sales or the products by prices, employees by their salaries and so on. All right. So this is the first use case for the Intel function as a data analyst where you go and segment your data in order to understand the behavior. Now in the other hand, if you are data engineer, you can use Intel function in order to do load balancing in your ETL. So now I'm just going to explain it in very simple sketch. All right. So now we have the following scenario where we have two databases and we would like to move one big table from the database A to database B. So in this case I'm doing something called full load. That means I'm loading all the rows from one database to another. So if you do it in one go what could happen is that it could take long time. So it could take hours or even sometimes days and maybe at the end you will get maybe some network errors because you have stressed the networks between those two databases and everything going to break and you're going to lose the data and you have to start again. So now instead of loading this table in one go what we can do we can go and split it into fractions or let's say packets. So we can split this table for example into four small tables using the function entile. So now after we split this big table into small tables, we're going to go and start moving those small tables one after another and with that we are not stressing the networks and it's going to succeed. So now after loading everything at the end in the target database we're going to have those small tables and of course we can go and use the union in order to merge them in order to have again the big table that we have it in the original database. So this is very common use case for the entile in order to split the load and to balance the processing of extracting data. All right. So now we have the following SQL task. It says in order to export the data divide the orders into two groups. So let's go and do that. First we're going to select everything from the table just in order to see the data sales orders. So let's go and execute it. So now we got our 10 orders and what we have to do is that to go and split it into two groups. In order to do that we can use the entile function. Two groups means two buckets. So let's define the window. So here we don't have to partition the data using partition by but we have to specify the order by. So now which column we're going to use in order to sort the data. Of course here there is no rule like you can go and split the data by sales or by the order status by date by anything you want. But we usually go and use the primary key. It's just systematic, better, and more clean, especially if you have a sequence of numbers in the order ID. So you can export the first range of the orders, then you can go to the next group and so on. So let's go with the order ID and let's give it a name buckets. So that's it. Let's go and hit execute. Now, as you can see, it's very simple. We got our two groups. So this is the first batch of of the data and this is the second batch of data. So now we can go and select the first batch and export it, import it in the next system. And then after that we go with the second batch. And of course if you still suffer from the size of those packets, you can go and split it to more smaller size. So you can go over here and make it four. So with that we're going to get smaller buckets and it might be easier to export the data. So this is really great use case for the entile function. All right everyone. So with this you have learned the two use cases for the entile function that I usually follow in my projects. So as a data analyst you can use it in order to do segmentations and as a data engineer you can use it in order to do load balancing of the ETL. Okay everyone so with that we have covered everything about the integer based ranking functions. Now we're going to talk about the second methods. We have the percentagebased ranking functions and here we have two functions the cubist and as well the percentile. So now let's have a quick recap. So with the percentage based ranking SQL going to go and calculate a relative position as a percentage and assign it for each row. So the output going to be a continuous normalized scale from 0 to one. And this is really amazing in order to do distribution analyszis. So those functions going to consider in their calculation the overall total the whole size of the data set which can help us in order to find out the contribution of each value to the overall total. And now in SQL in order to generate the percentage we have two different formulas. So in one hand we have the function cumist and in the other hand we have the percent rank. So that means we have two different functions with different formulas in order to generate and calculate the percentage. So now let's start with the first function the cumist. All right everyone. So now let's start with the first function. We have the dis and it stands for commumulative distribution. It's going to go and focus or calculate the distribution of your data points within a window. So what this means in order to understand it, we're going to go and have very simple example to understand how SQL works with this function. So let's go. All right. Again we have our very simple example of the sales and we have the following query. So dist then we don't give any argument inside it. So it's going to be empty and the window going to be like usual order by sales descending from the highest to the lowest and the order by is must. So the first step is SQL going to go and sort the data. We have it already sorted from the highest to the lowest. So now the next step is that SQL going to go and start calculating the percentage for each row. And we have a very simple formula. It says the cumist equals to the position number of the value divided by the number of rows. So now the next step is still going to go and start calculate the percentage for each row. And we have this very simple formula. It says the cubist equals to the position number of the value divided by the number of rows. It's very simple. Let's do it step by step. So SQL going to start with the first value in our list. So it going to be calculated like this. So what is the position number of the first value? It's going to be one, right? So this is the first value in our list. And what is the total number of rows? We have five rows, right? So 1 2 3 4 5. So we're going to divide one by five. And the result going to be 0.2. So this going to be the first value for the first row. Okay. So now SQL going to go to the next row. And this time we're going to get a special case. As you can see, we have the 80 twice. So we have here a tie. So now first we need the position number. As you can see, we are at the position number two, right? But since we have the 80 multiple times, SQL going to go and take the last position that we see the value 80 and the last position going to be the record number three. So that's why SQL going to say for this record it's going to be the position number three and not two and then it's going to go and divide it by five and we will get the value of 0.6. So this is the most confusing thing with this function. So if SQL finds a tie, it will completely ignore the current position number. So we don't have two. It going to go and take the last position number for the same value. And the last in our list going to be the record number three. So that's why we have three over here. Okay. So now let's keep moving. Let's go to the third row. And as you can see, we are again in the tie. But this time, this is the last time we see 80. So next we don't have 80. So what's going to happen? We're going to have exact same result. So it's going to be 3 divided by 5. So as you can see if we have a tie they going to share the same percentage. So that means with the cube list if you have same values they going to share the same rank. So let's keep moving to the fourth one. So now what is the position number of the 50 we are at the record four. So position number four divided by five we will get 0 comma 8. Okay. So now let's move to the last one and it is the easiest one. So which position do we have over here? It is the position number five. It's the last one. And the number of rows is five. That's why we will get one. So guys, that's it. This is how the cumulative distribution works. Once you understand the formula, it's going to be very easy in order to understand the output. So as you can see, calculating the percentage always depends on the total size of our data sets. You can see here the number of rows. So with this we're going to get an output that help us in order to understand the distribution of our data points within the data sets. All right everyone. So now we're going to go and focus on the second function that generate percentage as a rank. We have the percent rank. So the percent rank going to go and focus on generating the relative position of each row within a window. So in order to understand what this means, we can have a very simple example in order to understand how scale works with this function. So let's go. Okay, again we have those sales very simple example and the syntax going to be like this percent rank and inside it we don't use any arguments and the window going to be like this order by it is a must sales descending from the highest to the lowest the first step that is going to do is that it's going to go and sort the data from the highest to the lowest and we have it already like this and next SQL going to go and start calculate the percentage which is very similar to the cumulative distribution but this time it's going to be like this position number then we subtract it from one and as well divided by the number of rows subtracted from one. So it's like exact formula but we are only subtracting here once for both numbers. Okay. So now let's go through all rows step by step and see the output. So it's still going to start with the first row right. So what is the position number of the first row? It's going to be one. Then we have to subtract it by one. That's why we will get zero. Now what is the total number of rows? So we have here five rows and it is subtracted by one that's why we're going to get four. So now 0 divided by any value the output going to be a zero. So that's why for the first value we will get a zero. All right. So now let's move to the second row over here. And here we have our special case where we have a tie. So we have two sales sharing the same value 80. So now for the percent rank SQL gonna have different behavior than the cumist. Remember in the list SQL did search for the last position of the shared value. So it was the position number three since this is the last time we see 80. But now with the person rank is still going to stick with the first occurrence of the shared value. So now by checking those two 80s what is the first occurrence? It is the record number two. So that's why we have position number two subtracted by one we will get one. And here the same going to be number of totals we have five subtract by one we have four. So now if you divide one by four we will get the result of 0 comma 25. So this is the percentage of this value. So now let's go to the second row. Here we have again the tie. So scale going to stick with the position number two the first occurrence. So it's going to be the same two subtracted by one we will get one. And as well the total number of rows five subtract by one we will have four. That's why we will get the same exact results. So here as you can see with the percent rank it's like the list the shared value going to share as well the same percentage rank. Now let's move to the fourth one. So we have the value 50. So what is the position of this? It's going to be the record number four. Subtract it by one we will get three. And if you divide three by four you will get 0.75. And now moving to the last value over here it's going to be easy. So what is the position number of the 30? It is five. Five subtracted by one it's going to be four. And as well we're going to have four as well here for the total numbers subtracted by one. So if you divide four by four you will get one. So that's it guys. This is how the percent rank works. It always has the scale from 0 to one. So it's always like this. Doesn't matter which values do we have inside and it's going to has like continuous scale. And again here if you have a tie they're going to go and share the same percentage rank. Okay guys. So now if you go and compare those two functions you're going to see that they are really similar to each others. The output of both functions we are generating percentage based ranking and both of them as well handling the ties perfectly. So they share the same percentage rank. If you check the syntax they are very similar. And now by checking the formulas of both of them we are always considering the overall size of the data sets. So here the size is considered in the calculation to help us finding the relative position of each value to the overall and this is very important in the analyszis in order to measure the contribution of each value to the overall. So now about the use cases if you want to focus on the distribution of your data points go with the cumulative distribution but if you want to focus on the relative position of each rows then go with the percent rank. All right. So now there is one more difference between the and the percent rank and that's if you check the formulas. You can see that the is more inclusive. We always consider the position number of the current row. But with the person rank we don't consider the current row. We like skip it or make it exclusive. So we say for the person rank it is more exclusive and the cumulative distribution it is more inclusive. So now if you ask me the hard question which one to use, I'm going to say if you want to be more inclusive, go with the commutive distribution. If you want to be more exclusive with the current row, go with the person rank. So they are very similar to each others. So if you want to calculate the distribution of your data, go with the cumulative distribution. If you want to find the relative position of each row, then go with the percent rank. All right. So now we have the following task that says find the products that fall within the highest 40% of the prices. Let's go and solve this. Now we are targeting the table products and I will just select like two columns products price from sales products. So that's it. Let's go and execute this. So now as you can see we got five products and their prices. And the task says find the highest 40%. So we have to find and generate a percentage rank. In order to do that we have the two functions cumist and the percent rank. I will go this time with the list. So let's go and do that. So list and then let's go and define the window like this. It's going to be order by we are targeting now the prices right? So order by the price from the highest to the lowest and let's go give it a name this rank. So let's go and execute this. So now with that SQL going to go and generate for us a percentage ranking using the formula that we just learned before. So now in the output we are getting all the products but the task says we have to get only the products that are in the highest 40%. So that means the first row the second row and that's it. So those rows are in the highest 40% the rest are below that. So in order to do that to filter the data we're going to use the subquery. So select star from and then we have our sub query like this and then our filter going to be this rank smaller or equal to 0.4. So this is our threshold in order to get the data. So let's go and execute this. And now as you can see we got the top products the top 40%. Now of course you can go and format the percentage. We can do that like this. So let's take the test rank multiply it with 100. So let's go and execute this. So as you can see we got 20 and 40%. We can go and add to it as well the percentage character right. So we can go and say concat and we're going to add the character after that like this and let's call it test rank percentage. So that's it. Let's go and execute it. So that we have solved the task. We have the products that fall within the highest 40%. Now, of course, you can go and try the percent rank. So, it's very simple. We just have to go and switch the cumulative distribution with the function percent bank. So, let's go and execute it. Now, as you can see, we will get the exact same results. So, we're still getting the gloves and caps as the highest products within the 40% of the price. So, guys, that's it. It's very simple, right? All right friends, so now let's have a quick recap for the window ranking functions. So what they're going to do, they're going to go and assign a rank for each row within a window. And we have two types of ranking, right? The first one is the integer based ranking. It's going to go and assign a number an integer for each row. And here we have four functions. Row number, rank, dense rank, and in tile. And the second type of ranking, we have the percentage based ranking. So scale fair is going to go and calculate a rank and then assign it for each row. And here we have two types of formula or functions. So we have the cube dist the cumulative distribution and the second one we have the percent rank. And now to the next point if we are talking about the rules of the syntax. So the expression should be empty. We should not pass any argument to the functions. We must use order by in order to sort our data. So it is required and the frame clause are not allowed to use. So you cannot go and customize a frame within the window function. And as we learned there are many use cases for the ranking functions. For example, we have the top end analyzes the button end analyzes in order to identify our top performers or the worst performers in our business. Another use case using the row number we can identify and remove duplicates in our data. So we can use it in order to find data quality issues and as well to improve the quality. And another use case if our table don't have a clean primary key we can go and generate unique ids using the row number in order to do as well by generating one more use case it was the data segmentations you can use the intel in order to segment your customers your products employees and so on and another use case we can do data distribution analysis as we learned we can use the cubeist in order to understand the data distributions of our data points compared to the overall and the last use case it's more for data engineering we can use the intel function in order to equalize the loading process of our ETLs. So as you can see there are many use cases for the ranking functions. Okay, so that's all about how to rank your data using the window functions and now we're going to cover the last group. We will learn about the value window functions. How to access another records. So let's go. All right everyone. So now we have this very simple example. We have the months and the sales. Now we can use the value functions in order to access a value from another row. So in order to understand it let's say that SQL now processing the months and we are currently at the month of March. So now for example I would like to access the value from the previous month from February. So in order to do that we can use the lag function in order to get the value of 10. So with that we have in the same row the current sales of the month March and as well the sales from the previous month the February. And maybe in other cases I would like to get the sales of the next month from April. In order to do that we can use the function lead and we will get at the same row the value five. So now I can very quickly compare the current month with the previous month and as well with the next month. And now in the other cases you might be interested in the first month of your list. So it's going to be here January. So in order to get the sales of the first month you can use the function first value. So we're going to get at the same row 20. And now for the last option I think you already get it. We can go and get the value of sales of the last month. So here we can get the July. So for that we're going to use the function last value and we will get the value of 40. So this is exactly the purpose of the value functions or analytical functions. We can access a value from another rows. And here is really important to understand as well the value functions is like the ranking functions. We have to use the order by in order to sort the data in order to understand what is the first row and the last row. In this example, the data is sorted by the month. So guys, the access functions are really important for analytics. You can use it in order to access a value from other rows in order to do comparison. All right. Right. So now let's have a quick overview of the syntax and the rules for the value functions. So here we have four functions lead, lag, first value and last value. So as you can see we can group them into two groups. So we have the lead and lag. They are very similar to each others. Especially with the syntax we can use three things or three arguments inside it. Expression offset default for both of them. For the first value we can use only an expression. So that means we have to pass a value for those functions. You cannot leave it empty. So now about the expression data type, you can use any field with any data type. There is no restrictions about only for example using numbers. Any data type is allowed. Now about the definition of the window. The partition by as usual is optional like any other group. The order by here is a must. You must define an order by. It's like the ranking. So here you cannot leave it empty. And now we come to the last one. We have the frame clause. There are really different stuff over here. So for the first two functions lead and lag you are not allowed to define any frame. So you are not allowed to define any subset of data. It's very similar to the ranking. So you must use order by but you cannot define the frame of the window. But for the other two functions the first value and the last value they are optional. You can go and use them. And for the last value it is recommended to define frame close. Don't worry about it. We're going to have enough examples in order to understand. So as you can see those functions has different requirements. So there is no generic rule for all of them. But one thing that they all agree on that you must use order by. So now as usual what we're going to do we're going to go and deep dive into those functions. We're going to address first the two functions lead and lag because they are very similar to each others. We're going to understand the use cases when to use them and of course we're going to practice in SQL. So let's go. lead and lag functions. The lead function can allow you to access a value from the next row within a window where the lack function is exactly the opposite. It's going to allow you to access a value from a previous row within a window. It sounds very easy, right? So let's understand how is SQL going to execute those functions. Okay. So now let's have a quick overview of the syntax for both of the functions lead and lag. We have here very simple example for the lead function. So as usual we start with the function name. It's going to be the lead. And now after that we're going to go and pass the arguments. And as you can see we have here multiple stuff. So let's do it step by step. So the first thing is that we're going to go and specify an expression. And the data type could be any data type. It could be a number like here the sales. It could be a character like names or dates or anything. So this is required. We have to specify an expression. We cannot leave it empty. And we can use any data type. Now moving on to the next one. We have here a number. So what is this? This is the offset and this offset is optional. So you can go and skip it. So what offsets means? What we are doing over here? We are specifying for SQL the number of rows forward or backward from the current row. So here in this example we are specifying the offset as two using the lead. And with that we are telling SQL go jump to the next two rows and get me the value. And if you are using lag it means you are telling SQL go back two rows up and get me the value. So here you are telling SQL how many rows it needs to jump and if you don't specify anything like leave it empty SQL going to go and use a one. So the default of this with the offsets going to be one if you don't specify anything. All right moving on to the last one and to the third one. This is as well optional. You can go and leave it empty. So here it is the default value. Now what happens with those functions is that sometimes SQL jump to the next two rows or something like that and SQL doesn't find anything. So there is no more rows available to access and with that SQL going to go and return a null. So that means if SQL goes to the next rows or go to the previous rows and doesn't find anything SQL as a default going to go and return a null. So if you don't specify anything over here in those scenarios you will have a null values as a return from the whole function. But in some scenarios you don't want to have a null you would like to have a value. So here you are defining the default value. So it should not be a null, it should be a 10. So scale if you don't find anything return a 10. Don't return a null. So again guys, the default values, the offsets, all those informations are optional for you in order to configure it. But you should know the default if you don't use anything for the offset is going to be one for the default value going to be null. But you must specify an expression. So here you cannot leave it empty. All right. So that's all about the arguments that you can pass to the lead or lag functions. Then the next stuff are the standard stuff. So we have the overclos then we have the partition by as usual partition by is optional. And then to the order by those functions it's like the rank functions. It requires you to sort the data. So it is a must to sort the data otherwise will not know what is the next row what are the previous rows. So we have to sort the data. It is required. You cannot skip this. So it is not optional. All right. So the syntax is not crazy right? We have the usual stuff but only we can go and configure the default value and the offsets. Okay guys, now we have very simple example. We have months and sales and we're going to go and understand how the SQL works for both of the functions lead and lag side by side. So now in the first example we are interested in the sales of the next month. So in order to do that we're going to use the lead function. So lead and then we're going to specify the argument. It is the sales. We want the value of sales and then we define the window like this order by month. So it's going to be ascending. And now in the right side we're going to be interested in the sales of the previous months. So in order to do that we're going to use the lag function. So it's going to be very similar to the lead. We have lag and then the sales since we are interested in the sales and we're going to sort the data by the month. So now let's see how going to do it step by step and side by side. So going to start with the first. So now let's see how skill going to process those informations side by side and row by row. So it's going to start with the first row over here. What is the next month of January? It is February and we are interested in the sales of this row. So SQL going to take the value from the next row and we're going to have the value of 10. So now by looking through the January we can see the sales of the next month of February in the same row. So now let's check the right side over here. Now we are interested in the previous month. So what is the previous month of the first row? It will be nothing. Right? So we cannot point it with anything. That's why going to say this is null. There is no previous month for the current row. And we're going to have it as a null. Okay. So now it's going to go to the next row. We are at February. What is the next month? It's going to be March. And it's going to point to it. So we will get the 30 as the sales of the next month of March. And on the right side, what is the previous month of February? It's going to be January, right? So, it's going to get the value the sales of the previous month. And here we will get 20. So, as you can see, it's very simple. On the lead, we are always checking the next values. On the lag, we are always checking the previous value. So, let's keep going. We are currently at March. What is the next month? It's going to be April. So, it's going to go and point to it like this. and we will get the sales of the next month April. For the March on the right side, what is the previous month? It is February. Right? So, it's going to go and point to February. So, we will get the sales of 10. And now, interesting to the last row over here. You can see that we are at April. What is the next month of April? There is nothing because we are at the end of our table, right? So, since there is no month after that, we will get a null in the output. But for the lag, we still have a previous month for April. So what is the previous month? It is March. And we will get the sales of the March. So it's going to be 30. So that's it guys. It's really simple, right? It's just like they are doing the opposite things. So now if you check those values side by side, you can see that with the lead, we will always get a value for the first row, but for the last row, it can be always empty because there is no next value. We are at the end of the table. But if you check the lag for the first value, we will always get a null because there is no previous value or previous record from the first row. And for the last record, as you can see, we're always going to get a value because we will have a previous value. Okay, let's move on in order to understand how scale this time works with the offsets and the default value. So now we have the same data, but we have different task. So now on the left side, we would like to get the sales of two months ahead. So it's not the next month, it's going to be two months. And we would like to tell SQL if you don't find any value don't return null return for us is zero. So this is going to be our default. Now if you check the syntax it's going to be exact like before but we are adding now an offset of two because we are interested in two months ahead and we are specifying here a default value zero. So if you don't find anything put zero don't put null. Now on the right side we have the exact opposite. We are interested in the sales of two months ago. So we are not interested in the direct previous month we need the sales of two months ago. And here the same thing if you don't find anything don't return null give us a zero. So as you can see we have the same syntax but using the function lag. So now let's understand how going to execute this step by step and side by side. So going to start with the first month January. So now SQL going to ask what is the sales of two months ahead. So we are at January. It will not be February it's going to be the month of March. So it's going to go and point it like this and we will get the value of 30. So 30 is the sales of two months ahead. And now on the right side we are as well at January. It's going to ask the question what is the sales of two months ago. So we don't have any previous data. Right? So we will not get anything. It's going to return null but it's going to check do we have a default value? Well yes. So this time HQL will not return null. It's going to return the default value. And this time it's going to be zero. All right. All right. So now let's go to the next value. We are currently at February. What is the sales of two months ahead? So it will not be March, it's going to be April. So it's going to go and point it like this and we will get the value of five. So now on the right side we are currently at February. Now the question is what is the sales of two months ago? We have history. We have the previous month but we don't have two months in the history. That's why we will still get zero as the output with the default value. Okay. Okay. So now let's keep going to the next value. We are currently at March. SQL going to ask what is the sales of the two months ahead. We have only one month after that but we don't have two months. That's why SQL will not find anything and it's going to return null but it's going to go and use the default. So here we're going to go and get the value of zero. There is no more data available in the table. But now on the right side we are currently at March and we are asking what is the sales of two months ago. So now we have enough history in the past and it's going to get the value of 20. All right. So now let's go to the last month over here in our table. April. What is the sales of two months ahead? We don't have any data. So it's going to be zero as well. But now on the right side, we are currently at April. What is sales of two months ago? We have enough history. That's why SQL going to get and point it like this. So we will get the February going to be 10. So that's it. This is how SQL works with the lead and lag using offsets and as well default value. Let's go back in SQL in order to practice those two functions. Okay, so now we have the following task and it says analyze the month over month performance by finding the percentage change in sales between the current and the previous month. So that means we have to go and compare the current month with the previous month. So the main use case for the lead unlock is to do comparison analyszis and we have a very common use case it's called time series analyzes. So it is the method of analyzing our business our data in order to understand the patterns and trends over the time. And one of the most important and classical question that you're going to get from the decision makers or business is to do year-over-year analyszis or month over month analyszis. So the year-over-year analysis is going to help us in order to understand the overall growth or decline in the performance of our business over the years over the time. But in the other hand, we have month- over-month analyszis in order to do shortterm trends analyzes and as well discover the patterns in the seasonality. So the main focus is to understand the performance of our business over the time. So now let's go back to it in order to solve the task. Okay guys, so now let's go and do it step by step. Now what is the first step? Before we go and compare things together, we have to collect the data. We have to do the calculations first. So we have to find out first the total sales for the current month and then the total sales for the previous month. And after that we can go and compare them. So now let's start with the easy stuff. We have to find out the current sales for the current month. So in order to do that, let's just do very simple select. So what do we need? We need let's take the order ID. Let's take the order date because inside it we have the month. Uh let's go and collect the sales. So that's it for now from sales orders. So let's go and execute this. So now in the result we got the usual stuff. We have 10 orders, sales and order dates. But the order date is on the level of the days and we are not interested on the whole date. We would like to get only the month in order to calculate the total sales for the month. Now we're going to go and use a function in order to extract the month from a date. Don't worry about it. We're going to have a dedicated chapter in order to show you how to deal with the dates format in SQL. So now what we're going to do, we will use a very simple function called month and order dates. And let's call it order month. So that's it. Let's go and execute it. Now, as you can see, we got a new field where we have only the month of informations. So here we have January, February, and March. So now the next step is that we want to find the total sales for each month. So what we're going to do, we're going to go and use group by. So, let's do that. We're going to go and say we want the sum of sales. I'm just going to call it current month sales. And let's go and get rid of all those informations. We're going to go and group by the month, right? So, group by and let's have the month. So, that's it. Let's go and execute it. So, it's very simple, right? We got now the three months and the total sales of the current month. So now with that we got the first information that we need in order to do the comparison. We have for each row the total sales for the current month. So now the next thing that we're going to do is to find out the total sales for the previous month like side by side in the same row. And in order to do that we have learned we can go and use the lag function. So we're going to go and integrate the lag window function in the same group by. So we're going to do it like this. So lag we are now interested in the previous month. So that's why we're going to go and get the sum of sales as an expression inside it. And after that we're going to define the window. It's going to be like this over and order by is a must. So we're going to go and sort the data by the month. Right? So let's go and do it. And with that we have defined the previous month sales. So you are the previous month sales. So now let's go and execute it in order to see the results. All right. So now let's check the results. The first row what is the previous month? There is no previous month. We are at the first record and the first month that's why we have null. Now let's go to February. What is the sales of the previous month from January? It is 105. So this is correct. And now to the last value to the March. What is the sales of February? The previous month it is 195. So with that we got the two informations. We have the current month and as well the previous month. So guys as you can see it's magic right? It's very simple. we can go and use the lead and lag functions in order to access another values from another rows without doing any complicated joins and so on. Okay. So now what is the next step? We're going to go and subtract the total sales from the current month with the previous month. So in order to do that we're going to go and use a sub query like this. So select star from and we're going to have it like this as subquery. And now the calculation is very simple. Let me just move this a little bit down. So it is the current month subtracted from the previous month and let's go and call it month over month change. So that's it. Let's go and execute this. So now let's go and check the results for the first month. You can see that we don't have any value and that is correct because the previous month is empty. So there is no change. And now moving on to the February. You can see over here we got plus 90. That means we have here improvement in the performance of our sales. Now moving on to the last one. It's really bad. We have decline in our performance. We can see that we have minus 115. So that means the current month is doing really bad compared to the previous month. So the March is really bad month. Okay. So now as you can see in the output we got the absolute numbers but the task says find the percentage change. So we have to convert this to a percentage and we can do it like this. It's very simple. Let's do it in a new column. Just going to zoom out a little bit. So, it's going to be the change the differences divided by the previous month sales. And then let's go and multiply it with 100 in order to get the percentage. So, like this. And now, as you can see, we got zeros. And that's because those numbers are integer. So, we have to go and cast one of those values. Just going to do it for the first. So, cast and float. So, that's it. Let's go and execute it again. Now the result looks better. We have the percentages but we have a lot of decimals. So let's go and round the number to let's say one decimal. So only one and let's give it a name. So you are month over month percentage. So let's execute. So now as you can see things get better. And with that we have calculated the percentage change in sales between the current and the previous months. And this is how we do month overmonth analyszis. All right. So now we have another use case for the lead and lag function. We can use them in order to do customer retention analyzes. It's all about measuring the customer behavior and loyalty. So we are helping the business and decision makers to build strong relationship with the loyal customers and for them as well to focus on their needs. So now let's see how we can use lead and lag function in order to do customer retention analyszis. So let's go. All right. Right. So now we have the following task and it says in order to analyze customer loyalty rank customers based on the average days between their orders. So there is a lot of things going on over here. Let's do it step by step. And I would like always to start with a very simple select. So let's go select informations like the order ID. Let's get the customer ID and as well since we want the days we would like to have the date. So order dates from the table sales orders and let's go and sort the data. So order by customer ID and order dates. So that's it. Let's go and execute. So now as usual we got our 10 orders, the customers and when they did order. So now let's check the task. Let's solve this over here. Days between their orders. So we have to find how many days are between two orders. For example, if we check the customer number one over here, he did order around 10 January and the second order is like after 10 days 20 January. So we have to go and subtract those two dates. Now in order to subtract those informations and do calculations, we have to have everything in the same row. So for example, if we are at the first row over here, I would like to have as well one column about the next order. So the date of the next order. So we have to access a value from another row. Of course, we can go and do joins, but we have lead and lag functions. And for this scenario, we're going to go and use the lead window function. So let's go and do that. I'm going to go and call the order date over here as a current order. And let's go and calculate the lead. So we I would like to get the next order date. So I would like to get this value over here in the same row. That's why we this time we're going to get the order dates. And now let's go and define the window. Now we have to go and partition the data because we are analyzing each customers separately, right? So that's why we have to partition that by the customer ID. And of course in order to do the lead, we have to use the order by. So let's go and define that as well. Order by and it's going to be by the order date. So now we have to give it a name. The order date here is the current order. This going to be the next order. So next order. Let me zoom out a little bit and make this smaller. So let's go and execute it. So now as you can see in the output we got a new column called next order. And with that we got the current order, the current row and as well the value from the next row. So what is the next row? It's going to be the 20 January. The same thing of course for the next row. Over here we have the current order date and the next order date. So this value going to be exactly as the next one over here 15 of February and then since we are working with window since this is the whole window over here the last order for this customer it's 15 of the February there is no next order so this going to be null the same thing if you check the other customers you're going to see always the last order don't have any next order so looks like everything is fine and for the last customer he has only one order so now with this we got all the informations for our calculations. So we have the current order and the next order in the same row. Now we can go and subtract them in order to get the days between those two orders. And now in order to subtract date we has to use the function date div. Don't worry about those functions. We're going to explain all those stuff in the next chapters. So now just follow me with those steps. What we're going to do, we're going to go and subtract this date the order date with the whole thing over here. Right? So the whole thing here is the next order. So let's do it in a new line and it's going to be very simple. So date diff we are finding the differences between two dates. So the syntax going to be like this. First we have to define what we are talking about. Are they days, months, years and so on. So we have to tell SQL find me the differences in days. Now we have to specify two days. So the first one going to be the order date. This is the current date and the second date going to be the whole thing from here. So let's take it and put it side by side and this calculation going to give us number of days. So we're going to call this days until next order. All right. So now let's go and execute the whole thing. So now let's check the result. As you can see over here we got 10. So this is 10 days between those two dates and the next one we have around 26 days. Here we have a null because we don't have here a date and for the next one we have 31 days. So we have a whole month over here. So everything is working perfectly and with that we have solved only this part days between their orders. So guys you see right this is the magic of the lead and lag function. We can very easily access any information you need in the same row in order to do such a important analyzis and with very simple query. We are not doing any crazy stuff like joining and stuff. We are just specifying the lead function. So now we got all the informations that we need. Next we're going to go and calculate the average of those days. So in order to do that we have to go and use a subquery. So let me just zoom out. So let's go and select star just prepare the subquery. So the whole thing going to be a subquery. I'm going just get rid of the order by it's not now necessary. So let's me just put it like this and shift it. So now what do we need? We need the average of the days. So we need the average of this value. So what can we do? We're going to go and use a group by. So customer ID since we have to find the average for each customers and we're going to get this value and say average days until the next order and we're going to call it average days. So and we have here to group by. So group by customer ID. So like this just make this a little bit smaller and zoom in here. So that's it. Now we are just doing a very simple average and group I statements. So let's go and execute it. Now as you can see it's going to go and aggregate the data. So we have now only four customers and for each customer we have the average days between their orders. So now what is missing in our task? If you check over here it says rank the customers based on this average. So we have to go and use the rank function. So here again another window function that we have to go and use. We're going to do it together with the group by. So let me just make this a little bit smaller and then let's do it over here. So I'm just going to go with the rank function. Then we're going to define the window like this over order by and then we're going to go and sort the data by the average days. So that means we're going to go and get this calculation over here and put it as order by it's going to be ascending. So we are focusing on the lowest average days. So that's it. Let's call it rank average. So now let's go and execute this. So now by checking the result, you can see now we have a ranking for the average. And here skill says that the number one customer or the number one loyal customer is the customer number four which is not really correct because the number four we don't have a lot of informations about this customer he or she did order only once. So either now you go and like filter the data and remove this customer where you say if the average is null then don't put it in the rank or we can go and replace this value with a very huge value in order to make it at the end of our list. For example, we can go over here and replace the null with qualisk like this. And we say if the average is null, then let's say give me a crazy number like this very huge one. So that's it. Let's go and execute. And now as you can see this customer going to be at the end of our list. And now we can see that the most loyal customer is number one. And then the other two customers are in the rank two. Here we are sharing the same rank since we have the same average. So guys with that we have solved the task and we have ranked the customers based on the average days between their orders. So we have now a really nice rank and we can understand now the behavior of the customers and maybe we have to go and focus on the customer number one and understand her or her needs. And of course the function that helped us here in order to do such a customer retention analyszis is the lead function in order to find the next order to calculate the days. So this is how you use lead functions to do such a use case. the first value and the last value functions. I think the name says everything, right? So the first value going to allow you to access a value from the first row within a window where the last value exactly the opposite. It going to allow you to access a value from the last row within a window. Easy, right? So now let's understand how SQL execute those functions. Okay. So now as usual, we have this very simple example. we have the months and sales and we have it twice because we would like now to go and compare side by side the two functions first value and last value. So now for the left sides we would like to get the sales of the first month and on the right sides we would like to get the sales of the last month. So now for the first task we can go and use the first value. It's very simple. So the first value function then the argument going to be sales since we want the sales and then the window going to be defined like this order by month because we want to get the first month. So as usual we must use order by now on the right side in order to get the sales of the last months we can go and use the last value right so the same things last value sales over order by month. So as you can see on the left and right we don't use any frame definition but the default going to be used from this. All right. So now let's see how SQL going to process both of those queries side by side. So the first step is SQL going to go and sort the data. They are already sorted from the lowest to the highest. And then the next step is going to start row by row finding the first value on the left side. So what is the unbounded proceeding? It's going to be static and always pointing to January. So this is always going to be the unpounded proceeding. We have it in both sides like this. And what is the current row? It's going to be at the start the first row. And on the right side the same things over here. So the window definition going to be is only one row right. So what is the first value in this window? It is 20. Right? The same things on the right side. What is the last value in this window? It is as well 20. So we will get exactly same results. Now let's move to the second row. So it's going to be pointing to February. And the frame definition going to be here extended like this. So what is the first value in this frame? It's going to be as well 20. Right? So in the output we're going to get 20. And now in the right side the current row going to be as well pointing to February and the window going to go get extended. So now what is the last value of this frame? It's going to be 10. Now let's keep going. We're going to go to the March and the window going to get extended. What is the first value? It's always going to be the same. So 20 on the right side window going to get extended. What is the last value? It's going to be 30. So as you can see the default definition is always having the static start always the same start of the subset and as we are moving with the current row the frame going to get extended. So now moving to the last one and with that we will get the whole data set inside the frame and the first cell is going to be 20 on the right side. the same things going to get extended like this and this time the last one going to be April and five. So now if you go and compare them side by side you see that on the left side the task is solved and everything is working correctly right. So we have for each row always the sales of the first row and what is the first row it is January. So we have everywhere a 20 which is correct. But now if you check the right side you can see there is something wrong right? We are getting not the last value. We should always get April right? We should have here everywhere a five. So we have here exactly the same result as the sales. So it's really useless to use it like this, right? And that's of course because SQL is using the default definition of the window frame. Last value is the only function from all window functions that you cannot use the default frame definition. You have to go and customize the frame definition in order to get the effect of the last value. For the first value, everything is working. If you're using a default frame, if you are not specifying anything, but for the last value, you will not get the effect correctly without customizing the frame window. So my friends, you can go and use the first value function like all other window functions without defining a frame. You can go with the default and you will get the effect of the first value, but the last value you have to go and define a frame. So let's see how we can solve that. All right. So now in order to solve this, we going to define the frame like this. It's going to be the rows between the current row and the unbounded following. So we just switch things around. So now let's see how this going to work. Now of course it's going to go and sort the data and so on. Now it's still going to have a pointer to the unbounded following. So it's going to point always to the last row in our data set and then it's going to proceed step by step. So the first row going to be like this and the frame going to be the whole thing, right? So from the current row until the unbounded following. So what is the last value the last row? It's going to be the five, right? The April. So we will get in the output five. Now let's proceed to the next value. The frame going to be shorter and smaller. And what is the last value? It's going to be as well the five. Right? So now we jump to the next one. And the frame going to be like this. What is the last value? As well five. And then we will get the last value like this. Current row is equal to the unbounded following. We have only one row and it's going to be as well five. So as you can see it's very simple just fix the frame clause and you will get the last value working as expected. So this is how SQL going to go and do it. Now let's go back to SQL and start practicing. All right. So now we have the following task. It says find the lowest and highest sales for each product. So now let's see how we can do this. As usual we're going to start with very simple select statement. So select order ID. We need the product ID and as well their sales. So let's select the table sales orders. So that's it. Let's go and select this. Now in the output we got our orders, products and sales. So now let's start with the first part of the task. Find the lowest sales for each product. So in order to do that, we can use the first value function. So let's go and do that. First value. Then what we are talking about, we have to give an expression. We need the lowest and highest sales. So let's go and have the sales inside it. And now we have to define the window. So over since we are saying for each product that means we have to go and make windows. So we have to divide the data using partition by products ID. And then we must use an order by right. So we have to go and sort the data by the sales. Since the first value should be the lowest value, we have to do it as ascending from the lowest sales to the highest sales. So we're just going to leave it like this as a default and we're going to call it lowest sales. So let's go and execute this. So now let's go and check our results. First going to go and partition the data by the product ID. So as you can see we got now here four windows. Then sort the data by the sales. So the data are sorted from the lowest to the highest from 10 to 90. So now what is the first value of the sales? It is the first row, right? So it's going to be 10. That's why we have everywhere a 10. Let's check another one. Let's take this one here. So this window has two rows and it is sorted the lowest sales or let's say the first value is 25. So with that we have solved the first part of the task finding the lowest sales for each product. Let's go to the next one. We have to find out the highest sales for each product. So let's go and use the last value for this. So let's have a new line. We're going to have a last value again the sales. Then we're going to go and define the window. So it's going to be the exact same window. We have to partition the data by the product ID and order the data by sales. So let's go and just copy the previous one and let's call it for now highest sales. So let's go and execute it. So now if you check the results, you will see our issue over here again. Right? We are not getting the highest sales for this window. The highest sales is 90. But as you can see, we are getting the exact same sales. And we have explained that in the previous example. So in order to fix this, we're going to go and add for it the frame. So rows between current row and the unbounded following. So now let's go and execute this. So now let's check the result. As you can see over here, we got the highest sales correctly. So for this window, the highest ones is 90. and as well for this window the 60 and so on. So with that you have solved both of the tasks the lowest and the highest sales. But now I would like to show you my honest opinion about this tasks. I will not go and use the last value to find the highest sales. So let me show you how I usually do it. I'm going to go and use the first value in order to find the last value. So now let me show you what I mean. Let's go and add a new row. I will just take the whole thing from the lowest sales. But what I'm going to do, I'm just going to go and change the order. So that means we will not go and sort the data like this ascending from the lowest sales to the highest sales. We're going to go and switch it. So we're going to go and sort the data from the highest sales to the lowest sales. And with that, the first value going to be the highest sales. So let me just rename it highest sales. Let's give it like two. So let's go and execute this. And now you can see over here we got the exact same results because we sorted the data differently and we get the first value. So this is going to give you the exact same effect like the last value. And as you can see I don't have to define now any window or something like that. I can stick with the default frame but just twisting the order by. So this is how you can do it as well using only the first value. So now just for the sake of this task there's as well another possibility in how to solve this. You can go and use the minmax functions. So let me just take the same and have a new one the lowest sales. We can go and say you know what let's get the min. So we are saying find me the minimum sales and we don't have to go and sort anything. So we can go and just divide it like this. So let's give it another ID. Let's go and execute it. So as you can see we got the exact same results like the other two highest sales. So as you can see we can solve this task using three different functions. Either go and use the last value but you have to define the frame or you can go and use the first value where you switch or flip the order by or simply just using the max function in order to get the highest sales. So guys as you can see we can use the first value and the last value in order to find out the extremes like here in this example the lowest and the highest sales. So there is like similarity between those two functions and as well the min and max. And of course what we're going to do with this value over here we can go and compare it with the current sales. So for example we can go and extend our task where we say find the difference in sales between the current and the lowest sales. So in order to do that let me just clean up all those stuff and let's stick with the first value and the highest value like this. So we have to compare now the current sales which is this field over here. the sales the original one with the lowest sales with the whole thing from here. So let's go and do that. So we're going to have a new line and we're going to say just simply subtract the sales from the lowest sales like this. And let's give it a name sales difference. So that's it. Let's go and execute it. Now as you can see the result in one row I'm comparing the current sales which is 90 with the lowest sales from this product. It's going to be the 10. So with that we're going to get the distance let's say between those two informations and it going to be 80. So now for the next one the distance between this value and the lowest value is shorter. So we are near the lowest value. So as you can see over here we can now compare the sales between the current sales and one extreme in order to find the distances between two values. So this is again very important analysis in order to do comparison analyszis. All right friends, so now let's do a quick recap about the value functions or we call them sometimes analytical functions. So what they do, they're going to go and allow you to access a specific value from another row. This going to help you in order to do complex calculations with very simple SQL without having you joining tables together or doing self joins. And for the value functions we have four types or let's say for functions the first one allows you to access the previous value like the previous month using the lag function. The next one it allows you to access the next values the next month using the lead function. Then we have another one it allows you to access the first value in a subset using the first value function. And another option we can go and access the last value in a subset using the last value function. Moving on to the next one, we have the rules of the syntax. So about the first point, it is the expressions. We can go and use any data type. It could be a number, string, a date, anything. Now in order to perform those functions, we have to go and sort the data by the order by. So order by is required. It is a must. Then for the frame, you are allowed to use it. So it is an optional thing. I would say always leave it empty for the frame. But only for the last value, you have to go and customize otherwise it will not work. Now to the next point, we have the use cases. We have simply very important use cases for the value functions in data analytics. So what we can do? We can do time series analyszis. As we learned, we can do month overmonth analyzes and yearover-year analyzes. Those analyszis are classical and it's always the first question in that analyszis in order to measure are we growing with the business or are we declining? How the performance between the current year and the previous year. So as you can see we are doing always comparison using those window functions. The next use case is as well about the time we can do time gap analyzes as we analyzed the customer behavior the customer retention where we have calculated the average days between two orders and the last use case it's as well about comparison comparison analyzes we can go and use the value functions in order to compare the current value with extreme like comparing the current sales with the highest sales or to the lowest sales. So my friends those analyzers are essential in data analyzers you will be countering them in each company in each business you have to answer those questions and you can do that very easily using the SQL window functions all right my friends so that's all about the window value functions and with that we have covered everything about how to aggregate your data using SQL and those are very important tools on how to do data analytics in SQL especially if you are a data scientist and data analyst. So with that we are done with this chapter and I can tell you with that we have covered the intermediate level. So we have learned how to filter the data, how to combine the data and as well the most important functions in SQL. Now we're going to go to the third and last level we will cover now the advanced level. So the first level going to be about the advanced SQL techniques. So now if you go inside it and in SQL there are like different techniques in order to organize our complex projects. So first I'm going to explain for you what is exactly I'm talking about what is complex queries and why we have it and then we're going to start with the first topic the subqueries. So let's go. Normally in projects we have a database and we have a person that is responsible for the database the database administrator that take cares of the database structure. And now in very simple scenario we're going to have a user that is writing queries in order to retrieve data from the database. So he or she going to write an SQL query and then this query going to be sent to the database where it's going to execute it and then the database going to return the results. So at the end our user going to see the result of the query that he wrote. So this is a very simplified scenario on how we use a database. But my friends in the real world things are totally different. Things in real projects get very complicated like this. So for example, you have a financial analyst that is writing a huge block of SQL query that is very complex and there will be like another user that have different role like a risk manager that is as well writing a very complex query and from different departments from different projects for different tasks. You will have a lot of analysts that are writing many complex queries. So all those analysts and managers have a direct access to your database and they are executing a complex analytical queries in order to generate maybe a report or something. Now not only those guys are doing analyszis on your database you will have as well our friend the data engineer that is saying you know what I'm building a data warehouse and I would like to extract your data. So that data engineer going to go and write an extract query in order to extract the data from the database. And then he has a different script for the transformations in order to manipulate, filter, clean up, aggregate your data. And then a third script in order to collect the result of the transformations and load it in another database called data warehouse. A data warehouse is like special database that collect data from different sources and integrate it in one place. in order to do analytics and reporting. And now at the end of this chain, you will have a data analyst and she writes as well queries in order to analyze the data in the data warehouse. Or you might have a different query in order to prepare the data before inserting it to a tool like PowerBI in order to generate visualizations and reports. So we call this a data warehouse system or a business intelligence system that extract and extract from your data and manipulate it and transform it for analyzes. Now not only we have a data engineer and data analyst accessing your database and doing queries, we have as well our friend the data scientist. So now our data scientist as well has a direct access to your database. So he might write like different queries in order to extract the data and as well to manipulate the data that are needed in order to develop a model and doing machine learning and AI. And now one more scenario that I see in many projects where the result of the data analyst going to be used in another query in order to prepare the results for data visualizations PowerBI or in order to export like a Excel list. So as you can see we have a lot of people with different roles that want to access your database and do analyzes on top of it and that's because everyone want to answer questions based on the data and now if I look to this I still think this is a simplified version and how things works in the data projects and I can tell you in real projects things are way more complicated than this so now if you sit back and look to this we will find many challenges and problems for example all those people are not talking to each others And each one of them are creating like their own query. But if you go and take all those queries and compare them side by side, you will find in the scripts and queries logic that is keep repeating. So the queries from the analyst or the data scientists and data engineers, they might contain a redundant logic. And of course the issue of this we have the same effort repeating over and over and maybe not everyone is getting the logic implemented correctly because not all of them having the right skills in SQL. So this is a big issue in this setup. And now we have another challenge having this scenario. If you don't optimize it you will have a performance issue everywhere. So the data warehouse or the data engineer scripts might take like 5 hours and the query from the analyst might take like 40 minutes and before inserting the data to reports we might have 30 minutes and 1 hour there 30 minutes there and everyone else is as well suffering from bad performance on their queries and the performance everywhere is really bad. So if everyone is writing big complex queries don't expect that they will have a good performance. Now to the third challenge that I observed in many projects and that is the complexity. Now behind the original database you might have a data model that is prepared and optimized only for one application. So you will have in the data model a lot of tables and all those tables have different relationship between them and of course only the developers and the experts of this database understand the physical data model behind this database. And now if you give access to all those analysts they will have a lot of questions because first they have to understand the data model before writing any query. So that means a lot of data workers are keep asking our expert from this database questions. So for example how to connect the table A with the table B and where do I find my columns? What this table means? I'm getting bad result in my query because your data is really corrupt. So the developers of the database will get a lot of questions from the analyst and they have to explain over and over their data model so that the users are able to write those complex queries. So that means all those users are stressing the database team by many questions and as well the users are writing very complex queries. So the complexity is a really big challenge. Now as well by looking to this picture you will find a lot of errors from those queries to the database and this might cause a lot of database stress. So keep executing repeatedly a big complex queries going to makes really big stress for the database and it going to bring the database down. And the last challenge of this picture is that the data security. So if you leave it like this by giving the users a direct access to your database tables you might have a problem because it might be okay for like some data engineers and so on but you don't want to give for each data analyst a full access to the database tables. So you have to protect your tables the columns the rows everything. So you cannot leave it like this where everyone having a direct access to the physical database tables. Now enough talking about challenges problems and issues. Let's be solutionoriented. So what are the solutions of those issues? Of course, there are many solutions, but we're going to focus now on five techniques. We can go and use sub queries or CTE, common table expressions. We can introduce views to our database or temporary tables or we can go and use the technique of the CTAs carrier table as select. So this is exactly why we have to understand those five techniques in order to solve all those issues that we might face in our data projects. All right friends, so now after we understood the importance of those five techniques, let's take a quick and simplified look to the database architecture because I want you to understand what happens behind the scenes and how the database execute the queries from these five techniques. So by understanding this architecture you will understand how things works. So let's go. For each story there are two sides. We have the server side and the client side. In the client side it's like for example you you are writing an SQL query for a specific purpose. Now in the server side we have many things. So the server is where the database lives and it has many components like the database engine. The database engine is the brain of the database that handles different operations like storing, retrieving and managing data in the database. So each time you execute a query, the database engine going to take care of it. And now in the database we have very important component that is the storage and the two main types of storage in a database are disk storage and cache. The disk storage is like a long-term memory where the data is stored permanently. So it's like the disk at your PC. It stores the data permanently even if you turn off the system. And one important feature of the disk is that it can stores a lot of data. But the disadvantage of the disk storage is that it is slow. So it is slow to write and to read. Now in the other hand we have the cache is a short-term memory where the data is stored temporary. It's like the RAMs at your PC. It holds the most frequently used data. So the database can access it quickly in order to retrieve data. And the big advantage of the cache is that it is fast. So it is very fast for the database to retrieve data from the cache compared to the disk. But the disadvantage of the cache, the data is stored there only for short period. So it's like tradeoff between the speed and how much data you can store and how long. Now let's talk about the disk storage. This is very important in databases. There are typically three types of storage areas. There we have the user data, the system catalog and the temporary data and each storage type has a different purpose. So what is user data storage? It is the main content of the database. So it stores the actual data all the informations that are relevant for the users. So it's stored there all the important data that the users cares about. So this is the storage where the users are interacting all the time. So where do we find the user data? If you go to our database sales DB and then you go to the tables now we find all these tables that we are already used the customers employees orders and so on those tables are the user data. So now if I go and say select from sales orders and all those informations that we are seeing now are the users data. So this is what we users actually care about. All other stuff that we see inside databases as a user we don't care about it. We care only about our data. But in the database, we don't have only the user data. We have many other informations. So this is what we mean with the user data storage. Now what is system catalog? This is the internal storage for the database for its own information. So it's like a blueprint that keeps tracking everything about the database itself. So that means the main purpose of the system catalog is that it holds the metadata informations about the database. So what is a metadata? Metadata is data about data. Now let's understand what this means. What we have done so far is that we have created a table called customers and we have defined inside it like multiple columns like the customer ID, first name, last name and then we have inserted our data inside this table. So we have inserted five customers. So those informations are my data. I have created those informations and stored it inside the database. That's why we call it the user data. So nothing so far is new. So now what happens behind the scenes is that the database server will not only store the user data that you have provided but also it's going to go and store a different type of data inside the database and this data is the metadata. So the database server going to store the metadata of the customer's table and it going to look like this. There is like a table name, there is a column names and those are the column names that you have defined inside your database and those are the column names that you have defined as you are creating the table customers and it's going to store as well additional informations like which data type like the customer ID is int and the last name is v charts and many other informations like the length of the column and whether the column is nullable or not. So as you can see in the metadata we are having a description a data about the structure of the customers and in the metadata we can find a lot of informations about not only the tables and columns but as well about the schemas and the database. So you can find a full catalog about the structure of your database. Basic table the customers table it contains data about the actual data. So it stores data about the customers. But the metadata of the customers table contains data about data. So in the databases each table that you are using in order to store your data has a table twin that describes the structure of your data. So this is what we mean with a system catalog or a metadata. And now you might ask where I can find all those system catalog and metadata inside our client here. Well, you cannot navigate through those informations in the object explorer like we used to do for the user data. But you can find those informations in a special hidden schema called the information schema. The information schema in SQL server is a systemdefined schema that contains a set of built-in views that help us to find information about our database like tables, columns, and other objects. So let's go and explore it. We're going to go and say select star from information schema. And then let's have a dot. And now we get from SQL a list of all views that are available in order to browse the metadata of our database. So for example, you can see here tables. You can see informations about the views and as well about the columns. So let's go and select the columns and let's go and execute it. And now in the output we can find informations about the schema about the table names like for example here the customers. Let me just go and select this table. And then we find all the columns inside this table how they are sorted. So we have here the order of each column and as well the data type and the size of each column and many other stuff. So as you can see we got here all the informations all the metadata of each table and as well for each column inside the table. So with that you can check which tables does exist in your database. For example I find here like something called test two. So maybe I was trying to test something. I can go now and clean up stuff right and this is exactly why the database maintain such a catalog. It helps the database to quickly find the structure of each table and of each column. and it helps me as well as a user to browse the catalog of the database. So for example I can go over here and say okay let's get a distinct table name. So with that I will get a list of everything that I have inside the database. So we have the customers employees and some tests that I have done. So metadata are awesome. Now we come to the third storage that temporary data storage. It is a temporary space used by the database for short-term task like processing a query or sorting data. And once these tasks are done, what going to happen? The database going to go and clean up the storage. And now of course the question is where we can find these temporary tables that is using the temporary storage in the disk. Well actually if you go to the object explorer you will not find it inside our database sales DB but you will find it inside the system databases. Now since we are working locally we have the full access to everything inside the SQL server. But in real projects if you are just a user or let's say developer you will not have access to the system databases only for the database administrators. But now we are working on the local copy. So let's go to the system database and here you have a special database from the SQL server called temp DB. And if you go inside it we will find here tables and temporary tables. So this is exactly where you can find all the temporal tables that you are generating. Now currently we didn't create any temporary tables that's why it's empty. But once you start creating temporary tables you will find those tables underneath this folder. We will learn about the temporary tables in the next sections. So these are the main component of the database architecture. So now let's have an example. Now we have a table called orders that is stored inside the user storage and the metadata of this table is stored in the catalog. So now let's say that you are at the client side and you write a simple select query in order to select the data of the orders. So now that query is sent to the server in order to be executed and the database engine going to take the query in order to process it. So first the database engine going to check whether we have the data in the cache because if the data is stored in the cache then things going to be really fast and the database engine can solve the task quickly but in this scenario we don't have the orders informations in the cache that's why the database engine going to say okay it's not in the cache let's check the disk so it will find the orders information in the disk and the query going to be executed then the result of this query going to be sent back to the client side where at the end in return you will see in the output the result of the table orders. So this is how the SQL database execute very simple select query query is a query inside another query. So what this means let's have a sketch to understand it. So so far what we have learned we have different database tables like the orders customers and so on and we write a simple SQL queries like select from where. So the SQL going to retrieve data from the database tables and in the output we will get some kind of results. So this is so far what you have done. We have done very simple queries. Now in our query we can have things little bit different. So we could have another query that is inside our query where we do the same things like select from where. So we have now a query inside our query and we call this embedded query we call it a sub query and the original query the first one where we have select from we call it main query. So now if you execute the whole query what going to happen SQL first going to go and select the subquery and then it's going to execute it. So it's going to go and select and retrieve data from our database tables and the result of the subquery will not be sent to the users to us. So we cannot see it. What can happen? the result can stay inside the query as an intermediate results and then now our main query can go and start interacting with this intermediate result from the subquery. So the main query going to do some kind of operations on top of this intermediate results and use it for filtering or joining or any purpose and still the main query can go and query the original database tables. So now the main query has two sources for data. The original database tables and as well the result from another query. So now by looking to this you can see the subquery is a query inside the main query and it play a role of supporter. So it supports the main query with data and the main job of the main query is of course to get all those data and to show us at the end the final results. Now there is now two things about this intermediate results that we got as a result from the subquery. Once the execution of the query is completely done, what can happen is going to go and destroy this intermediate result. So it's going to totally drop it. So we will not find it anywhere. It's completely lost. Now the other thing about the intermediate results is that imagine you are making another query that is completely outside of the first query. We are selecting few tables from our database. Now you might say you know what is it possible to access the intermediate results from the first query. So now we are talking about completely external query you cannot do that. The intermediate result of the subquery is only locally known from the main query itself and it is not globally available for any other query. So the subquery can be used only from the main query. So with that we have understood what are subqueries and now you might ask me why do we need them in the first place? Why sub queries are important? Let's have the following sketch. Now in our complex task we might have to do several stuffs in our query. Like for example the first step we have to go and join tables in order to prepare the data and then the outcome of the joins should be filtered. So this going to be our step two. And then on top of that in the step three we have to go and do transformations like maybe handling the nulls or creating new columns and many other stuff. And the last step we want to go and do data aggregations like summarizing the data or finding average. Now if you go immediately and start writing the SQL query without having a plan what can happen you're going to end up having a long complex SQL query and it's going to be really hard to write and as well to understand and read. And now what we can do instead of that we're going to go and divide our task based on those steps. So we're going to write one query section for each step. For example, for joining tables we're going to have one query for filtering another one transformation another one and for the aggregation we're going to have the last query. So now since each step is like a preparation for the next step we can go and say each of those queries is a subquery. So for step one, step two, step three, we have sub queries and they are all doing like calculations and preparations for the last step to the aggregations and we call the last step the main query and of course the whole thing can exist in one single query. So if you want to visual this like you have a subquery in circle and then this circle belongs to a bigger circle called the main query. By the way, sometimes we call the main query as the outer query and the subquery we can call it an inner query. And of course, we can have many subqueries and many small circles inside each others to form something called nested queries. So this is the main purpose of using subqueries in our scripts and queries. It's going to help us to reduce the complexity and going to make it easier to read and we can have like a flow logical flow inside our queries. Now for the sub queries there are many different types and categories. So now what we're going to do I'm going to show you an overview of all those types and categories and then later we're going to deep dive into each of those types. So first of all if you are thinking about the dependencies between the subquery and the main query. There is mainly two types of subqueries. We have the non-correlated subquery. That means the subquery is independent from the main query. And the second type is the correlated subquery. It's exactly the opposite. The subquery gonna depend on the main query. Of course, we can explain all those stuff in details. Don't worry about it. So, this is the first group. Now, there is another group on how to group up the subqueries depending on the result type. So, I mean with this that the subquery has different output and results. For example, we have scalar subquery. It returns only one single value. or another type it's called the row subquery. It's going to return multiple rows and the final type called the table subquery. It is a subquery that returns multiple rows and as well multiple columns. Now we come to the third way and the last way on how to categorize the subqueries and this time based on the location and the clauses. So we are describing here where the subquery going to be used within the main query. So we can use it in different locations and clauses like the select clause or we can use it in the from clause and this is the most common type for the subqueries or we can use it before joining tables and we can use it in order to filter the data in the work clause and in the work clause as we learned there are two different sets of operators. We can use the subgrade together with the comparison operators the less, greater, equal and so on. Or we can use it with the logical operators like the in, any, all and exists. So now those are the different types and categories for the subqueries and we're going to now deep dive into all of them. So now let's go and start with the easiest category, the result types of the subqueries. Now we have different types of subqueries based on the results. So this means the amount of data that the subquery going to return. So the first type is the scalar subquery. So it is a subquery that it's going to return only one single value like for example the value three. Let's have an example for the scalar subquery. So in this query for example if you are saying select star you will get all columns all the rows from one table. But for the scalar subquery we need only one value. So how we usually get it is by doing some aggregations. For example, if you go and say let's get the average of sales. So let's execute it. And with that in the output we have only one value with a 38. We call such a query as a scalar query. So it has only one row and only one column. So this is a scalar query. All right. So now to the second type we have the row subquery. So it is a subquery that going to return multiple rows and a single column. So we're going to have like values 1 2 3. So it is only one column with multiple rows. Let's have an example for the row query. As you can see now we are saying select star from the table orders and now we are getting multiple rows and multiple columns. But for the row queries we need only one column. So you can go over here for example say customer ID. And if you go and execute it. So now if you check the output we have a single column and as well multiple rows. So we have like a list of values and this is what we call row query. All right. So now to the last type we have the table sub query. It's going to go and return multiple rows and as well multiple columns like any regular tables. So this subquery going to return a lot of values. Okay. So let's see an example of that table query. So if you check our example here, select star from orders, we got here multiple rows and as well multiple columns and of course we can go and select multiple columns like for example the order ID and the order dates. So if we execute it here in the output we have multiple columns we have two columns and as well multiple rows that's why this kind of query is as well a table query. All right. So with that we have learned the different types of subqueries based on the result type. Now we're going to go and learn how to use the subqueries in different locations in our query. So we're going to start with how to use subquery in the from [Music] clause. Okay. So we typically use the subqueries in the from clause in order to create temporary result sets that act as a table for the main query. So it's like in some scenarios we cannot use the tables directly from the database. We have to prepare it somehow before we do our actual query. Okay. So let's check the syntax of the sub query inside the from clause. So we start with the usual stuff where we go and say select and few columns that we want to retrieve and then we say okay from usually after the from comes the table name from our database that we want to query. But this time instead of writing the table name, we're going to have another SQL query. So that means we don't define the table name, we define another select statements where we have as well again select a column from specific table and then maybe we have a filter. And in order now to tell SQL this is a subquery, we have to use the parenthesis. So we're going to have the parenthesis at the start and at the end. This is a subquery. This is not the main query. And after the parenthesis, we can go and define the alias for the results that we're going to get from this subquery. In many databases, this alias is an optional, but for the SQL server, we have to go and specify an alias. So, it is a must in SQL server. So, again, we call this a subquery and the outer query we call it a main query. So, this is the syntax of the subquery in the from clause. Okay. Okay, so now we have the following task and it says find the products that have a price higher than the average price of all products. So we're going to do it step by step and here we have two steps. The first one is that we have to go and calculate the average price of all products and the second step we're going to use this value in order to filter the table products in order to find the prices that is higher than this average price. So let's start with the first step where we're going to find the average price. I'm going to select the following informations. So product ID, price from the table sales products. So let's go and execute it. So now we have the product and as well the prices and we need this price here in order to compare it with the average price. So that means we need this price and as well side by side we need the average price. So that means we need aggregations and details and that's why we're going to go with the window function average. So let's go and do this. This is very simple. So it's going to be the average price and we don't want to partition the data. So it's going to be an over empty and this going to be the average price like this. So let's go and execute it. And with that we have calculated the average price. So now we have all the informations in the first step. We have the average price, we have the price and as well the products. So now the next step is that we have to go and filter the data to find out all the products where the price is higher than the average. That means we will do this step based on those information that we have now. So that means we have to go and use the logic of subquery and main query. Since this is the first step to prepare the data, we're going to use this as a subquery. So we're going to call this a sub query like this. And we have to go and use it in the main query. So how we going to do that? We have to go and write the main query. So it's going to be I'm going to start over here. Select and then I will take all the columns from. So this is the main query. Let me just make this a little bit smaller. And what we're going to do now so now the main query going to get the data from the sub query. So the whole thing going to be used inside the from close. So now in order to put the subquery inside the main query we have to go and use the parenthesis. So we're going to have it at the start and as well at the end and what we usually do we go and add like a tab in order to understand okay this is the subquery and then this is the main query. So now one more thing that we have to add for the whole subquery in the SQL server that we have to give it an alias. So you can go and give it any name that you would like. I usually go with only one character with the T. It stands for table. So you can use anything that you want. But we have in SQL Server to give an alias for the subquery. So now what we are saying, we are saying select everything from the subquery. If you go over here and execute it, you will get the exact same results because the main query is doing nothing. It's saying just select everything from the subquery. But now in order to solve the task, we are not interested with all products. We are interested only the products where the price is higher than the average. That's why we have to go and use the where clause. So we're going to say where the price is higher than the average price. So this filtering is done in the main query. It's not inside the subquery. So now that means in the main query we are doing something. Let's go and execute it. And with that we saw the task. We are getting now two products where the price is higher than the average price. So as you can see it's very simple. If the task has multiple steps then we can do that using multiple sub queries until we have the main query and we can learn from this that the subquery is here is only to support the main query. So we are preparing here that all the data that we need in order to have the final result for the main query. So for this task we cannot go immediately calculating the results we have first. So for this kind of task we cannot immediately like put everything in one select query. We have first to prepare the data in one subquery and then pass the values for the main query. And this is what we mean with the table subquery. And here one quick tip for you. If you would like to see the intermediate results that we are getting from the subquery, you can go and highlight the subquery itself without the parenthesis. So we are just highlighting the subquery. You can go now and execute it. And with that SQL will not go and execute everything. SQL going to execute only what you are highlighting. So this is really nice way in order to see the results of the subquery as you are like debugging or searching for errors. You can go and see the intermediate results that is used from the main query. And of course if you deselect and not highlight anything and execute SQL going to go and execute everything the whole query. So this is how we use the table sub query inside the from close. All right. Right. So let's have another task and it says rank the customers based on their total amount of sales. So again if you check here we have like two steps. First we have to find the total amount of sales and then after that we have to go and rank the customers. So again we have like two steps and we can use the subqueries in order to solve it. So let's start with the first step where we're going to find the total amount of sales. So let's go and select the customer ID and as well the sales from the table sales orders. Let's go and execute it. So now in the output we have like multiple customers and their sales. We have to go and now find the total amount of sales for each customer. That means we have to go and use the group by. So we're going to go and summarize the sales. So total sales and then group up the data by the customer ID. So like this. Let's go and execute it. Now as you can see in the output we have four customers and we have the total sales for each customer. And with that we have solved the first step. We have the total amount of sales for each customer and we have now prepared the data for the next query in order to rank the customers. So now I think you already getting how important are the subqueries in order to do stepby-step analyszis. So this is our subquery. Now we need the main query. So I will start preparing it. So main query like this. And let's go first and select everything. So select star from let me just make this a little bit bigger like this. And now we have to go and convert this query to a subquery. So we need the parenthesis. So the starting and the ending and for the SQL server I'm going to give it an alias and I would like to push everything to the right side. So let's go and execute it. Perfect. So it is working with that the subquery is passing the data in the from clause to the main query. Now of course the main query is now is useless. It's just like selecting the data. We have to go and calculate the rank and for that we have a very nice window function. So we're going to go and use the rank. So it doesn't need any parameters over we have to sort the data order by. So we have to go and sort the data by the total sales descending from the highest to the lowest. So we're going to go with the total sales and descending. So now as you can see we are using the total sales that we have already prepared in the subquery. So without preparing first the data we will not be able to rank the customers in the main query. So that's it. Let's go and execute it. And with that SQL sorted our data and we have a nice ranking based on the data that we had from the subquery. So this is the highest customer with the sales and then the customer number one and so on. So again in this task we have like multiple steps and we use the power of the subqueries in order to do it step by step. So that's all on how to use the subquery inside the from close. Okay. So now let's see quickly how SQL executed our query. So we have here our query and we are quering the table orders. So the first step is that SQL going to go and identify the subquery and then it going to go and execute it. So SQL going to go and execute the subquery part where we are aggregating the data based on the customer ID. So once the subquery is executed the next step is that the result going to be introduced as an intermediate results. So these results we will not see it in the output. It's going to be like temporarily saved in the memory. So now the next step is that SQL going to go to the main query and it's going to execute it based on the intermediate results. So that means the main query will not go back to the original table. It's going to go and query the intermediate results. So here what SQL going to do going to go and rank the intermediate results by introducing a new column where we see the ranks 1 2 3 4 and the output of the main query going to be the final results. So as you can see it's very simple. First SQL is executing the subquery and the result of the subquery going to be used in the main query and once the main query is executed we will get the final results. So the subquery here is only supporting the main query. So those are the steps that SQL uses in order to execute the subqueries. So now let's understand how the database server execute the subqueries behind the scenes. Let's go. So now let's say that you are data analyst and you are writing a query at the client side where you have a subquery inside the main query. So once you go and execute it what's going to happen the database engine going to go and identify the subquery and in this situation the database going to execute first the subquery. So here subquery is like selecting and retrieving data from the table orders. So that means the database has to retrieve the data from the disk storage from the user data. So now once the subquery is executed the result the intermediate results going to be stored in the cache. So this means the result of the subquery is temporary and as well very fast to retrieve. And now once the database engine is done with the subquery it going to go and start executing the main query. So let's see in this scenario it's completely depending on the result of the subquery. So that means the main query going to go and interact with the cache storage. So this means now the data going to be retrieved very fast from the result of the subquery. Once it's done, it's going to forward the result to the database engine and the database engine going to forward the results to the client side. And at your side, you will find the final result. And of course, once everything is executed, the database engine going to go and clean up the cache. So the subquery results going to be destroyed and removed completely from the cache in order to have a free space for other queries. So this is how the database server execute the subqueries behind the scenes. All right. So now we're going to talk about how to use the subquery in the select clause. So now we typically use the subqueries in the select clause to aggregate the data side by side with the columns of the main query. Okay. So let's check the syntax of the subquery in the select clause. So we start with the simple stuff where we say okay let's go and select a column that we want to retrieve from specific table. So nothing new we are just quering a table. And now what we can do in this query is that not only we can go and select the columns from specific table we can go and insert here inside the select another query like a full query like select from and where. So again it's like query inside another query and we call this of course a subquery. In order to tell SQL this is a subquery we go and add the parenthesis. So with SQL going to understand huh this is a subquery and the result of this query going to be used in the select. So we can handle it like any other column. We can go and give it like an alias. It is here optional and not m to add an alias. So this inner query we call it a subquery and the outer query going to be the main query. So this is how you put a subquery in the select clause. But there is one rule for this query that the result of this subquery must be a scalar query. That means the result must be a single value because otherwise it will not work. SQL here is expecting only one value. So this is how we use the subquery inside the select clause. All right, let's have the following task and it says show the product ids, product names, prices and the total number of orders. So now if we check the task there is like two parts. The first part is that we are showing the details about the products and the second part that we have to go and calculate the total number of orders. So let's see what we're going to do. First let's go and solve this simple part here where we have the product ID, product names and prices. So we're going to go and select the product ID and the product and then the price from the table sales products. Let's go and execute it. So with that we have solved the first part of the task. We have the details about the products. Now we go and solve the second part. We have to go and calculate the total number of orders. Now this information come from different table from the products. We cannot calculate it from products. We have to go and query the orders. So now what am I going to do? I'm going to go and calculate this part in separate query. Instead of having it here inside the products. So let's have a semicolon in order to have a second query. So we're going to go and select the total number of orders. That means we can go simply do account star from the table sales orders. Let me just make it a little bit bigger. So we're going to call it total orders and a semicolon as well. So now if you just execute the whole thing, you will get here like two parts in the results. First you have the details of the products and the second part we have now the total number of orders. We have 10 orders. But now with that we have like two different queries like separated from each others and we have two different results. But in the task we have to show all those informations in one result. So now what we can do we can put one query inside another query. So now if you check the second query the total orders you can see we have only single value. So we have a scalar query scalar subquery. That's why we can go with this as a [Music] subquery like this. And I'm going to go and put everything in one line in order to see it. So let's remove the semicolons. We don't need it. And now what we're going to do, we're going to go and take the whole thing and put it inside the main query. So this is the main query. And now think about it as new column. So I will put the query here. So it is just one new column in our select. But in order to have it as a subquery, we have to use the parenthesis at the start and at the end. And of course, we have to go and give it a name. So I'm going to go and use the same name over here. So it's going to be as total orders. So with that, the setup for the subquery is ready and it is inside the select clause in the main query. Let's go and execute it. Now, as you can see, we have everything together. We have the three informations the product details and as well side by side with the total orders and since it is always the same value it going to go and be repeated for each row. So this is what we call scalar sub query inside the select clause and here again very important to understand if you are using a subquery inside the select clause only the scalar subquery is allowed. So for example instead of having one value from the aggregation we can go and use the order ID. So let's see what going to happen. We will get an error. It going to says subquery is returning more than one value and this is not allowed because we are using the subquery in the select clause. So that's why we have to have only one value and by using the aggregation you will get one value. So let's repair it. And it's working. And now again if you would like only to see the results from the subquery what you can do you can go and highlight the subquery like this without the parenthesis of course and you go and execute it and with that you can see in the output the 10 this is the intermediate results that's going to be passed to the main query and if you want the whole thing to be executed just like unmark it and execute and with that everything can be executed the subquery and the main query. So this is the scalar subquery in the select clause. Okay, so now let's see quickly how SQL executed this query step by step. So this is our original query and we need two tables from our database for it. So the first step is that SQL going to go and identify the subquery and it's going to go and execute it. So this is the first step. So the query is targeting the orders table and we are just simply doing a count. So in the output we will get an intermediate results where we are counting the number of rows of the orders. Now the next step is that SQ is going to go and pass this value to the main query. So this is the second step and if you go and pass this value to the main query, it's going to look like this. So you are saying product ID, products and the tin. So after SQL prepared the main query, SQL going to go and execute it. So this time we are targeting the products and in the output we will get all the informations from the products without any filter because here we don't have any work clouds and the final results we will get it like this. So we will have the product ID, the product and the total that we got it from the subquery. So as you can see here the subquery here is a scalar subquery where we have only one single value. So again it's very simple always SQL starts with the subquery and then it's going to go and pass the values to the main query and at the end the main query going to be executed and we will get the final result from it. So this is how SQL executed our query. All right, next we're going to talk about how to use the subquery in the join clause. All right, so now as we are joining tables in SQL, sometimes we have to go and prepare the data before doing the join to dynamically create a result sets for joining with another table. So again here we cannot join tables directly. We have to do a preparation step before doing the joins. Okay, let's have the following task and it says show all customer details and find the total orders of each customer. Now, of course, in SQL, you don't have only one solution, you have multiple solutions. But I would like to solve this task using the subquery. So, now if you check the task, we have like two parts. The first part we have to show all the customer details. And the second part, we have like here an aggregation find the total orders of each customer. So, now let's solve those different parts using two different queries. Let's start with the easiest one. Show all customer details. So I think this is very simple. So select star from sales customers. So let's go and execute it. So in the output we have all the details about the customers and we have solved the first part. Very simple. Now let's go and solve the second part. We have defined the total number of orders of each customer. That means let me just have a semicolon over here. We have to go to the table orders. So let's go and select first the order ID, customer ID from the table sales orders like this. So I will just highlight the second query and execute it. Now in the output we have 10 orders and we have the different customers. Now in order to find the total orders for each customer we have to go and use the group pie. In order to do that it's very simple. We're going to go over here and say so count let's go with the star and then we're going to go and group up the data by the customer ID. I will go and call this total orders. So let's go and execute only these parts and with that we have four customers and we have the total number of orders. So with that we have solved the second part of the task. So now what I'm going to do, I'm going to go and execute both of those queries using the semicolon separately like this. I will just make this a little bit bigger. So let's go and execute it. Now in the output we have the two results, all details about the customers and the total number of orders for each customer. So now what we want to do is to go and combine those two results in one. And in order to do that we can use the joins. So now we have to think about what is the first query, what is the second query. Since the first query returns all the customers that we have in the database, I would like to have this as the left table and since in the second query we have only four customers, I would like to have it then as the right table and I will go with the left join so that I don't miss any customer because if I do the inner join, I will lose the customer number five. So let's go and do that. So this is the first query in the main query. So I'm going to call this main query. And now I'm going to give this as well an alias like the C. And now we're going to go and join this table from the database together with the results the output of this query. So that means we're going to do it like this. Left join and now we're going to join with a sub query. So we will have our parenthesis. I will just put here few spaces so that it's clear it is a subquery and we need for this an alias. So let's go and say for example the O. So with that we are joining a table with the result of a sub query. And now of course what is missing is joining the tables using a key. Now if you check the two results you can see in both queries we have the customer ID. That's why we're going to join with the customer ID. So on then the customer ID with the customer ID from the sub query like this. So we have everything and let's go and execute it. Now as you can see in the output we have all the details about the customer and as well together with the total number of orders for each customer together with the total number of orders for each customer and as you can see we didn't miss any customer. So we have all the customers from the database and we can see that Anna doesn't have any orders. Now you might say you know what we have here the customer ID twice. So what I'm going to do I will select all the columns from the customers but from the subquery I'm interested only on the total orders. So like this let's go and execute it. Let's make this a little bit smaller. So now the results are really clean. We have all details from the customers and as well the total orders of each customer. And of course as we learned if you would like to check the results from only the subquery you go and highlight it and execute it. So as you can see you can put the subqueries almost everywhere and this is how we use subqueries inside joins. Okay. So now we're going to focus on how to use the subquery in the wear clause. So now in this scale as we learned we can go and filter the tables using the wear clause by using like static values. But now in real data projects we're going to go and filter the data based on like complex logic. So now in order to prepare this complex logic we go and use the sub queries in order to make like dynamic filtering for our main tables. And now in order to filter data using the wear clause we have to go and use operators and we can split it into like two groups. We have the comparison operators and another sets we can call it logical operators or sometime we call it subqueries operators. So now first we're going to talk about the comparison operators. So there are operators that we can use in order to compare two values in order to help us filtering the data based on specific condition. And now in SQL basics we have learned that we have different comparison operators and they are very simple. So in order to compare two values we have operator like the equal we have as well not equal the opposite. So we have greater than less than and as well we have greater than or equal to and the last one we have less than or equal to. So they are very simple. Now instead of comparing two values, we're going to go and compare a value with the result of subquery using the comparison operators. All right, let's check the syntax of the subquery inside the wear clause using the comparison operators. So we start with the standard stuff where we say select few columns that we want to retrieve and we want to get the data directly from specific table in our database and now we come to the where condition where we want to filter the table. So we say where and then we select specific column from the table one. Now since we are talking about the comparison operators we can go with operator for example equal and usually we go and specify here like static value like a number or string but instead of having a static value what we can do we can get the value from another select statements another query like here for saying select a column from table two and with a filter. So now whatever comes from this subquery going to be used in order to filter the table number one. And of course we are telling SQL this is a subquery by defining the parenthesis at the start and at the ends and the outer query going to be the main query. So as you can see we are using the subquery in order to filter the main query. And here in SQL if you're using subquery with the comparison operators we have a rule the subquery must be a scalar subquery. So only one single value. So that's all about how to use the subquery in the wear clause using the comparison operators. All right. So now we have again the same task and it says find the products that have a price higher than the average price of all products. We have solved this task already using the subquery inside the from clause. But now we're going to go and solve it again using the subquery but this time inside the wear clause. So let's do it step by step. Let's go and get the informations that we need. So we need the product ID, we need the price from the table sales products. So let's go and execute it. So now we got the list of all products. But we have to go and filter those informations using the column price. So with that in the result, we got all the products, but we don't need all the products. We need only the products where the price is higher than the average. That means we have to go and filter the table based on the values of the price. So now in order to do that what we're going to do we're going to use the wear clause and we have to go and filter the data based on the price and since we need higher than we're going to go and use the compressor operator higher than now next we need the value average price. So how we going to do it? We don't have the average price like out of the box in the table products. We have to go and calculate it. That's why we're going to go and write another query where we're going to go and find the average price from the table sales products like this. So now let's go and highlight it and then execute it. And with that we got now the average price of our products. And as you can see in the output we have only one single value. So this is a scalar query. So now what we need? We need this value in order to be used in order to filter the first query. So that's why the first query is the main query bigger. The second one is the subquery that going to support the main query in order to filter the data. So now what we're going to do, we're going to take the subquery and use it in the wear clause. And now of course we have to tell SQL this is a subquery. That's why we have to put it inside two parenthesis. So with that we have the sub query inside the wear clause in order to filter the main query. So let's go and execute it. And now as you can see in the output we have now only two products where the price is higher than the average price. So with that we have solved the task but this time using the subquery in the wear clouds in order to filter the main query. And of course in order to see this value in our select since it is scalar sub query we can as well go over here and put it in our select just in order to see the value. So average price. So let's go and execute it. And with that we can see as well in our results the average price. So this is how we use the subquery in the workcloud using the comparison operator. Okay. So let's see quickly how is going to execute our query step by step. So as usual first is going to go and identify the subquery. It's going to be our select average price and so on. And now the next step SQL going to go and execute our sub query. So it is based on the products and since we are doing aggregations without group by at the output we will get only one value. So the average going to be 20. This value is start intermediately in the memory. So we will not see it in the output. SQL going to go and pass this value to the main query. So the main query going to look like this. We are selecting few columns from the table and we are filtering the data based on the price that is higher than the value 20 that we got it from the subquery. So now once SQL have everything for the main query SQL going to go and execute it. So SQL going to go to the products and only select the products where the price is higher than 20. So it's only those two rows and in the output we will get the final results the two products as well. So product ID and product price. So that's it. It's very simple. This is how SQL executed our query. So as usual first starting with the subquery passing the value to the main query and at the end so the main query going to be executed with the informations from the subquery and we will get at the end the final results. So that's [Music] it. All right. So now we're going to talk about the second group of operators and we're going to start with the in operator. So what is in operator? As we learned before in the comparison operators, we can go and filter the data based on only one single value. But now in some scenarios, we have to go and filter the data based on multiple values, not only one. In this case, we can go and use the n operator. So if you go and use the n operator, it's going to go and check whether the value matches any value from a list. So a list of multiple values. If it matches any of them, so we will get a true. Okay. Okay. So now let's have a quick look to the syntax of the sub query using the in operator. So we start with the classic stuff where we say okay we would like to retrieve the column one column two from the table one and we want to filter the data based on the column from the table one. Now after specifying the column we're going to use the in operator and after that we can go and specify static values but since we are talking about the subqueries the values going to come from another query. So here we have another select statements from table two and we filter the data for this query. And now the result of this subquery going to be used in order to filter the data using the in operator. And now the big difference between the in operator and the comparison operators that the subquery is allowed to have multiple rows. So there is no rule about having like one single value scalar subquery. We can have in the result a list of multiple values. So this is the syntax of the subquery using the in operator. All right, let's practice using this task. It says show the details of orders made by customers in Germany. So let's see how we can solve this task. First it needs the details of orders. So as we know we have the table sales orders. So let's go and execute it. So in the output we have all orders and with all details. But for the task we don't need all the orders. We need only the orders that made by customers from Germany. So now if you check the table orders, you don't find any informations about the countries, right? So we have to go and get it from another table. And as we know, we can find these informations in the table customers. So let's build another query. So let's say select star from sales customers like this. So let's go and execute only the second query like this. Now, as you can see in the customers, we have the country column, and this is exactly what we need. So, now let's make a list of all customers from Germany. So, we don't need all customers. We need only the one that come from Germany. That's why we're going to go and use the work clause and we say country equal to the value Germany like this. So, let's go and execute it again and check the results. Now, in the output, we have our German customers number one and number four. So now we're going to go and use this information in order to filter the table orders. So let's go back to the table orders over here. And here we have the customer ID informations. And as we can see we need the orders where the customer is either one or four. Now in order to filter that we're going to go to the first query and use the work clause like this and say the customer ID. So now since we have like two values one on four we can go and use the operator in. So let's go and use the in and let's go and build the list. So let's go and have the one and four. So let's go and execute it. Now we can see the results. We have the orders but only from the customers one and four. So with that we have solved the task. We have the details of orders made by customers in Germany. Right? And now of course this is really bad solution because what about if we get like in the future new customer you don't want to go and keep adding here like values and so on for each time you have a new customer. We want to make the values for this list to be dynamic. So we don't need a static value we need like dynamic values and we can use the subqueries in order to retrieve those informations. Right? And we have it already in the second query. So let's go back to the second query over here. We need only those two values one and four. That's why we're going to go to the query and say okay let's retrieve the customer ID. So let's go and execute it again. And with that we have with a one and four exactly like we have it here in the first query. And of course in the future if there's like another customer that come from Germany this list going to be little bit longer. So this query going to always retrieve all the customer ids that have the country equal to Germany. So now what we're going to do, we're going to take this as a sub query. Let's go and get everything from it and now put it instead of those static values. So of course we're going to go now and put few spaces to the right side in order to understand this is subquery and of course here we don't use any aliases. So now what we are doing the results from this subquery going to be used in order to filter our main query. So let me just call it main query like this and make this smaller. So let's go and execute it. And now we are getting the same results. We are getting all the orders from only the customers one and four where they come from Germany. And this informations come dynamically from the subquery and we don't have to worry about new customers from Germany. It's going to be added here automatically. And this query going to always return all the orders from Germany. So this is the power of the subquery together with the in operator if you are having like multiple values multiple rows. So we have solved the task. All right. Now one more thing. Let's say that the task is exactly the opposite. It says show the details of orders made by customers who don't come from Germany. So now here there's like two ways in order to do it. Either you go to the subquery and you say you know what the country should not be equal to Germany. So if you go and execute it, you will get all the customers ids that are not from Germany. And if you execute the whole thing, you will get all the orders where the customers are not from Germany. So either you do that or you stay with the equal to Germany, but you go and convert the whole logic by using the operator not. So now we are saying the customer ID should not be equal to one of those values. So it should not be equal to one or four. And for that we are using the notin operator. So let's go and execute it. So now with that we are getting all the orders where the customers don't come from Germany by just using notin operator. So that's all about the notin and the in operators. All right. So now let's see step by step how is execute our query. So we are targeting two tables the customers and the orders. So the first step is that SQL going to go and identify the subquery and it's going to go and execute it. So the subquery here is filtering the data based on the country. So the query going to be executed and in the output we will get only two rows. So it is one column with multiple rows. This is the row subquery and this is our intermediate results where it's going to be passed to the main query. So our main query going to look like this. We are selecting few informations from the orders and we are filtering the table orders based on the customer ID where we are saying the customer ID must be one of those values one or four. So the subquery here is supporting the main query with the informations for the filter. Now once SQL have everything going to go and execute our main query and this going to be like the following. So we will start with the first row and here the customer ID is equal to two. So the value two is not equal to 1 or four. That's why this row will be excluded from the final results. Now let's move to the second row. We have here the value three and the value three is not equal to one of those values. That's why this value going to be as well failing. So we will not have it at the output. And then it's still going to go to the next one. Now this time the customer ID is one and it is equal to one of those values. It's equal to one. So we have a match. That's why this row will be included to the results. And the same thing for the next row because we have the customer ID one and so on. Now after SQL checking all those customer ids whether they are in the list one or four we will get the final results where we have here all the orders where the customer ID either one or four. So this is how SQL executed the in operator using the subqueries. Okay. So now moving on to the any operator. So we can go and use the any operator in order to compare a value if it matches any value from a list. So that means we can go and use it in order to check whether a condition is true for at least one of the values in a list. Okay. So now let's check quickly the syntax of the subquery using the any and all operators. So as we learned before we can go and use a subquery inside the wear clause in order to filter the main query using like the comparison operators like here less than. Now the syntax of the any operator is that you're going to go and use the comparison operator and after that immediately you use the keyword any. And for the all operator going to be exactly the same where you're going to go and put after the comparison operator the keyword all. So the syntax is very simple. We just add those keywords. So let's practice using the following task. Find female employees whose salaries are greater than the salaries of any male employee. So that means we want to go and compare the salaries between the male and female and specifically we are searching for female employees whose salary is greater than at least one male employee. So let's solve it step by step. Let's go and start selecting few informations like for example the employee ID and first name, gender, salary from the table sales employees. So let's go and execute it. So now we have like five employees. Three of them are male and two are female. So now since we want to compare the data between male and female let's go and create two queries. The first one is filtering the data based on the gender. So the first one is for the female. So and we can go and remove this information over here. Let me just make this little bit smaller and zoom out. And the second query it's going to be the exact opposite. Let's go and get employee informations for the male. So let's go and execute it. Now the first results are the female employees and the second one are So now for the first result is for the female employees and the second one is for the male employees. So now what do we need in the output? We need the female employees. That means this is going to be our main query. So we are focusing on the female employees and we are using the male employees only as a filter and what we need we need only the salary informations that's why we can prepare it like this. I will just put everything in one line to make it clear. So this going to be our sub query. So now we're going to go and work with the main query where we're going to add one more filter where we're going to filter the data based on the salary. Right? So we're going to say if the salary is greater than and now we need the values from the subquery right so this is our subquery we're going to put it like this and don't forget about the parenthesis at the start and at the ends and I would like still to have those two uh queries so let's go ahead execute it and now we will get an error and that's because our sub query is returning multiple rows and this is not acceptable we are using the comparison operator and SQL expect from the subquery to have scalar subquery. So only one single value. But now in order to solve this issue, we can go and use the logical operators either all or any. So now since we are saying it's enough for the salary of the female employee to be higher than at least one male employee, we will go with the operator any. So let's go after the comparison operator and have the keyword any. And let's go and execute it again. And now as you can see in the output we got only one female employee where her salary is higher to one of those male employees. So let me just go and get the first name as well from the second query just to have it like this. So now if you go and compare the salary of Mary it is not higher than Michael but it is higher than Frank and Kevin. And since we are using the any operator it's enough for Mary to have salary higher to one of those values. In this case, it's higher than both Frank and Kevin. And the condition is fulfilled. That's why we are getting the marry. And the other female, let me just check. Do we have else? So, we have Carol is salary is less than all the salaries of the male employees. So, it must be at least higher than one of the male employees. So, with that, we have solved the task, right? All right. So, now we have another operator that is similar. We call it the all operator. We can go and use it in order to compare a value if it matches all values in a list. So that means we can go and use it if we need to check whether a condition is true every value in a list. I know that might sound a little bit complicated but don't worry about it. We can have examples. Now let's say that our task says find female employees whose salary are greater than the salaries of all male employees. So that means now the condition is more restrictive. Mary should now has a salary higher than every male employee. So it should be higher to all those values that we have from the male employees. And of course in this scenario it's not because we have Michael. Mary has less salaries than Michael. And this is a problem because Mary should has higher salary than everyone. So let's go and try it. If I go and write here all and let's go and execute it, you will see we will not find any results that fulfill this requirement. So we don't have any female employee who her salary is higher than all male employees and that's because we have a very small data sets. So this is how we use all and any operators in our subqueries in SQL. All right. So with that we have covered almost everything about how to use the subqueries in different locations and clauses. But we didn't talk about the exist operator and that's because I would like you to understand a very important concept in the subqueries where we have two different types of the subqueries based on the dependencies the non-correlated and correlated subqueries. And after that we're going to go back to the exist operator. All right friends. So now we come to the part where it is a little bit complicated about the subqueries. Now we're going to talk about the dependencies between the subquery and the main query. So far all the examples and the subqueries that we have learned where a noncorrelated subquery. A non-correlated subquery means a subquery that can run independently from the main query. So that means the subquery is like standalone query. But in the other hand we have the exact opposite type of the subquery. We have the correlated subquery. A correlated subquery is a subquery that relies on values from the main query for each row it processes. So that means the subquery here is completely depending on the main query. So I know this might be a little bit confusing. That's why we can have the following very simple sketch in order to exactly understand how this works. So as usual we have a database tables and now this time going to go and start executing the main query first. This is the first thing happens. So the main query going to go and query the database in order to get results and SQL going to process the results row by row. So now what going to happen? The main query going to go and pass the first row informations to the sub query. So now the subquery going to get the data from the main query. So SQL going to execute the subquery. So here the subquery going to return a value like for example one. So here it's very important to understand that now the SQL or the main query going to check is there a result from the sub query in this example yes we have a results. So here SQL is checking the output for the subquery for the first row. So if there is a result SQL going to go and return the row in the final result. So this is the whole iteration happened only for the first row. So we're going to process the whole thing again from the start for the second row. So the main query going to get the second row from the database and it going to pass it to the subquery. Once the subquery gets this new informations, SQL going to go and execute the subquery once again. So now let's say that after executing the subquery, there were no results. So the subquery is not returning anything after the execution. So now what can happen? SQL and the main query going to check okay there is no result from the sub query and this means this row should be excluded and not presented in the output. So we will not see this row at the output. So as you can see SQL is executing the subquery once again for the second row. So this will keep happening as long as we have row. For example, we have another row. The main query going to pass it to the subquery. The subquery going to be executed for the third time and the result of the subquery is going to be one. So the same thing going to happen. SQL going to check it. Okay, we have a value. So this row is allowed to be in the final results and so on. The cycle going to keep repeating for each row that's going to be retrieved from the main query and once we have processed all the rows, the final result going to be presented in the output. So what we have understood so far the correlated subqueries is always depending on the main query and the subquery going to be executed for each row that we're going to get from the main query. So in this example we have four rows and the subquery is executed four times. So this is how the correlated subquery works. It's a little bit more complicated than the non-correlated subquery. The non-correlated subqueries are really straightforward. So first the subquery going to go and execute the database only once and the output of the subquery going to be like an intermediate results that going to be used from the main query. So the main query going to go and query the intermediate results and in the output we're going to get the final results. So as you can see in the execution of the non-correlated subquery it is straightforward. There's no iterations everything going to be executed only once. So now if you compare them side by side you can see that with the non-correlated subquery it is completely independent from the main query. So that means the subquery going to be executed only once and after that SQL going to go and as well execute the main query only once using the result from the subquery. But on the left side the subquery is going to be executed multiple times and it is completely depending on the main query and there is like an iteration for each row that's going to be retrieved from the main query. So the process going to be cycling until all the rows are processed and this is exactly how the correlated and the non-correlated subqueries work in SQL. All right. So now let's have the following task and it says show all customer details and find the total orders of each customer. We have already solved this task and you know in scale we don't have only one query in order to solve something. We have multiple ways in order to do it. So we solved this task before using the subqueries and the joins. Now we're going to go and solve this task using subquery in the select clause and as well using the correlated subqueries. So again let's do it step by step. It's very simple. First we need all the customer details. So as we learned select star from sales customers. So if you execute it you will get all the details of all customers. Now we need to find the total number of orders of each customer. Now before we have solved this using a simple query where we have used the count function together with a group I but this time we're going to do it little bit different. So let's go and write query saying select count star from the table sales orders. So now let's go and execute it. With that we have the total number of orders. So let's go and take this sub query and use it in the select. So we are using it as a scalar subquery. So let's just put it over here. And this is the main query. And in order to make this as a subquery, what we're going to do, we're going to have the parenthesis and we're going to say the total sales. So now let's go and execute it. So now as you can see, we have here all the details about the customers and we have the total sales. But we have one issue. We don't need just the total order. We need the total orders for each customer. So each customer has different total orders. So we cannot have like the following setup. We cannot say group by customer ID. And then you have like here the customer ID and so on. So if you go and execute it, you will get a problem. And that's because if you go and execute this subquery over here, you will get like multiple rows and multiple columns. So you have like a table query. And this type of subquery is not allowed to be used in the select clause, right? We have to have only scalar subquery. So that's why we cannot do that. So we have to go and remove all those stuff. But we can go and solve it using the correlated subqueries. So now the subquery is completely independent from the main query. So in order to correlate it, what we're going to do, we're going to go and connect it. So I'm going to give aliases for the tables and I'm going to say where the customer ID equal to the customer ID from the main query from the customers. So again we are connecting the customer ID from the orders in the subquery with the customer ID from the table customers that comes from the main query. So now we are saying okay execute this only for a specific customer not for the whole table. So let's go and execute it. So now in the output we have the total sales for each customer and we don't have here like the total sales in the whole table orders and that's because what is happening for each row the subquery going to be executed. So for the customer number one this query going to be executed like this count the total number of orders where the customer ID equal to the one. So let me just show you what this means. If I go and remove this from here and just put the number one. So if I go and execute this, you will see the customer ID one has three orders. And let's just put it back and execute. And the same thing going to happen for each customer. So for each customer, for each row, this subquery going to be executed and it can be filtered with the customer ID that comes from the main query. So this is another way in how to solve this task using the correlated subqueries. So now let's summarize and understand quickly what are the differences between the non-correlated and the correlated subqueries. So now if you are talking about the definition the non-correlated subquery are subqueries that are independent of the main query but in the other hand the correlated subqueries are dependent of the main query. And now if you're talking about the execution the non-correlated subquery is going to be executed only once and then the results going to be used by the main query but by the correlated subqueries the subquery going to be executed for each row that we have from the main query. And as we learned for the non-correlated subqueries we can execute it on its own. So we can go and select it and execute it. But the correlated subqueries we cannot execute it on its own. So we have to execute always the whole thing. And if you are talking about which one is easier, I think it's clear that the noncorrelated subqueries are easier to write and to read. And in the other hand, the correlated subqueries are harder to read and as well it's complex. Now, if you're talking about the performance of the database since the correlated subqueries can be executed only once, this of course going to lead you to have better performance because things are really straightforward and not complicated. But in the other hand with the correlated subqueries there is more effort because SQL has to check a lot of stuff and the subquery going to be executed many times. So the noncorrelated subqueries are faster. We use the noncorrelated subqueries in order to do static comparison. So the value that we are getting from the subquery is executed only once and we will get only one static value in order to use it for filtering and so on. But in the other hand we use correlated subqueries in order to do rowby row comparison. And since we don't have here a static value each time the subquery going to run we're going to have different results. This going to add more dynamic to the filters and we don't have a static value. So those are the big differences between the non-correlated and the correlated subqueries. All right. So now after we understood the concept of the two types correlated and non-correlated subqueries we're going to go now and cover the last operator for the subqueries. We have the exists. So what is exist operator? All right. So now we're going to talk about a very interesting operator function in SQL the exists. So now in some scenarios as you are querying the data from one table you would need to go and check whether the rows of this table exist in another table. So that means you are checking like the existence of your rows in different table. And exactly in this scenario we go and use subqueries together with the operator exists. So the exist operator is very simple. It just simply check whether the subquery returns any results any rows. All right. So now let's understand the syntax of the correlated subqueries using the exist operator. This can be a little bit complicated but we're going to do it step by step. Don't worry about it. So let's start with the easy stuff. In the main query we're going to go and write a simple select. We are selecting few columns from the table two. And now we don't need all the data from table two. We want to filter the table using the wear clause. Now what we're going to do after the wear clause, we're going to write immediately another keyword called exists. So we don't specify any column before the exist like we have done in the comparison operator or the in operator. We don't need that because we are not filtering based on a value. We are filtering based on the logic. That's why we have the word exist immediately. And now directly after they exist, we're going to go and define the subquery like this. So we're going to start saying select one from the table number one. Well, it is not like a must or something. But it is very commonly used to specify here a one. We are not using the subquery in order to retrieve informations from the table one. We are just testing whether the subquery going to return a value or not. And we don't care about the returned value. It could be one, it could be column, it could be anything. So we don't care about the data that is retrieved. We are just care whether the subquery is returning anything. So that's why we go and write any value like here a one. So now we are not done yet. This subquery is not yet connected to the main query. We have somehow to go and connect them together. And we can do that using the wear clause where we go and connect the ID from the table one from the subquery with the ID from the outer query from the main query. And with that we are building like a relationship between the subquery and the main query. So with that the subquery is now depending on the values from the main query because here we have the table 2 id. So the ids from the main query going to filter the subquery. So this is the syntax of correlated sub queries using the exist where we are making the subquery depending totally on the main query. So let's understand how exist works. So now for each row that we have from the main query, it's going to trigger and cause an execution of the subquery. This subquery going to help us to evaluate this row. So we are testing this row. Now if the subquery doesn't return anything, so there is no results, what can happen? The row that we are evaluating from the main query will be excluded from the final results. But now in the other hand if the subquery is returning a value so we have like some kind of results then this row that we are evaluating going to be included in the final results. So the subquery is used in order to do a test. Do we have a results or we don't and based on this SQL either going to include or exclude the row from the final results. So this is the logic behind the exist in SQL. All right. So now we're going to go and solve the same task using the exists. So the task says show the details of orders made by the customers in Germany. So we have already solved this task using the in operator and the subquery. Now we're going to go and solve it using the exists. So again we're going to have the same logical steps that we have done before. So first we're going to go and select all the details from the table sales orders. So let's execute it. And with that we have all the orders and all the details. But of course we don't need all those informations. We need only the orders that's made by customers from Germany. So that is the first query. Let's go and construct the second query. We're going to say select star from sales customers. But we don't need all the customers. We need only the customers from country equal to the value Germany. So let's go and execute it. So now we have all customers that come from Germany. Now we have to go and put those two queries together in order to get the final results. So as we learned before the second query going to be our subquery. So it's going to be supporting the first query in order to filter the data. So the first query going to be our main query. Let me just make this smaller and the text as well. Now we don't need all the orders, right? We need only the orders where the customer come from Germany. So we need the work clause. So now we can have the filter logic like this. Show the order details only if the customer ID exist from the subquery. And now we have to go and put our subquery. So our subquery going to be this one over here. So let's just move it to the right side. And in order to have it as a subquery, we have to close the parenthesis. And now since exist is correlated subquery, we cannot have it like this. we have to go and connect the subquery together with the main query. So now the subquery is currently independent from the main query because we want to check each order information from the order table to check whether the customer exist in the sub query. We're going to go and add the condition like the following. And now it's like the joins we have to go and connect the customer ids together. So we're going to go over here and give it like an alias and as well for the subquery. And now we're going to say customer ID from the orders should be equal to the customer ID from the subquery the table customers like this. So again this customer ID come from the subquery and this customer ID comes from the main query. So now since we are using the subquery only in order to test the existence of the customer. So if the subquery returns anything or not, it doesn't matter what you are selecting in the subquery. So so you can go with the star or a column or any static value. But for some reason all the SQL developers decided to go with the static value one. And of course you can go and add like a column like the customer ID but it's like unnecessary step for the SQL in order to retrieve the information from the customer ID. So it's going to be way faster for SQL if you say okay select one. So let's stick with the best practices. Use the one value if you are working with exist. So this is our sub query and I think we have everything. Let's go and execute it. Now as you can see in the output we got all the orders where the customers come from Germany. Now of course if you want to go and try another value and execute you will get exactly the same results. So it doesn't matter which value you are using. So with that we have solved the task this time using the exists. Now if the task says show the details of orders made by customers that don't come from Germany it's going to be very simple. We're going to go and use the operator not before the exist. So where not exists. So now we are flipping the whole logic and we are saying there should be no matching with the subquery. So now if you go and execute it you will get all the orders where the customers don't come from Germany by simply using the not logic. And there is one more thing that is annoying about the correlated subqueries. If you compare to the non-correlated subqueries as we learned before, let me go back to the n operator. Now this is a non-correlated subquery. And if I go and select only the subquery, I can go and execute it independently. So I can go and check the intermediate results and like validate my query. But the problem with the correlated subquery, I cannot go and highlight the subquery and then go and execute it. And that's because in the syntax of the subquery we are adding a column that is outside our subquery that come from the main query. So this piece of information currently for the SQL is unknown and that's why we are getting this error because SQL saying okay I don't know where this column come from. So this is little bit annoying using the correlated subqueries you cannot go and test the intermediate results. But how I usually do it I go and test like an intermediate result for only one row. So for example, I'm going to go and pick like a customer here. For example, two. So I'm going to go and say okay, the customer ID should be equal to two. So let me just remove this from here. I got this value from the main query. So if I go now and execute it, I can see here. Okay, the subquery is not returning anything because there is no such a value. So with that, I'm just testing like one row. And of course in order to make this working I have to go and add as well the column from the main query. So this is why correlated subqueries are a little bit more hard to understand compared to the non-correlated because we cannot go and test the intermediate results like we can do there. So this is another way on how to solve this task using a correlated subqueries with the operator exists. Okay. So now let's see step by step how SQL executed the correlated subqueries using the exists operator. So now this time SQL will not start with the subquery. SQL going to go and start immediately with the main query. SQL first going to identify the main query and it going to go and execute it. But it's going to executed row by row. So the first row going to be the first customer. So now SQL going to go and put the first customer under the test. So now the next step is that SQL going to go and pass the value of the customer ID from the main query to the subquery. So we are doing now exactly the opposite. So now what going to happen? SQL going to prepare the subquery with the following information. So we are saying the customer ID equal to one and then SQL going to go and execute it. So now once SQL executed this query, we will get the result of one and that's because we have here multiple times where the customer ID is equal to one. So there is rows in the order table where the customer ID equal to one. So now what going to happen? the row from the main query going to pass the test and this customer going to be included in the final results. So now the next step with that is going to go and start testing the second customer. So we're going to put this customer under the test. Now we're going to go and pass the value to the subquery. So here we're going to have the value of two and then SQL going to go and execute this query and of course we will get a result because we have here multiple times where the customer ID equal to two. So that's why in the output of this subquery we will get one. So now it's still going to say great we have a value from the subquery that's why it is safe to show this customer in the output. And now it's still going to go to the next row and so on. So for the next two customers the same things going to happen. All of those customers will have a value from the subquery and that's why they are all like passing the test. So we will have it in the output. Now skill going to go to the last row from the table customers. So we have the Anna and we're going to put Anna to the test. So now what going to happen? SQL going to go and pass the value five to the subquery and SQL going to go and execute this query to the table orders. Now once SQL execute this query there will be nothing returned and that's because we don't have here in the table orders a customer ID equal to five. And now SQL going to say well we are not getting any results from the subquery. That's why this customer going to fail and SQL will not show it at the output. So it will be completely removed. So the customer Anna is excluded because the subquery is not returning anything. Customer ID number five Anna does not exist in the table orders. So it's going to fail the test and we will have in the final results only for customers. So this is exactly the purpose of the exist. we are checking and testing the existence of our rows from another table from another query. So this is how SQL executes the correlated subqueries using the operator [Music] exists. All right friends, so with that you have covered everything about the subqueries, all the different categories and types of the subqueries and now we're going to do a quick recap about the subqueries. So as we learned subqueries is just simply a query inside another query. And we use the subqueries in order to break down a complex queries into smaller, simpler, easy to manage pieces that makes everything easier to develop and as well to read. And as we learned there are like many different use cases for the subqueries. So we use subqueries in order to create temporary result sets to be used later from another query. And we learned that we can use the subqueries in order to prepare the data before joining the tables. And another very important use case for the subquery is that we can use it in order to filter our data using a dynamic and as well complex filter logics. And as we learned, we can go and use the correlated subqueries using the exist operator in order to check the existence of data and rows from another tables. and as well using the correlated subqueries help us to do rowby row comparison. All right my friends, so with that we have covered an important technique on how to nest your queries in SQL. Now in the next step we're going to talk about one of the most famous technique on how to do multi steps in SQL the city common table expression. So let's go. A city common table expression is a temporary named result set like a virtual table that could be used multiple times within your query to simplify and organize complex query. So let's understand what this means using the following sketch. So we have our database tables like orders, customers and so on. And in very simple scenario we write a simple SQL in order to query and retrieve the data from the database and then in the output we will get the result of the query. So this is the simplest version of querying data. Now things get complicated in our project and we could have the following technique in our query. So we still have this section where we are saying select from. But now inside our query we can write another query like for example select from where which is completely nothing to do with the first query and we can give this new query inside our query a name CTE and we can call this query a CTE query common table expression. And the first query outside this CDE we call it a main query. Now if you check this we have like a query inside another query. So now let's see what is going to do with this. The first thing is going to go and execute the city query. So the city query going to be executed and we're going to go and retrieve few informations from our database tables. Now the output going to be available only in the query and the output going to have the shape of like a table like for example the sales. So now the sales table and the orders tables both of them are tables but one is stored in the database and the other one is an intermediate virtual table. So now what can happen in the main query we can go and start querying the sales table the result from the CTE as any other normal table like we do to the database tables. So the main query going to go and retrieve few informations and maybe do some manipulations on top of the sales table or let's say the CTE results and of course the main query as well can go and say you know what let's go and query as well few tables from the database. So the main query has two sources of tables. Either get it directly from the database or get it from the table that is created inside the query and then once everything is done the final results of the main query going to be presented for the user as a final result. So as you can see the CTA query has one task where it generates like a table that lives inside our query and we can go and use it as we want. So now this intermediate table that is created from the city has two features. First this table will not live long. So once the query ends what going to happen is going to go and destroy this table. So it will not be available afterward and we are not able to query it anymore. So SQL is doing here like a cleanup and the second character about this let's imagine that we have another side query and it's retrieving tables directly from the database tables. Now if you say let's go and join those tables as well with the sales from the first query well it will not be working because SQL going to say I don't know what you are talking about and that's because the sales is only locally available for the main query in the same query. So that means it's not globally available like the database tables for any query. It is dedicated only for the main query within the same query. And now you might tell me bar wait I have heard this story before right? So this is an identical story to the one that you have told us about the subqueries. So what is exactly the difference between the subquery and the CTE? Well, you are totally right. The story is identical between the subqueries and the CTE but still there are differences between them. So let me show you few differences. Now let's put them side by side. We have on the left side the subqueries on the right side we have the CTE. So now if you look on how we wrote the CT and the subqueries you can see that on the subquery we are writing it from bottom to top. So first we have this inner query the subquery and then on top of it we have the main query. But now on the other hand the CTE we are writing it from top to bottom. So first we write this inner query the CTE query and then beneath it we're going to go and write the main query. So this is the first difference between them on the way we write the query. So if I'm thinking about subqueries, I start from bottom to top. If I'm thinking about CTE, I think from top to bottom. But still you say, you know what, I don't care how we write it. They are doing the same thing. The subquery is introducing an intermediate result that is used later from the main query. And the same thing for the CTE. It present like intermediate table that is used as well from the main query. Now let me tell you the big differences between them is that in the subquery the result can be used only once. So you cannot have another place in your main query where you go and reuse the result from the subquery. So you can use it maximum only in one position and only once. But in the other hand with the city technique, you can think about the sales table as a virtual table and not only you can use it in one place in the main query, you can go and use it in many other places. So you can go and join it again. So that means I'm using the output from the CTE query in two different places in the main query or maybe from three different places. So you can have another place where you go as well and query the sales table that is only available in our query. So this is the main and the most important difference between the subquery and the CTE. It's from the name common table expression. We think about the result of the CTE as a table. So we can go and select it. We can go and join it with any other table. So it is like a hidden virtual table lives inside our query. But the subqueries it's totally different. It's a result only for one position in the main query and it's used only once. So that means if you want the subquery in two three different places, you have to go and write the subquery three different times. So now you understand why do we have CTE and why do we have subqueries. All right. So with that you have understood what is CTE. Now the question is why do we need CTE in the first place? What is the main purpose of the CTE? Let's go back to the sketch. Now let's say in our complex SQL task we have to do the following step. Step one we have to go and join the tables together in order to prepare all the data that we need for the next step. And now in the second step we have to go and aggregate the data. Maybe we are doing summarizations. Now in our task we have to do as well different types of aggregations based on different data. And now what might happen is that we have to go and join again the same tables in order to prepare the data and perform different type of aggregations like for example the average which going to be in the last step. Now we have learned before we can go and use the subqueries in order to make this logical flow. So for step one, step two, step three, we will have subqueries and the final step going to be in the main query. But now if we keep doing this we're gonna have a problem and that is we are repeating the same step more than once. So we are joining the table twice in step number one and three for different purposes which cause us to have two different subqueries that looks exactly the same and this is exactly the weak point of the subqueries. It might introduce redundancies. So that means the subqueries alone will not help you to eliminate all the duplicates in your code. But still we have different techniques in order to solve this issue. So what we going to do? We're going to have only one step in order to join the tables. And then this data going to be used in the step two in order to aggregate the data. And then we don't need the step three of joining again the data. We're going to reuse the step one. And we're going to use the same data for the step four which is aggregating the data using average. And we can do this with the help of the amazing CTE. So now if you compare the steps in the subqueries with the steps with the CTE you can see with the CTE we are reducing the number of steps which can lead to reduce the size of the query. So now again here in subquery we think about the steps from bottom to top but in the city it's the way around we think from top to bottom. So that means the first step on the top it's going to be joining the tables and then below it going to be step two and step three. And of course since we are repeating the join we're going to put it in CTE and then we can use it twice in different places in the main query. So as you can see there are a lot of benefits of the CTE. It's like the subqueries. We are breaking down complex queries into smaller pieces that are easier to write manage understand and as well we have like a logical flow from step one to three but with one more benefit that we reduce the redundancies of our code. So we don't have to join the tables twice. Now I'm going to show you a simple example how the CTE makes our life easier in our query. We might have to do different stuff like for example we have to go and find the top customers. So we can put this in one CTE and we might need as well to calculate what are the top products and we can put as well this in another city. So you don't have to put everything in one big city. Then you can have the same issue of having complex query. And let's say that we have as well to find and calculate the daily revenue. And for this as well, we have to put it in one CTE. Now once we have all those parts, we can put everything together in the main query. So now if you look to this structure, you can see it's really easy to understand this code. It's easy to read. So CTE improves the readability of our queries. So that means your code is divided into clear sections making it easier to understand what each part does. Now if you keep looking to this we have another advantage of the CTE introduces modularity. So that means it breaks your code into smaller manageable parts. So this means instead of writing one huge complex query you break it down into smaller chunks using CTE. Each city is like self-contained and handles specific part of the problem and then you can combine them all together in the final query. It's like we are putting together a puzzle piece by piece. And now one very important advantage of the CTE is the reusability. So that means we can have a result set that is used multiple times inside our query. So that means you write the logic the code only once and then use it in different places inside your query. This is very important. Not only you are wasting time writing the same stuff over and over, but also it reduces the errors and mistakes that you might do if you are repeating the same code. Especially if later you want to go and change the logic then you have to go and visit each time you have done this logic and then do the changes and you might forget some places. That's why the CTE is amazing. You can write the logic once and then you go and reuse it in different places. So these are the advantages of using this technique the CTE inside your [Music] queries. So again you are at the client side and you are data analyst. You are writing a query where you are defining a CTE called details and inside it you have some logic and now in the main query you are selecting the data from the orders and as well you are joining it with the details with the CTE multiple times using multiple conditions. Now once you go and execute this query the database engine going to read the query and say aha we have here a CTE and it has the main priority. So that means it going to go and execute the CTE first. And now let's say that in the city you are retrieving data from the table orders and the table orders of course in the disk storage inside the user data. And now once the city is completely executed the database engine going to go and place the results in the cache and it's going to name this result as details. It's like a table name. So the database engine is done with the CTE. It's going to go now and grab the main query and it's going to start executing it step by step. So the first step is that to get the data from the orders. So since the orders exist in the disk storage, it going to go and retrieve it from there. Now the database engine going to check the details. Okay, we have it in the cache. That means we don't have to search for it in the disk storage and it going to start retrieving the data from the details with high speed. And now it's going to go to the second step as well joining the data with the details. So again the database engine going to go to the cache and going to see the table details and retrieve the data based maybe in different conditions. And then to the third time as well we are joining to the details and we're going to get the data from the cache. So as you can see from the main query we are using the result from the CTE multiple times in different places and the retrieval of all those informations is happening in high speed. So this is one big benefit of using the CTE is to utilize using the high-speed memory of the cache. So that means retrieving the data from the cache from the details is way faster than retrieving the data from the disk storage from the orders. Now once the main query is completely executed the result going to be returned to the database engine and then it's going to send it back to the client side and we will see the results in the output. So that's it. It's amazing right? This is how the database server execute the amazing technique the CTE behind the scenes. All right. So now for the CTE, we don't have only one CTE. We have different types of CTE. So mainly there are like two types of CTE. We have the nonrecursive CTE and recursive CTE. And we can say for the nonrecursive CTE, we have two subtypes. The first type is the standalone CTE and the second one is the nested CTE. And now what we're going to do, we're going to deep dive into each type. And we will start with the easiest form of the CTE, the standalone CTE. It is the simplest form. So what is standalone CTE? It is a CTE query that is defined and used independently in the query. So that means it is self-contained and it doesn't depend on anything. It doesn't depend on any other CTE or queries. So that means we can run the standalone query independently from anything inside our query. So let's understand what this means. We have our CTE. It's going to go and query the database tables and in the output we will get an intermediate results and then the output can be used from the main query. So the main query going to query the intermediate results and present in the output the final results. So now if you check our CTE, it is completely independent from anything else. So it simply query the database and it has one output. So since this CTE is independent from anything else we call it a standalone CTE. Now if you compare this CT with the main query you can see that the main query cannot be executed alone. And that's because it needs the result from the first query. So we cannot say the main query is independent cannot be executed alone. It always depend on the city query. So that means city first need to be executed then the main query can be executed. So this is what we mean with the standalone city. It doesn't depend on anything else. So now we can understand the syntax of the CTE. So we have a very simple query select from where. So it is a very simple select statement. Now in order to put it inside a CTE we can go and use the with clause. So it starts with the keyword with then the CTE name. It's like a table name and then we have the keyword as in order to say this CTE is defined like the following. So this is the definition of the CTE and it has two parenthesis the starting and the ending. So with this you are telling a scale okay now we are talking about CTE and it has a name. So if you are using a query inside with clause we call this a CTE query it is where you define the CTE. Now of course we don't want only to define a CTE. We want to use it. So outside of this definition we can go and use it like this. So we are saying select from the CTE name. So that means we want to select the data from the result of the CTE. And here it's very important to use exactly the same name as you define it in the width clause. So if you leave it like this, we can call this the main query. It is the place where we use the CTE. So this is the syntax of a very simple CTE in SQL. Okay. So now what we're going to do, we're going to have like a task that's going to keep progressing through this section. So we're going to start with the first step and we will keep adding steps as we progress in the CTE. So now the first step in this task says find the total sales per customer. And now of course since we have only one step, it makes no sense to use the CTE. But we will use it since we know that there will be different steps later. So let's start doing that. Now before I use any CTE, I would like just to write our query first. So we need the total sales for each customers. It's very simple. So we're going to go and select and what do we need? Let's go and get the customer ID and we need to do aggregations on the sales. So summarize the sales and we're going to call it total sales from the table. And now since this is our first query, we have to get the data from our database. So we don't have any other option. Our data going to be in the sales orders. So let's go and get it. And don't forget to group by for the aggregation. We are grouping by the customer ID. That's it. Let's go and execute it. And as you can see in the output, nothing is fancy. We are just aggregating the sales by the customers. So with that, we have solved the task. But now I would like to put my query in a CTE. And that's because later we're going to add more steps. So let's put our query in a city. And in order to do that, we're going to start with the with keyword. And now we have to define the name of the CD. So I'm going to call it city total sales like this. And then afterward we're going to say as and then we have to go and add the parenthesis at the start and as well at the end. And with that we are telling SQL this query is a CTE query. So that means the SQL should store the result of this query in a cache in memory to be used later in the main query. our CTE and of course what is missing is the main query and you have to do it exactly after the definition of the CTE. I will just make here a small comment about the main query. Uh let me just make this smaller like this. And now we have to go and have a very simple select statements from. And now I would like to get more details from the customers table. So I will just go now to the customers. So now we are not querying the CTE right? We are just querying the database table that we have and I would like to get from the customer the customer ID and the first name and let's go and get as well the last name. So now if we go and query this what happens in the output we are getting the data actually completely from the database table the customers and of course we are not using at all the CTE inside our main query. Of course, we can do that, but it's just waste of like space in the memory because SQL did execute this and stored it in the database memory. And of course, we would like to use the city in our main query. So, let's go and do that. So, let's go and do a join, but this time we're going to join the data from the CTE. So, let's go and get the name and I will just call it CTS. So what we are doing now we are joining the physical table the customers with the virtual table that we have created with the CTE that exist only in our query and of course not only we are joining the tables we would like to get the informations from the CTE. So CTS and we need only the total sales. So total sales. So that means those three columns comes from our database table customers and only this column the total sales comes from our CTE. So let's go and execute the whole thing. Now as you can see in the output everything is working. We have the three columns from the table customers and we have the total sales for each customer and this total sales comes from our city. Now as you can see the last customer has a null over here and that's because in the table orders we don't have the customer five. And now you might say you know what I would like to see the intermediate result from the CTE because what we are seeing now in the output is the final result from the main query. So now what we can do in order to see the result of the CTE we're going to mark the query in the CTE of course without any parenthesis or the width. So just the query and execute it. And with that you can see in the output the intermediate results that we are passing to the main query. And as you can see we don't have here customer number five. That's why in the final results we are getting null and that's of course because we are using the lift join. So if I execute the whole thing you can see we are getting the customer five over here with the null. So as you can see is very simple. We just treat it as any normal database table. But this table is created from our query that we have defined in the city over here. Now of course in the city you can use any kind of clauses like select from join group by having everything that you want window functions all aggregate functions but there is only one restriction you cannot go and use the order by clause so you cannot sort the data in the city so let's go and try it out let's go and say order by and let's say I want to sort by the order ID for example so let's go and execute it you can see here SQL is saying Okay, I cannot do it for you because order by is not allowed in many things. So you cannot use it in views, in sub queries, in comment table expressions, the CTE over here. So it is not allowed. You cannot use order by in the CTE. But of course you can go and sort the data in the main query. So if you go over here and say order by customer ID. So if we execute it, it's going to be working. So in the main query you can use order by but in the CTE this is the only thing that you cannot use inside the city. So that's it. This is our first CTE in this section. All right. So this is the simplest form of the CTE the standalone. Now we can have not only one CTE, we can have multiple CTE. So it's going to look like this. We have our database and this time we don't have only one CTE. We have multiple CTEes in our query and each CTE is going directly to the database and it will query the database in order to prepare the intermediate results. So in this example four CDEs is going to the database and preparing four different intermediate results and of course SQL going to execute it from the top to the bottom. So first the CD 1 then 2 3 four but they have nothing to do with each others. So now once we have all the four intermediate results the main query going to go and retrieve all those informations and do some magic in order to prepare the final result for the end user. So now by looking to this sketch you can understand all those CTE are independent from each others. So there is no nesting or something. Each CTE is self-contained and it could be executed on its own without depending on any other results from any other CTE or any other query. So it goes directly to the database and get the data. So that's why all of them are standalone CDs. And since we have multiple CDs, then it is standalone multiple CDs. That's it. It's simple. So now let's check the syntax of the multiple standalone cities. So we're going to start writing our first city. So it start with the with clause and then we have the city name and then the logic of our city. So nothing new. This is how we define the city. And then in order to use it, we're going to have our main query where we select from our new city and we make sure we are using the name of our city. So nothing new. Now in order to add another city to our query, what we're going to do, we're going to go after the definition of the city. And below it, we're going to go and start defining the city too. But this time, as you can see, we are not using the width clause. We are using a comma. So that means only the first city going to be using the with clause in order to tell SQL we are talking about CTE. All the other CDEs you're going to separate it using the comma. So the syntax going to be comma instead of with then the name of the CTE and then we're going to say as the following definition. So we're going to write here the query of the second CTE. So now of course if you want to go and add more CTE you go and use the comma below it and as well you define the third city. So you can have as much cities as you want and always separate it with comma but only the first city start with the width. And of course in the main query we can go and use the results from the city 2 where we are for example here joining the data between the city 1 and city 2. So as you can see in the main query here we are like collecting the data from these different cities in order to do the final step in the main query. It start with the width. So SQL understands okay now we are talking about CTE and once SQL sees after the parenthesis a comma SQL can understands okay now we are talking about another city and now if you don't go and use a comma after the parenthesis SQL can understands okay we don't have any more CDEs the next query it's about the main query so this is how you create multiple standalone CTE all right so now back to our task where we are creating like a report step by step so now we have in the task a second step where it says find the last order date for each customer. So now we have to go and add one more information about our customer. So when the last time the customer did order. So how we going to do it? Now we have to add this to our query. And I would like to use as well the CTE in order to have this logic. So as we learned from the first task, this is the first step in order to find the total sales for each customer. And here we have the main query. Now I would like to put now in between another CTE. And as we learned from the syntax, we have to go and add a comma. We cannot go and use the width again. And we have to give it a name. So let's call it CTE and last order. So latex and we have to define it. So as and then double parenthesis. And now in between we have to go and add our logic. So now we have to focus only in this logic. So forget about the other CTE and the main query. So we have to find the last order date for each customer. So we're going to go and query again the table orders. So what do we need? We need the customer ID. We need the order date from our table sales orders. So that's it for now. Let's just select it and execute it. And now with that you can see all the customers and as well all the orders. But we would like to have the highest order for each customer. And we can go and use our aggregate function, the max function. So what we're going to do it's like here at the top. So we have to go and use the function max and group up by the customer ID. So group up the customer ID. Uh let me just shift it like this. And let's give it the name last order. So like this. And as you can see I'm just selecting now only my query. I'm not selecting everything. And I keep executing in order just to check the results before we integrate it in the main query. So now as you can see we have for each customer one row and we have as well the highest order for each customer. So with that we have solved this subtask. So as you can see it's really easy to extend. I'm just making like another box and I'm adding inside it the business logic that I want and this going to solve one problem from the whole task. So you feel now exactly the power of the CTE. We are making complex logic but still it's easy to add. Now imagine you are not doing this. You are always extending one big query. It's going to be really hard to extend and that's why a lot of SQL developers really love using CTE and they like use it in each query or in each task that they have. So we have solved this task and we have to go now integrated in the main query. It's going to be very simple. So we're going to get over here and we will go and just add another join. So we're going to join it with the city and as you can see SQL now is offering it as a table even though it is not a physical table that exists in our database. It only lives inside our data but still SQL treat it as a table. And this is exactly what we are doing. We treat those informations as table. So city the last order and I will call it CL. And then of course we have to go and do the same condition like here. So the CLLO customer ID should be equal to the customer ID from the first table, the customers. And of course we have to go and add this new information to the main query. So CL the last order. So now what we're going to do, we're going to go and execute the whole thing. So we have now two CDs and as well our main query. So let's go and execute it. Now again let's check the data. The first three columns comes from the physical table customers. The fourth one, the total sales comes from our first city over here. So from here and the last order comes from our new city that we just defined the city number two. So as you can see guys, everything feels like organized and structures and we have like flow and of course those cities are standalone cities. So we can go always and select the city and execute it separately. It doesn't need anything else from outside this query. It just needs the tables inside your database. So guys again here pay attention if you want to add more CDs use the comma. You cannot go and use for example here I another width. So if I execute it I will get an error. So you have to separate it with this comma. And another mistake that I do frequently that I forget and go add here like to the last CTE a comma and this happens to me if I'm using a lot of CDEs. So if I go and do it like this, I will get as well an error because the main query doesn't need a comma. So the last city should not has a comma after the parenthesis. So I just removed it and execute. So guys with us we have now multiple cities inside our query. All right. So now what is a nested CTE? It is a city inside another city. So it's kind of like subqueries, a query inside another query. So not only a main query can use the result of CTE another CTE can use the result from a CTE and of course the nested CTE is like a main query is depend on other query that means you cannot go and select it and run it independently from the query. So always you have to run the CTE inside it first before seeing the result of the nested CTE. Okay. So now let's understand what this means. Again we have our database and we have a city query that goes directly to the database and queries the data from there and in the output we will get the intermediate results. And now in this scenario this time we will not have only one intermediate results because we have many different steps. We need another intermediate results before everything is prepared for the main query. So that means we have another step that's going to be built up on top of the first intermediate results. So that means we can have another CTE that's going to be quering the results from the first CTE and build on top of it another intermediate result. So as you can see here we have CTE1 and CTE2 and that means now we have like two intermediate results. And now of course we can go and add CTE 3 4 and so on. But now let's say that the CTE2 going to prepare the final intermediate result for the main query. So now the main query going to go and query the second intermedator results and it's going to do the final step where the final result can be presented for the user and of course if it is needed the main query can access not only the second intermediate result from the second CTE but also the first intermediate result from the CTE1. Now we call the first CTE a standalone CTE because it doesn't depend on any intermediate results. It goes directly to the database and gets the data. But now since the second city is completely depending on the city one. So this time we're going to call this CTE a nested CTE because we cannot go and execute it on its own. It always depends on the city one. And of course the main city is depending on everything. So as you can see we're using the CTE we're going to go and build like a chain. So this is what we mean with the standalone city and nested city. Okay. So now let's understand the syntax of the nested city. So we start as usual with the definition of the first city using the with clause and then the name of the city and the definition of the city. So here it's nothing new. Now we go and define the second city as we learned using the comma then the name of the CTE and the definition. So this is our CTE number two. So now the second CTE is depending on the results of the first CTE. So how we going to do it? It's very simple. Now for the CTE number two, we're going to select the data from the CTE number one. And with that, we are making the second city depending on the first one. So this means the second CTE is getting the data from the first one and it's querying the data in order to do the second step. And with that we are nesting one CTE in another. And the CTE2 is completely depending on the first one. So again we call the first CTE as a standalone CTE because it doesn't depend on anything. We can execute it on its own and it just need the data directly from the database. But the second city since is completely depending on the city number one we call it a nested city. So they are very similar. We are just selecting the data from the city number one. And now comes our main query. And of course it's going to go and use the data from the second step. So it's going to go and select the data from the city number two. But it's still of course it's not a rule. It can go and access the data and select the data from the city number one. So this is how we can create a nested city in SQL. All right guys, back to our project where we are creating a report about the customers and we would like to add one more step. So the task is rank the customers based on total sales per customer. So this is one more step inside our projects and we would like to go and use as well the CTEs in order to implement this step. So now what do we need? We need to rank the customers based on total sales for each customer. So here like we have two steps. First we have to calculate the total sales per customer and then we have to go and rank it based on this information and of course the sales are stores inside the orders. So now let's go and start implementing the CDE. So we're going to have a comma and we're going to call it CTE customer rank as and then we're going to go have the parenthesis and inside it we're going to develop now the logic. So first we have to go and aggregate the data by the total sales. So select customer ID and then sum the sales from the table sales orders and then of course group by the customer id. And now I can hear you even telling me bar we have already done this. We have already this logic. So why we are repeating? If we go to the first CTE you can see we have already done that. And you are totally right. We have already the logic. So it makes no sense to repeat it again. And if we do this then we didn't understood the power of the city. So we don't have to repeat the same logic and we can reuse the city inside another city. So now we don't need all those stuff. We can go and focus immediately with ranking the customers. So first let me just select the data from the first city. So I'm going to go and select. So what do we have? We have customer ID and we have total sales. And we're going to select it this time not from any physical table. We're going to select our city. So like this. And now what we're going to do, we're going to go and select the whole thing and execute it. Well, this is the issue of nesting cities. Sadly, this CTE is completely depending on the first city. So we cannot go and execute it on its own. And this is of course very annoying because each time I execute the query by the end of the query SQL gonna go and destroy all the CTE. So in the memory we will not find the CT and that's why once I executed it SQL don't know anything about this city. And in order now to see the result of this we have always to execute as well with it the city that I'm using. So what I usually do I go over here and make everything in comment in the main query and now I can go and execute the whole thing and now I will see in the output the outcome of this nested city. So this is the big difference between the standalone cities like here and the nested. So now let's go back to our task. We have to rank those sales based on the total sales. So we can go and use the rank function from the window function. So rank over and now we don't have to partition the data. We just want to sort the data by the total sales descending. So like this the highest sales going to get the rank number one. So let's go and give it the name as customer rank. Now as you can see we have a really nice rank beside those informations. Customer three has the highest sales and customer two has the lowest total sales. So with that, as you can see, we didn't repeat ourself. We just reused another CTE in our current city. And this is exactly why this technique is very amazing in order to reduce redundancies and to reduce the complexity of the whole query. So nested are annoying to execute, but they reduce the redundancies of our code. Now we are done with our logic. We tested everything. So what we're going to do, we're going to go and integrate it in our main query. So let me just remove the comments from here and let's go and add it in the main query. So we will do the same thing. We're going to go and do a left join with the last city that we just created. So let me just call it CCR and the same conditions. We are always joining on the customer ID. But don't forget to rename the alias. So it is CCR customer ID equal to the customer ID from the first table. And of course we have to go and select the new information. So CCR dot customer rank. And now let's go and execute the whole thing. Now as you can see in the results those three columns comes from the customers table. The total sales comes from the first city. The last order from the second city and the customer rank comes from our nested city that we just created. So guys, it is not a simple task creating such a reports because it involves different aggregations and different functions, but our work is organized. As you can see, it's very simple. We have step one, step two, step three, and the main query. And it's really easy to add more components to our query. Now, I would like really to keep practicing using those nested queries. So, we have the following task. We would like to add one more step in our report. segment the customers based on their total sales. So I would like to implement this as well using CTE. So let's go and solve it. We want to go and add a new CTE. It's going to be CTE customer segments as and then we have to go and define our logic. Now if you check our task, it has two parts. We have to find the total sales and then we have to segment the customers based on this information. So it is something very similar to what we have done in the step three. So that means we don't have to go and calculate again the total sales. We have to go and use as well our amazing first city. So let's go and do it. What do we need? We need the customer ID like this. And let's do basic segmentations using the case win. So let's say case when the total sales if it's higher than 100 then let's say the customer going to belong to the group high and let's go and add another category. If it's not higher than 100 if it is higher than 50 then the customer going to belong to medium. And if the total sales is less or equal to 50. So what's going to happen? We're going to say else the customer belong to the low category. So that's it. We're going to have an end and let's call it customer segments. All right. But of course we have to go and select it from a table and it's going to be our city. So total sales and let's put it in our new city. And I would like to test it before like putting it inside our main query. That's why I will put everything in comments in my main query since it is a nested city sadly. And we will just go and select our new nested city like we have done before. So let's go and execute it. Now as you can see in the output we have two customers with the category high and two customers with the medium. But in order to make sure that everything working perfectly, I would like to go and add the total sales just to see the numbers. So let's go and execute it. Well, you can see everything is correct. So those customers having higher than 100 in the total sales and those two having higher than 50. But let's go and change stuff around. I would like to have it like 80 as a medium just in order to have a low. So with that the customer number two having a lower sales than 80. That's why we are getting the segment low. Everything is done and we have segmented the users into different categories. So I don't need to test anymore. Let's go integrate it in our main query. So we're going to do the same things over here. We're going to say lift join and we're going to get our new CTE. So CCS and we have to do the join condition. Don't forget to change it. And we have to select our new nice information. It's going to be the customer segments. And now we can go and execute the whole thing. So we have now like four different cities and one main query. And now we can see in the output we got all three informations from the table customers. The first city, the second, third and this is our new column that we just created. So again we have done this using a necessityd like this. Let me just add it and it was really easy to extend and to add to our report. All right guys, so with us we have done like a many projects where we have analyzed the customer information based on different aspects from our data and we have done it like step by step and now you have like a feeling on how to write complex SQL queries using the help of the CTE and we have done it like step by step. So as you can see if you go through the scripts you can understand okay it is divided into multiple steps and each block is responsible for one specific problem of the whole report and this is exactly the power of the CTE it introduce modularity. So each CTE is self-contained and talk about one issue and this is amazing way on how to organize your project using SQL and how to structure your work. All right, my friends. So, now let's have a little break in order to have a real talk about the city. But first, some coffee. And now I can say that I'm working with SQL since really long long time ago, over 15 years. And I can say as well, I have met a lot of SQL developers in different projects. And if there is one thing that all those SQL developers love is the CTE, they love using it everywhere. like each time they write a query they going to be writing SQL CTE and of course it's fine it's not a bad thing but the problem with that they overuse it of course not all of them but a lot of SQL developers overuse using the CTE of course the CTE is very powerful but with power comes great responsibility remember with great power comes great responsibility so my advice for you especially if you are new to the CTS try to not add a new CTE each time you are doing something new and I saw it a lot like for each new calculation for each new column they jump immediately and create a new CT and what happens at the end we can have like massive number of CTE inside one query and the developer thinks now everything is organized and easy to read but believe me it's exactly the opposite if you open any code and you have a lot of CDEs and especially if they are necessities it is impossible to understand what is going on even if the developer like describe each CTE and the task of the CTE, it's going to be really hard to understand and as well to read. If everything is like nested and you have like I don't know 20 cities in one query. So it's going to be impossible to read and to understand and as well you're going to be using a lot of memory and you might get bad performance. So my advice for you try always as you are creating new CDs to think about how about to merge two CDEs in one. So it is really always important to rethink and refactor your CDEs in order to merge it into one and to reduce the number of CTE. But now if you ask me how many CTEs are okay in one query, well I don't have a magic number for that. But normally I tend to say between three and five CTE it's fine. So it's going to be easy to understand and to read and so on. But once you get more than five CTE then you have to rethink your code. Maybe you have to create another complete query so you don't have to put everything in one query. So this is my advice for you. Try to not overuse the CTEs in your projects. Not for each step always refactor the CTE, consolidate them and try to not have more than five CTEs in one query. So that's my advice for you. Be responsible using the CTE. And let's go back to our course. So with that we have learned the standalone CTE and the NIST CDE and both of them belongs to a type called nonrecursive CTE. So what is a non-recursive CDE? It means it is a city that is executed only once. So there is no repetitions or looping or anything. So the SQL going to execute it in one go and that's it. But in the other hand the recursive city is exactly the opposite. So a recursive city it is a selfreferencering query that repeatedly processing the data until a certain condition is met and we usually use the recursive city if we have like hierarchical structure and we want to navigate and travel through the hierarchy. I know this might be confusing but don't worry about it. We're going to have very simple examples. Now again we have our tables in the database and we have a CTE. Now the query of the CTE going to be executed for the first time and in the results we're going to have the initial data from the CTE but it is not everything yet. Now this intermediate result is not ready yet for the main query but instead of that it's going to go back to the CTE and CTE going to check whether the current results is meeting a specific condition. So now if the check says no it's not meeting the condition what's going to happen the city query going to be executed for the second time. So as you can see we are looping through the CTE. Now the result of the second iteration the second execution will be added to the intermediate result. So now the intermediate result has more data and again before we can use it from the main query it going to be checked from the CTE. Does the result fulfill the condition? If it's still no, then go and execute the CTE again. So we're going to have a third iteration and a new data going to be added to the intermediate result. So this is our third iteration. Now it's going to be checked again from the CTE. Did we fulfill the condition? If the answer is yes, then the loop going to break and everything else. So there will be no fourth iteration of the CTE. So with that, the CTE says okay, I'm done. This is the final result of the intermediate result. then the loop going to break and everything ends and the city will not be executed for the first time and now the city going to say okay I'm done now my intermediate result is ready to be used from the main query and now nothing new happens the main query going to go and retrieve the data from the intermediate results and do some magic in order to prepare the final results so that means there will be no iterations or looping inside the main query the looping going to be happen only in the CTE and that's why we call it recursive CTE. So now if you compare it with the other types, all other types are always in one direction and all the CTE is going to be executed only once but the recursive CTE going to be keep looping until the condition is met and only then it's going to forward the data to the main query. And normally we use the recursive CTE if you are navigating through hierarchical structure. So if you have in your data like hierarchal structures, you can go and use the recursive CTE in order to navigate through it. So this is the recursive city. Okay. So now let's check the syntax of the recursive CTE. It is a little bit complicated but we're going to do it step by step. So what do we have? We have a query and we would like to put it in a city. So we're going to have the usual stuff with clause the name of the city and as and then the query. So this is the definition of our city. But now if you leave it like this SQL going to execute it only once. But we would like to make a loop iteration. So in order to do that we have to go and define a second select statement inside our CTE like this. So we are selecting the data and here we have to define a breaking condition. So here in the second query we are defining a condition in order to break the loop otherwise it's going to loop for infinite or the system going to break. You could use it in the wear clause or you can use it even in an inner join because both of them are filtering the data and you can use it in order to break the condition. All right. So now still there is something missing. How we going to make like things looping? Well, we have to reference this CTE to itself. So what we going to do? We're going to say the second query going to select the data from the same CTE. So that means we have now a query that is quering itself. And this is of course what we want. We want to make iterations and we want to make a loop. That's why we have to go and reference it to itself. And now in SQL you cannot have it like this. You cannot have like two select statements in one query. you have to connect it somehow. That's why we can go and use the union all or union depend if you want to have duplicates or not. So now we call the first query the anchor query. The anchor query going to be the first query that interacts with the database and provide us the initial intermediate results. So it is the starting point of the iteration and we can say it is the first step in the process. So this going to be executed only once and it going to provide us the initial step the first step in the process. Now we call the second step as a recursive query and we call it like this because this query going to be executed multiple times and it will keep repeating and add data to the intermediate results until the condition is met or let's say there will be no more data that is available to be processed. So this is the syntax of the city query for the main query nothing is changed. So we have to go and use the city name in the main query. So this is the syntax of the recursive city. So think about it like this. SQL going to go and execute the anchor query only once and then after that going to go through the recursive query and keep looping and looping and iterating until a certain condition is met and then SQL going to go out from the CTE. So this is actually what we mean with the anchor and recursive queries. All right. Right. So now let's have a simple task in order to understand the recursive city. So the task says generate a sequence of numbers from 1 to 20. So now let's do it step by step. So that means we have to create a loop from 1 to 20 and after 20 the loop should stop. So let's go and do it. Now the first step of the recursive CTE is to build the anchor query. So the anchor query is responsible for the first iteration. So that means the first row of the output. So what is the first value between 1 and 20? It is the one. So let's go and write a query that generate the value one. So select and we're going to say one as I'm going to give it the name my number. So that's it. Let's go and execute it. Now you can see in the output we have the first member of our sequence. And this is exactly the task of the anchor query. It retrieves the first step in the iteration. So let's go and call it anchor query. Now the next step with that we have to go and build the iteration. So we need a CTE. So I will build now the city. So we're going to say with we're going to call it series and then we're going to put everything in parenthesis and then we're going to go to the main query. So this is the main query and we will go and select everything from the Sirius the city. So let's go and execute it just to make sure that everything is working fine. So we didn't create any loop or anything. We have just created a city on top on the anchor query and we just call it from the main query. So now we come to the second step of building the recursive city. We have to build the recursive query. So let's do it. I will just make this little bit smaller. And now before we start writing the query, we have to go and use union all in order to go and connect the anchor query with the recursive query. And let me say this is the recursive query. So how we going to build it? Let's go and start with the select. And now next what I usually do I just make sure that we are making a recursive city. So I go with selecting from and then we're going to use the name of the current city so that we are referencing the city to itself in order to make the city recursive and to do the looping. Now here comes the tricky part. So we need to create like the sequence. Now what is the current value? The current value is one. Right? Now what do we need? We need the second value in the sequence which is two. So we can do it by 1 + 1. So if you do it like this you will get the output two. But actually what we are doing here we are always taking the current value and we are saying plus one in order to generate the next value. So in order to do that instead of saying one we're going to take the my number the current value and we're going to add to it plus one in order to generate the second value in the sequence. So that means my number always holds the current value and we do the operation + one in order to generate the next sequence. So having it like this what we are doing we are generating the sequence of numbers. Now if you go and execute it like this let me just execute it what will happen it going to breaks because SQL will not allow it and SQL set it to 100 iterations. So more than 100 SQL going to break the query so that we don't have infinite number of looping. So this is bad because we didn't define the breaking mechanism of the looping. So now we have to define as well in the recursive query how the loop going to ends and we usually use a condition. For example, we can go and use the wear clause and we can say okay keep looping and keep generating but always check whether the value of the my number is less than 20. And you might ask okay it should be less or equal to 20 right? Well no because if you are making less and equal to 20 what going to happen once the my number is equal to 20 you are allowing one more iterations where you will get in the output 21. So that's why we are making it with 20. So now let's go and execute it and let's check the sequence. It start with 1 2 3 4 5 and until we reach the 20. So with that we have solved the task. Again here it's not that hard right? We are just providing the initial step and then we are providing the loop where we are defining inside it how the loop going to ends. Now there is one more thing that you can do with the recursive CTE is to define the limit of iterations. So for example in your code if you say okay if this iterates more than 10 times then the SQL should breaks and stops. So you can define for the SQL the maximum number of recursions. So how we can do that? We can do that in the main query. So if you go over here and say option then two parenthesis and then max recursion and after that you can define the limit. So for example let's go with the 10. Now of course we are iterating in our code now more than 20 but here we are making the rule it should not iterate more than 10. So let's go and execute it. So now we can see that our SQL breaks and it says the maximum recursion is 10. So as you can see now in the output we are getting the error of having more than 10 iterations which is not allowed. So with that you can control how many recursions you can have. Let's say that you would like to have like thousand iteration. So if you go over here and say you know what I would like to have a sequence of 1,000. If you let me just comment this out. So if you execute it you will get an error because the default is 100. But of course you can go and increase the maximum recursion. For example let's go with 5,000s. in the output it will work and you will get a sequence of 1,000. So with this you can control how many iterations are allowed in your query. So that you have like a control on it. Okay. So now we can understand step by step how SQL executed the recursive query. And here we have like flow diagram in order to understand the process the steps of executing the recursive query. So let's go and do it. Now in the start we have the first step is to run the anchor query. So our anchor query is just a select for the value one. So in the output we will get the value one in my number and as you can see the anchor query going to be executed only once. So there is no iterations or anything. SQL executed once and then goes to the next step. So what is the next step? It's going to execute the recursive query. So it's going to go over here and now what going to happen? We will get the current value of my number. The current value is one. and then we're going to add to it a one. So 1 + 1 we will get from the recursive query the two which is added to our results. Now it's going to check the condition is my number now smaller than 20. Well yes it's smaller than 20 and what's going to happen since it's true is going to go and reexecute the recursive query. So now we are doing the second iteration. So again it's going to go to the recursive query and going to say okay what is the current value of my number? It is two. So 2 + 1 the second iteration will give us the value three. So as you can see each time the recursive query is executed it is adding more values to our result. So the same question can be asked is now my number smaller than 20. Well yes it is smaller. Well what can happen is still going to reexecute the recursive query. So SQL going to keep looping and iterating and adding values to the output until we reach the value 20. So now SQL going to ask is 20 my number now smaller than 20. Well no. So it is false and what's going to happen the chain will break and we will not loop anymore. So it's going to be the end of the city and this going to be the final results that's going to be used from the main query. So this is how SQL executed this recursive CD. Okay. So now let's have another task for the recursive CD. This time it's going to be a little bit more advanced. So the task says show the employee hierarchy by displaying each employees level within the organization. So that means we have to show for each employee for each row a level that tells us the hierarchy of the employee. So first let's go and explore the table employees. So let's go and select everything prompt sales employees. Okay, let's go execute it. So now by looking to the results we have like few informations about the employee. We have information about which department the gender salaries but here we have the key. It is the manager ID. So this is like self referencing to the same table. So for example the first employee the value is null. That means this employee has no manager which makes this employee like the big boss, the CEO. Then now by looking to the next two employees, they have a manager ID one. So who is the manager of those two? It's going to be the first row, the manager ID number one. So the manager ID number one is the post of those two employees. And then for the fourth one, we can see the manager ID number two. So the manager of Michael is actually Kevin, the second row. And for Carol the manager ID is three. That means Mary is the manager of Carol. And this is exactly what we can do with the recursive CTE. We can use such informations in order to create like a loop. So let's go and do it step by step. First we're going to start with the anchor query as usual. So this is the anchor query and here the first step or the first record going to be the highest manager which is the CEO, right? The first record. So in order to select now the only the first record what we can say we can say where manager id is null. So let's go and execute it. And with that we have now the first row and we can use this as the first step in our iteration. So now let's go and pick few informations in the select like the employee ID and the first name and as well let's go and get the manager ID. And now we have to start creating the levels. Right? So this is the first level. So I'm going to have the value one as let's have it like level. So our CEO has the level number one. So let's go and execute it. So now as you can see Frank is the CEO and he is in the level number one. So this is our anchor query. Now we have to do the iteration right. So we have to go and start creating the city. So let's call it with CD employee hierarchy and then as and then this is the definition of our CD. So let me just make it like this. And of course what do we need? We need the main query. So main query we will select everything from our new city like this. So let's go and test it. All right. So now we have prepared the CTE and the main query and of course the next step with that we're going to go and build the recursive query but first we need the union all in order to connect the two queries and recursive query and now we can start building the logic. So now we want to find all the employees where their manager is the employee ID number one right because they going to have the second level in the hierarchy. So what we're going to do, we're going to go and select and we need the same stuff. So we would like to get the employee ID, the first name, and the manager ID. And we need the level. So this going to be the level number two. It's not correct yet. I'm just want to show what this means because we need to get the employee ID and the first name and so on. We cannot get it yet from the CT because in the city we have only one employee. So we still have to go to the database and grab the next employees. So now I will give this as an alias like E and I will select it as well from those employees. So so far we are not doing any recursive yet right in the recursive query we're still querying the database but now we don't need all the employees from this table we need all the employees where the manager ID equal to one right now. Of course, in order to get those employees where the manager equal to one. So we can do it with the workclouds for example and say manager ID equal to one. Let me just select this and query it. Now we will get those two employees where their manager is the CEO the top manager. But of course we cannot do it like this. What we're going to do we're going to join this table with our current CTE in order to make a loop. So let me show you what I mean. We will remove this. We're going to use the inner join and we're going to reference it from the CTE and let's give this the name C H and we connect it like this. So on we're going to say the manager ID of the employee should be equal to the employee ID. So the employee ID at the start going to be the number one. So it's going to be like this employee ID. Now we are connecting the manager ID with the employee ID and we are as well reusing the CD inside itself in order to make the iterations and here we don't need the work clause because the inner join going to filter the data automatically as we learned the inner join going to show only the matching rows from the left and to right so that mean there will be filtering. So we are almost there but of course we don't want to show it as a two. What we're going to do, we're going to show it like this. Level + one. So the current level is one. The second iteration going to be two. And the third iteration going to be three. So I think we have everything for our iteration. Let me just check and make this smaller. Now again we have here our anchor query. This is only for the top level manager. And then here we are just connecting the managers with the employees. And we are reusing the CTE in order to make the effect of the loop. And as well we are using the inner join in order to break the loop once there are no more rows to process. So let's go and execute it. Now let's check the output. This is our top manager. So level one. This information comes from the anchor query. Then the second iteration it is the employees where the manager ID equal to one. So it's going to be those two employees. So those employees in our hierarchy are the second level in our organization. And then we're going to search for employees where their manager ID is equal to either two or three. And this is going to be those two employees, Carol and Miracle. And now to the third iteration, we're going to search for all employees where their manager ID equal to either two or three. And now to the third iteration, we're going to search for all employees where their manager ID equal to either two or three. And this going to result having those two employees because their manager ID is equal to three or two and they're going to get the level of three. And then after that SQL going to try to search for employees where their manager ID equal to five and four and SQL will not find anything and that's why it kind of breaks. So with that we have solved the task. All right. I totally understand if this is complicated but now we're going to do it step by step in order to understand how SQL executed this and why we have done it in this way. So again we have our flow diagram. We start by running the anchor query then the recursive query and then we have a check. If the check fails we iterate otherwise we end. So let's do it step by step. Here we have the table employees and beneath it we have the result of the city. So the first step it says we run the anchor query and we run it only once. So it's going to go to the anchor query and start executing it. So here we are selecting from the table employees but we are making a filter on the manager ID. So the manager ID should be null. So that means we will get the record of Frank and Frank going to be at the output and we are saying the level of this employee is one. So we will have here at the level one. So this is the output of the anchor query and that's it. This will never be executed. Now we go to the next step. Now we will run the recursive query. So what's going to happen in the recursive query we are saying okay I would like to select as well data from the employees and join it with the city results but the join should be an inner join so only the matching data between the CTE and the employees and now comes the join condition and this is the key for this iteration we are saying the manager ID of the employee should be matching to the employee ID from the CTE. So SQL going to go and join the table with the CTE. So now we have here only employee number ID one. So it's still going to do it step by step searching for any matches. So for the first one we don't have a match because the manager ID is not equal to one. So that's why it will not be included in the result. The second row here the manager ID is equal to one and this is a match with the employee ID. So SQL going to take it and put it at the output. Not only that, SQL going to increase the level. So we have here the current value is one. So level + one. What can happen? We will get the value two. We are still in the same iteration. We are not iterating yet. So this is the first iteration of the recursive query. So until the whole join is done to the next row, we have a match as well because the manager ID is equal to one. And we're going to have the same thing. The level going to be as well too because the value of the level didn't change. It's still the current value is equal to one. And this going to keep going. So two, three, we don't have any matches. And with that, SQL is done executing the recursive query. All right. So now the SQL going to say, okay, did we process everything? Well, no. We still have missing output. We still have missing employees. That's why we didn't fulfill the condition. And we're going to run this again. So now in the second iteration, it's going to join as well again the city result with the employees by matching the manager ID and the employee ID. But this time it's going to focus only on those two ids. So the two and three. So SQL going to go and find any matching where the major ID equal to two or three. So it's going to do it step by step. The first one is not. The second one is as well not. The third one is not because the manager ID is one. But now to the employee number four we have a match. So it's still going to take this one and put it in the output like this. And now in this iteration what is the current level? It is two but we add to it one that's why we will get in the output three. And then SQL keep going. So we have here the employee number five and the manager ID is equal to three. So what happens? SQL takes it as well and put it in the output as the result of the CTE and as well the current level is two + one. We're going to have as well three. So with that SQL done joining the tables and going to ask again did we process all employees? Well yes it's true that means we don't have to do any more iterations because if you do any iterations SQL will not find anything. So for example if you go over here let me just remove this and let's say we are joining with the four and five. So what can happen isql going to search in the manager's ID for four and five and it will not find anything. So that means we will not be adding anything to the CTE. That's why SQL stops. So we have a complete results and we have now all the data from the employees in the output and this results going to be passed to the main query. So this is why we have done it like this and this is how executed this recursive query. I would like to visual for you what this means the level or the structure of the organization. So the hierarchy looks like this. The level one the top manager is Frank. So this is the level number one. And then we go to the level number two. So we have those two employees. So we have Kevin. So this is the level number one. And then we have two employees Kevin and Mary at the level two. So they work together and their boss is Frank. So it's going to look like this. And they are at the level two. We have then Michael that directly reports to who? To Kevin because here we have the employee ID two and two. So we have one employee here and as well Carol is as well at the level three and she reports to Mary and both Michael and Carol are at the level three. So this is what we mean with the level. It can help us to identify which employee at which level in the organization. If you have like hierarchy in your data and you can see in one table things are referencing each others like here the manager ID is actually the employee ID. So it's like we are referencing those ID to each others. This means there is hierarchy and there is a structure in this table and you can use the recursive city in order to build those levels and to navigate as well through the hierarchy. All right. So that's all for the recursive city and with that we have covered all the different types of cities that we have in SQL. So now let's have a quick recap. So we have learned that the CTE the common table expression is a temporary named result like a virtual table that could be used from different places in the query and we have a lot of advantages for the CTE. The main one is it breaks the complexity of query into small multiple pieces which makes our query much easier to read and as well to understand. So it improves readability. Another advantage of the city is that those small multiple pieces they are really easy to manage and to develop. So those pieces are like self-contained which makes our queries more modular. So it introduces modularity inside our queries. And we also learned that the CTE help us to reduce the redundancy inside our queries where it makes the result of one query usable in multiple places inside our query. So it makes our code smaller and reduce redundancy. And one more advantage of the city is that it help us to do looping and iterating in SQL by using the recursive CTE. And we have understood as well that we can treat the CTE result as any other physical table inside our database. So we can treat it and handle it like any other tables. Only one exception that this table lives only in one query. So we cannot query the CTE from an external query. Now we have learned that the result of the CTE could be used from the main query. This is the classical one. But not only we can use it in the main query but also we can use it in another CTE query which leads to having nested cities. And of course we have learned as well we can use the result of the CTE within itself which makes the CTE recursive and allows for looping and iterating. And I can only keep recommending to not use more than five CTEs in one query. Otherwise you're going to get the exact opposite and benefits from cdes where your code going to be really hard to understand and to read and even to extend. Okay my friends with that we have covered this amazing and very important technique in SQL the common table expressions the city. Now in the next step we're going to talk about a new type of objects that you can use in databases. We don't have only tables we have as well views. And views are amazing in order to give you dynamic and flexibility in your project. So let's talk about views. Now a view is not like a query that we can use in SQL. It is an object that we can find in the database. So before we jump immediately to the view, I would like to give you the big picture, the whole structure of the database. So let's go. We have like hierarchy structure and the highest level of this hierarchy is the SQL server. The SQL server manages multiple databases. It's like the control center that keep everything running and accessible. Now inside the SQL server, we have multiple databases. So a database is collection of informations that are stored in structured way. It's where all your data is kept and organized in different tables and objects. And each database is separated from others and it has its own data. Now inside each database we can find multiple schemas. A schema is like a logical way on how you group up related objects like tables and views together within a database. Like for example, if you have a database called sales, we can group up different tables about the orders underneath the schema orders. And maybe we have like multiple views and tables about the customers where we can put it in the schema customers. So if you find like multiple tables and views that are describing the same object, the same topic, we put them all together underneath one schema. So again, a database could be like the sales database and the HR database. They are completely different types of data. And underneath the sales, we can have like different sections. We have the sections about the orders and sections about the customers. And now moving on, what we can find inside the schema, we can find tables. A table is where actually your data is stored. It contains multiple columns and rows. So it is where the data physically lives. And now inside the schemas, we have another type of object. We call it view. And of course in this section, we are focusing on the views. So a view is like a virtual table that has a structure and everything but inside it we don't have any data. So the view does not store any data and in order to see the data we have to execute the query behind the view and only after that we're going to see some data but it is not like the tables it doesn't store the data permanently. Now inside the tables we can define multiple stuff like columns and as well keys and the same thing for the views. Inside the views we can define multiple columns and one last level for each column we have like a name and a data type. So as you can see the databases are really organized and we have like hierarchy where the top node is the SQL server and the lowest node is the columns and rows. So this is what we call the database structure. Now in order for you to build and manage this structure we have set of commands we call it DDL the shortcut of data definition language. So the detail is a set of commands that allow us to define and manage the structure of the database. So we have commands like create where it help us to create databases, schemas, tables, views. Another command called alter. Of course after you create something you would like maybe later to do changes and updates and of course we have the drop in order to remove any database object like dropping a schema, dropping a database, tables, views. So as you can see the DDL commands can help us to manage the database structure. So from this picture we have understood that we can create views inside schemas in the database. So now if you check the client and the object explorer you can find the exact hierarchy. So it start with the SQL server. This is our local server that's run at our machine and then we can find inside it multiple databases and one of them is our sales DB that you have installed together with other database like the adventure works. So now if you go to the sales DB over here you can go and drill to the next level and now we can find here a lot of objects and one of them that you know we have tables and views and now you might say okay but between the database and tables we have schemas so where are the schemas well actually if you go inside the tables you're going to find our tables customers employees and so on but before it we have a name called sales doc customers and you can find it everywhere sales doc customers sales do employees and so on the sales is the schema that bring all those tables together underneath one logical schema. So we have a database called sales DB. We have a schema called sales and we have a table called customers. And now if you would like to see all the schemas inside this database, what you can do? You can go to the securities over here and then here we have like a folder called schemas. If you go over there, you will find the list of all schemas that we have in this database. You might say, but we didn't create all those stuff. If we have only the sales that we know. Well, as you create a database in SQL server, you will get a lot of other system default schemas that the server can create. One of them is the information schema where it holds a lot of views about the catalog and the metadata where you can find the list of columns, tables, views and so on. So here we have only one schema that we have created for the user. It is the sales. So let's go back. Now if you go inside one of those tables you will find here multiple stuff like we have columns, keys, constraints and so on. And if you go to the columns you will end up at the lowest level of the hierarchy. And here we have the columns like the customer ID and we have some extra informations like the data type length and so on. So this is the structure and the hierarchy of databases. Now I would like you to understand a fundamental concept on the database in order to understand the views the three-level architecture of the database. This architecture can describe the different levels of data abstractions in a database. So let's see what this means. So the architecture is divided into three levels. The first level is the physical level. Then we have the logical level and the third one is the view level. Now let's understand each level what it means. So now the physical level it is the lowest level of the database where the actual data is stored in a physical storage and usually who has access to this layer are the database administrators because they are the experts and they have to manage the access and the security of this layer because they are the expert that have to manage a lot of stuff like optimizing the performance making sure that everything is secure and managing the backup and recovery and to do all the configurations and many other tasks. So at the physical layer we have to deal with a lot of stuff like the data files, partitions, logs, cataloges, blocks and caches and many other stuff that each database needs in order to store your data. So as you can see this layer is very complicated and you need to be really an expert of databases in order to be able to manage all those stuff. So we call this layer a physical layer or sometimes we call it an internal layer. So now let's move to the next level. we have the logical level. So the logical layer it is less complicated than the physical layer. Here at this level you have to deal on how to organize your data and normally we have here like an application developer or we have like data engineers that access the logical level in order to define the structure of your data. So those developers can focus on how to structure your data rather than how the data is exactly storing the data physically at the storage. So they don't have to deal with all those details. they leave it for the database administrator and they can focus only on how to structure the data. That's why we need for this kind of role an abstraction level for them which is the logical level. So now what actually the developers are doing at this level? Well, they are like creating tables and defining the relationships between those tables or they can go and define views. they can create indexes on the tables in order to optimize the performance of the tables or maybe they are creating stored procedures and functions and some other codes in order to manage those tables. So as you can see they are building the data model they are structuring your data but they don't care at all where are those data stored physically in the database. So as you can see here things are less complicated than the physical layer and it is perfect abstraction for developers to build projects. So we call this the logical layer or sometimes we call it the conceptual layer. Okay. So now moving on to another level of abstraction. We have the view level. So the view level is the highest level of abstraction in the database and it is what the end users and applications can access and can see. So for example, you could have like one view for business analyst. So you prepare and customize a views that are suitable only for the business analyst and you might say you know what let's prepare another set of views that are suitable for data visualizations and reporting like you can go and connect for example a PowerBI in order to create dashboards. So they are fully customized and prepared views in order to be connected with the PowerBI reports and you can keep doing that by creating multiple set of views that are suitable for specific purpose and use case. So as you can see at this level we are exposing our data for multiple users and multiple applications. So now the question is what do we have to deal at the view level? Well, you have their only views that holds only the relevant informations for the use case or users. So the users at this level have only views. They don't have to deal with the tables, indexes, store procedures, any files, logs, partitions or anything. This is the highest level of abstraction because the focus of this layer is to make it friendly for the end users and easy to consume. So we call this layer the view layer or sometimes we call it an external layer. So this is the three-level architecture of the databases or we call it the three abstraction levels of the database. So the physical layer has the highest complexity, the lowest abstraction and the view layer has the highest abstraction. So this is one more reason why the views are very important concept in SQL [Music] databases. Okay. So with that we have enough fundamentals in order to start talking about the views. So the question is what are views? A view is a virtual table in SQL that is based on the result of a query without actually storing the data in the database. So in short this means views are stored or persisted SQL query in the database. So let's understand what this exactly means. Now so far what you have learned we have like database table and all what you have done we create a select query in order to retrieve the data from this table. So once we execute our query we will get the result back. Now if you are talking about views they have as well like the structure of the table but without any data inside it. And for each view there is like a query attached to it. So there is no data but we have like a query in order to get data. We call the normal table as a physical table and the view we call it a virtual table. So now how exactly we're going to get the data. So now if you go and write query by selecting data from the view not from the table from the view what going to happen SQL going to go and trigger the queue that is attached to the view and this query is responsible to query the physical table and then the result going to fill the structure of the view and we will get back of course the results. So we are directly querying a view but actually we are indirectly querying a physical table. So the view is like between us and the data. So that means my real data is stored inside the database tables and the views are like an abstraction layer between me and my real data. And of course the data will not be stored inside the view. Each time I'm querying the view what's going to happen the SQL query behind the view going to be executed again. So it's going to go and retrieve the data and get it back to the view and then I will see it in the output. So this is what we mean with SQL view. So now let's have a quick comparison between tables and views. Tables stores the actual data physically at a database. So the tables where the data is persisted with in the other hand the views they are virtual tables and they do not store any data inside the database but they present the data from the underlying tables. So that means views don't persist any data physically. Now the tables are hard to maintain and as well hard to change. So it needs a lot of efforts in order to do any change like adding columns and moving columns always requires a lot of efforts for the migration especially if you have large tables. But in the other hand the views are way easier to maintain and very flexible to change. All what you have to do is only to change the query of the view. So that means you can very quickly change stuff in the views compared to the tables. But if you are talking about performance, tables are faster than views. For example, if you go and do a simple select on the table, you will get the data back as soon as the database fetches the data. But if you are selecting something from the view, it is actually two queries. The query that comes from the user and as well the second query is the view query. and the query of the view could be very complicated in order to extract the data from the underlying table. So selecting something from the view is always slower than selecting something from a table. Now if you have a table you can read from the table and as well you can write to a table but the views are read only as the name says it is only a view. You cannot go and write something to the database using the view. Okay. So those are the big differences between views and tables. All right. So with that we have a clear understanding what are views. But now you might ask me why do we need views? That's why now what we're going to do we're going to deep dive into multiple scenarios and use cases that you might encounter in your SQL projects. So let's start with the first use case. The first use case and the core reason why we use views in our data projects is to store central logic from a complex query in the database so that everyone can access it and with that we improve reusability between multiple queries and we reduce as well the complexity of the overall projects. So let's understand what this means. So now in our project we have like two tables in the database orders and customers and we have learned previously that if we have like a complex query we can go and use the city. So for example in our city we are joining tables and doing some aggregations using the sum and the city going to store the data in an intermediate results and then we have the main query. For example we are doing the step two where we are ranking the data. So the whole thing is in one query and let's say that a financial analyst was doing this type of analyszis. Now what could happen is that you might have another user for example a budget analyst where he is doing exactly the same first step. So he has as well a city query where first the data are joined and then aggregated using the sum. But the last step in the main query he's not doing ranking he's just doing like max and min. And not only that, we have a third user, the risk analyst, where as well doing the same initial step using the CTE, joining the tables and doing the summarization. But here the risk analyst in this scenario, he's just comparing the data at the last step in the main query. So now if you sit back and look to this, you can see all those three data workers, all of them are doing the same first step. So all of them are doing the same CTE. They are joining the data and then doing summarization. And of course this is a complete waste of time that each one of them has to create first the city from the scratch in order to do some analyszis. So it is complete redundancy and makes no sense. So this is exactly the disadvantage of only using cities in the projects. Now what we can do instead of that those three data workers going to decide to say you know what let's put the first step as view in the database. So instead of using CTE each time we're going to take this script and put it in the database. So we have now a central logic that is stored in the database where everyone can use it. So we have this query this logic only once and everyone can benefit from it. So now the financial analyst instead of going directly to the physical tables they can go to the view. So thus means she needs only to write one script the rank script. Same thing goes for the budget analyst. he has only to write the query for the max and min and as well for the risk analyst he just need to compare the data. So as you can see all those queries are reduced and they can only focus on the analyzes. So this is exactly the magic of views in data analytics. This logic this knowledge can be centralized in the database and this is way faster and better than having this logic written each time someone want to do any analyzes. So this is why we need views in data projects. So now if you compare views with CTE, the CTE are used in order to reduce the redundancy within one single query. So it improves the reusability within one query. Where in the other hand in the views we are reducing the redundancies from multiple queries. So we are reducing the complexity of the whole project. So the views are improving the reusability in multiple queries. Now think about it like this. We use views in order to persist a logic in the database. So the logic is so important that we want to persist it in the database. It's like in the tables we persist data but with the views we are persisting logic. But in the other hand in the CTE the logic is not persisted. It is temporary and going to be calculated only on the fly within the scope of one query. So this logic is important only in this scenario and it is not important for any other queries. That's why it makes no sense to persist it using the views. So you have to decide is this logic is very important then take it away from the city and put it in the view. But if you think you know what this logic is not really important and only important in this one query then stay with the city because creating views always needs some extra steps in order to maintain the view. You have to create the view. You have to drop the view if you don't need it. But the CTE, there is almost no maintenance for it. The database going to do automatically the cleanup once the query is done. So there is no extra activity to drop a city or something. That's why CTE is easier to use than views. So those are the big difference between the views and cities. Okay. So now let's check quickly the syntax of a view. So now we have a query like select from where. So this is a query a simple select statement. But now in order to create a view an object in database we have to go and use a DDL command create. So we're going to say create view cuz we want to create a view then the name of the view and then it's like the CTE we say as and then double parenthesis. So as you can see it's very simple and we call this a DDL command where we are telling the database go and create a view and the logic of the view comes from this query. So it's very simple. This is how you can create views in database. Okay. So now let's have the following task and it says find the running total of sales for each month. I'm going to start this task by solving it using the CTE. So first I'm going to go and do few aggregations on the top of the month. So let's go and select. So now what do we need? We need the order dates but we need it as a month. I'm going to go and use the date truncate like this and say okay I would like to have the date as the granularity of month. So let's go and call it order month. And now after that we're going to do a few aggregations like for example let's go and get the sum of sales and we're going to call it total sales. And that's it for the start. So now let's go and call it from the table sales orders and group by and we are grouping up by by the month. So something like this. Let's go and execute it. And now for this we get for each month the total sales. And now the next step that we have to go and calculate the running total for the sales. This is of course not the running total. So that means either we can go and use subqueries. So this means this is our first step and we need a second step. So either use queries or cities. I will go with the city over here. So I'm going to say with city and monthly summary and we're going to define it like this. And now what we're going to do, we're going to go and define the main query. So the main query going to be simple. So select and let's go and get the order month. And now we have to build the running total. So we're going to go and use the window function. So sum total sales. And then we're going to say over we don't have to partition the data. We will just sort it by the order month and we can leave it ascending. So this is the running total and we have to go and select of course our CTE from here. So let's go and execute it and with that we are getting the running total. Of course we can go and add the total sales in the output in order to understand the results. So here in the output we are just building accumulative sales. So for this scope everything is fine. and we are using the CTE. But now imagine that this logic is important for multiple queries. So it's really nice to have such a report where we are aggregating the data at the level of the month and this could be used from different users and different queries. So now we say how about to put this logic in one view so that everyone can access it and we don't have to repeat the same aggregations over and over. And now before we put it in view, someone comes and say how about to add one more aggregation so that not only the total sales we can add. So now before we put it as view maybe some other user says you know what we would like to have one more aggregation not only the total sales let's make the scope a little bit bigger so that everyone can believe it. So for example we can go over here and say you know what let's go and add the total number of orders. So we can go over here and say counts and let's get the order ID and say this is the total orders or maybe some other says let's get the quantities as well. So we can go and summarize the quantity like this and we call it total quantities. So with that we are like doing a lot of aggregations on the month level. Let's go and execute only the CTE. So now we have really nice report that is based on the months and can be used from many different queries. So now what we're going to do, we're going to take this and put it in a view. Let's go and select only this logic and create a new query. And now what we're going to do, we're going to put our query here and we have to create now the DDL in order to create a view. So it's going to be like this. Create view. Let's give it the name maybe starts with the V underscore and this going to be the monthly summary. So this is the name of the view and as then we put everything in parenthesis. It's like you are building a CTE. So we have here our logic and here is our DDL query in order to create the view. So now let's go and execute it. Now as you can see in the output it says only that the command is completed because this is not a select query. This is a DDL command. So the SQL going to tell you okay either I created it successfully or not. So now the question is where do I find now my view? Well, if you go to the object explorer, you can see over here underneath our database sales DB, we have here something called tables where we are used to query those tables. But beneath it, we have as well our views. So if you check the views and expand it, now we are not seeing M view because we just created the view here. So go over here and refresh. And once you do that, you will see the newly created view. So this is the one that we just created. So now what we can do, we can go and create a new query and let's go and just query the view. So select star from so v month monthly summary. Let's go and execute it. And now as you can see we are getting now the result of the view and I'm accessing now this logic from completely external query. So now I can think about the view as any other table that we have in the database. And again the big differences between the views and the tables. The tables has data has actual data and everything there is persisted but the view is just an abstraction for me and behind it there is like a query that goes to the table and query the tables in order to present the results. But for me I don't care about all those details. I can go immediately to the query over here and start querying. So now in order to create the total running sales I don't have to create the CTE and sub queries. I just go and get for example our main query. Let's go back over here. So now instead of using the CTE I can go directly and access the view. So as you can see now my query is very simple. I'm doing immediately the step two without having to prepare the data first. So if I go and execute it I will get exact results. And now if you compare the query on top of the view like this with the city query you can see that the CTE has more steps and it is like little bit more complicated than the query on top of the view and this is exactly the benefit of the view. We reduce the complexity and it is very easy to consume from the point of view of users. So this is how you can put your logic in central place using views and with that we have learned how we create a view. Now one more thing about the schemas. If you check our tables over here, they have all one schema. So we have sales dot customers, sales do employees, orders and so on. Our new view has the schema of DBO. If you create any object whether it's table or view and you don't specify a schema in a default schema called DBO. And now let's go back to our DDL scripts. So as you can see over here, we didn't specify any schema. We just said okay, this is the view name. And now in order to put our view in the correct schema we don't want it to be in the defaults. You have to go and specify the schema name in the DDL. And now in order to do that we go to the name of the view and we write the schema name and then separated with a dot. So the first one is the schema name and the second one is the view name. So now let's go and execute it. Now if you check over here you don't see anything new. But if you refresh you will find another view in the correct schema. So we have sales dot vmon monthly summary and this is exactly what we want. So this is how you can assign a view or even a table to the correct schema if you don't want to use the default one the view. All right. So now the next step is that you say you know what I would like to clean up. I don't need those two views in my database. So how to delete a view? We can go and use the command drop. It is very simple. If you go and create a new query and you say drop and then you say what you want to drop. you want to drop a view and then you have to specify the name and schema of the view. But now since it is the default schema DBU, I don't have to write it down. So we can start immediately with the view name. So V monthly summary. So that's it. It's very simple. So now we go and execute it. It says it's completed but as you can see nothing has changed. We go and refresh. And now we can see that the database did remove the view with the schema DBU. So it's very simple. This is how you can drop a view in SQL. Okay. So now to the next step. Let's go back to our DDL of creating the view sales monthly summary. And now you say you know what I would like to change the logic inside the view. So how we can update this content? How I can update my query? If you say let's go and for example delete this column. I need only three columns. So and you go execute it. The database say I cannot do it for you because we have already such a view. So SQL will not go and replace stuff going to say no we have the same name and I cannot do anything for it. So how we can update the view? Well in other databases like ocris for example it's very simple. You can go over here and say create or replace view. So it's like you are telling the database create this view or if it already exists then replace it and you will not get error in the postcress. But in the SQL server it is little bit more complicated. we don't have this command. So here you have two ways. Either you go over here and say you know what let's first drop the view. So you go with the same name over here and then what you're going to do you're going to go and mark the drop view. So if you execute it like this the view going to be dropped and then we recreate the view like this. So what we have done we destroy the view and then we recreate it using the new logic. Or you say you know what I would like to have everything in one go like I don't want to do it in two steps. I would like to have everything in one command and for that you have to use in SQL server the TSQL the transacts SQL it is like an extension for SQL only in SQL server well it's like programming where you can go and add variables or you can all go and add checks we will not do a deep dive in this language but I would like to show you how to do it for the views so just follow me with that I'm going to go and replace the whole thing and then we're going to say if and now we are checking the system catalog if the object ID And now we go and specify the view name. So let's go and copy the whole thing with the schema as well. And then we're going to say for SQL this is a view. So if this object exists so we are saying is not null. So that means it exist in the catalog then what SQL should do? Should drop this view. So we're going to say drop view and it's like we have done it first and then semicolon and then we say for scale go and with that we are saying for SQL the tscale is done. So the logic is done and after that we have the DDL for our view. So again what we are doing we are checking before creating the view whether the view exist. If it exist then we are telling the scale go and drop it and if it doesn't exist that means we haven't created this view yet. it is completely brand new view then this step going to be skipped so that there is nothing to drop. So now if you go and execute the whole thing it will work and of course if you go and refresh over here you still see the view. So SQL did destroy the table first and then recreated. So if you execute it again. So this is how you replace your logic in view in SQL server. And with that we have learned all possible scenarios. How to create a view, how to drop a view and how to update the logic of a view. Now back to our database architecture and let's understand how the database executes views. So now let's say that the data engineer is creating view called top end. So the query going to be sent to the database engine and once the database engine understand this is a view this is not a table. So now the database engine going to go to the disk storage and to the catalog and it will stores not only the metadata about the view also the SQL that is responsible for the view. So it's going to take the SQL statements that you have defined in the create view and place it as well in the catalog. So if you compare to the tables we have in tables only metadata but in the views we have both the metadata and as well the query of the view and as well you can see that the database engine will not create a table in the user's data. So there is nowhere data stored inside the disk or the cache. So the actual data the physical data will not be stored anywhere. We are storing only metadata and the query inside the system catalog. So now we tell our data analyst okay we have a new view and the data analyst can go and write a query in order to retrieve the data from the view. So he going to say and say select from the view and execute it. The database engine going to take it and understand okay now we are talking about view. So the database first has to retrieve not the data going to retrieve the query from the catalog in order to understand what do we have now to execute. Then the database going to execute the query of the view first and the data for this query comes from a physical table called orders. So now the database engine is querying the order to retrieve the data so that we have a data for the end user and then it's going to be executed and the result going to be sent back to the data analyst. So as you can see there is like two queries. The SQL engine first has to execute the query from the view and only after that the database engine can execute the query that comes from the user. So actually the data comes always from a physical table but we are not providing the data analyst an access to the table. We are just providing an access to the view. So this can happen each time an end user selecting data from the view. Always the database engine going to grab the query from the catalog, execute it first in order to get the data and then execute what the end user wants. And now if the data engineer says no, let's go and drop the view. So she writes a query in order to drop the view. And the database engine going to go to the system catalog and delete both the metadata and the query. So as you can see, if you are dropping a view, you are not losing the actual data. So there will be no user data lost at all. So don't worry about it. What you are losing is only the query and the metadata about your view. It's only if you drop a physical table like the orders, you will lose your data. So dropping views is not that bad like dropping a database table. So this is how the database works with the views behind the scenes. Now moving on to the second scenario to the next use case of using views in projects is that we use views in order to hide complexity and to improve abstraction. In many scenarios we work with a very large and complex databases and we can use views in order to reduce the complexity and make things easier for the users. So let's understand what this means. Now I'm going to explain for you a scenario that happens almost in each project. Like if you get an access to a database where you want to do analyzes, you will be in scenario and this can happen a lot where you're going to find a large database where the tables are very complex to understand. They have a lot of columns. They have like technical and cryptical names and how tables are connected to each others and relationship between them. It's almost impossible to understand. then you have to be deeply involved with the data models with documentations and with experts until you understand how to query this database. So if you are not a developer and from end user perspective it can be nightmare where you are trying to do multiple joins in order to make simple analyzes and of course from the database perspective this data model is good enough for one application but if you are opening your database for multiple data analyszis projects this can be a nightmare because you have to go and explain for each user how to query the data. So what we usually do instead of giving a direct access to such technical and hard to understand data model we go as developers creating multiple views since we are the expert of the data model and these new views going to be an abstraction of the complexity that I have in my database and we have to make sure that those views are providing objects that are friendly. So they have like a full English name that makes sense and as well the columns are friendly and we try to not offer a lot of views so the user don't have to do all the joins. So we provide like few views that are friendly and has a lot of informations that the users needs for the analyzes. So with that the users can have an access to something more friendly and easy to consume and then they can write simple queries in order to do analyzes on top of these friendly views. And this is what we can give a name like we are providing a data product from my complex physical database. So here again how important are the views to provide an abstraction and easy to consume objects for the users and with that I can hide all my complexity and the script of the view going to be developed from the experts and only once so that the users don't have to understand or to write these complex SQL joins and with that you can make your data projects way easier than before. So this is another important use case for the views where we can use it in order to provide abstraction and as well easy and friendly objects for the end users. Okay. So now let's have the following task and it says provide view that combines details from orders, products, customers and employees. So now instead of having all those tables from our database, we have to provide one combined view that has everything well almost everything. So now let's see how we can create such a view. So let's start first by the table orders. I'm going to go and select first star from sales orders and let's go and execute it. This is the central table that connects everything. You can see here we have the order ID, product ID, sales, customers and so on. So it is a great start point. So now we're going to go and be picky about the columns. I would not show all the columns but I would say let's go and show for example the order ID. This is essential. It's nice to have a unique identifier. Now the product ID, I will not show it but I will just list it over here. The same for the customer ID, saleserson ID. Those stuff I would like to replace later. So I will just make it as comment so I don't forget about it because it makes no sense to show the product ID and customer ids and so on. We would like to show the details about each object because instead of having the product ID, I would like to show for example the product name itself and some other informations from the table products. And with that we are reducing the complexity. So now what else we can get from the table orders? We can go and get the order date. I will put it here. And maybe we can go and get stuff like sales and quantity. So like this. Of course, we can go and put all the columns. But for now, I will go with those informations. Now, it's important since we're going to have a lot of tables. Let's go and make sure we are using aliases. So, now we're going to have the O for each of those columns. All right. Fine. So, now we have four details from the table orders. Now, what is next? We have the product ID. So, let's go and get the informations from the products. What we're going to do, we're going to use a left join just to make sure to not miss any order. If you go with the inner join, you might miss some orders. So I will not do that. So let's join it with the products like this. And so now we have to go and join the tables. So we can use the keys product ID equal the order product ID. All right. So now the question is which informations we want to show for the users. Let's go to the table orders. So we have the product and category and the price. I would say let's go and get the product and category. That's enough. So now instead of the ID I'm going to have it like this. So it's going to be the product and the category. Now let's go and test it. I'm going to execute it. Now as you can see we don't have a product ID. We have the product name which is more friendly. So we have now those two columns from the orders and those two from the products and the last two as well from the orders. So it looks really nice and friendly and with that the user don't need extra table called products. We have everything in one. Now let's go and do the same for the customers. So let's go and do the same thing. So let's join sales customers see and as well join them using the key customer ID equal to the customer ID. Now we have to go and grab a few columns from the customers. Let's go and check. So we have a first name, last name and country and score. I would say I would go with the names and the countries but instead of having first name and last name I'm going to put everything in one. So we have to go and concatenate the informations. So we're going to get the first name then plus then empty between the first name and the last name and then the last name like this. Now we will not call it a name. We're going to go and call it the customer name because later we're going to have as well an employee name. All right. So next we want to get the country and we have to say this is the country from the customers. So we're going to call it customer country and that's it. Let's go and execute it. Now we can see we have again our orders products and now we have the informations from that customer. But here we have issue that we have some nulls and that's because there is no last name. So what we're going to do, we're going to go and handle the nulls for the last name and as well for the first name. So we're going to use the kowalis. If the last name is null then make an empty string and the same thing for the first name. So first name. All right. So now let's go and execute it. So with that we are getting as well the first name if the last name is missing or if the first name is missing we can get the last name. So looks good. So it looks good with that. We have the customer's details. The last thing we have to go and get the employees. So the employee here is called salesperson ID which we can connect it directly to the table employees. So if you go to the employees over here, which columns do we need? We have the first name, last name, department and so on. I would say let's go get the names and the departments. So first let's go and join it. So lift join sales employees and we're going to join it using the employee ID. and we're going to join it with the sales person ID that comes from the order table. So now instead of the person ID we're going to have as well the same thing. So I will just go and copy paste this. So instead of the alias we're going to have E and as well E over here and we're going to call it sales name and as well what we going to have we're going to have the department. So department and that's it. Let's go and execute it. So now we have a lot of informations in our view. So we have the first columns from the orders then from the products and here we have from customers and those two from the employees and the last two again from the orders. So that we have combined now all the relevant informations from multiple tables in our database in only one view. This result is relative big but still we have all the informations in one and it is more friendly for the users in order to consume our data instead of going and joining like all those four tables together. So now the next step we're going to put the result of this query in view in our database so that our end users can start consuming it. So how we going to do it? This is our combined query and now we're going to write the DDL for it. So create view and now we're going to give it the name order details and then as and we're going to put the whole thing in two parenthesis. So at the start and at the end and of course don't forget the schema. So our schema is sales sales dot then we have the view name just in order to have it in the correct schema and not in dbo. So everything is ready. Let's go ahead and execute it. So now let's go and check our database. So if you go and refresh, you will find our second view order details. So now let's go and test it. We're going to say select star from sales v order details. Let's go and execute it. And with that we are getting now a combined view that are showing all important informations from the database. So this is what the users can see. And with that the users don't care about how many tables do we have in the tables and how to join all those tables. We have only one view and we can start working on it. This is a very common use case for the views. Okay. Moving on to the next scenario to the next use case. We use SQL views in order to implement security and to protect our data in the database. In many scenarios, we have sensitive informations in our data and we cannot go and share it with everyone. So one of the best practices is to create views in order to protect your data before sharing it with the users. So let's understand what this means. So now let's understand first the scenario without views only tables. So now let's say that you have the table orders four columns and three rows and then you have like for example a manager that has an access directly to the database and start writing some queries in order to retrieve data. But in your project you have multiple people that has an access to your database like for example a data analyst and as well she is writing a script in order to retrieve data from the orders and as well you have maybe a students that has an access to your database and querying the data like any other role like a manager and data analyst. So as you can see you have now different rules in your project and all of them having the same rights by accessing directly your table. So a manager or data analyst or a student they are seeing the whole table all rows and all columns. And of course in the real projects this is a big problem. Sometimes the data are sensitive and you cannot give an access for everyone. And of course if you are using only tables this going to be a nightmare because you can go and create multiple tables but it's going to be really hard to make all those tables in sync. But instead of that we have views. So what you can do you can go and remove all accesses to the physical table but instead you can go and create multiple views for each role. For example you can go and create a view called orders managers and maybe you can give all the data and all the columns because the managers are allowed to see let's say sensitive data but still it's nice to create a view maybe you change your mind later and you go and remove something. Now let's say that for the data analyst you want to offer all the data but there is only one column that is very sensitive. So what you can do you can go and create another view called orders analyst. So in the view only three columns are available ABC and then you give access to all data analyst and with that you have protected this sensitive information. So we call this column level security. And now we come to our poor students. And here we create another view where we are not only protecting the column D but also we are protecting few rows like for example the row number three because we want to offer only few informations to the students. So we are protecting the columns and as well the rows and for that we can create another dedicated view called for example orders students and we can offer it to the students and with that we are doing column level security and as well row level security. So we are offering multiple views very easily without having to worry how to load the data from one table to another. So creating those views are really easy and provide us a perfect tool in order to manage the security of our data. So this is one very common use case of using views in data projects. All right. So now let's have the following task and it says provide a view for EU sales team that combines details from all tables and excludes data related to the USA. So the first part of the task is similar to what we have already done but we cannot offer all data for the user. So this time we are providing a view that is specifically created for a team the sales team. So the first part we have already done it where we are combining all details in one view. But the problem with the view that we have created that it shows all data. But now the requirement change we cannot show all data. We have to go and exclude the USA data from our details. So let's see how we can do that. It's very simple. We're going to go and grab the same query. We will not repeat that. So we have as well here joining tables and prepare everything. But instead of showing all data, what we're going to do, we're going to go and filter the data based on the customer country. So it's very simple. At the ends we will have a work clause where the C country is not equal to USA. So we have now a filter. Let's go and execute it. And with that, as you can see in the output, we are getting the orders that are not from USA. And with that we are protecting the data of the USA and the EU sales teams can access only their data. So it looks nice and protected. And with that we are doing now role level security. That means we are hiding now all the orders all the rows that are not allowed to be seen and consumed from this group of users. So now what is the next step? It is very simple. We're going to go and put everything in one view. So with that we have the query ready and we can go and create the new view. So we're going to call it create view. Then we need the schema and the name going to be almost the same. So order details but EU. And then we have to have as punch parenthesis like this. So everything is ready. Let's go and execute it. And now we can go and refresh in order to see our new view. If you still don't see it, you can go to the views over here and refresh as well to the folder. So with that I can see we have our new view. Now, of course, the next step we go and test it. So, let's create a new query. Select star from sales and v order details EU. So, let's test it. And with that, as you can see, we are getting the combined view only for the data that is relevant for the EU sales team. So, I'm not seeing here any USA records. So, with that, we are providing view that protects few rows like the orders from USA. So as you can see views are really great in order to provide security to our data whether we are protecting the columns or the rows. For example in our view we can say not only I want to remove the USA orders but let's say the department information is sensitive information and I would like to hide it from the view. So you can just simply remove it from the select and with that you are doing column level security. So now I have two options that I can provide to the users. The first option doesn't has any like role level security. It is the first view the order details. We don't have there any filters. So it's going to show all the orders. So here we give access only to people that are allowed to see all data. And we have another option the details with the EU. It doesn't show all data. It shows only a subset that is relevant for the EU team. So now it's really easy to control the security of my data using the views. And this is very important use case for the [Music] views. Okay. Okay, so moving on to the next use case for the views, we can use it in order to have more dynamic and flexibility in our projects. So let's understand what this means. If you have a table and you have multiple users accessing this table, now what can happen? you might change your mind about the design and the data model of your database where you can say you know what instead of having one table I'm going to go and split it into two tables or maybe another decision you say you know what I'm going to go and rename a table or in another day you decide you know what let's go and rename few columns or maybe add a column remove column so you are doing changes to your physical data model and you are changing stuff in the tables you know what's going to happen all those users that are accessing the tables going to scream because all of them having a complex SQL queries and your small changes at the tables are breaking everything in their queries and what this means this means escalations and you don't have anymore the freedom to change anything in your database without talking before to 100 people before doing any change. So we don't do that instead of that we use views. So what's going to happen? You create a view and you tell the users, okay, take this view and consume it and leave me alone. And now you have again your freedom to do any changes you want. So you go to your tables and do splitting, renaming and changing everything you want as long as you are updating the query between the table and the view to make sure that the users are not noticing any change. So for example, if you go and split the table into two tables, then you have to put in the view a join or union in order to reconstruct the same structure that the users are used to. And if you would like to rename something in your database, like instead of ID, you are now calling it a key. All what you have to do now is to go to the query of the view and rename it back from a key to an ID. So no one going to notice that you are doing changes to the physical tables. So using views and offering it to users is a gamecher for you because giving the users views kind of gives you more freedom dynamic and flexibility to change anything in your data model and the tables without getting any headache. So this is amazing use case for the views. Okay, moving on. We have a lot of use cases for the views. They are just amazing. So the next one is we can use views in order to introduce a second version of my data model in another language. So we could offer multiple languages to the users. Let's understand what this means. So now we have the following scenario. We have again our table orders where the data is persisted and everything in English and of course what happens sometimes you have like international team that are accessing your data. So you have team in USA and maybe you have team from Germany that as well are end users that want to access the data. Of course it depend on the number of users that are using your database. But if you have a lot of users that come from Germany and as well from India, it might make sense that you go and translate your data and the table structure into another language. So for example, instead of giving access to the table orders, we can create another view called bishong. That's the order in German. But not only you are giving a new name for the object, you could go as well and rename all the columns inside the view. Then the German users going to access the German view and it's going to be for them easier to understand the content of your database. The same thing for the Indian team. And for the Indian users, you can go and provide a view in Hindi. I'm not sure whether I'm pronouncing the word correct, but this is the first word that I said in Hindi. I don't promise that I'm going to learn the Hindi language because it's enough to learn Germany. So I'm trying as well to write this word Adish. I hope it is correct. And to be honest, it is really interesting how you write this word in Hindi. So now back to the topic. As you can see now we are using like the views in order to provide a translation for our database by just giving a new name for the views and as well for the columns. So this is another nice use case that I usually use as well in my projects in order to provide multi- languages for the data model that I have and I can do that with the power of views. Now we come to my favorite use case for the views and that I personally recommend in each project that we can use views as a virtual data ms in a data warehouse. So now why this is my favorite? Because I'm specialist in data warehouses and data leaks and this topic is very important decision in each project like this. So let's understand what this means. So now a classical data warehouse architecture based on the approach of enmon is going to look like this. We have multiple source systems where our data are spreaded and now we would like to go and extract all our data from these multiple sources and put it in one big database called data warehouse. And there will be a lot of operations on this central database like the data going to be first cleaned and then maybe integrated together and maybe we are building there some historical data. So we're going to be doing multiple steps in order to prepare the data for complex reporting and analyzes. And what we usually do in the data warehouse, we're going to store all those informations as a physical table. Now once we have built the data warehouse, what's going to happen? We're going to have multiple use cases that would like to access the data warehouse in order maybe to do some different reporting. Now, it's going to be very complex if we connect immediately like a reporting engine like PowerBI directly to the data warehouse. But instead of this, we try to split the data warehouse into multiple subsets like we can split it after topic or domain or departments and we call those subsets as data marts. So a data mart is always specific for a use case that's focus on one topic like for example we could have a dedicated mart for the sales and another data m which is dedicated only for finance topics but both of them comes from our data warehouse. Then the last layer going to be like for example the reporting and dashboarding maybe you have something like powerbi where you are creating a dashboard one data m like the sales or and as well maybe few stuff from other marts. But now the big question here in the data mart is how should I store the data? Should I store the data using tables or should I use views? And now the best practice says if you are building data marts then use views. And we call this virtual data marts. And there are many reasons why using views at a data mart it's way better than using tables. Like for example, it is more dynamic and quicker to change them cuz usually at the data mart you are building a lot of business logics and you want to have some flexibility and speed and the maintenance efforts is very simplified. No need to build any ETLs or data loads from the data warehouse to the data parts and this makes the data warehouse as a real single point of truth for your data. And once you start copying data from one layer to another layer, it's going to be really hard to maintain and chaotic and you have to have really restrict monitoring and data quality. So that's why using views you're going to always reflect the status of the data warehouse and this can help you of course with the data consistency which is a critical point in each data warehouse project. So there are many reasons why we build virtual data mart and we go with the views in this layer. So as you can see how the views are playing a very important role in building a data warehouse. So this is another amazing and very important use case of using views in your data projects. All right friends, so now let's have a quick recap about views. So we have learned that views are a virtual table that is based on the result of a query without actually storing any data in the database. So we use views in order to presist a complex SQL logic and query in the database. And we have learned that in some scenarios views are better than CTE because it improves the reusability and reduce the complexity in multiple queries which reduce the complexity of the whole projects where the CTE only improves the reusability in one query. And we have learned that as well the views in some scenarios are better than tables. We have learned that they are very flexible and easier to maintain since they don't store any data and it's really fast and easy to change stuff in the view compared to the tables. But as well we have learned that the tables are faster than views. Now there are like endless use cases for the views. But from my experience in projects I have choose for you the best use cases for the views. The first use case is if we find like a common repeated logic in SQL queries, we can go and store this logic in view in the database so that the users don't have to keep repeating the logic over and over. So we use views in order to have a central business logic. Another use case is to hide the complexity of your physical data model and to offer for the users and high abstracted layer. So you provide for the user something very friendly and you hide all the complex technical data model that you have in the database because not everyone is expert with your data model. One more use case we can use views in order to implement security and to protect our sensitive data in the database. So we can offer multiple views in order to protect columns or rows in a table. Another use case we have learned that we can use views in order to have more dynamic and flexibility for your database where we offer the users a table view and then you have the freedom to change stuff at your physical data model without affecting all users. And another nice use case for the views we can offer multiple languages from our data model. And the last use case we have learned how views play an important role in a data warehouse system. So views are amazing. All right my friends. So with that we have learned everything about this new objects the views in databases. This is amazing for flexibility and dynamic in your projects. Now in the next one we're going to learn how to create tables based on query and we will learn about the temporary tables. So let's go. Okay. So now first let's have a look again to the database structure. We have learned that in each SQL server there are multiple databases and in each database there are multiple schemas. And now inside each schema we can define multiple objects like we can define tables and views. And now we will be focusing on the object table. And we have learned as well we can use the language DDL data definition language which is a set of SQL commands in order to define this database structure. So we can use the SQL command create in order to define a new table or alter in order to update the structure or drop in order to drop the whole table. So a table is an object in the database structure and we have learned as well there is three levels of the database architecture and we have understood that at the logical level the middle one the conceptual level we deal as application developer or data engineer with the tables. So we define tables and relationship between them. So if you are an end user or a business analyst it's going to be little bit more hard to work with the tables. You have to be a developer or a data engineer. But working with tables is way easier than working with the complexity of the database at the physical level. So you don't have to be a database expert or administrator to work with tables. So the difficulty here is like in the middle. The abstraction is not that low but as well not that high. So now let's answer the question, what are tables? A database table is a structured collection of data. It's like a simple grid or spreadsheet that you might find in Excel. So it has different columns like each column represent a field like the ID, name, country and the table has as well multiple rows and each row represent a record or an entry of the data. So for example if this table is about the employees then each record each row is one employee. Now the intersect between the rows and columns we call it a cell and a cell is a single piece of data. Now the whole table going to be stored physically in the database as database files. So they are in the database like multiple files that are holding the informations about the table and those files are stored physically in that disk storage of the database. So that means your data inside the tables are not stored like a spreadsheet like an Excel but they are stored in special database files that usual developers and end users don't have access to those files. So tables again it's like an abstraction and representation for the actual data that are in the files. So actually each time you are querying the database table the database has to go to those files and fetch the data for you. All right. So this is what we mean with database tables. Okay. So now we have like different types of tables in SQL. We have tables that stays forever. We call it permanent tables. So they stay as long as you don't drop them. And you have another type of tables they called the temporary tables. And those tables going to be deleted and dropped once the session ends. So now we're going to focus first on the first type, the permanent tables. And there are two ways on how to create them. The first way is the classical way where you create table from the scratch and then you go and insert your data. So we call it create insert and the other way called create table as select. It's going to create as well the table but based on SQL query. So let's understand the differences between them. The create insert method is the classical way on how we define and create tables in SQL where first we have to go and create the table and define the structure and after that we insert our data into the database table where the other method the CTAs create table as select. And this one going to create a new table as well but this time based on the result of SQL query. So let's understand what this means. Okay. So now to the first method create insert. So here we have two steps. The first step is we have a DDL statements where we use the command create. So once we execute the first step what's going to happen the database engine going to go and create for us an empty table. It is a brand new table where we can hold our data. So with that we have defined the structure of our table but it's still an empty table. So now in the next step we have to go and insert our data inside this new table. So our data can come from multiple sources like a CSV file or maybe completely from another database where we are doing migration or maybe you are inserting manually your data or maybe it come from an application or you are doing data migration from one database to another. So at the end once you execute insert what's going to happen your data going to be inserted in this new table. So in this method we have like two steps. First we define the structure of the table and the second step we take care of inserting our data inside the table. And now this new table and your data going to be persisted permanently. Now let's check the other method the CTIS. Here it's only one step where you define a query and once you execute this query what going to happen the database has to retrieve the data from another table. So it might retrieve data from our new table that we just created using create insert. So once the query is executed we will get a result. So now what the database going to do going to create a new brand table but this time the definition and the data of this new table it doesn't come from any definition that we specify. it comes from the result of the query. So whatever structure that we have in the results, it going to be reflected in our new table. So again the definition and the data that we see in this new table comes one one to one from the result of our query. So in this type we don't have to define anything or to insert any data. We are just writing a query and the output of this query going to define the table. But in this method as you can see it always needs a database table in order to execute the query. But the create insert method we are creating something from the scratch. So these are the two different ways on how you create tables in SQL and the differences between them. Okay. So now you might ask you know what the CTAs are very similar to the views. We have a query and the output of this query going to be like an object in the database. So what are the differences between them? Let's check this. Now let's say that in our database we have a table that has three columns A, B, C. And now what we can do, we can go and create view based on a query. So you create the DDL statement in order to create the view in the database. And of course the database going to go and store the query in the database and it's going to be empty. So there will be no data because views does not store any data and the query of the view will not be yet executed. But now in the other hand if you go and create a table using CTIS. So here again we have a query attached to the object to the table. So here what happens the database has to execute the query in order to understand the structure and as well the data that should be inserted inside the table. So our SQL query going to be executed and the result of the query going to be inserted inside the table. So that means this new table is storing already the result of the query. So now this is the first differences between the table and view. As you create view the query will not be executed and we don't have anything about the result of the query where in the CTIS we have already result of the query stored inside the table and everything is prepared. So now let's see what's going to happen once the user selects something from the view. So now the database going to go for the first time executing the query of the view in order to fetch the data from the original table and then presented as a result for the user. But now in the other hand if the user go and query the table that is created from the CTIS. So now what can happen? SQL will not execute again the query of the CTIS because the database already done that and prepared everything. So that means we are not querying anything from the original table and the data can be directly fetched from the new table. So the user is going to get immediately the result from our table that is created from the CDIS. So here comes the second difference between the tables and views. The views are slower than CTIS and that's because the database has here an extra task. It must execute the query of the view in order to get the data. But in the CTIS the query going to be faster than the view because we have already executed everything and prepared it for the user. So that's why tables from CTIS are way faster than views. And now there is another difference and perspective about this which is from my point of view is more important than the performance. So now let's say that in the next day we are doing data updates on the original table like we are doing updates on the column C and as well in the column P. So now let's see what this means for the user if they are using views. So the user in the next day is executing again the same query and again here the database has to execute the query of the view in order to fetch the data from the original table. So that means today in the views we are getting different data than yesterday because we have a new data and new updates and the user in the result going to see as well the new updates and the fresh data. So the user is seeing exactly the status of the data in the original tables. But now let's see what going to happen if the user go and query the table from the CATS. So in the table of the CATS, we are still having the data from yesterday. All those new updates from the original data will not be reflected in this new table because once the user selects something from this table, the database will not go and query or fetch the new changes from the original table because we have already prepared the data from yesterday. So that means our user now is getting old data from the CTAs table and the only way to get new fresh data from the CTIS is to reexecute the CTIS query. And of course this is another step and it is harder to maintain the table from the CTAs and this is a big difference for the users between the views and the tables from the CTAs. Now think about views you are ordering a pizza at restaurants. So every time you are quering the view you are placing an order the chef going to go and make a pizza from the scratch using the freshest ingredients. So that means you are always getting a fresh hot pizza. And think about the CTS as like a frozen pizza from a grocery store. The pizza was prepared earlier and stored in the freezer. And if you want to eat it, you have to go and heat it up in the oven. But it's still not like a fresh pizza that is made on the spot and from the scratch. Now I made myself hungry because I love pizza. So I think I'm going to go for a quick break. [Music] Okay, so now let's check quickly the syntax of those two methods. The first one is create insert. So first step we have to go and create a table using a DDL statements. So we use the command create and then we have to tell SQL are we creating a table or view. In this scenario we are creating a table and then we specify the name of the table. Then after that we have two parenthesis and inside them we make a list of all columns that we need inside this table. So we have two columns the ID and the name. And after that we are defining the data type of those columns and maybe as well the length. There are a lot of options that we can add to this syntax but now we are just checking the simplest form of creating a table. Now the next step is that we need an insert statement. So we are saying insert into our new table the following values. we are inserting the id number one and the value for the name going to be frank. So this is a classical way on creating new table and inserting data to it. Now let's move to the second method the cas. Now this time we have an SQL query like select from where and some extra logic. So this is our query and then we're going to go and put our query inside a DDL statement. It's like we have done it in the views. It's exactly like we have done it in the views but this time instead of saying view we're going to say table. So again we have the create command and we are creating a table then the name of the table and then we say as and then we have two parenthesis and inside them we have our query and this is where the name come from create table as select cas. So it is very simple in one statement you have everything you are creating a new table and as well you are inserting the data that comes from this query. Now this syntax is used in databases like MySQL, Postgress and Oracle. But in MySQL we have like a shorter way on how to do it. Again we have our query select from where. But now in SQL server we can insert a command between the select and from like this. So we are saying select the following columns into new table. So we have this keyword into then the table name and then you continue after that with your query from where aggregations and so on. So here it's like the DDL is inside your query itself but in the other databases you can have like the query is separated from the DDL statements. Personally, I prefer this syntax than having this into because if you have like big complex query, this can be really hard to see and to miss the column selection. So, this is the syntax of creating a new table from a query the CTAs in different [Music] databases. Okay. So, now we're going to check the scenarios and use cases where it makes sense to use. So, let's start with the first one. Now we have learned before it makes sense to have a complex logic stored inside the database so that our end users don't have to keep repeating the same logic over and over and it's as well maybe complicated for some users. So that's why we have used views and the result of the view going to be used from our users. So everything can stay easy and friendly to consume for our users. But now what might happen is that the logic of the view could be very complicated and needs a lot of time to be executed from the database. So it takes really long time until we get the intermediate result from the database. So that means if it's going to takes 30 minutes then each users has to wait 30 minutes until the query is executed and none of your users going to be happy with this situation. In this scenario, if this happens, you have to try maybe to optimize the query. But if you cannot do anything about that, you have to switch the view to CTAs table. So now what you have to do, you have to take the same logic and then put it in so that the intermediate results are stored in a table. And of course at the moment of creating the table, it will take 30 minutes. It will take long time because it is the same query and the database going to need the time until creating the intermittent results. But the big advantage is that once everything is prepared maybe at the night at the morning once your users are like online and start querying the data they have everything prepared. So the user is going to go and start selecting and analyzing the intermediate result but this time using the table that you have created from the CTAs and the response time going to be for all users again normal and fast. So if you have a scenario where your views are very slow you have to go and prepare the data at the night using the CTIS and prepare the tables to be analyzed from the end users. So this is the most common use case for the CTIS and this scenario happens a lot in projects where you decide to go instead of views to go with the CTIS in order to have persistence data and you gain performance. Okay, so finally back to SQL let's go and create a table using now we're going to go and create a table that shows the total number of orders for each month. Let's go and do it. So first what do we need? We need a query. So let's write it. select. I'm going to go with the date name in order to get the name of the month from our order dates and we're going to call it order month. And then we're going to go and aggregate the data by counting the order ID for total orders from our table sales orders. Uh don't forget to group by our month. So something like this. Let's go and execute it. So the result is very simple. We have the order month and the total orders. So we have two columns and three rows. So we have our query and of course we didn't create anything yet. Now in SQL server in order to create a table from the query what we're going to do exactly before the from we're going to write into and now we have to specify the schema and the table name. I'm going to stay with the schema sales and I'm going to call it monthly orders like this. So that means we have our query and the DDL is exactly between the from and select. So now if I go and execute this what going to happen we will not see here the result of the query. We're going to get here like three rows affected because this is a DDL statement. It is not anymore a query and the database is telling us I have created now a table with three rows. So now if you check our tables we don't see it yet. Let's go and refresh and check again the tables. Now we can see our table here sales monthly orders. Now of course we have to go and check whether everything is fine. So let's go and select the rows from our new table sales monthly orders. So let's go select it first and execute. And now we can see again the result of our query. But we are not writing here the query. We are just selecting it from the table. So our data is stored in our table. And we can go and check the structure of this table. So if you go to the columns you can see we have here the order month and the total orders and those informations comes from our query. So SQL is saying here the order month is a var which is correct because here we have the names of the month. So SQL is able to define the data type of the table from our query and the second column the total orders it is an integer and that's because we have here numbers. So as you can see SQL is defining the structure of the table based on the result of our query over here. And of course the data inside the table comes as well from the query. And the result of this table going to stay like this as long as you don't change anything. So if you go and close this and open it after one year it's going to show exact same results. So it's going to live in the database as long as you don't drop this table. But if things change in the table orders, this table will not be updated automatically like we have learned in the views. So now if you want to say you know what I would like to go and drop this table well it is very simple just go and say drop table and the table name over here. So make sure you select it and execute it. And now if you go over here and refresh. So let's check the tables. You can see here the table is dropped. And now if you say you know what let's go and refresh the table that come from the CTAs every day so that we always get refresh data inside this table. So now let's go and execute again our CIS. And with that if we go and refresh we're going to find again our table inside it. Now if you go and execute it one more time in order to refresh the data of the table what you going to get? You're going to get an error. The database going to tell you we have already this table so we cannot recreate it. So now the question is how we can update the the content of this table. Well, we have to go and drop it first and then recreate it. And if you want to put everything in one statement, we have to go and use the TSQL. It is transacts SQL. It's like extension where you can do some programming inside SQL. So in order to do that, what we're going to do, we're going to go at the start over here and we're going to make an if logic. So we're going to go and search for the objects. So we're going to say if the object ID and now we have to go and specify the name of this object together with the schema. Make sure to select everything sales monthly order and put it inside here. And then we have to define the type of this object. And here we're going to go with you. It is userdefined table. So we are saying if the object sales monthly orders is not null. So that means it exist. So what you want to do? we have to go and drop it. I'm going to take the statement from here and then we're going to put it after the if over here. So we are saying if this table exist then drop the table otherwise don't do anything because we don't have any new table and the query going to work and at the end of the TSQL we have go in order to say the TSQL is done and then our usual query after all that. So let's go and execute the whole thing and as you can see it is working. So what happens? The database did find this table and drop it and then executed our query. So if you keep executing this, you are just refreshing the content of this table. So this is how we work with the CTAs in SQL. All right, moving on to another common use case for the CTAs that I usually use as well in my projects. We use CDS in order to create a persistent snapshot of the data at specific time in order to analyze data quality issue. So let's understand what this means. Now in some scenarios you have like a table and you are analyzing an issue. So there is like a data quality issue at your data and you are analyzing this scenario in order to understand why it happens. But the problem is that at the same time there will be updates on the table and your data is changing. So there will be updates maybe on some fields or you are getting new records and everything is getting mixed up and you will not be able to analyze the scenario where the data quality issue happened. So now it's almost impossible to find the ro cause of your issue. But instead of that what we do if we have like an issue of the data we go and create a fixed persisted snapshot of the data in a separate table using CTS so that we make sure nothing is changing and everything is fixed. And with that I can keep doing my analysis on the same data without the worry that data are getting changed. So this is another way why we use CTS in projects to make sure that we have snapshot of the data to ensure that our analyzes are done on the same scenario that caused the buck and going to be used as a foundation for finding the problem and fixing it. All right, moving on to another use case of the CTAs. We can use it in order to create our data m to make it physical data m instead of virtual data ms using views. So let's understand what this means now. As we learned before, if you have a data warehouse system, our data warehouse layer going to store the data inside tables. But for the second layer, the data m, we can go and use views in order to have dynamic and flexibility in order to generate multiple data ms. And we called it the virtual layer. But now in some scenarios if things get complicated your data m and reports going to be slow because there for each action you are generating a query. So the powerbi reports and dashboards are creating queries in your data marts and your data marts have always to go to the data warehouse in order to retrieve the data for the reports and the whole thing could take minutes or maybe sometimes hours. So in these scenarios we cannot stay using views because they are slowing everything down. But instead of that we have to convert our data mart to a physical layer. That means instead of using views we have to go and use tables. And one very common way in order to generate the tables of the data marts on daily basis is to use queries between the data warehouse layer and the data mart layer. It's still going to take maybe 30 minutes. That's why you can go and prepare the data at the night. But at the reporting layer where things and the performance really matters, the performance going to be better because the response time from the tables is way faster than views and the reports don't have always to waste time waiting for the data marts to get data from the warehouse. So this is another use case where you use CTAs where the views at the data marts are slow and we have to go and replace them with stables using CTAs to speed up things. But still my recommendation here is that start first with the views. So create a virtual data mart using views because the implementation going to be very dynamic and fast and you are always getting fresh data from the warehouse but maybe later if you notice okay some data ms and models are complex then maybe go and replace few marts from views to tables using cis. So this is another use case for the and it is nice workaround for your data warehouse system. All right friends, so with that we have covered now the first type of the tables that we have in databases. The permanent tables where you create a table and it's going to live forever until you go and drop it. Now we're going to talk about another type of tables in databases. We have the temporary tables. So let's understand what are temporary tables. So temporary tables or sometimes you call them as a shortcut temp tables. They store intermediate results in a temporary storage in the database during a session and the database automatically drop these tables after the session ends. So let's understand what this means. Now we have learned in the CIS we could use a query in order to retrieve data from one table and then it puts the intermediate results in brand new table in the database. So with that we are creating another table based on a query. The same thing for the temporary tables. We have as well a query that goes and retrieves the data from a table and as well the database going to go and create new brand table in the database that has the structure and the data from the result of the query. So it is exactly at the CTIS. What is the difference here? Well, it is about the lifetime of the table. Now the database tables that you have created using create insert or CTIS those tables going to stay permanent and they're going to live in the database as long as you don't drop them. So even if the system is completely offline the data going to stay at the database once it is online again but the temporary tables going to get deleted and dropped from the database automatically once the session ends. So what session means like once you open the client and you connect to the database and you are start doing queries we call the time between connecting ourself to the database and disconnecting from the database we call this a session. So that means once you close the client and you disconnect from the database and maybe shut down your PC and do something else. What going to happen? The database going to go and destroy and delete all the temporary tables that you have created during the session. So that mean the table going to live as long as you have a session and you can access during this time the table as you are accessing any other permanent table. So this is what we mean with temporary tables or sometimes we call it as a shortcut temp [Music] tables. Okay. So now let's check the easiest syntax ever. So for the temporary table the syntax going to look like this. you're going to have like a query select from where and as we learned in the CTIS if you go and say into then the table name it's going to go and create a physical new table but now if you want it as a temporary table what you going to do you're going to just put hash before the name of the table then SQL can understand okay now we are talking about temporary table and the database going to store it in that temporary storage so it is very simple this is the syntax of that temporary tables so so far we have learned that we have a database called sales DB and inside it we can find the tables that we have created the customers, employees, orders and so on. Those are our tables and they are always there like if you go and close everything and then start it or in the next day you're going to find always those tables with the same data. So they're going to exist as long as we are not dropping them. Now the question is where do we find the temporary tables? Well, as we learned, if you go over here at the system databases, you will find multiple databases from the SQL server and normally only the database administrator has an access to this and one of those databases called temp DB, temporary database. So, let's go inside it. Now, we can find multiple objects and one of them we can find here the temporary tables. And now, of course, we don't have anything inside it because we didn't create anything. So, let's go and create one. We have already an open session and active session with the SQL server. As you can see here, we are connected to the database and we can start creating temporal tables. So now what is the plan? I would like now to do few modifications on the table orders. But I will not do it directly at the table orders. I would like to take a copy from the sales DB and create from it a temporary table. So let's go and do that. What do we need first? We need a query. So I would like to select everything all the columns all the rows from the table orders. So from sales orders. So this is my query. Now so far nothing is created. We have only select statements. But now in order to create a temporary table what we're going to do we're going to put a statement between the select and from. So exactly before the from go over here and say into then in order to make sure it is a temporary table we use hash and then the table name. So we're going to call it orders. So that's it. We have our query and in between we have the into and make sure you are using hash in order to be a temporary table. So let's go and execute it. And now we can see that 10 rows are affected and we don't have any error. And now of course we cannot see it yet because we have to go and refresh the object explorer. So let's go and do that. And now let's expand it. And now we can see our temporary tables. As you can see it is at the schema dbo because we haven't defined any schema. And this is the default one from the database. So nice. Now we have the table and let's go and check few stuff. So let's go and select the table itself. So select star from and make sure to say hash orders. Let's go and select it. And now we are getting the data from the temporary table and not from the original table. The orders in the database sales DB. So all those informations comes from the temporary table. Now, of course, you can do whatever you want to this temporary table because it's not that important and it's anyway going to get deleted. So, let's say that I would like to delete all the orders where the order status equal to delivered. So, let's go and do that. What we're going to do delete from our hash orders. So, make sure we are selecting the temporary table and then where we're going to say the order status equal to what I say delivered. Yeah, delivered. So delivered like this. Let's go and execute it. Okay, with that it says five rows are affected. Let's go and select it again. So select from orders and let's check that. So as you can see now we don't have all orders. We have only the orders where the status equal to shipped. So all delivered orders are removed. And now we can do whatever we want to this copy. We can analyze it. We can modify it. We can go and insert a new data. So we can do whatever manipulation we want on this copy. And now if you say, you know what, I like this result and I would like to have it not only during the session. Maybe I'm going to need it for tomorrow or something. So now what we're going to do, we're going to do the exact opposite. We're going to now store the result of the temporary table back to our database so that we don't lose this intermediate result. So in order to do that, we're going to say into and then make sure to specify the sales dot because we want to select the correct schema and then let's say it is orders and I'm going to call it test like this. So let's go and execute it. So it says five rows are affected. Now we have to see those informations in the sales DB. We still don't have this table over here. So right click on the DB and then refresh it. So let's go again to the tables. And now you can see we have our new table orders test. So it is amazing right? What we have done is we have took a copy from the original table orders to a temporary space. We have done some modifications and play with the data and we have done some analyzes and then the end result of our temporary table. We have loaded back to another new table called orders test in order maybe in the next day to keep working on it. So it is really nice way to do changes in place where you say you know what it is temporary and whatever mistakes you makes it's okay it is like playground. So now we still have an active session with the database and our temporary table going to be always here. Now let's see what going to happen if we end our session. So in order to do that let's go and just close everything. So I will just close and we'll not store anything. So with that we have now ended the session. Let's go and start it again and see whether we still have the temporary table. So we have now again to connect to the SQL server and now we have another session. So that means the old session is already lost. Let's go to the databases to the system databases to the temp DB and let's go to the temporary tables. As you can see the database already cleaned up everything and this space is again empty for any new temporary table that I'm going to create. So as you can see once you close the session everything going to get lost. Now let's go back to our sales DB over here to the tables. We can see the table that we have created orders test it is still living here and still has like the data that we have created. So this is how things works with the temporary tables in SQL. Now let's see how the database server executed that temporary SQL. So now let's say that you are as a data analyst. You have created a query and then you say into in a temporary table. Now the database engine going to identify the query and first it's going to go and execute the query and then it's going to go and execute it and maybe we're going to get the data from the table orders and after the query is executed the database engine now has the results. Now two things can happen. First the database engine going to go and store the metadata informations in the system catalog. And now the second thing the database engine going to create a table but this time not in the users but in the temporary storage in the disk. So the table going to live there for a short time. And now what you can do you can write multiple SQL queries that are doing maybe multiple analysis on top of this table. So each time you select something the database engine has to go to the temporary storage and fetch the data from there. And now once you are finished and let's say you close your client the session between you and the database going to ends and now the database going to understand okay there is no more connection to this user and it going to go and clean up now the temporary storage with any tables that are created from this session. So that means the database is automatically cleaning up the storage maybe for other sessions. So this is how the database engine works with the temporary tables. So now the question is why do we need temporary tables? Let's see the following scenario. Now let's say that in our source database we have a table called orders and now we would like to go and load the table in our data warehouse. We have to do several transformations in order to prepare the data for the analyzes in the data warehouse. So maybe you have one query to remove the duplicates and another one to handle the nulls and maybe you are doing filtering and cleaning up and the last step you would like to aggregate the data. And now of course those queries those transformations want to change the content of the table orders and there is no scenario where you can do that directly on the source database and of course this is not allowed. That's why in data warehousing we have to go and get our own copy of the data and then on top of this data we can do our transformations. Now one way to do this using the temporary tables. So you have one script in order to extract the data from the table orders and put it in temporary table as an intermediate results and then you come with the transformations and all those queries and they start manipulating and changing the data of this extra copy in the temporary table and the last step you have the load where you go and load the final version of the intermediate results in the database. This is if you would like to do the whole ETL before inserting the data to the database. So now the orders table and the final table in the data warehouse both of them are tables. So they are permanent tables and they will stay there as long as we don't drop them. So they are very important tables. But now for the intermediate results it is not that important. It is just an intermediate step that we have done in order to have our extra copy of the data to manipulate it and so on in order to prepare it to be inserted in the data warehouse. So after we loaded it in the data warehouse, this copy of the data is not anymore important. It shouldn't stay like for a long time. That's why in this scenario, maybe we can go and use the temporary tables instead of normal tables for the intermediate results. And that's because only of one advantage is that the database going to go and do an automatic clean up after the host session ends. So it comes out of the box automatically from the database. So that means I don't have to deal with the dropping mechanism of this table for the next load. If there is like something wrong in the data warehouse, you would like always to check the copy where the transformations are done in order to debug and find issues. So I don't normally use temporary tables in these scenarios, I use just normal tables. But for other small projects, maybe this makes sense. So this is one use case on when to use the temporary tables in your projects. We use it in order to store intermediate results temporary until we are done with the session and then once we are done the database can go and drop that temporary table. All right guys, now a quick talk about the temporary tables. To be honest, I never use this in my projects. If I need an intermediate results in one query, I can go and use the CTEs. And if my intermediate results is very important then I put it in either view or CTIS but it is nice technique to learn maybe you can utilize it in one of your projects. All right guys so now let's have a quick summary about tables. Tables in database are like spreadsheet or grid that contains columns and rows and your actual data are stored in these tables. And we have learned there are two types of tables. We have permanent tables and temporary tables. Permanent tables lives in the database forever as long as you don't drop them. But in the other hand that temporary tables they have short lifetime. They will be dropped from the database once you end the session. Now we have learned as well there are two methods on how to create tables in databases. The first method is create insert. This method involves two steps. The first one is defining and creating the table and the second step is by inserting the data inside this new table. So you are creating something from the scratch. And the second method we call it CTAs. It create as well brand new table but based on the result of a query. So this type is done with only one step but it always needs another existing table. And we have learned as well the difference between tables and views where the main advantage of using tables created from CTIS is that to ensure the performance is fast enough at the end of the users or your reporting system. So we use CIS instead of views if the logic of the view is very complex and takes a lot of time to be executed in the database. And one more nice use case for the CIS is that we can go and persist a snapshot of the data in order to analyze a bug and data quality issue and to ensure that we have the exact data in order to find a solution for the bug and the issue. Now we have learned as well that we can use temporary tables in order to store intermediate results in a temporary storage and the main advantage of the temporary table is the database automatically drops all that temporary tables when the session ends and that's because for you the intermediate results are not that important to live long time. Hey my friends. So we have learned that in real data projects if you have a database there will be a lot of analytical use cases that want to access your data and do analytics. And what going to happen? They're going to write complex queries because in many scenarios they are doing complex analyzes. And if you don't do anything about it in your projects, you're going to face a lot of challenges like complexity and a lot of redundancy of the same complex logic but from multiple users and maybe performance and security issues. And we have learned we have five amazing techniques in order to solve those problems. We have learned the subqueries and cities and as well how to create objects like views, CTAs and temporary tables. So now what we're going to do, we're going to go and compare them side by side in order to have a big picture about the advantages and the disadvantages of each method. So let's go and compare them. Okay. So now we have our five methods and the first criteria that I would like to compare them is the storage type. We have learned that if you are using subqueries and CTE, what can happen? and the database going to put the result of those two techniques in the memory in the cache so that later the main query has a fast access to those intermediate results. But in the other hand if you are using temporary tables or tables from CDS the new created table can be stored inside the disk storage. And now for the views as we understood there will be no data storage and that means we are not using any storage from the database. Now if you are talking about the lifetime so that means how long the object going to live or persist in the database. Now our three techniques sub queries CTE and temporary tables all of them going to live a short time in the database. So all of them are temporary. But now if you are talking about creating objects using CIS and views those two going to be permanent. So that means they're going to live in the database as long as you don't drop them. Now we're going to compare them with something similar is when the database going to go and drop or delete those objects. Now we have learned that the subqueries and the cities have a short time. They going to live only during the execution of the query. So once the query ends the database going to go to the cache and delete everything. But for the temporary tables they live little bit longer as long as you are in the session. But once you end the session, the database as well going to go and drop and delete your table. Now for the objects that comes from the CIS and views as we learned they are persistent and permanent and the database can only delete them if you ask the database to do that by using the DDL command drop. So the database will not delete anything for these two. So now the next one is the query scope like how we can access those objects. Now for the subquery and the CTE the scope is here very small. It is accessed only from one single query. The query itself where you write the city and subquery. So you cannot access it from external queries. But we have learned that the temporary tables cis and views you can access all those objects from multiple queries. So that means you can access those objects from multiple external queries. Now the next one if you are thinking about the reusability if you look to the subqueries they are very limited. the subquery going to be used only in one query and only in one place. So if you need it in multiple places, you have to go and repeat the same logic. So subqueries are the worst with their reusability. But now if you are talking about the CTE, it is little bit better. You still can access it only from one single query but you can access it in the same query from multiple places. So you can access it multiple times from different joins and you don't have to repeat the same logics over and over. But still it is limited because you have only one query that is using the logic. Now if you think about the temporary tables I could say the reusability here is medium and that's because you can access the data by multiple queries but only during this session. So once the session is ended you cannot access it anymore which means you have to recreate it in order to reuse it again. So it is more reusable than the city and the subqueries but not that good like the CTAs and views. Those techniques can offer the highest reusability for you. So they are always there for multiple users from multiple queries. So it can eliminate a lot of redundancies and you have to do the job only once. Now moving into the next one. If you are thinking about the intermediate result of those techniques, the question is how fresh is the data? Is the data from these objects always up to date? Now for the subqueries and the cities they are always up to date because the SQL is executing the logic on the fly and storing the data in the memory and immediately after that going to come the main query and get the data. So always the intermediate results in the memory are up to date. But now if you think about that temporary tables and the CTIS the query is only executed once and if there is like any update and changes on the original table you will not find those changes in those objects and that's because SQL executed once and that's all. So if you query those tables there is no guarantee that the data are up to date. So if you want fresh data you have always to drop the table and create it again from the query. Now if you are talking about the views they are amazing they are always up to date because views does not store any data. So each time you ask the views for data what's going to happen the database going to go to the original table and fetch the data to the view. So your data are always fresh and up to date. So this is a big picture about the behavior of those advanced techniques that you can use in SQL projects. And if you ask my opinion my favorite is going to be the views in the first place. Then in the second in my list is the city. They are amazing, but don't use more than five CTEs in one query. Otherwise, it's going to be really annoying and hard to read. And then I'm going to say in the third place, the sub queries. And then the CDIS. I use CIS if the views are slow. If that's a scenario, I'm jump to the CDIS and create a permanent physical tables from my query. And the last one that I rarely use is the temporary tables. So, this is how I rank those techniques in my skill projects. Now I would like to show you as well a big picture on how things works in my projects in order to see all those different techniques and possibilities that you can use. It's like a big picture and recap. So story time. So you have a database and things starts where you have a database administrator or let's say a data engineer that is creating a new table from the scratch. So he going to write a DDL statement in order to create one physical table at our database. And now our database table is empty. That's why in the second step he going to go and write an insert statement in order to fill our new table with data. Now once we have a table we're going to give the access maybe to a data scientist or data analyst in order to start writing SQL queries. So now the first thing that could happen that the logic is complex and she has to do that in two steps. So the first step is a query that prepares the data in order to execute the second step. So that's why she going to go and use the subquery and the main query going to go and retrieve the data from the intermediate results in order to prepare the final results for the analyst. Now what could happen is that there will be an SQL logic in the query where it keep repeating the scripts. So now instead of writing another subquery for that she going to go and put this logic in CTE and now she going to go to the main query and use the result of the CTE in multiple places in the same query. So all those stuff the sub queries and the city queries the main queries all those stuff happens in one single query and now what could happen is that she is writing an amazing code. So instead of using it only in her query what's going to happen she going to go and persist this logic in the database. So she going to put it as a view in the database so that all other users and analysts can benefit from this logic and they don't have to write it again. So instead they're going to go and query the view and this going to makes the life easier. And of course our data analyst can as well use this view in the main query. And now one more thing she has as well another logic that is really complex and as well everyone can benefit from it. But the issue this query is very slow. So now she has to decide do I put it in view or do I create a new table based on the query using CTAs. Now of course because of the performance and the view takes around 30 minutes to be executed. She decided to execute the query using the CTIS where she generate a physical table so that all other analysts as well can access this new table in order to reuse the results and of course she can use it in her main query and with that now you have experience how things works in real projects. It is not simple select query from table it is like this people are creating subquery CTE views temporary tables CTAs for different purposes. All right my friends. So that's all about the CTIS and the temporary tables. And with that we have learned all the techniques on how to organize our complex projects. Now next we're going to start talking about something completely different. We're going to talk about the stored procedures on how to put our code inside the database. This is all about that programmability and how to add stuff like parameters, variables, error handling. So it's like programming. So let's go. So let's uncover this word of the s procedures and let's go. Now think about store procedures like this. Every time you go to a coffee shop, you say, "I would like a large coffee with a coconut milk, no sugar, and extra whipped cream." And you repeat this over and over each time you go to this coffee shop. And now, if you are working with stored procedures, it's going to be like this. Whenever you go to the coffee shop, you just say, "Give me my usual." and the barista know exactly what you mean behind that and you will get exactly your order without specifying and repeating everything word by word and this is exactly what's going to happen if you work with stored procedures so let's have some coffee right all right so now we can continue all right so now let's start again from the scratch we have always these two sides we have the client side and the server side of the database and what we have learned we have like a database and you as a user you can go and create like different SQL statements Like for example, you can create like an SQL select statements in order to retrieve data from the database or another SQL statements where you are inserting data to the database and another one let's say that you are updating the content of your tables and so on. So you have like different statements in order to interact with the database. Now let's say that what you are doing is not only one time job you are keep repeating those steps over and over. So you are always like doing an insert then an update and then a select and you keep repeating that day after day. So now imagine that you are doing something crazy where you go in vacation but the job should be done. So what you do you hand over all those select statements to your colleagues and they have to do it every day as well as you are gone. So you go and give them all those SQL scripts and you tell them okay you have to execute the first query then the second query and then the third query. This is of course not a good way on how to do things because of course there will be some human errors where like the execution of the script is not correct like first updating then inserting and things can go wrong and that's exactly why we have stored procedures in SQL. So what we can do we can put all those SQL statements together in one frame in one program and we call it start procedure. And now once you do that all your SQL statements will not stay at the client side they will be stored now in the server side of the database. So that means in store procedures we are storing our SQL statements inside the database. So you don't have to go and hand over your SQL statements to your colleagues. And now all what you have to do in order to interact with your SQL statements is to go and execute the store procedure. So you write very simple command called execute SP for example. So with that you are calling your stored procedure that is stored inside the server. And once you execute this what can happen the database going to go to the stored procedure and start executing all the SQL statements that you have inside the store procedure and it's going to do it exactly in the order that you have defined. So from top to bottom. So now once the database went through all your SQL statements, it's going to return back to the user the data that we have from the selects. And with that things are really easy and you can tell your colleagues okay just execute this third procedure and the rest can be done from the database. So with that you minimize the human errors and you make sure that everything can be executed as you wish and as well as you are back from your vacation things are easier. You have to just go and execute the third procedure. So this is what we mean with start procedure. You can store inside it multiple SQL statements in specific order and you can save it inside the database and each time you need your SQL statements you can go and simply execute them. So now let's have a quick comparison between a normal query normal SQL statements compared to a stored procedure. So a normal SQL query you have like select from where and so on. This is like one-time transaction. You are asking the database for one thing and the database is answering. So it is like one-time request. But now in the other hand in the stored procedures you have multiple SQL statements and once you execute the stored procedure there will be many interactions with the database in one go. So that means you will have multiple transactions that is happening in your store procedure. So an SQL query it is like a simple request. You need one thing and you are getting it. But on the other hand in the start procedure it is like a program. As you are writing a code in any programming languages it is more than one request it has a lot of stuff like for example you can go and build looping logic where we go and iterate through something or you can go and build a control flow where you have a logic like the FL statements. So there are like different paths in your code and as well in programming we have like parameters and variables in order to make our code dynamic and flexible and as well we can build error handling on our code in order to customize what can happen if there is like an issue. So the store procedure it is like having a code like for example in Python. So that means you can do more complicated stuff compared to a simple query where you have only like one request. So in the stored procedures you are doing like programming and coding and it is more advanced than only just having a query. So that means if you are working with stored procedures things going to get more complicated and advanced but of course you will get a lot of flexibility and reusability compared to a simple [Music] query. So now there is like another alternative to stored procedures. Well, you can go and put all your SQL statements in a Python code and things can work as well. So, either you put your SQL statements inside the stored procedure or in a Python code. But now the big question is what are the differences between them? Well, there is like a disadvantage if you having Python in different server because you have to go and build a connection between your server and the database server and connection means always networking and you might get slightly worse performance. So this is one advantage for the start procedure. Another advantage for the search procedure that all the scripts that you're going to store inside the store procedure in the database going to be pre-ompiled. So pre-ompiled means the SQL database servers knows already about your SQL statements and there was already a check whether all the syntaxes are correct and the database as well going to be preparing everything to execute the stored procedure like maybe preparing the execution plans and a lot of stuff. So if you store your skill statements inside store procedure in the database, it is very close to the database and the database knows everything about your scripts and it is ready to execute it. But if you put all your SQL statements outside of the database, of course, the database has no chance to understand what is coming. So it cannot go and compile anything until Python sends the code to database. So this is another advantage for the stored procedure. But now if you build your SQL statements in Python, you will get a lot of advantages. Like for example, you can go and build very flexible Python codes where you can use Python features together with the SQL and with that you open the door of many possibilities and flexibility. Another thing with Python, you can make great version control. So everything is integrated in Python tools. And one more advantage is that if you have a complex requirement in your projects, it's going to be really hard to implement it in stored procedures. it's going to cost you a lot of lines of code and things going to be not comfortable. But if you are implementing a complex logic in Python, things going to be way easier. So with Python, you can implement complex logics very easily compared to the stored procedure. So those are the big differences between the stored procedure and Python. Now I have to be honest with you about having your code in store procedure or in Python. Well, if you are working together in a data project, I will never recommend you to use stored procedure if you have the possibility to have your code in Python. And that's because I saw a lot of projects using stored procedure and most of them ends in chaos. It is really hard to debug. It is really hard to test. It's like catastrophic. So really don't use in your projects any store procedures. Especially if you have like a big project and you have a lot of data and tables and so on. You can manage everything perfectly using Python. Especially if you have platform like data bricks or snowflakes then of course the best way to control your data projects is using Python. But of course if you don't have this possibility and you have only a database server and you can only work with this then you don't have any other option. You have to work with the store procedures. But if you have this possibility to put your project inside Python and to run your scripts from there, then it is way better than having stored procedure. Well, this is my opinion. I'm just talking about working in projects in big projects. But if you have like small projects, few tables and so on, then it's fine to stay with the store procedure. But never build a big project using stored procedures because I tell you it will never work. So try to always to think about to have the right platform in order to run your projects. And now I'm thinking about it. Maybe I should have put this tip at the end of the video, not in the middle. So whatever. If you still want to learn store procedures, we're going to continue on that. And I'm going to have like a really nice example about how to build store procedures step by step like having a mini projects. So why not learning both of them. So let's go. Okay. So now let's have a quick look to the syntax of the store procedure. It is very simple. So it has always two parts. First we have to define the start procedure. So we can do it like this. Create procedure. Then we have to define the procedure name and then we say as and then we have begin and end. It's very important for SQL to understand when that definition starts and when it ends. And then between the begin and end we're going to have a set of SQL statements. So here you can insert whatever you want. Insert update queries anything. And once you have defined the sort procedure the next step is that we're going to go and execute it. So the syntax is very simple. We're going to say execute and then the procedure name. So that's it with that SSQL going to go to the S procedure and start executing all the SQL statements that you have in the definition. So this is the syntax of the S procedure. As I said it is very simple. All right guys. So now let's do it step by step. The first step is that we're going to go and write a query. So let's say that we have a very simple task and it says for US customers find the total number of customers and the average score. So let's go and do it. It's very simple. So select count star total customers and then the average of scores as average score from our table sales customers and then since it says US customers we have to go and filter the data based on the column country is equal to USA. So that's it. This is our query. Let's go and execute it. So we have a very quick nice report about the total number of customers and the average score. So now let's say that I have a weekly meeting and I have to represent this reports over and over. So that means I have to go and execute this query like frequently in weekly basis in order to get the data for the reports. So now what this means I have to go and save this query in order to use it later that each time I have to rewrite it. So that means I have to store this text somewhere that I don't go and rewrite the query over and over. So what I usually do, let's go and we copy the whole query and then we create a new text and let's say it's going to be my weekly query and it's going to be SQL. So I'm going to go and edit it and here I'm going to save my query and each time I need this query I have to go and copy it, go back to my SQL and then I'm going to go and paste it in order to execute it. So either going to write it each time or copy and paste it. Well, we don't have to do that. we have start procedures. So that means we're going to go to the step two where we're going to turn this query into a store procedure. So let's do that. It's very simple. So we're going to say create procedure. And now we have to go and give it a name. So it's going to be get customer summary. And then after that we're going to say as and then we need the begin and end. And in between we're going to put our query. So let's go and copy our query and just put it in between. So that's it. Let's go and execute it. And with that we have created our store procedure. And now in order to see our store procedure we can go to the object explorer to our database sales DB. And then here we have a folder called programmability. So let's go inside it. And here we have a lot of stuff like functions, triggers and we have stored procedures. So let's go inside it. And we can see over here this is our new created stored procedure. So we are almost there. The next step is that we're going to go and call our store procedure. And this is the easiest part. So it's going to be execute the stored procedure. And the syntax is very simple. So execute and then the name of the stored procedure. So get customer summary. So let's go and execute it. And with that as you can see we get the result of our query. So as you can see it is very simple. In just few steps we created a store procedure. And then in the future you don't need the whole thing. You just go and execute the store procedure. I don't have to store the query locally at my PC or to copy and paste anything. If I want this report now, I just have to execute the store procedure like this and I will get the results. Okay. So now let's keep moving. Now we're going to talk about the parameters inside stored procedures. So what is a parameter? It is like a placeholder where you can pass in information from you into the store procedure while running it and using parameters in store procedure it's going to make it flexible reusable and dynamic. So let's understand what this means. Let's say that you got a new task. So it says for German customers find the total number of customers and the average score. So that means now we have like to generate two reports one for USA and one for Germany. And in both of them you are doing the same aggregation. And again we have to go and start writing the query. It's going to be very similar to the one that we have in the previous example. So we are doing the same stuff same aggregations but the only change here is that we're going to use another value to filter the data. So instead of USA we're going to go and say here Germany. So let's go and execute this one over here. And with that we can see we have total number of customers too. So this is the report that we have to provide like in weekly basis. And again in order not to go and copy paste stuff we're going to go and create a store procedure for that. At the end we're going to have an end. But now of course we cannot have like the same names we're going to go and say here Germany. So let's go and execute it. And the next step we have to go and execute the store procedure. So like this. Let's go and execute it. And the whole logic now stored inside the database. Let's go and refresh on the explorer over here. And you can see now we have two stored procedures. But now you have to feel there is something wrong. Always in programming and coding. If you find yourself repeating the same task over and over then there is always a smarter way on how to optimize that. Repeating stuff in coding is always bad thing. So now clearly we are repeating the same query in two different store procedure. And now if you compare them you see it's because of the value. So we have here the value for the filter once Germany and one USA. And those values are static values. So it's always going to stay inside the store procedure as USA. But instead of that we can replace those static values with a parameter. And then you decide as you are executing the stored procedure for which country you want to execute the store procedure. So let's go and do that. I'm just going to remove everything from here and focus only on the first store procedure. Now what we're going to do after giving the name of our store procedure we have to define our parameter. So it start with at and with that SQL understandhuh now we are talking about parameters and we need now the name of the parameter. So it's going to be country. It could be any name that you want and after that we have to define for SQL the data type. It's like when you are creating a table and you define columns you assign a data type for each column. The same thing here you have to assign as well a data type for each parameter. So we're going to use the data type in var and for the countries it's enough to have the length of 50. So with that we are telling SQL for this third procedure we can pass an information to the store procedure and this information and value going to be used inside this parameter. So now after we defined this parameter over here we can go and use it anywhere inside our query. And of course we want to go and use it instead of this static value. So now we're going to remove this static value and instead we're going to have the parameter. So now we are saying you're going to filter the table based on the value that comes from the user and not anymore static with a USA. And as I said you can use this parameter everywhere like even here in the select statements. So it is a value that could be used everywhere in your query. So that's it. We have defined our new parameter and we have used this parameter in our query. So now we have to go and update the store procedure. We cannot leave it as create. Instead of that, we're going to say alter. So we are saying alter procedure and with the new informations. Let's go and execute it. And now we have to go and execute it. So now what we're going to do, we're going to say execute get customer summary. But now our store procedure is expecting a value from you from the input. So we're going to do it exactly like we done in the name over here. So we're going to say the parameter country is equal to Germany. So that means the value of this parameter come from me come from the input and this information going to be passed to my query to the store procedure. So let's go and execute it. And with that as you can see we are getting the report of customers for Germany. And now if you say okay let's go and generate the report for USA. All what you have to do is replace the parameter. So in the value instead of Germany we're going to say USA. So let's go and execute it. Great. Now we are getting as well the report for us customers. So that seems my friends for those two reports I just need one store procedure and with the help of the parameter I made my store procedure now more flexible and professional. So this is exactly the power of the parameters it makes everything reusable and dynamic. And now of course we don't need the store procedure for Germany. So what we can do we can go and drop it. So we're going to say drop procedure and it was like this Germany. So we don't need this store procedure and we're going to stay with only one dynamic store procedure. So this is how to use parameters in store procedure and why it's important. Okay. So now to the next step is that we can go and add default values for the parameters. So let's say that I execute very frequently this report where I say the country equal to USA and I don't want each time to define the parameter value equal to USA. So if you are using a value very frequently you can add it as a default inside the definition of the store procedure and it is very simple. So if you go to the definition again over here after the parameter and you say equal to USA. So now it's very important to understand that the country will not be always equal to USA. It is just you are saying if I don't get from the user any value then as a default I'm going to go and use the USA. So let's go and again change the definition of our stored procedure using alter. So execute and now we can go to our store procedure and I can skip the whole thing over here and execute it. So now as a default I'm getting the report of USA without passing an information to the store procedure because I know it is as a default USA. But if you need it as a Germany of course you have to go and define it. So you say execute the store procedure where the country equal to Germany. So if you execute it like this SQL still going to use your value. So the value that comes as an input from the user has more priority of course as the defaults. And with that we are getting the Germany reports. So as you can see it's really nice right using parameters in store procedure. All right moving on to the next step. Now we can work with multiple queries inside one stored procedure. And this is what we have learned at the start. We can have multiple SQL statements in one stored procedure. And now we have a new report and query to generate. It says find the total number of orders and the total sales. So let's do it quickly. We can write it like this. Select counts order ID. This is the total orders and then the sum of sales. Total sales from our table sales orders. And of course we are always creating a report based on specific country. So that means we have to go and join it with the customers table in order to filter the data. So on customer ID equal to the customer id. And now we're going to go and filter the data. So country equal to USA. So something like this. Let's go and execute it. And with that for the US customers, we have six orders and the total sales 180. And of course, the same thing we're going to do for Germany. So now, of course, we will not go and create an extra store procedure. For this, we're going to go and put everything in one store procedure. So let's go and copy the whole thing and put it here inside. So after the first report we're going to have the second report and now the best practice here if you have multiple queries in store procedure go and add at the end of each query a semicolon. It is just easier to understand how now this is the end of this query especially if you have like a big complex queries where you have CTE union and so on. It's going to be really hard to understand that we are talking now about completely new query but it is not like something the database requires it but it's just easier to read. So just add semicolons at the end of each query. So now let's go and execute the whole thing in order to change the definition of our query. And one more thing of course don't forget we don't need static values over here. We're going to go and add our nice parameters. So add country. So I think with that we have everything is ready to be executed. So let's go and change the definition of our store procedure. And now let's go and start with the defaults where the country equal to USA. So let's go and execute it. And now in the output as you can see we have two results. And that's because we have two queries. So the first report is for the first query and the second one for the new one that we just created. And the same thing if you go and execute the store procedure for Germany we will get as well two results. And here we can see we have four orders and 200 of total sales for Germany. So as you can see it's very simple. You can go now and add multiple SQL statements not only queries you can go and update you can do an insert delete any kind of SQL statements you can just go and add it inside your program. And as usual SQL going to execute it from the top to the bottom. So since this is the first SQL statement it's going to execute it first and then after that it's going to go to the next one. So this is how you can add multiple SQL statements to your store procedure. All right everyone. So now we're going to talk about the variables. So what is a variable? It is like a placeholder where you store inside it a value in order to use it later inside your stored procedure. So that means variable holds like a value inside the memory and you can reuse it everywhere you want inside your stored procedure but it's not like the parameters. Parameters are something like outside the store procedure. It's an input from the one that is executing the store procedure and the store procedure has to adapt with the parameter. But a variable it's something that lives inside the store procedure and we use it as a developers in order to make our code dynamic and to move a value from one place to another. So let's have a very simple example now. Let's say that we don't want our report here about the total customers as a query. So I don't want it as a result in the output. Let's say I'm generating a report always like this. We are saying the total customers from Germany equal to two and the average score from Germany is equal to 425. So I need it as a text not as a table like here. So in order to do that we can use the TSQL print in order to give a message after executing the store procedure. So the syntax of print is very simple. So we can go over here and say print and then we have single quotes and let's go and get the whole message from here without the comments and then the semicolon and we can repeat that for the second message. So for the average score and we put it over here as well a semicolon. Now if you do it like this this message going to be always static. So we will have always like two for the total customers and the average score going to always be like this even though that the data is changing. So we cannot have it static like this. We have to make it dynamic and especially if we are calling this function for USA. So we cannot have it here as a Germany. So let's see how we can make this dynamic. Now let's start with the easy stuff. Instead of the Germany over here we can go and put our parameter right. So instead of this so we're going to say at country but now the problem is it is part of the whole string we cannot do that so we're going to stop the text and you can see the coloring is changing and then have a plus in order to have concatenations. So this text comes first then the value from the country and then we're going to have as well the double point as a static text and again a concatenation and then we have the two we can talk about later. So let's do the same stuff over here. So we're going to say plus add country caring is not changing because of this code. So let me just remove it and then afterward plus make it static again plus and remove the final quotes. So with that in the message we have now dynamic where we get the value of the country from the parameter. And now we come to the interesting part. We have here an issue those two values they come from this query. And of course we cannot use a parameter for that. We have to use now the variables. Now in order to make a variables we have three steps. The first step is that we have to tell SQL about our new variable. So SQL can prepare and make like placeholder for it in the memory. So we have to tell and prepare it with our new variables. Now usually we do all the declarations of our variables at the start of the store procedure immediately after begin. So that means we're going to go over here and say declare and now after that it's like the parameters. It's very simple. So at total customers. So this is the name of the variable. And after that we have to define the data type. Of course you have to understand the data type from the query. Since we are saying count star then the output going to be an integer. That's why we're going to write it like this. So integer. And now we need another one for the average. So what we're going to do we're going to make a comma. Now we are declaring another variable. So at average score and the data type of this one going to be float because we have an average. So that's it for the first step. We are telling SQL we have two variables and SQL going to go and create an empty placeholder. So now in the second step we have to give our variables a value. So where we going to get the values? We're going to get it from the query. So let's do that. Now let's start with the first column. As you can see we have here the count star. And as we learned anything that we write on the right side, it going to be like an alias for the column. But in SQL if you go and write something before it, it going to be the variable. So we can do it like this. at total customers and then equal. So now we are saying whatever value this query returns it should be stored inside my new variable so that I'm assigning values to my variable. But here there is one thing that we cannot have any more aliases because our query will not return any results. Our query have now only one task to assign values to my variables. So that's why we cannot have it like this. We have to remove the alias. And the same thing we're going to do it for the average. So at average score equal to the average score and we have to remove the alias. So that's it. Now our query having different purpose. It is not for returning result. It is to assign values to our variables. So now we have values in the next step we have to go and use it. And we can use our variables everywhere inside our store procedure. So it could be in the print, it could be in the next query. So in any select statements in any place. Sometimes we use variables in order to pass an information from one query to another one. But in this example, we want to use our variables inside the prints. So it is very simple. We now we're going to go and replace the static number and it's like the parameter. We're going to say at total customers and the same thing for the average at average score. So that's it. It's very simple. So again the step one we have to declare them to define it for SQL and with that we're going to get an empty variable. The second step we have to add values to those variables and the last step we have to go and use those variables. So it makes sense right now if you check our message over here you can see that everything is dynamic and we don't have any static values but there is one more thing that's in the print everything should be as a string. So we cannot have dates numbers floats and so on. So that's why you have to make check if you're adding any parameter and variables all of them should be string. So the country it is okay because we have the data type of varchar but the total number and the average score this is not really good because they have different data type and we have to go and now cast those data types to another one. So we're going to say cast and we're going to say here as invar so that we don't get any errors from SQL. So cast as well here as in vchar like this. All right. So I think we are ready. Let's go and change the definition of our store procedure in order to test. So let's go and execute. Perfect. And now let's go and test. So let's start with the defaults where we have the parameter as USA. So now as you can see we are getting one result and this is from the second query. So the first query is not returning anything anymore in the output. But if you go to the messages over here, you can see we have a new message. It says total customers from USA is equal to three and the average score from USA is equal to 825. And this is exactly what we wanted for our reports. Now let's go and execute the parameter equal to Germany. Again, we have only one result. And in the messages, we're going to get total customers from Germany is equal to two and the average score from Germany is equal to 425. So this is exactly how we work with the variables. We use it in order to hold one information in one place in order to reuse it later in different place. So that's it for [Music] variables. All right everyone. Now we're going to talk about how to control the flow in your store procedure and we're going to learn how to do that using the if else statements. So now let's have the following scenario. Now if you check our query over here we are doing the average of score and if you check the data you can see that in the scores we have nulls and nulls are really bad for aggregations. So we usually have to clean up our data before doing any aggregations. And in this scenario we can understand null as a zero. And how we going to clean up and handle the data? We're going to go and make an update on our table where we say if there is like a null then make it as a zero. And we will do this as a pre-step inside our store procedure. So that means first we have to clean up the data and then afterward we're going to generate the reports. And this is what we usually do inside SQL projects. So the logic going to be very simple. We have to check first do we have nulls inside the score. If the answer is yes then we have to go and update the null values to zero. But if the answer is no, we don't have any values then we can skip everything. So now we're going to go and build this logic inside our store procedure in order to clean up and prepare the data. So let's go. Okay. So now this part we're going to call it generating reports and we're going to have another part called prepare and clean up data. So now let's prepare first the structure of the if statements. So the syntax going to look like this. So if and then begin and end. So this is the block of the if and we're going to do the same thing for the else. So we have else and we have begin and end. Let me just separate them. So now how this works? We have to create a condition. If the condition is met then the if statement going to be executed. But if the condition is not fulfilled and we have false then the else statement going to be executed. So what is the condition? We have to check whether there is null inside the scores. So let's write a very simple query. It's going to say select one from sales customers where score is null and always we have to check the country equal to let's say USA. So let's go and execute this one over here. So now we are getting in the output a results. If we are getting a results that means somewhere there are nulls. But if you go for example and say here Germany and execute the same query in the output you see that we don't have any results. That means for the German customers we don't have any nulls in their scores. So if this query returns something we have nulls. If it didn't return anything then there is no nulls. And we're going to use exactly this query as a condition. So we're going to take our check and say if exists and then two parenthesis and then we put our query. So what we are saying if exist if this query return anything then go and execute the next block and if it is not exist that means it is not returning anything then go and execute the second block. So it's a logic right it's very simple now of course instead of having a static value over here we can use our parameter so at country and now we have to tell SQL what to do if it exists. So in between we can have like an update statement. So update sales customers and we're going to set the score equal to zero. But very important we have to go and use where condition otherwise it going to go and update everything. The score is null and the country equal to our parameter country. So with that we are updating exactly the nulls for specific country. And let's have a semicolon at the end. And at the start maybe I'm going to say just to have a nice message in the output print and we can have a message updating null scores to zero and as well a semicolon at the end. So if there is any nulls then execute the whole thing print the message and update the table. So now the next step is that we're going to go and tell SQL what can happen if the condition is not fulfilled. That means we don't have any nulls. Well we don't have to update the table at all because we don't have to clean up anything. But I'm going to go and make print over here. So print and we're going to give the message no null scores found. And at the last end I'm going to go and put a semicolon. So that's it. This is our logic. We are checking our condition and then we execute if the condition is met where we update the table with zero instead of null and if the condition is not met then don't do anything. Just print a message. Now you might say you know what why you are doing this? we just can use this update statements and we don't need the whole if else statements. So why we are checking in the first place? I can like each time I run this store procedure I go and update all the nulls if they exist to a zero. Well, this is not really professional because you are wasting resources. So each time you run an update statement like this. So imagine that you have a big table and each time you run your store procedure, SQL have to go and check whether there is any nulls and so on. And this is of course consume resources. It's way better if you go and check first whether it's really needed. So that's why we are doing this logic. Now as you can see our store procedure is getting bigger and bigger. So we have like two parts. The first part is preparing and cleaning up the data. And the second part we are generating reports. Let's go and update the whole thing and execute it. And now we have to do it step by step. So let's check our query over here. And you can see we have here null for USA customers. So let's go first execute it for the USA as a defaults. And now let's go and check the messages. It's saying updating null scores to zero. That means the first block is executed because SQL did find a customer with a null. And with that the average of scores going to be different than previously. So we have now more accurate average in our reports. So if you go and check our query again, you can see now we have a zero instead of null. Let's go and execute it for Germany like this. And let's go and check the messages. It says no null scores found. And that is correct because for Germany we don't have any nulls. So with that we have created a control flow using the FL statements. And as you can see we are not doing any more like simple queries. We are creating like a mini program. And now it's like an ETL where first we prepare the data and second we generate reports. And you can imagine a real project how big those stored procedures going to get where you have a lot of tables and a lot of things to do. Okay. So now we're going to talk about the error handling in store procedure. Error handling it is like essential things to do while programming because it gives you the control on what can happen once you have an error. And there's a lot of things that you can do like maybe deleting data, printing a very structured like message or maybe doing some logging and so on. So you have a full control on what to do if there is an error and of course we can do that in the store procedure. So now let's check the quickly the syntax. It is usually has two parts. The first part is the try part. So the syntax is like this begin try end try. So you are defining the boundaries of the try and in between you going to have all your SQL statements and your code and the second part going to be the catch parts. So you say begin catch and end catch. So you are defining the boundaries and then in between you can tell SQL what to do if there is like an error. So what is try and catch? Like the word it says try it's like you are attempt to do something that might fail. So you are telling SQL try to execute this code. So the SQL going to go and try to execute your codes. And if any error happens while executing your codes, the SQL going to jump to the second block and start doing whatever you have defined in the catch. But if there is no errors at all, this part will not be executed. So the catch is like your backup plan. If something goes wrong here, then go to the plan B and do something. So let's see the workflow of the try catch. So first the SQL going to go and execute the try and then it going to check is there any error. If we don't have any error then everything ends and that's it. But while execution if the SQL face any error what going to happen it going to go and execute the catch. So as you can see the workflow is very simple and this is what we mean with try and catch. So let's go back to SQL to have some example. All right. So now back to our store procedure. Let's go and introduce an error inside our code. So let's go over here and maybe in our query we're going to go and divide by zero which is of course a problem. So we have this error over here and let's go and update the logic of our store procedure. And now if you go and execute it. So let's go and do that. We will get an error saying yeah you cannot divide by zero. But now what I would like to do I would like to have something else where we have customized message when error happens. So I would like to have the control on which information should be displayed if there is an issue. And in order to do that we have to use the try and catch. So it's going to be very simple. Now this is my whole code. So the whole thing from preparing to generate the report the whole thing is my code and we have to put the whole thing in a try. So how to do that? Exactly after the first begin we're going to have another begin but for the try. And now what we're going to do, we're going to go to the last end over here and have an end try. So with that we put now the whole code inside the try. And after that we're going to introduce the catch. So begin catch and end catch. And now in between we have to tell SQL what can happen if we encounter an error. And here we can do many stuff but I would like now to focus on customizing the error message. Let's start with the first one. So I'm going to say print let's say an error accord. This is the first thing. Then on the next line I'm going to print more informations. And now we're going to say the error message. So error message double point space. And now we can go and use some predefined functions from SQL like for example the error message. This function going to return the description of the error like the one we have here divide by zero error encountered and we can go and keep adding stuff the way that we need like maybe the error number. So we can have it like this and for that we have as well a function called error number and I think we have to cast this one because it is a number and in the messages we have to have only vchar. So this going to be as int var like this and we can keep adding stuff to our message like for example let's take the error line and for that we have as well a function so it's going to be the error line like this and we have to cast it because it is as well a number and as well what is really important is the name of the stored procedure. So error procedure and we have a function for that error procedure like this. It's going to be a string. So that's why I don't have to cast it. So now with that we have defined for SQL what to do if there is like an error in our code. So let's go and execute the whole thing. And now let's go and execute our stored procedure. So let's go and do that. So now as you can see in the output we are not getting any results and it is not giving an error. But if you go to the messages, you will see a very nice message. So it says an error is occurred. The error message is divided by zero and we have the error number in which line and as well the stored procedure name. So as you can see it's amazing. This is how we use the try and catch in order to have more options on to control what can happen if there is an error. Now the next step what I'm going to do, we have to go and organize our store procedure. As you can see, everything is getting bigger. So now what we usually do, we use tab in order to make spaces between each section. So now the first section is between the first begin and the last end. So we have to go and mark everything and hit once a tab. So now it is easier to read. Now the whole thing is our codes. So now the next level is the block of the try. So the whole thing over here is the try. So let's go and do that. I'm just going to mark everything until here and then hit tab. So now we can see it better, right? And the same thing for the catch. I think I have already done that. So it's already pushed. Now we go to the next level. So between this begin and end, everything is pushed. So this looks nice. The same thing over here. It's pushed as well. And then we don't have here any begin and end. So it looks okay. And the same thing over here. So all our begin and end is now sorted correctly. Now the next step is that we can go and improve the comments a little bit. So we can split our code into multiple sections. So what we're going to do, we're going to go over here and say this is step one. And what I like to do is to go and add separation using the equals or any special character that you like and as well here. So with that we have the first step. We are preparing the data. And then let's go and copy the whole thing and go over here and say this is the step two. And we're going to say this is generating summary reports and something like this. And of course below that we can say what is this report about. So calculate total customers and average score for specific country. And as well we can go over here and add as well a comment. calculate total number of orders and total sales for specific country. And of course we have to go and remove this error over here otherwise we'll get an error and we can go and add something about the catch where we can say like this again few comments we're going to say error handling. So let's go and execute it again in order to make sure we have the newest version. And with that we are done. We have a really nice stored procedure with multiple steps and we have it professional where we have error handling inside it and everything looks well organized and easy to read. So this is how we build stored procedures. All right my friends. So that's all about the store procedures. That was an amazing feature in SQL to add programmability in SQL. Now in the next step we're going to cover quickly the topic of the triggers. So let's go. All right. So previously we have understood that we can put all our SQL statements in one stored procedure and you have to go and manually execute the store procedure. So that means in order to trigger the start procedure, you have manually to execute it and this is of course a problem. How about to do that automatically? So triggers in SQL they are special stored procedure that automatically runs or let's say fired in response to a specific event that happens on a table. So what this exactly means? So now let's say that we have a table in our database and now something could happen to this table like inserting data, deleting, updating data, all those stuff that is happening we call them events. And now what we can do we can go and attach like a trigger on top of this table and each time an event happened like insert update delete something else going to be triggered like maybe going and inserting data somewhere else in another table or doing a check whether we are allowed to delete the data in the first place or maybe sending a warning message or something. So based on any changes to the table we can trigger another events and we can do that using the SQL triggers and for the SQL triggers we have like multiple types like the DML triggers and this type of trigger going to respond once we have like insert update delete statements. Another type of triggers we have the DDL triggers like you can make a trigger to respond to any schema changes like creating altering or dropping a table or even view by the way not only tables. And the third type of triggers we have the login trigger. So the trigger can respond to login events. Now in this tutorial we're going to focus on the DML triggers the insert update delete. And for the DML triggers we have two types. We have after triggers and as well we have instead of triggers. So as the name suggest if you use after so it can be executed after the event and the other type that instead of it's something that cannot wait until everything happens. So this time the trigger going to be executed during the event not after it. So now in order to understand all of this we're going to have really nice use case. And now the use case is about maintaining an audit logs. So what we mean with that? Let's have for example the table employees. The employee data are usually very sensitive informations because there we can see which employees are added, the salary updates, the employee terminations and this makes the table very important because we would like to track all those changes that is happening to this table. So each time we are inserting, updating, deleting, we would like to maintain a log about all those changes in order to analyze it later. It is of course very important such a logs for the compliance and the auditors and in case there is like a problem we can go to the logs to understand when this happened who made the changes and what exactly changed and now in order to maintain logs we're going to use the power of triggers. So what we're going to do we're going to go and attach like a trigger on the table employees and each time we insert new data to the employees we are triggering another events. So what can happen this new employee going to be inserted in the audit logs in order to have a record about this activity in the logs. So that means each time you are inserting data to the table employees you are automatically inserting data inside the logs and this is really amazing use case for the triggers. So let's go and implement it. Okay. So now let's check quickly the syntax of the triggers. So we start with the usuals create trigger then the trigger name and then we have to specify on which table this trigger going to be built in. So now we are attaching like a trigger on top of one table and after that we have to define for SQL when this trigger going to happen. So what is actually triggering the trigger and here you can define after or instead then you have to define the operator. So first you have to define like after or instead of and then we have to define the operation. So insert, update, delete or one of them. And with that you are telling SQL when exactly this should happen. And now after that we have to tell SQL what going to happen if the trigger is triggered. So here we have like begin and end. And then we have like several skill statements that's going to describe what's going to happen once we have the trigger. So that's it. As you can see the syntax is very simple. Okay. So now let's do it step by step. First I would like to create a table where we're going to store the logs information. So it's going to be very simple table. We're going to say create table. Then we're going to call it sales employee logs and we're going to have the following columns inside it. So let's start with the primary key. It's going to be the log ID and the data type int and then we're going to have like a sequence. So we're going to have identity and this is the primary key. Let's go to the next one. It's going to be the employee ID and the data type going to be ins. The next one is going to be the log message. So let's have it as a vchar and I'm going to have it like 255 and then to the next one we're going to have the lock dates and then we're going to have like let's say a date or a date time. So that's it. Let's go and execute it and with that we have a new table inside our database. Now the next step is that we're going to go and create our trigger. So we're going to say create trigger and I'm going to call it like this trg. This is just a prefix to indicate this is a trigger. And I'm just going to call it after insert employee. And now we have to define the table. So it's going to be on sales employee. So now with that we are saying we have now a trigger on the table employees. And now we have to define the logic. So we're going to use after insert. So that means after we insert any record to the table employees the following things should happen. So we're going to say as and then begin and end and in between we can have our logic. So what can happen after a new record is inserted to the employees. We're going to go and insert a new record to the employee logs. So we're going to have insert into sales employee logs and we're going to have here the three columns employee ID the log message and the log dates. So now which value is going to be inserted? it going to be like from a query. So we're going to say select and we're going to say as well employee ID and for the log message we can have customized one like let's say new employee added and it's going to be equal to the employee ID. So in order to have the employee ID it's going to be like this. So that's it. Now to the next one we need the log date. It's going to be get date. And now you might say okay but where this employee ID is coming from? Well, it going to come from the table from inserted. So what is actually inserted? It is like special virtual table that holds all the new inserted data to our table employees. So anything we are inserting inside the employees will be available inside this table. And of course this is only available during the execution of this trigger. So you cannot go now outside of this query and start querying the table inserted because you will not find anything. This is only like a virtual table that contains anything that you are doing to the table employees and you find a lot of informations like the salary, the age and so on. So that's it for the inserted. Now we have to make sure that in our message we have everything as a string because the employee ID is an integer. So we have to cast it. So cast and then we're going to say as far char like this otherwise we'll get an error. So I think we have our trigger ready. We have a new trigger on the table employees. And now the first question is when this trigger going to happen? Well it can happen after inserting data to the employees. And then the second question what's going to happen? Well, once we have this event, the whole thing here going to be executed where we are saying insert to the logs, the employee ID, the message and as well the date when this happens. And we can get all those informations from the table, the virtual table inserted. So I think we are ready. Let's go and execute it. And now if you go to the object explorer to our database, let's go to our table employees and then to the triggers. So if you refresh over here you can see our new trigger that we just created. So with that we have to find our trigger and we are ready. Now the next step is that we're going to go and trigger our trigger. So let's go and do that. Let's have a new query. But first I'm going to have a look to our logs. So sales employee logs. So let's query this one. And as you can see our logs is empty because we didn't insert anything to the table employees. Let's go and do that. Let's trigger our trigger. So what we're going to do, we're going to say insert into sales employees and we're going to have the following values. So we are at the counter, I think six. Let's have the first name Maria. The last name an then we're going to have the position. It's going to be the HR for example. The birth date, let's pick something. I don't know. We have a female here. And the salary. Let's go and get this salary and the hierarchy it can be for example three. So let's go and execute it. And with that as you can see we have inserted a new data to the employees. Let's check now the logs. So let's query it. So we have here nice log about the employee number six. And we have here nice message and when this did happen. Of course you can go and insert another employee let's say seven with the same data. So let's do that and check the logs. And with that we have another log for the new employee. So this is really amazing use case in order to maintain a log to your data and you can go and make like some analyzes on how many inserted happens and of course not only on the insert you can have it on the update delete. So as you can see it is very simple. This is how we create the triggers in SQL. All right my friends. So that's all about the triggers with that with with that we have covered now with that we have covered now all the concepts and topics that you have to learn about SQL. Now in the next chapter it's going to be about the performance. So as you start writing queries and so on you will start noticing some queries are really slow. Now what we're going to do in this chapter we're going to learn different techniques on how to optimize the performance. And the first and the very famous one is to go and build indexes in databases. So let's understand what this means. So what is an index? An index is a data structure that provides a quick access to the rows to improve the speed of your queries. So an index is like a guide for your database in order to speed up the process of searching for data especially if you have like big tables. So now in order to understand what are indexes, imagine you have huge book and you want to find a specific topic or a chapter. Instead of flipping each single page in order to find the topic that you are searching for, you would use the index at the back of the book in order to jump straight to the right page. And that's exactly what index does but for your data. Another analogy that I use in order to understand indexes is think about the indexes as a big hotel. Now let's say that in the hotel we don't have any guide and you would like to find the room number let's say 5001. Now what you going to do? You're going to go and search for your room floor by floor and checking each room until you find your room. But instead of that, thankfully hotels have a numbering system. And you can ask for a map from the reception in order to understand in which building in which floor you can find your room. So by just following the map and maybe some signs, it's going to be very quickly to locate and find your room in such a big hotel. And that's exactly what each database needs. It needs an index in order to help the database finding and locating the right data without having to scan everything. And now let's say that you ask me, you know what, I have this big table and I would like to speed up the queries using indexes. And my first question going to be, what are you exactly doing with this table? Are you using this table to search for text or are you doing like complex analyszis with this table? And the reason why I'm asking this is that we have different indexes in databases for different purposes. So now let's have a quick look to the different types of indexes that we have in database. I divide the indexes in databases into three categories. The first one is by the structure how the database is organizing and referencing the data. And here we have two types. The clustered index and the non-clustered index. Those are very important to understand. Now we have another category for the indexes. We can divide them by the storage. And in this category we are talking about how the data is stored physically in the database. So we have two types. We have the row store index and the column store index. And the third type is the functions and here we have two types. We have the unique index and the filtered index. Now each index type has its own strings but as well there is always a tradeoff. Some might improve their read performance. The other one might improve the insert and update operations. So it's all about choosing the right type of index for the job. So now what we're going to do, we're going to go and deep dive into each of those types in order to understand how they work and how we can create them. And we will start with the first category, the structure. We have the clustered index and the nclustered index. Now before we dive into how the indexes works in databases, let's understand first what happens to the database tables if you don't use any index. When you create a new table in your database like for example the customers table where you have let's say 20 customers inside this table. What you're going to see at the client side is like spreadsheets like a table with rows and columns but behind the scenes the database store it a bit differently. It's going to store the data in a data file on the disk and inside this file the data can be stored inside blocks called pages. So it's not like rows and columns that are stored inside data files and inside the data files we have pages. So what is a page? A page is the unit of data storage in a database and it is a fixed size of 8 kilobyt where the SQL database can store anything inside it. It can store inside it the rows of your tables or columns metadata indexes and every time you are interacting with your data the SQL is reading and writing to those pages. So as you can see the SQL is not storing the data inside like rows and columns. So if you are running a query the SQL is not like selecting a specific column it always fetch a data page in order to read the rows inside this page. And the main two types that we're going to learn is the data page and the index page. So how the data page looks like it is divided into multiple sections. The first section is the page header where the database can store key informations about the metadata like the page ID and it has the following format. It start with the file ID like one and then we have a unique number for each page. So for example 150. So the page header is a fixed size of 96 bytes. Now to the next section, we're going to have a variable size. This is where your data row is going to be stored. So your actual data and row is going to be stored in this section. And the SQL going to try and fits as many rows as it can in one single page. And this of course depends on the size of each row. So if you have like a large table where the rows are really big, so SQL can fit only few rows in one single page. And now moving on to the last section in the data page, we have the offset array. This is like a quick index for the rows stored inside this page. It keeps track of where each rows begins so that the SQL can easily locate a specific row without having SQL like scanning the entire page in order to find a row. So this is the structure of the data page and this is exactly how the SQL stores data inside the databases. So now back to our example where we have the customers table and 20 rows. So let's see how SQL going to be creating those pages. Now if you are not using any index in this table. So now what going to happen? SQL going to insert the data inside those pages as you are inserting the data inside the customers. So maybe first you are inserting the customers like 12 5 6 7 and SQL going to insert it to the data pages exactly like that. So that means SQL is just inserting the data as you insert it to the table. So let's say each data page is like fitting only five rows. So after we insert five customers, SQL going to go and create another data page for the next rows. So in the next page, the SQL going to insert the next five customers. And once it's full, it's going to create another data page in order to start adding the next customer until we have like for example four pages for that 20 customers. So now if you check the customers inside those four pages you see that they are not sorted at all and that's because in this scenario we are not using any index. So we call this structure as a heap structure. So a heap table is a table without a clustered index. That means the rows are stored randomly without any particular order. This is not a really bad because it's going to be very quick to insert data inside this table. But of course finding something from this table going to be very slow. So this is the first tradeoff. You have a very fast writes but a very bad reads. Think about it like you are throwing all your papers in a drawer without organizing them. So you can toss things very quickly in this drawer. But if you want to search for specific paper later, it's going to be very long process until you find it because nothing's in order. So now let's see how the SQL going to handle if you read something from this table. Let's say that you are searching for the customer with the ID 14. So now SQL has totally no idea where to find this customer. So SQL going to start fetching each data page and start scanning each row. So it's going to start with the first data page and start scanning. Well, SQL will not find 14 here. So SQL going to go to the next page and start scanning as well. Searching for the ID 14 and nothing going to be found. The same thing for the third page as well. SQL will not find 14. So SQL going to go to the last data page and there after scanning four rows in this data page finally SQL going to find the customer number 14 and it's going to return it for the clients. So as you can see in order to find one customer SQL did read four different pages and scanned like 19 rows in order to find the customer and this process we call it full table scan. So the full table scans means SQL is scanning the entire table page by page and row by row in order to find specific row. And of course for this table maybe it's not a big deal. But if you have like a big table where you have like hundred of thousands or maybe millions of rows searching through the heap structure going to be very painful and slow in order to locate one row. And here exactly why we need indexes in SQL databases. So let's understand the first type of indexes the clustered index. All right. So now let's understand what can happen if you create clustered index in your table. So say you create a clustered index on the ID column of the customers. So the first thing that's going to happen SQL going to physically sort all the data based on the column ID. So the rows going to rearranged in each data page from the lowest to the highest. So in the first page we're going to have the first customer ID number one then 2 3 4 5 until we reach in the last page the last customer number 20. So as you can see the first page has the lowest value and the last page has the highest value. So that's not all. The next step is that SQL going to go and start structuring and building the B tree. So what is a B tree? A B tree short for balance tree. It is hierarchal structure that store the data as a tree upside [Music] down. It start with the root the root node and then it keep branching out until we reach eventually the leaves. Between the leaf nodes and the root nodes we call this section the intermediate nodes. So it could be like one level or multiple levels between the root and the leaves. And once SQL construct the B tree, it's going to be very easy for SQL to navigate through the B tree in order to find specific information. So let's see how SQL is building the B tree for the clustered index. Now very important to understand that the leaves the leaf nodes and the B tree for the clustered index contain the actual data the data pages. So all your nice sorted data pages and your data is stored at the leaf level. Then after that SQL going to start building the intermediate nodes and here the database going to use different type of pages. We have the index page. So in the index page we cannot find the actual data the entire rows but instead the index page stores a key value that contain a pointer to another index page or to a data page. So for example we have here the value one the key and then the value going to be the ID of the data page. So here we don't have like the whole row about the data we have here only a pointer to another data page. So here we are telling the scale if you are searching for ids between 1 and five you can locate it at the data page ID 1.100 and then we can store in this index page another pointer where we can tell SQL if you are searching between 6 and 10 then you can locate it at the second data page. So this is the structure of the index page it contains only pointers to another page and the same thing for the second two pages. The SQL going to create another index page where it's going to says if you are searching for IDs between 11 and 15, you can find it at the third page 1 double point 10002. And for the last group between 16 and 20, we have another pointer to the last page to the page number one3. So as you can see inside those index pages, we have like a pointer for each group of ids for each cluster. So for the group of customers between 1 and five we have one pointer and for the second group between six and 10 we have another pointer. So that means we don't have here a pointer for each row. We have a pointer for each group for each cluster. That's why we call it clustered index. And now once SQL is done building the intermediate nodes, SQL going to go and build the last node, the root node where it says if you are searching for customers between 1 and 10, then go to the index page with the ID 1.200. So that means the route node here is pointing to another index page, not directly to the data page. And the same thing, we need another pointer for the second index page. So the customers between 11 and 20 go to the index page with the ID 1.201 and this is exactly what going to happen if you create a clustered index in SQL. First it going to go and physically sort all your data in the databases. So if it's from the first time sorted randomly SQL has to arrange everything and sort the data from the scratch. And then it's going to go and build this structure where you have in the root node and index page in the intermediate nodes the index pages but at the leaf level at the leaves we have the actual data the data pages. So now let's see what going to happen if you query the table where you search for the ID number 14. So it's going to check which pointer to use since 14 is in the group between 11 and 20. It's going to go and use the second pointer to the index page with the ID one double point 2011. And here the SQL going to open this index page and check the pointers. So since 14 is between 11 and 15 it going to go and use the pointer to the data page one point 102 and with that SQL located the correct data page the third page and now SQL going to open this data page and find the customer ID number 14. So as you can see it was very fast for SQL to locate the correct data page with only three jumps from the root node to the intermediate node. The SQL were able to find fast the correct data page. And here SQL needs only to read one data page instead of reading as we saw in the heap structure four different data pages. And of course you might say but still here we are reading like three pages. Well, reading an index page is very fast compared to the data page because reading a data page is always slower than reading an index page. So, as you can see, this P3 structure, the clustered index structure did help the SQL and the database to locate the right data in the right [Music] databases. And this is exactly how that clustered index works in the SQL database. All right. So now we're going to move to the second type and we're going to understand how exactly SQL build and create the nonclustered index. So let's go. So now we are back to the heap structure where our table don't have any index and our data are stored randomly inside the data pages. And now if you go and create a non-clustered index on the customer ID, what can happen? And here's the big difference that SQL will not touch or change anything on the physical actual data on the databases. So the database is going to stay as it is and nothing going to be changed and the SQL start immediately building the B structure. So it's going to start immediately building an index page and this index page is a little bit different than the one that we have learned previously. So since it's index page, it's going to store pointers. But this time SQL going to store in the key the customer ID. So one is the customer ID and now the value the pointer it will not be the data page ID. We will be more specific. So we're going to have like an address where exactly the row is stored. So it's going to start with the file ID, the page number because the customer ID one is stored in the page one2. But SQL gonna go add as well the offset number of the row where exactly in the page we can find this ID and the whole thing we can call it an air ID the row identifier. So now let's see quickly how the index page is pointing exactly to the row inside the data page. So the first part of the row identifier is mapping to the data page ID and then from the 96 it's going to take us to the offset and that's exactly the location of the row number one. So 96 is the part where we're going to start finding the row number one and that's going to takes us exactly to the place where we can read the information about the row ID number one. So this is how the index page is locating the exact place of the rows. So SQL going to go and continue and assign for each customer ID a pointer to the exact location. So as you can see now in the index page we don't have like a pointer for each group of customers like we have learned in the clusters index. We have now a pointer for each ID and this type of index page we call it roator page. So now SQL going to go and continue and map a pointer for each customer ID that we have inside our table. So we will have multiple index pages pointing to our data page. So as you can see we have a lot of pointers and the data inside the index page is of course sorted but inside the data pages it left as it is. And now those index pages that has the row identifier going to be stored at the leaf level of the B tree. So at the leaf level we don't have the actual data the data pages we have index pages where we have pointers then to the actual data and then it's going to go and start building the intermediate nodes. It's exactly like the clustered index where it's going to point to another index page. So between one and five customers it's going to be in the index page number 200. So the next step is going to go and build the intermediate nodes. It's going to be exactly like the clustered index. Nothing going to be changed. is like the same structure. So it is an index page pointing to another index page but this time for a group of customers and then we're going to have as well the root node. So again we call this structure as a B tree structure where they point to another databases but the databases are not part of the B tree. So now let's say if we are searching for the customer ID number 14, what's going to happen? It's going to start again from the root node and then it's going to find the pointer to the intermediate node and then jump to the next step to the intermediate node and then it's going to find the pointer to the index page between 11 and 15 and then it's going to go and scan this index page and find okay for the customer ID number 14 we have the following address. So it's going to go and locate the exact database and as well the exact place of the row. So it can go and jump immediately to the row without scanning anything else. So here this time with the nclustered index the SQL did read three different index pages. And finally the one data page in order to find the data. So if you compare to the clustered index you can see that we have here one extra layer one extra index page to be scanned in order to find the right place of the row. And this is how SQL creates the B tree for the nonclustered index and how it scans it in order to find the information. All right. So now when I think about the clustered index and the non-clustered index, I think about a book. You can think of the clustered index like the table of contents at the front of the table. So the table of contents kind of tells you where to find each chapter and the chapters are exactly sorted like the table of contents and this is exactly what the clustered index does. But now in the other hand think about the nclustered index as the index that you can find at the end of the book. The index of the book is a very detailed list of topics, terms and keywords where it points exactly to the location where you can find it in the book. And the content and the topic of the book is not sorted like the index of the book. And this is exactly what the noncluster index does. It is coexisting with the data. It is an extra list where it can point exactly where we can find the data inside our table. All right. Right. So now let's put those two indexes side by side to understand the differences between them. So the structure of the cluster the index is a B tree where it start with the root node where we have an index page. This index page is pointing to the intermediate nodes where we have as well index pages and those index pages are pointing to the actual data to the data pages. So at the leave level of the clustered index we have the data pages the actual data. What's special about the clustered index is that it physically sort the data inside those pages. So everything here is physically rearranged and sorted. Now if you are talking about the nclustlustered index we have as well a bit tree. So the same thing at the root node we have an index page pointing to an intermediate index page but this time the intermediate nodes are pointing to another index page. They are not pointing like the clustered index to a data page. they are pointing to index page. So now if you check this structure you can see that at the leaf level for the clustered index we have the actual data the data pages but on the other side at the leaf level for the nclustered index we don't have the actual data we have index pages but those index pages are pointing to the actual data to the data pages but the big difference of that the data pages are not part of the B3 the B3 of the nclustlustered index is just a separate structure that does not involve any data. So we have only index pages and it just points to the data pages without changing anything physically with your data. But in reality what happen is that you can have those two types of indexes the clustered and the nclustered indexes in one table. So one can happen the leaf level of the nclustered index going to be pointing to the data pages of the clustered index because those index pages don't care whether those pages are sorted or not. It's just going to go and point to the correct page and to the correct row. So that means we have now like two different B3 structures that are pointing to the data. And here there is like one thing that you have to understand that that you can create only one clustered index on a table. And this rule really makes sense because you can sort the data only in one way in SQL. And that's of course makes sense because you can sort the data physically only once. And that's why in SQL databases you are allowed to create only one clustered index because physically the data can be sorted only in one way. But in the other hand in the non-clustered index you can create as many nonclustered index you need. So you can create three four and all of them are pointing to the same data pages because in the B tree of the non-clustered index you don't store any data pages. We store only pointers to the data and you could have like multiple pointers. So this is the most important and the main difference between those two indexes. Now if you put it side by side, we have learned that the clustered index going to go and physically sorts and stores the rows at the B tree. But the nclustered index is going to go and create a separate p structure with pointers to the actual data. And by the way, the clustered index we call it the main index that we could use in each table. So the clustered index is the main one, the most important one that you can go and use in each table in your database. Now as we learned if you are talking about the number of indexes you can create maximum one index for each table but for the nclustered index there is no limitations you can go and create multiple indexes for each table. And now if you go and compare them about the read performance how fast we can get data using clustered index. Well it is faster than the nclustlustered index. And that's because in the nonclass and index we have this extra layer at the leaf node from the B tree and because of this having extra layer that means SQL has to do extra job in order to find the data that's why clustered index is faster than the nonclustered index but now in the other hand if we are talking about the right performance how fast we can insert data to the tables well writing data to a table with a clustered index is slower than the nclustered index. And that's because as you are inserting data to the table, SQL has always to check the databases is everything sorted correctly and if not SQL has to go and start physically sorting the data again in order to have the correct order. So there is a lot of stress in order to sort the data with the clustered index. But in the other hand in the non-clustered index we don't have this. So the physical data going to stay as it is. We are just creating nice new pointers. So if you are writing to a table where you have a clustered index, it's going to be slower than writing to a table where you have nclustered index. And of course the fastest way to write data to a table is to not have indexes at all. So a heap structure. So SQL just go and start inserting data inside those databases without creating any extra structures. So as you can see it's like always a tradeoff. You can read fast but you're going to write slower. So you cannot have like everything. Now we are talking about the storage efficiency. The clustered index going to be better with the storage than the nonclustered index and that's because of the same reason with the nonstructured index. We have this extra layer of index pages and index pages needs storage and that's why they can waste more storage than the clustered index. Now if you're talking about the use cases when to use clustered index. Well, if you have like a column this column has to have few criteria in order to be good candidate for the clustered index. First, it's going to be good if the values inside the columns are unique. And second, and it is way more important than that, the values of this column should not change a lot because if this column having a lot of update operators and the data is keep changing, that means each time SQL going to go and start sorting the data again left and right. So having a column that is frequently changing, it's not good for clustered index. And that's why the primary keys of tables are a perfect candidate because first they are unique and second we will never go and update a primary key value. We always append a new primary key value and that's why primary keys are perfect for clustered index. And one more thing where I go and use clustered index is that to optimize the performance of a range query. If you are quering the data between one value and another one clusters index works really well. Now in the other hand if we are talking about the non-clustered index we could use it on coms that are used in the search conditions or if you are joining tables without using the primary keys then you can go and apply the nclustered index in order to have faster joins or you can go and use it to optimize the performance if you are searching for an exact value exact match. So those are the main and important differences between the clustered and the nclustered indexes. All right. So now before we go to SQL and start practicing, I would like to show you the syntax of the index. So it's very very simple. It start with create and then we can define whether it is clustered or nonclustered and then the keyword index. But this section is optional. So if you don't define anything, the default going to be the nonclustered. So if you say create index the SQL server going to go and create nclustered index. Then after that we have to go and define the name of the index and then we have to tell SQL which table we have to create the index in on table name and then we can go and define one column or multiple columns for the index and we call an index with multiple columns as composite index. So for example we can go and create a clustered index using this command create clustered index the index name and then we specify the table and the ID. So we are saying create clustered index based on this column the ID from the table customers. And if you want to create a nclustered index you say create nclustered index and the same thing. So so far we are using one column in the index but we can go and create a composite index with multiple columns like the following example. So we can say create an index and as you can see we skipped here defining the type and that's because the default going to be nonclustered index. And now here we are specifying two columns the last name and the first name. And as you can see we specifying as well for SQL how to sort the data. So we are saying last name should be sorted inside the data page ascending lowest to the highest but the first name should be the way around from the highest to the lowest. So you can control how the data going to be sorted physically in the data page. So as you can see it is very simple. This is the syntax for creating index in SQL. All right. So back to SQL and the first question is where do we find indexes in the database? Well you can go and explore it. If you go to the object explorer over here and check any tables from our sales DB for example the customers and here you have a folder called indexes. So if you expand it you will find here an index. I didn't create any of those indexes in the database. But in SQL server, if you define any of the columns as a primary key, the SQL server going to go by default creating a clustered index for the primary key because it makes always sense to create a clustered index on the primary key. So this one is created as a default and as you can see at the start we have like a key primary key customer and then it is clustered. Now I would like to start from the scratch. That's why I would like to go and create a new table without any indexes. So what we're going to do, we're going to go and load the table customers into a new table. So how we going to do that? We're going to go and say select star from sales customers and before the from we're going to say into a new table. So it's going to be TB customers. So like this. Let's go ahead and execute it. So now if you go to the left side and refresh the tables you can find we have now a new table called DB customers. Now let's go and check whether we have any indexes inside it. So indexes it is empty. So we don't have anything no clustered index or anything else. And this table has the structure of heap structure. So the data are inserted there randomly. It is not sorted. And if I go over here and for example, let's say I'm going to select from this new table where customer ID equal one and I execute it. The SQL server did a full scan on the table in order to find this customer ID. So our new table DB customers is heap cluster. But let's go and change that. What we're going to do, we're going to go and create a new clustered index. So we're going to say create clustered index and then we're going to go and give it a name for the index. We usually follow the following index. So we have index as prefix and then after that we specify the table name. So DB customers and then the key for the index. So the column that we are using in order to index the table. This is important to stick with the same naming convention for the index name because later as you are monitoring your indexes, it's going to be really easy to understand. Okay, this index is for the table DB customers and we are using the customer ID to index. So now after that we're going to go specify on which table we are doing the index. So on sales DB customers and then we're going to specify the column name. So we are saying build for me a clustered index based on the customer ID. So now let's go and execute it. So as you can see it's very fast because we have only five rows. So the database just switched all the data pages very fast. Now let's go and check our new index. So let's go and refresh and let's go inside it. And now we can see that we have our new index clustered index based on the customer ID. Now as we learned we cannot create multiple clustered index. But let's go and test that. So I will just take the whole thing and let's say I would like to create a class index based on the first name as well here. So let's go and execute it. So as you can see saying you cannot create more than one clustered index on this table. That means we can create only one clustered index. And let's say that after you created the index you chose the wrong column and you would like to change it to the first name. So what we're going to do, we have to go and drop the index. So we say drop index and then you need the index name. It was this one. And then you have to specify which table. So it's going to be sales DB customers like this. So if I do it like this and let's go and refresh again. You can see that we don't have any indexes anymore and the table is packed as a hip structure. And now you can go and create the correct clustered index for this table. But to be honest, I'm going to stick with the customer ID. So I will not create a clustered index on the first name because the first name of course is not unique. You can have like maybe multiple customers having the same name. And as well updates could happen on the first name and that's going to be very expensive. So that means I'm going to stick with my index on the customer ID. Let's go and execute it. And now I have again my index on my table. Now let's say that that I have the following select statements from our tables. So customers and I'm searching for the last name where let's say we are searching for brown. So let's go and execute it. So let's say that we are getting more and more customers and our table is getting bigger and I frequently use this query. So I'm searching for specific customers using the last name. So what we can do, we can go and create a nonclustered index for the last name in order to improve the performance of this query. So let's go and create that. So we're going to say create nonclustered index. And now we're going to give it the name using the naming convention. So DB customers and we're going to use the last name for this index. So on sales DB customers and we will use the column last name for the index. So let's go and execute it. And now if you go to our indexes and refresh, we will find our new index over here. And as you can see, it says it is nonclustered and as well non-unique. We will talk about the uniqueness later. So as you can see, it's very easy. We have just created a uncclustered index on the last name. And now as we learned, we can go and create multiple nonclustered index on the same table. Let's say for example, now we our query looks like this. We are searching for the first name using for example the value Anna. And now this query happens a lot and maybe slow. So we can go and create new nonclustered index. So let me just have it like this. And for the nonclustered index you don't have to specify always like nonclustered index. As default it's going to be nonclustered. So we can skip that. And here let's call it first name. And the column that we are using is the first name. So let's go and create this index. And now let's go and refresh our indexes. And as you can see, SQL did create a nonclustered index for the first name. So if you don't specify the type of the index, it's going to be as a default nonclustered index. All right. So now let's talk about the composite index. It is an index that has multiple columns inside the same index. So far we have used only one column in the index but we can go and specify multiple columns and that's because sometimes our wear conditions are complicated and based on multiple columns. So for example let's say that we are searching for country equal to USA and at the same time we are saying the score should be higher than 500. So that means in this condition we are using two columns and we would like to speed up this query. So how we going to do it? So we're going to go and create let's say an index and give it a name DB customers and let's say country score on sales DB customers. And now it is very important to do the following thing. Now we have to go and define a list of columns that we want to be included in this index. And it is very crucial and important that you get the same order as your query. So your query start with the country and then the score. You have to do it the same thing in the index. So the first column it's going to be the country and then the score. So it must be the same order as your query. So let's go and create this index. And if you go to the indexes over here, you can see that we have created our new index. So now once you create such a index and your table going to be like always updating this index you have to be committed and responsible. So in your queries if you want to filter the data using country and score always start with the country then the score in order to be able to use the index optimizer. So if you do it like this the index going to be working but if you go and query the way around. So you start with the score and then the country the SQL will not be using your index. So either you adjust your queries or you have to go and recreate the index based on this switch. So be very careful with the composite indexes. The order is very crucial. So you're going to have it exactly like the query. And now you might say you know what now we have like a nice index for those two columns. What going to happen if I go and use in my query only one of them like for example the country. So now the question is if I go and execute this query is the SQL is using this index even though that I don't have the score. Well yes because it follows the leftmost prefix rule. So this means SQL can use the index if you are using always the lift columns. So here in our index country is on the left that's why it is working over here. But if you go and skip the lift column it will not work. So if you go over here for example and say let's go and select only the score and it is like higher than 500. What we have done, we have skipped the country in this query and that's why it will not be working. So as long as you are including the left columns, it will work even though it is only one column. So in this scenario, the first query going to use the index, the second one will not be using it. So now let me give you a very simple example in order to understand how this works. So let's say that we have an index using four columns A, B, C, D. Now in your query if you go and target the column A the index going to be used. Now the same thing going to happen if you go and use A and P. So if you're using those two columns you will be using the index. So those are where the index will be used. So now let's have the scenarios where the index wants be used. So for example if you go and just jump immediately to the column B. So you are not using the left column the A that's why you will not be using the index and as well in your query if you are using A and you are skipping the P. So you have A and then C you will not be using the index. So you have always to use always the lift columns. So here if you are using A B C you will be using the index. And let's see here you are using A B and then you jump and skip to the D you will not be using the index. So this is what we mean with the leftmost prefix rule by using the composite index. So if you're using multiple columns inside one index, be careful with the order of the columns that you are defining. All right. So that's all for this category, clustered and uncclustered index. Now we're going to move to the second category where we talk about the indexes by the storage, the row store and the column store. So now let's say that we have a table we have multiple rows and multiple columns. Now if we use a row store index this is the classical one. What going to happen? Our table going to be splitted into multiple rows. And as we learned each group of rows going to be stored inside a data page. So that means we are organizing the data row by row which means all the columns for each row going to be stored together. This is the traditional way on how the databases organize their data where the informations are stored row by row. But now in the other side if you use column store index the SQL going to go and split your table into multiple separate columns and then SQL going to go and store the values of one column together in data page. So that means if you go and open a data page you will find only the values of one column. You will not find the entire row. So if it's like the first name you will see only the first name informations you will not see the last name information in this data page. So if you compare them the row store index stores the data row by row the column store index stores the data column by column. So this is a very high level representation on how the column store index is stored. As you know me we go in details in order to understand exactly how SQL works with the column store index. So let's go. All right. So now let's say that we have a table for the customers. We have three columns ID, name and status. And as well we have around 2 million rows, 2 million customers. And as we learned as a default, the table going to be built as a heap structure where the rows are stored row by row inside data pages. But now we go and create a column store index on top of this table. So now once you do that SQL going to go through a process in order to build the column store. So the first step is SQL going to go and divide the data the rows into row groups. Now in SQL server each row group can contain around like 1 million row. So in this example our table going to be splitted into two row groups. The first one million row in one group and the second one in another row group. Now you might ask me we are talking about columns. Why we are splitting the rows? Well, this is just a pre-step in order just to optimize the performance and to do parallel processing. And of course, the data will not be stored like this because we have the second step. Now, in the next step, SQL going to go and segment the columns. So now, SQL will go for each row group and start splitting the data by the columns. And that's why we call it a column store because we are separating the columns from each others. So that means we have one segment for the ID, another one for the name and a third one for the status. And this can happen for each row group. And now it's going to move to the third step in this process. We have the data compression. And this is the most important step in this process because it is the reason why column store is very fast compared to the ro store. So in this process there are like different techniques on how to do data compression and the most famous one is that it's going to go and create like a dictionary. Let's take for example the column status the status of the customer whether it is active or inactive. So the word active and inactive going to be repeated like 2 million times because we have 2 million customers and since it is like string it is like taking a lot of space and storage. But now instead of that we're going to go and compress the data. So first it's going to go and create a dictionary by replacing the value active and inactive into smaller values like one and two. So we have like a mapping between the long value to a small value. And after that SQL going to store like a data stream where we have like only two values one two one two. So we're going to have like a big stream of 2 million rows. So it's going to go and do this for each column and with that the size of each column going to be changed depends of course on how much different values you have in each column. So this step is very important in order to reduce the size of the data and as well to increase the performance. So now once everything is organized and compressed, SQL going to go and start storing the results in databases. But TSQL will not use the standard databases that we have learned previously. But instead going to use a special database called LOB large object page. So now let's quickly compare the structure of the normal database that we have learned in the row store with the new one, the column store, the LOB data page. So as usual each page has a header. This is same as any data page. But the next section is going to be the segment header. It has like metadata informations about the column segment that is stored in this page. Like we have the segment ID, the row group ID, the column ID and it has as well very important information the ID to the dictionary page. So the dictionary page is as well a type of pages in SQL. It has as well a header but inside it we have like a mapping. So it maps the original value, the long one, the inactive to the smaller version of this value, for example, one. And that's all for the dictionary page. It has the mapping between the original values and the smaller values. And beneath the segment header, we can have now the important place where our data can be stored. We have the data stream. So it is like sequence of ids from the dictionary that represents the values of the columns side by side. And of course, we cannot fit the whole 1 million rows inside this data stream. We're going to have like multiple LOP databases. So this is how exactly the SQL stores your data. If you decided to go with the column store, so let's go back to the process. So back to the process. As you can see, SQL is storing the data as LO data storage. So this is the last step and with that SQL did convert your table into a column store. So now we cannot just create a column store without defining whether it is clustered index or non-clustered index. So let's start with the first one the clustered column store index. So if you create such a index SQL of course will not be building a B3 structure. SQL going to use exactly this structure the column store structure. So as we learned the cluster index is a complete makeover of your table. when you apply it then SQL going to format everything column-wise and it is fully replacing the old row based table structure that we have at the start. So once you apply the clustered column store index it will not leave anything behind and your table going to be completely structured as a column store and one more thing which is makes sense of course all the columns from the original table going to be converted to a column store. So it is not leaving anything behind it. But in the other hand, if you are using non-clustered column store index, as we learned, it is like a companion to your existing table. So it coexist with the table and it will not replace anything. So the column store index can be an additional thing that is stored beside your table. So that means the original table will not be deleted at all like the clustered column store index. The first one is in the old row based storage. the regular table, the first one, and your data going to be as well stored in a separate structure in the column store index. And of course, in the non-clustered column store index, since we are creating an extra index outside of your original table, you can go and define which column should be included in this process. It must not be all the columns. You can go for example with only the status. So that means you build a column store index only for one column for the status of the customers. So this is what we mean with the clustered column store index and the nclustered column store index. All right friends, so now you might ask me why we are doing all those stuff. Why I would split my data by the columns? Well, it's all because of analytics. Because in analytics we have like big complex query where we have a lot of data aggregations and stuff on big tables. And the roster index is perfectly designed in order to improve the performance of such big queries. And that's why SQL databases like SQL server and as well BI tools like Tableau and PowerBI did adopt this methods in order to offer fast platform for data analyzes. So now let's understand exactly why the column store index is way faster for data analyzes than the row store index. So let's go. So again we have the customers tables and let's say we have like five customers where we have ID, name and status and as we learned before if we are using roster index the data can be stored in multiple databases and in each database we're going to have the whole record the whole information about one customer. So for this example we're going to have like three databases but if you are using the column store index it's going to be stored little bit differently. So the first column the id going to be stored in one data page and here the SQL will not go and build a dictionary because the ids are already short. So we're going to have like one data stream with all ids and now for the next column name is going to be stored in separate data page where we're going to have an extra dictionary page where each name going to be mapped to one small value. So the data going to be compressed and we're going to save storage. Now the database going to create for the third column the status one more data page and the dictionary here going to be very small. So for active we're going to have one and for the inactive we're going to have two and in the data stream we will be storing only the ids of the dictionary. So now let's understand why the column store is faster. Let's have the following query. We want to find the total number of customers that are active. So we have the query select count star from customers and we're going to filter the data by the status where it is equal to active. So now if we query the table with the row store what can happen? SQL have first to go and collect the data. So it's going to go to the first data page and collect the first two customer then to the second to the third and so on. And as you can see SQL here is reading everything the whole row the ID the name the status even though that for the query we actually we don't need all those informations we just need to count how many customers we need with the status active but still cannot go and selectively only reading the status has to read the whole record. So after SQL has all the data it's going to go and filter the data. So it's going to go and remove the inactive rows and then SQL going to do the aggregate operation and with that we're going to get three rows. So that's why the total count of active customers going to be three. But now let's see how SQL going to query the column store. So SQL first have to analyze okay which columns do I need actually for this query. Well, we need only the status. So SQL will not go and open all three data pages and read it. SQL will target only one data page the database where we have the column status. So it's going to take this very simple data stream and then it's going to go and understand the dictionary and it going to go and remove all the values where it is equal to two. So without in the output we have only three values and SQL going to go and do a very quick count for those values. So in the output we will get as well three total number of active customers. So now if you compare this intermediate result sets from the row store and the column store you can see that in the row store we have fetched and retrieved a lot of unnecessary informations for this query and this of course going to make the speed of the query very slow but in the column store reads exactly what it needs for this aggregation and we didn't read any extra informations about the names of the customers the ids it didn't like open any extra data pages it exactly gets the data that it needs for the aggregation and that's exactly why the performance of queries where we have aggregations and data analyzes is going to be very fast if you are using column store compared to the row store. So that's why we use column store for big data and data analytics. All right. So now let's summarize the differences between the row store and the column store indexes side by side. So let's start by the definition. The row store going to go and organize and store the data row by row. It is really nice method if you need a lot of columns in one row. But in the other hand, the column store index going to go and store the data and organize it column by column which is really great if you're focusing on specific column. Now if you are talking about the storage efficiency, the row store index going to take more space compared to the column store index and that's because as we learned the column store going to go and compress the data which going to save a lot of storage if you have large tables. Now to the next point which is more important about the performance. The read and write optimizations we can say for the row store things are more balanced. So you will get a decent speed for both write and read operations but things in the column store is different. It is fast for reading especially if you are doing data analytics but writing data like inserting and updating it is slower because as we learned there are like multiple steps until the data is written in the pages. So in one hand you are optimizing the speed of your analytical queries but in the other hand changing data it is slower than the roster index. Now let's talk about the next point input and output efficiency. Well the roster index it's not really good because you are retrieving a lot of columns. So a lot of data should be read from the disk storage in order to answer your queries. But in the other hand for the column store it is lower and that's because it targets exactly the data and columns that is needed for the query. So there will be generally less data that is read from the disk storage and of course that's why we are getting fast read performance. So now if you are thinking which systems are best for ro store index well the roster index is very suitable for the OLTB systems online transactional systems like banking and commerce systems where the full records are accessed very frequently but in the other hand the column store index is great for OLAP. All app systems are online analytical processing where you have like data warehouses, data league, business intelligence. You are building reports and analyzes. You have large data sets and very complicated aggregated queries. So if you have such a project then the column store index is the way to go. So that means the use case for the row store index if you have high frequency transactions where the system has to quickly access records and the use case for the column store is big data analytics where the SQL has to scan large data sets. So those are the main differences between the row store index and the column store index. All right. So now let's check the syntax of the column store index. Well, it is really easy what we're going to do. we can just put a column store keyword between the clustered or nonclustered and the index. So once you specify that then you are telling SQL you want to create a column store index and the rest is going to stay as it is. Now if you want to create row column store then you don't have to specify anything. There is no keyword for the row store. So as we learned before we can go and create a nonclustered index and cluster on the index and both of those syntax is going to tell SQL we are creating row store index but if you go and use the column store keyword then you are telling SQL that you want to create either clustered or nclustered column store index and here there is like a syntax rule if you are creating a clustered column index then you must not specify anything for the columns. So you cannot go and specify anything like an ID or country or any columns over here because it makes no sense once you say cluster column store then all the columns going to be included in the new structure. So this is the syntax of the column store index. All right. So back to scale let's check how we can create column store index. Now if you check our table here DB customers that we have created previously and we go to the indexes you can see that we have created few indexes and one of them is the clustered index. This one is a row store index. So our table is splitted by the rows. Now let's go and change that. Let's make our table splitted by the columns using the column store. So we're going to say create clustered column store index and we're going to give it the name index DB customers and it's going to be on the table sales DB customers and here if you go and specify a column it's going to be a mistake. So let's go and check that. So if you go and execute it says it fails because key lists or the columns is not allowed. So we cannot have this. So let's remove it. And now we have the correct syntax. Let's execute it again. We will get another error because it says in one table you cannot have more than one clustered index. We have already one. You have to decide do you want to split your table by columns or by rows. That's why we have to go and drop the previous index. So we're going to do it like this. Drop index. And I need the name of the index like this. And then we have to specify the table name. So that's it. Let's drop the index. Now if you refresh, we cannot see anymore our clustered index and our query should be working. So let's do that. Now let's check the indexes again. And now as you can see, we got a new clustered index, but this time it is column store. Now you can see at the start we have like an icon. This looks like a bar chart or like analytics and reports and that's because the main purpose of creating com store is to have a bar chart. So now of course we cannot go and create multiple clustered column index. We can have maximum only one. So now if you say you know what let's go and create for the first name another index but this time it's going to be a column store. So if I go and copy the whole thing over here and let's say it is none clustered column index and let's call it for example first name and we define over here the first name. So that's it. Let's go and execute it. You will see that we will get an error where SQL tells us you cannot create multiple column store indexes. That means you can create only one column store index for each table and you have to decide whether it is a clustered or non-clustered and you cannot create like the row store multiple non-clustered index. So you are allowed only with one column store index but this limitation is only here in the SQL server. In other databases I know that is allowed to use multiple column store indexes like in the Azure SQL server you can do that. So now in order to practice and you would like to create a nonclustered column store index, you can drop the first one and you can go and create the one that you need as a nclustered index. So actually let's go and do that. Let's drop the first one. So drop index and this is our index on this table. Let's do that. And once you execute the nonclustered column store index is going to work. And if you refresh over here, you will see that we have a non-clustered column store index for the first name. Okay. So now as we learned that the column store going to go and compress the data and the storage that is needed for the entire table going to be less than the row store. So let's see whether that is really true. Now in order to check this I will not do that in the database sales DB because everything here is already small. We're going to go and use another database. We have the adventure works DW2022 and if you have a newer version that's okay. So now what is the plan? We're going to go and create three identical copies of one table and we're going to have different structures. So the first one going to be the heap structure. The second one going to be row store structure and the third one going to be column store structure and then we're going to go and compare the storage of those three. So now we have to go and pick one of those tables. We need one big table. So for example the fact internet sales. So let's see how we can do that. Let's start with the heap structure. We're going to say select star into a new table. So it's going to be the fact internet sales and underscore hp for the heap. And we're going to get it from the table fact internet sales. So like this. And here it's very important if you are switching databases you have to go and use the database. So it's going to be use adventure work DW 2022. So execute this at the starts to make sure that you are switching to the new database. And now let's go and execute our heap structure. So with that we have created heap table as you can see 60,000 rows. And since we didn't define any clustered index this table going to be heap structure. Now let's go and create another table where we use clustered row store index. So what we're going to do, we're going to copy the whole thing over here and we're going to call this row store and we're going to go of course change the name to RS but still we are targeting the same table. So let's go and execute this at the start. But now in order to make it as clustered row store we have to go and create an index. So it going to be like this create clustered index. We don't have to specify the row store because it is as a default. It's going to be ro store. So let's call it index facts internet sales RS and then the primary key. So B key and now we need the table fact internet sales RS and now we need the columns the primary key well actually I don't know what is the primary key so let's go and check that so it is a composite primary keys so it's going to be the sales order number and sales order line number like this. So let's go and execute this. And with that we have clustered row index. I'm going to go and check what do we have over here. So let's go and refresh everything. So we have now two tables the heap and the row store. So let's extend it and check the indexes. And as you can see we have the clustered index. Now we need the third table. It's going to be the column store index. I'm just going to go and copy the whole thing over here. So this is the column store going to be here CS and CS and of course we don't need any columns for the column store and don't forget to add the column store keyword. So create cluster column store index and we have to rename as well over here. So let's go and execute our new stuff. So we create first the table and then we convert it to a column store index. So let's go and do that and we have to go and refresh and check our tables. So this is our third table and let's go and check the indexes and we have clustered column store. All right. So now we are done. We have our three different tables. Now let's go and check the stoages of those three tables. So now let's go and check our first table the heap table. So right click on it and go to the properties. And now we can see here a lot of informations about our table. But we are interested on the storage. So click here on the page for the storage. And now we can see here few informations about the storage and one of them is the data space. It is around 9 MB and the index space is almost nothing. So we don't have anything over here. So this is the storage of the heap structure. We don't have any indexes. Let's go now to the row store. So we're going to go to the RS and properties. Then let's go to the storage. And now as you can see the data space is exactly the same. And that's because whether it is heap or row store index, we're going to store the data in data pages as rows. So the size of the data itself will not change. It will be sorted differently. But what changed here is the size of the index. Now we are consuming more storage for the index. So that means the overall storage of the table with a cluster draw store index it is more than the heap structure. Let's go and check now our column store index. So to the CS and let's go to the properties. And now it is interesting to see whether our table is getting smaller. So let's go to the storage. And as you can see the data space is around 1 mgabyte compared to the 9 mgabyte. I know those are small numbers but still it is massively reduced space because everything is compressed and of course we are not using any index spaces because we don't have this B3 structure in the column store. So as you can see if you compare to the others it is the winner. This table that is using the column store is consuming way less storage than the others. So now if you want to rank it based on the storage the best one is the column store index table. Then the next one is the table with the he structure and the worst one is the table with the row store clustered index. So that's true. column store index is consuming less space than the other type of indexes. All right. So now what is unique index? Unique index is a special type of indexes that going to make sure no duplicates in your data. And there are a couple of reasons why is it important to have a unique index. The first one and the most obvious reason is to have data integrity. So the unique index going to go and enforce uniqueness in your data and that is very helpful. For example, if you have a column like an email address or a product ID. Having duplicate in such a columns can mess up your data very badly. So having a unique index on a column like an email going to make sure there are no sneaky duplicates inside your data. And the second important reason why unique index is important is to improve the performance. So for example, if you are searching for specific email, the SQL going to start searching for the email value and once the SQL find the value, the SQL will stop searching because we are sure that there is no duplicates in the data. So with that you are improving the performance of your queries. So if you are creating an index and you know this column is unique then make sure to make the index as unique index. So now if you have a look again to our clustered index where we have the B structure if you make this index as unique then you are giving an extra task for the SQL that's going to go and make sure that all those ids of the customer going to be unique. So SQL has to guarantee that there are no duplicates at all inside your data in the databases. So now since we are giving SQL an extra task to prove the uniqueness of the data building the clustered index going to be little bit slower. So that means inserting new data writing data going to be slower as the normal clustered index. But now if you are talking about the read performance the performance of our query it's going to be optimized a little bit faster than a normal clustered index. So again this tradeoff we are making writing data slower but we are gaining more speed on the query performance. So this is what we mean with unique index. Okay. So let's keep extending the syntax of the index. So now in order to tell whether it is unique or not we can specify it exactly at the start. So we say create unique is just before the clustered or nonclustered and then afterward the cl store and nothing changed for the rest. So we can specify this keyword to TSQL, it should be unique. And if you don't write anything before the clustered index, it's going to be not unique. So for example, this one says create an index. So we didn't specify anything here, duplicates are allowed in the index. But if you go and specify a unique index, then the duplicates are not allowed. So it is very simple. Okay. So now let's go and create unique cluster. Now let's go and target the table products. Let's go and first select the data from the table. So sales products and execute it. Now let's see that I'm going to go and create a unique index on the column category. Let's go and try it. So create unique nonclustered index and let's give it the name index products category on the table sales products and we are targeting the column category. So let's go and execute it. Now we will get an error because the category has duplicates. So if you go and query again our table, you can see we have here duplicate values and the SQL cannot go and create unique index for this table. It's too late. But you still can create this index if the table is empty and SQL will not allow you to insert any duplicates about the categories. And of course it makes no sense to have unique index on the categories because of course we're going to get duplicates here. But maybe you say, you know what, my products are unique. The product name should be unique and we are not allowed to have in this table two products with the same name. So if you have such a rule at your business, you can go and define a unique index for the products. So let's go and do that. Now we're going to go and replace the category with the products and the same thing over here. So we are targeting the column products. Let's go and execute it. As you can see now it is working because we don't have any duplicates inside the table products. And if you go and check the indexes over here, we can see our new index. And as you can see at the start here, it says it is unique non-clustered index. Now let's go and try the data integrity. Are we allowed not to add any duplicate to this table? So let's go and try that out. Let's have an insert statement. Let's say insert into sales products. And I would like only to insert the product ID and the product name. and we're going to insert two values. Values, let's say we're going to have a new ID 106, but we're going to go and insert duplicate for the product name. So, we're going to say caps. We have already a product called caps over here. So, we are now inserting duplicates. Let's go and try it. Now, you will get an error saying you cannot insert duplicates to this table because we have unique index. So as you can see this index is now helping us and improving the quality of my table. So this is how we work with the unique index in SQL. Okay. So now what is a filtered index? A filtered index is a regular index but with a twist. It only includes rows that meet specific condition. So let's understand what this means. So again we have our nonclustered index and the B3 structure. So now at the leaf nodes we will get only the ids the data that fulfill a specific condition. So for example if we are saying we want only the active customers this is the condition. So that means on the leaf nodes we will have only the customer ids that are active and any inactive customer will not be included at all at the data page and at the nodes. So that means our B structure going to be little bit smaller as usual because we have less data included in the structure. So our index going to be smaller than the regular nclustered index. So now the question is why is it important to have a filtered index? Well the biggest benefit is we going to have targeted optimizations. So for example if our analyzes always focuses on the active users and the inactive users are totally unrelevant. So that means having only relevant subset of data in the index going to make the whole index much smaller which leads to faster performance. So it's going to be faster to query this filtered B3 structure. So that means we are doing targeted optimizations and we are improving the query performance. Now the second benefit if you think about the storage since the size of the B structure going to be smaller that means we're going to need less storage space in order to store the index which is great thing if you have large tables in your database. So the filter the index going to make the structure of the index smaller which going to improve the speed and the performance and as well reduce the storage that is needed for your index. Okay. So now let's check the syntax of the filtered index. It's very simple. It's like any query you can go and add at the end of creating the index the wear clause and then the condition as you are doing in any select statements. But the SQL server is very restrictive using this type of index. So you cannot use filtered index on a clustered index. So it is only allowed for the nclustered index because it makes no sense. If you create a clustered index, the entire table should be reorganized and ordered. So it will not work for only subset of data and as well you cannot create a filtered index on a column store. So it is only allowed if you are using row store but you can go and combine the unique index together with the filtered index. There's no restrictions. So it's going to be like this. Create unique nonclustered index on the table and then you specify the wear condition. So this is the syntax of the filtered index and we have these restrictions. All right. So now let's say that we have the following query where we are selecting data from customers but always in our program or in our report we are selecting only the customers from USA. So we have the following condition. It says where country equal to USA and execute. So this is the basics of many queries that we have in our project and we are always filtering the customers based on the country. So in one query we are finding maybe the top customers and another query we are finding the average of scores and so on. But we are always filtering the data like this where country equal to USA. So now since we are using this column a lot and our table may be getting like million of records we can go and create nonclustered index on this column. So the usual way we go over here and say create nonclustered index and we call it like this index customers country and then it's going to be on the table sales customers and we select the column country like this. So if you do it like this SQL going to go and create a nclustered index for all customers not only from USA but for everything. So even if the customers come from Germany which is not really necessary because in our project we only focus on the customers from USA. So instead of that we can go and include the wear condition inside our cluster. So it's very simple we're going to go and say where country equal to USA exactly like our query. So now the index that's going to be built it will be focused and targeted only for subset of data only for the data that fulfill this condition. So now let's go and create our filtered index and it is working. Let's go and check our indexes on the customers. So let's go to the indexes over here and refresh. Now we can see our index over here. It says it is not unique because we didn't define anything at the start. So duplicates are allowed of course which is what we defined here. And as well it is filtered. So it doesn't contain all the rows from your table. It contains only the rows that fulfill our condition. So that means now if I go and execute this query, the index going to be used because the rows of this query is included in the index. But if I go over here and say Germany and execute the query, it's going to be slower because all those rows inside the query is not part of our index. So this index will not be used at all in order to improve the query. So this is how we work with the filtered index in SQL. All right. So now we're going to summarize and talk quickly about how to use the right index. So when to use which type? Let's start with the first one. We have the heap structure. So as we learned it is a table without any index. So in which scenario we don't have to use any indexes in case you want to have fast inserts. So if you want to have a fast write performance then don't take any index. So you stay with the default with the he structure of your table and we usually use it in not very important tables like the staging tables or temporary tables where we want to insert the data fast and then get rid of the data later. So here there is no need to utilize any index. Now if you are talking about the clustered index, we usually use the clustered index for primary keys. It is even a default from the database. If you create any primary keys, then SQL going to go and create a clustered index. So this is the main usage of the clustered index, you use it in the primary keys. And if there's like no primary key in your table, then you can go and pick another column where sorting the data is important like for example a date column. So it could be a good candidate for your clustered index. Now moving on to another type we have the column store index. So when I said here clustered index I mean clustered row store index of course. But now the question is when do we use the column store index. If you have like big complex analytical queries where you are aggregating a lot of data doing data aggregations then go for the column store index because it going to give you amazing performance. And as well if you are struggling with the size of tables so if you have a super large table you can go and use the column store index because it can go and compress the data and reduce the size of the whole table. So for those scenarios we use the column store index. So again for the row store clustered index we use it usually for the old TB systems where you have a lot of transactions and so on but for the column store we use it usually for the OLAP systems where you have a data warehouse reporting system business intelligence and so on. Now moving on to another type we have the nonclustered index. We usually use this index for non primary key columns. So that means the rest of the columns of your tables could be candidate for the nonclustered index. And there are a lot of reasons why you would do that. For example, for the foreign keys or using it on the columns that are used in order to join two tables and another place where you can use the nonclustered index for the columns that are used for the work clause. So there are like many scenarios where we can use the nonclustered index but not for the primary keys. Now moving on to another type, we have the filtered index. We use it in order to target a subset of data. So if in our query and analyzes we are only focusing on a subset of data all time, it makes no sense to have one big index for all data, we can use the filtered index to have focused index. And of course if the size of the index is a problem then you can use a filtered index in order to reduce the overall size of the storage of the index. And then to the last type we have the unique index. you can go and use the unique index in order to ensure data integrity of your table and as well it might prove slightly the performance of your query and that's because SQL has less task to do if the index is unique once SQL finds a match it going to skip the search so this is a quick summary and guide on when to use which index type that usually help me finding the right index all right friends so now let's say that you have created your index ES in your database and your query is optimized and you have fast performance but the job is not done yet. No god no god please no no no no because over the time the indexes get fragmented outdated unused and this going to lead to a poor performance in your queries and as well going to increase the storage costs and the overall performance of your database going to drop down. So indexes like having a car it need maintenance. So you need to change the oil and the tire of the car. And the same thing goes for the indexes. You have to maintain them. They need attention to keep everything running smoothly. So now I'm going to show you how I manage, maintain, and monitor the indexes of my SQL projects. So let's go. The first and the most important task is to monitor the usage of your indexes. So of course the first question we have to ask ourself over the time are we using really the indexes that we have created are they really helping improving the speed of my queries or was it just a good idea at the start of the project and later no one used those indexes. This is very crucial because if you are having an unused index you are consuming unnecessary storage space and as well the right performance in the tables going to be slow which is completely unnecessary if you are not using the index. So now our task is to find out the usage of each index that you have in the projects. So let's see how we can do that. So now previously we have created like multiple indexes on the table DB customers. So if you go to the DB customers and to the indexes, you can see that we have four indexes. Now we can go and show those informations by using a special stored procedures from the SQL server called SP help index. Let's go and do that. So SP help index. So it is a system stored procedure that comes with the database. So this stored procedure needs only one value and that is the table name. So we have it over here sales DB customers. Let's go and query it. So we have four indexes. Then we have a nice description of the index. So it says it is nonclustered index and whether it is column store. And it say where it is located. So it says it located on primary. Primary is the name of the file group where the data is stored. And as a default it can be stored as primary. And now the next information we have the index keys. It is nice information to understand which keys are used or which columns are used for the index. So the first one you can see we have two columns that means it is a composite index and of course for the column store we don't have any columns and then we have the first name last name. So this is a really nice quick store procedure in order to see information about our index. Okay. So now let's focus on our task on how to monitor the usage of the indexes. Now in databases we have a lot of schemas and tables that protocol the metadata of our database. And in SQL Server, we have a special schema called CIS where you can find a lot of metadata information about the SQL server. Metadata like the description of the tables, views, columns and as well the indexes. So now let's check what we can find inside the table indexes. So let's going to do it. Select star from CIS. This is the schema name. And then as you can see we have a list of many informations but we want to focus on the indexes. Now let's go and execute it. Now we get a huge list of all indexes that we have and a lot of informations for each index. We don't have to go and understand now each column. But I'm going to go and select the main important informations from this table. So what do we need? The object ID. This is the table ID. So the object ID and we have the name. It is the index name. And then here we have a nice information whether it is clustered or nonclustered. So let's go and select it type disk as so let's call it index type and we can go and check whether it is primary key or not. So let's get this information as well is primary key. I will go and just rename it is primary key and what else do we need whether it is unique. So it is as well nice information to have. So is [Music] unique. So of course you can go and grab a lot of stuff. It depends really on what you are monitoring. So for example, I'm going to go and check whether it is disabled or not. So is disabled and I'll just rename it. So with that I have like focus monitoring. I don't have to have all those informations. So let's go and execute. But now I would like to go and change few stuff like for example I don't want the object ID. I would like to have the full name of the table. And as well there is a lot of indexes that is unrelevant for my database. So now in order to do that we have to go and get the informations from another metadata table. So let's go and call this index and let's go and join it with another metadata table. It's called tables. So tbl and we're going to go and join it using the so the index object ID equal to the table object ID. And now if you like to see the content of this table we can go and create separately. So select star from our new table. So let's see the content of this table. So you can see we have the name which is the table name. And I think that's all what we need. We have a lot of other informations about the table. Well, I just need the table name. So let's go and do it at the start. tbl name as table name and I don't need anymore the object ID. But of course we have to go and use the alias for each of those informations in order to understand those informations comes from the index. So let's go and do that. All right. So my query is ready. Let's go and execute it again. So now as you can see we are getting the table name and the list is very short because it is only focusing on the tables that you have in the database. And this filter happens because of the inner join. But one more thing I would like to go and sort the data. So I'm going to say order by I would like to sort it by the table name and then the index name. All right. So now let's go and check for example the table customers. You can see that we have two non-clustered index and one of them is column store index. Those two we have created from the previous tutorial and we have an index on the primary key as you can see here is primary key equal to one and this is as well unique. So with that we have a really nice list of all indexes that we have in our database. But we are not there yet because our task is how to monitor the usage of the index. Now in order to get the usage for each of those indexes, we have to go to a special view called dynamic management view. And there the SQL server going to provide a lot of statistics about the usage for that index. And we can find it as well in the same schema. So let's go and query this table. So it's going to be select star from. So the same schema says dodm db_ind index usage stats. So let's go and explore this table and execute it. Now in those statistics we can find the usage of two indexes the index number three and one. And we can see there are like three usage informations of the index number one. And next we have like user seeks user scans and user lookups. So this is how many times the index is used as seeks or scans or lookups. We will understand those informations as we learn about the execution plan. And here we have a very nice information about how many time our index got updated. So as you can see here is zero because I didn't add any new data after creating the index. But of course all those numbers might be different at your site because it depends whether you are doing more queries and practicing. And you can find here more informations about when was exactly the last usage of those indexes and many many nice informations. So now let's go and integrate this view with our query. So now what I'm going to do, I'm going to do a lift join because if I do an inner join, I will only find the used indexes. But I don't want that because I want to see a full build of all my indexes in the database. So left join and we're going to go and get our view and call it S. And then we have to join it on the keys. So S on. So I would say let's go and grab the object ID equal to the index object ID. And of course we have to join on the index ID. So it's going to be the index ID equal to the index ID like this. Now we have to go and select few informations from this view. So I'm going to go and select like all those number of usage. So s let's get the user seeks as the user scans and the lookups and maybe as well the user updates and it is really nice information to understand when it was the last time used. So last user seek and the last user scan. Let me just correct it over here. And actually I can go and put those two dates in one date because if it's like the last seek it's going to be null over here or the opposite. And now what we can do we can go and put those two together actually in one column because when we have a value over here it's going to be null and vice versa. So we can do that using the null function kowalis like this and we can get this over here and we can call the whole thing last update. So like this and maybe I'm going to go and rename all those [Music] stuff. All right. So now we are done. Let's go and execute it. Okay. So let's go and check our new report over here. So this is our query and let's start with the first table for example the customers and go to the right side. And now we can see that we have three indexes and from these two indexes we have only one index that is not used at all. So we can see over here that the nclustered index on the country is not being used and that's because we have another index about the country that comes from the column store. So it could be like this that you are quering the table using the country but the SQL saying I would like to go and use this index instead of the first one. So we can say okay this one is not really useful maybe we can go and drop it right and for the rest you can see okay this column store index is used twice and the next one is once again the numbers at your side might be different and if we have a look to all other tables we have a lot of nulls so that means all those indexes that you have created on the DB customers let me check only one is used but now you might say you know what I've used the index but why I'm not seeing here any numbers about it well that's because those numbers will not live forever and we are using now the express edition locally at our PC. So each time you shut down your PC and you close the client the database going to shut down as well and those statistics going to be lost because they are in the memory. But in real projects the numbers going to be totally different than here and of course you're going to get realistic numbers. Now let's try to target one of those not used indexes. So for example let's go with this index. It is not clustered index on the product. So let's go and query that. Currently it is completely not used. So if I go and select it. So select star from sales products where product equal to caps. So with that we have used the index I think. Let's go back and query again and let's go to our index and check whether it is used. Well it is correct. So our query did use this index and we can see here it is used once. And now you can go and analyze in your project all the indexes that you have on your tables and you can see whether you are really using it with your queries or not. And if you are not using the query of course you have to make a decision about it. Maybe if you are working a team to ask about it who did create it and why. Maybe there is like one task in the database that is not frequently used. Maybe it's something that is run like once a month or something like that. So the index is needed but not that frequently. But still now we have like insights about what is going on with those indexes and whether we need them or not. And if you don't need them, go and drop them. All right, my friends. So here is the secret that 90% of SQL developers don't do that's going to make you in 1 minute the hero of the projects. So once I join a project and after saying hello to everyone, I open the database of the project and do one query. I checked the usage of the indexes of the projects and I can tell you after working 15 years with SQL that 90% of indexes created in projects are totally untouched and unused. So I collect all unused indexes and discuss it with the team. And if we don't find real usage for those indexes, we go and drop them. So after dropping all those unused indexes, you have done two great things for the projects. First, you have saved a lot of storage in the database. And second, which is way more important, you have improved and optimized the right performance on the database. So in your first day with one query, you have optimized the performance of the database. You have save storage and you're going to shine like an expert in your project. So if you haven't done that, do that now. All right. And now moving on to the next one. As we learned, identifying an unused index is an important task. But in the other hand, identifying a missing index is as well very important to improve the performance of your queries. So in SQL server, you can get recommendations from the database itself about missing indexes for your query. So let's see where we can find those recommendations. All right. So now let's say that you are doing multiple queries and you are doing analyszis and so on. For example, I have this query over here. It is query on the database adventure works DW and I'm joining just two tables the fact with the dimension and then filtering the data based on the colors and as well on the date key where I have like a range over here. So once I executed I got the following informations. It could be any query that you are doing while practicing and analyzing and so on. So now if you have like slow query and so on you can go and check the recommendations from the database about missing indexes. So in order to do that we can go and check again the metadata from the database system to see the recommendations about the missing indexes. So let's go and do that. So we're going to go and select from and now we have to go and target the dynamic management views and it is like this dm db missing index details. So let's go and explore the content over here. And don't forget that those informations going to be inside the cache of the server and if there's like a restart or something in the server you will lose all those informations. So now from my query there is few suggestions and recommendations from the database. Let's go and check it. So we can see here there are four recommendations about missing indexes from the database. So now let's go and check the first recommendation over here. You can go and check the table name from the object ID or you can find it here in the statements. So here the database is suggesting an index for the table dimension product and it is recommending us to make an index for the column color and that's because if you check our query we have like here a filter the wear condition where we are seeing the color equal to black and since we don't have an index on the color SQL is just suggesting to use an index for the color and of course in this situation we can go and use an uncclustered index. Now after that we have three recommendations for the same table fact internet sales. So for example here it is suggesting to make an index on the order date K because we are using it in the filter over here and as well suggesting to make an index for the product key since we are using it for the join. So this is really nice report about missing indexes in the database and it could assist you to find out things that you didn't thought about. But here my recommendation is evaluate those informations very carefully. Don't go and create like an index for each suggestions from the database. You still have to think about it. Is it really necessary? Do we really use this query very frequently and so on? So don't go blindly creating indexes for each recommendations from the database. So this is really nice tool and assistant for you in order to make a good strategy for your indexing. So this is how you find the recommendations of missing indexes from SQL database. Okay. Okay. So now to the next step, we have to go and monitor the duplicates in indexing. If you are working in team with multiple developers and you are working parallely in order to optimize the performance of the queries, what might happen is that different developers creating different indexes for the same column in the same table. But of course, this must not happen if you have a clean and solid review process in the project. But we are human and those things could happen. So that's why you have to monitor whether there are like duplicates. So the mission is to find whether there is a column that is involved in multiple indexes. So let's see how we can monitor that in SQL. Okay. So now it's very simple in order to find the duplicates of indexes inside your database. So we have learned before that we can find the list of all indexes in this table indexes in the system schema and then we join it with the tables in order to get the table name and then we have another table in order to find the columns that are involved in the index. Those informations we can find it inside the index columns and now in order to get the full name of the columns we're going to go and join it with the columns table. So it's very simple and makes sense. Let's go and execute the whole query. Now as you can see it is sorted by the table name and the column name and that's because we can find then easier the duplicate. So let's go and check the first table. So the country is part of this index where we have the column store nonclustered and again the country is involved in another index where we have the customer's country and this is a row store nonclustered index. So this is of course bad thing. We have to go and decide now do we want it as a column store or row store. And if we check as well this table, we can find the first name in two different clusters the same story. And that's because we were practicing and creating those indexes. And that's it. But now if you have like large schema and a lot of indexes, I would go and make like a flag in order to understand whether we have a duplicate or not. And that's by calculating the number of rows of unique table name and index name. And we can do that very easily using the window functions. So let's have new row. And we're going to go and use the function count since we want to find the number of rows over. Then we're going to go and partition by we need the table name and as well the column name. Our expectation of this column should be one. If we have more than one then there is an issue and that means the column is inside two different indexes. And now let's go and sort it by the column name and descending. So let's go and execute it. And now we have here a nice flag where we can see how many rows we have for a specific column in a table. So if it's one like those columns, they are fine. Those columns are involved only once in one index. But for the first four rows, we have here an issue because we count here two columns. That means we have two indexes for the same column. So as you can see the query is very simple and with that we have a nice report about the duplicates of indexes inside our database. Okay, one more thing in order to maintain our indexes is by updating the statistics. The database engines usually use statistics in order to understand which index should be used for our query. And if these statistics are not up to date, SQL going to make wrong decisions. So let's understand what this means. Now let's say that you have created a table and you start inserting data to this new table. Now the database engine going to go and create your new table and insert the data. Behind the scenes the database engine going to go and create for your new table statistics. It's like metadata informations about your data and that's like a report or insights about your table where you can find a lot of informations like the number of rows that distribution of values in a column and as well we can find the number of distinct values and histogram and patterns and many other informations about your table. So now of course the question is why do we have those informations in the database? Now imagine that you are doing select from where what going to happen the database engine has to go and create an execution plan. We're going to learn about this later in details. It is just a road map on how to execute this query. So here for example in order to load the data from the table there are like different ways on how to do it. So there is like a table scan, index scan, index seek. So that means the database engine has here three different ways on how to do it. And now in order for the database to decide which way to use, it's going to go and read the statistics of the table. So it's going to go and collect informations. Okay, how many rows do we have? Are the informations are unique? How is the distribution of the data and so on. And now based on those statistics and numbers, the database can now make a good decision about which methods to use in order to load the data. So for example, here the index scan is the best way to load our table. So this is exactly why the database needs the statistics in order to make the correct decision and to use the correct index. So now you might ask okay this is something internal for the database why do we have to care about it? Well there is an issue. Now for example in our table we have 50 rows and let's say that in the next day you went and inserted to this table like around 1 million row. Now the issue that could happen is that the statistics will not get updated about this table and the statistics can still say that we have only 50 rows. So that means the statistics of this table is now outdated. And the big issue that once you query this table, the SQL engine don't know at all about the 1 million row that you have inserted in the table because it's going to go and ask the statistics and it's going to answer with only 50 rows and the database going to say okay this is very small table and let's maybe skip an index or something. So that means the database going to make wrong decisions because the statistics are outdated. And now your task is to monitor those statistics and to keep updating them. So let's see how we can do that. Okay. So now the first thing that we have to do is to find out whether our statistics are up to date or outdated. In order to do that we have as well to access the metadata about our database. And for that as well we have tables and dynamic management functions in the system schema where we can find a lot of details about the statistics. And in order to monitor the statistics, I have prepared a query like this. So here I'm using a table called stats uh where here you're going to get a list of all statistics inside our database and the name of the statistics and then I'm joining it with the tables in order to get the table name and what is very important is the dynamic management function. So here we're going to get very important informations like the last updates and the number of rows and the number of modifications. So let's go and query it. So here we can see informations like the table name, the statistics name and now it's very important when the last time the statistics get updated. So now let's go and check our table DB customers. We can see here the statistics name and what is very important is the last update. So this tells us how old is the statistics. So for me it is like 4 days. And then we can find the total number of rows in this table. And now what is very important is the number of modifications that have been done on the table. So after updating the statistics on the 19th of October, there were around 15 rows that got modificated. This could be an insert, update, delete. So any operation of the table considered to be a modification. So that you can see there were a lot of modifications. So these statistics should be updated. So now for the table customers, you can see that the statistics are up to date. So we have here zero as a modifications and there will be no need to update the statistics. So this is how you can go and check the statistics informations inside your database in order to make a decision should I update the statistics or not. So now let's say that I would like to go and update the statistics of our table DB customers. Now as you can see we have here multiple statistics. So over here we have this statistics on this table and as well we have the statistics on the index. So as you can see we have here multiple statistics in one table. One for the table itself and one for each index that we have in this table. So now let's say that I would like to go and update the statistics only for one. I don't want to update everything in this table only for one statistics. Let's go and do that. So it's going to be very simple update statistics. And then we have to go and mention the name. So it's going to be sales DB customers. And then we have to specify the name of the statistics. So let's go and get this over here and let's go and execute it. So it was very fast. Let's go and reexecute our query and check the data. So now let's go and find it. It was exactly this one. And as you can see it just got updated and the number of rows is five and the number of notifications is zero. So we have now an upto-date statistics for this table. But let's say that I would like to go and update the rest but I don't want to do it one by one. So what we can do we can just copy the same thing over here but we don't specify any name of the statistic. So we are saying update statistics and then only the table name. So let's go and execute it. So now what going to happen is still going to go and update all the statistics that belongs to this table. So let's go and check our query again. Now you can see everything disappeared and the DB customer is completely up to date with no modifications problem. So this is how you can go and update your table and you can do then for the rest as well. But now there is like one more thing where you can go and update the statistics of the whole database. But beware this might take really long time and we're going to do that by executing a special store procedure. So execute SP update stats. This one over here. Let's go and do that. And now it is done. And we have here a pretty long log. It was fast because we don't have a big database. It is very small database. So it's not compared to any real databases. So now we can see over here that SQL is going through everything that you have in the database and trying to update the statistics. So in many situations it's going to be not necessary because there is nothing to update. There were no modifications and so on. That's why the database is smart enough to say no it is not required and it go and skip it. So now how I usually do it in my project is that I have like a job on the weekend where it's going to go and update the whole database statistics. So with that I make sure all my tables and indexes having up to-date statistics. Of course if you have small database you can run this like every day but if this takes long time then you can schedule it in the weekend. And as well if I know in the project that there will be in one day a lot of new incoming data. So we are doing some kind of data migrations. So I go and update the statistics after the data migration is done just to make sure we have up-to-date statistics. So this is how we monitor and update the statistics of the [Music] database. Okay. Okay, so now moving on to the final task that I usually do in order to monitor and manage the indexes is to monitor the index fragmentations. Over the time as your data is inserted, updated, deleted into your tables, indexes can become fragmented. So what is fragmentation? It means like there is unused spaces in your databases and the database is not filling them or your data is not anymore sorted correctly in the index and this of course leads to inefficient use of the storage and as well going to slow down your queries and in SQL in order to get everything organized again we have two methods the first method is reorganize so it's going to go and def fragment the leaf level of the index in order to get it organized and sorted again with the logical order. So it is very light operation and it will not block the user from using your table. And the second method called rebuild this is heavyweight operation. It going to go and drop the whole index and recreate it from the scratch. And this means of course not only the data going to get sorted again but as well the fragmentations inside your databases and the index going to be eliminated. So let's see how we can do that in SQL. Okay. So now back to our database and the first question that you have to ask do we have an issue with the fragmentations in our indexes. So we have to check the health of our indexes in the database. And in order to do that, we have again to go to the system metadata that we have and we're going to check their dynamic management functions. So there is like a special functions in order to get an answer in the SQL server. Let's go and do that. So we're going to go and select star from the function. So it is sis dot so it's going to be sis dot dm db index physical states this one. And this is a function that we have to pass few parameters. We will not go in details just follow me with this. So we have to give it the DB ID and a null another null and a third null and the last one going to be limited. So we have to do it like this. So let's go and query it. Now what do we find? We have the object ID. We have the index ID and few other informations but the most important one is the average fragmentation in percent. So this columns gives us the degree of the fragmentations in a word index. If it is zero then it is perfect. We have no fragmentation in the index and our index is very healthy. But if it is like 100 then that means it is completely out of order and we have to do something about it. And now you might say you know what I don't know which object it does and which index. Well you have to go and join few tables like the cy.ts and cis.index in order to get those informations. So we have to go and do that like we have done at the first query. So okay so offline I have done that. So I joined with the tables and the indexes and I'm sorting the data by the average fragmentations and percentage descending in order to get the problems at the start because we are interested where we have high percentage. So let's go and execute this. And now since it is practicing database I didn't insert any data and so on. But in real projects you will get here different numbers. And here is my recommendations about the percentage. If the fragmentation is between like zero and 10 that means everything is like okay and you don't have to do anything about it. But if the percentage is between like 10 and 30 then here we have to do something about it. So here I recommend to use the reorganize method in order to sort the data again correctly. But if you have more than 30% then here my recommendation is to go and rebuild the whole index because not only the data is in wrong order but as well there is a new spaces in your data page in the index. So you have to do something about it. So now let's go and imagine one of those indexes for example this one over here has fragmentation of 15%. So now what we have to do is to go and reorganize this index. Let's see how we can do that. So let's go over here and say the following. alter index and then we need the index name. So let's go and get it from here and then you have to mention the table name where the index exists. So we have it from the customers. So from sales customers so now we are editing the index and we have to tell SQL what to do now. So we just want to reorganize the index. So you go and use the keyword reorganize. So reorganize and that's it. This is very simple. So let's go and do that. And as you can see it is completed and it was very fast because we have small database. But sometimes it take little more time if you have a big index and big table. So after reorganizing you can go and again check the table over here and see the results and it should be like here is zero. Now let's see that we have another index where the fragmentation around like 50%. So let's go and copy it and this time instead of reorganize we're going to do rebuild. So I'm going to take the whole thing and this time we're going to go and rebuild this index over here on the same table and instead of reorganize we're going to say rebuild. So let's go and execute that. And with that SQL did drop the whole index and create it from the scratch. And this is usually takes more time than reorganize of course. And the next step of course is to go and check again the fragmentations and so on. So that's all about how to make your index healthy and remove the fragmentations from your index. All right, my friends. So as you can see, improving the performance of your queries doesn't end by creating them. It's all about staying proactive. So monitor the usage of the indexes, check whether there are any missing indexes, and always make sure the statistics of the database are up to date and keep your eyes on the fragmentations and make sure you have healthy indexes. So with that you have learned how I manage and monitor the indexes once I create them and I really recommend you to follow those steps. All right friends, so now let's say that you have a large complex analytical SQL query and it involves a lot of joins and aggregations and so on but it is slow and of course you want to go and optimize the performance of your query by maybe using indexes. And now the big question is where exactly I'm going to go build this index on which table on which columns. So that means you have to understand where exactly the problem is. Is it by joining tables or sorting data or by the aggregations? Now in order to answer all those questions we have something called execution plan. So what is that? The execution plan going to show you how the database exactly process your query step by step. And this is what we need. It's going to show us where exactly we have a performance issue. So in other words, the execution plan it's like your window on how the SQL database thinks and once you understand that then you're going to make a right decision on building an index. So let's understand exactly what this means. Okay. So now let's imagine that you are doing a query like selecting from table and then joining the data with another table. So now once you execute this query the database engine will not go immediately and start fetching data from the disk but instead of that first the SQL has to make a plan. So it's like you are planning a trip where you check the Google map in order to find the best route in order to reach the destination and the execution plan is exactly the same thing. The database has first to plan how to execute your query and it's going to build this plan step by step based on your query and as well the statistics. So the first step for example how to get the data from the tables and there are like multiple ways like scan index or full table scan and then after that it need to decide which type of joins going to be done like is it hash join or a loop join and then at the end of this plan it's going to be the select statements. So once the execution plan is ready the database engine going to start implementing the steps. So it's going to go and start reading your tables for example from the disk and then after that it's going to join the tables and then select the columns and send at the end the results to the end user. And now once everything is done the database engine going to do one more thing where it's going to go and take this execution plan and store it at the cache. And that's because the database engine can go and reuse this plan if we have a similar query. So for example, if you go and execute the same query again, the database engine here going to understand ah this is the same query. We have already built an execution plan for that. So it going to go and check the cache and it is way faster to get it immediately from the cache instead of building it. So in this scenario, the database engine doesn't have to make any decisions or something like that. going to go and get the plan from the cache and start immediately by executing the plan. And of course, the database engine will not hide the execution plan from the users. You can go and check it because you can go and check how the database loaded the data, how they are joined and so on. And then you can make a correct decision on how to optimize your query maybe by adding indexes. So let's go back to SQL and see how we can do that. Okay, so now we're going to work with the database Adventure Works DW2022. And now we're going to go to our tables and we're going to focus on the fact fact reseller sales. Now let's go and check the type of this table. So if you go inside it and go to the indexes, you can see that we have an index on the primary key. So we have a clustered roster index. So that means the data is structured in this P tree. So now what we're going to do, we're going to go and create a mirror of this table but without any indexes. So it's going to be very simple. Select star from our fact reseller sales and we're going to insert it in a new table. So into fact reseller sales and I'm going to call it HP for heap. So let's go and execute it. And now you can see we have inserted in the new table around 60,000 rows. So now we can go and refresh our tables in order to find our new table. So it is over here factory seller sales and if you check the indexes you will not find any. So that means it is a heap table. Now let's go and do a very simple query on top of our new table. So select star from the factory seller HP like this. So let's go and execute it and we got the results. So now the question is I would like to see the execution plan of this query. Now in order to see the execution plan we're going to go to the toolbar over here and we have three things. The first one is says display estimated execution plan and we have another one says include actual execution plan and a third one says include live query statistics. So now the question is what are the differences between them? Let's start with the first one displayed estimated execution plan. So here what's going to happen? SQL going to go and guess the execution plan without executing the query. So it's just an estimation. So this is only a guess an estimation. The second one is the actual one. So this going to show you the execution plan that is used in order to process your query. So after executing your query, SQL going to show for you which plan is used. So that means the estimated plan it is something before executing your query and the actual plan is something after executing your query. And the third one is while executing the query. So you're going to get a realtime execution of your query and you can see how your execution plan is working. So now we can go and try that. Let's go and activate the estimated execution plan. Now we can see over here we have a new output where you can see like few boxes. So this is an estimated execution plan without executing your query. But now if you go over here and switch it to the actual execution plan nothing going to happen because first you have to execute your query. So let's go and do that. So once we have executed we got the result the messages and here we have a new tab called execution plan. So if you go over here you will find the real execution plan that is used to process your query. And let's go and try the third one. And let's go and execute. It was pretty fast because the query is very fast. But here we can see how the data and the plan is working during the execution. So this is the live execution plan. And of course we have the last one which is the current execution plan. So those are the differences between those stuff. Now you might ask why do we have this estimated and actual execution plans? Well, it is really nice tool to understand whether everything like is healthy at your database because if the guessing is something else at the actual execution plan that means this is an indicator that something is wrong at the statistics or the index at your database. So if they are matching the estimated and the actual then everything looks good. But now we're going to focus only on one type of those execution plans. We're going to stick with the actual execution plan. So now what we're going to do, we're going to go and open two queries side by side and one going to be from the clustered index and another one is from the heap structure. So it's going to be like one to one. Let's go and query both of them. And now let's go and try to read the execution plan. But make sure that you are activating the actual execution plan. So we have here now two plans. So now we are at the he table and we don't have any indexes. So now the question is how to read this execution plan? Well, now the plan is very simple because we have a very simple query but we read it from the right to the left. So the first operation is the table scan and then we have here a very small arrow to the next one where we have the select. So from right to left. So now of course the first operator is how to read your data inside the table and here we have different types of scans and one of them is the table scan. So table scan actually is scanning the entire table. So it's going to go and scan all the rows inside your tables in order to execute this query. Now if you go and mouse hover on the table scan, you will find a lot of details about what is happening during loading the data or scanning the table. But it is little bit annoying better than that. If you go right click on it and then go to properties, you will get in the right side the same details but it is easier to read. So the first thing that we have to read is the number of rows that has been read. So we can see that we have read all the rows inside the table which is not really good and we have another important informations about the resources and the cost. So we have the CPU cost and the input output costs and what is interesting is the logical operator the table scan and we can see some nice informations about the storage. It says it is row store. Now let's go and check the execution plan of this other table where we have a clustered index. So let's go to the execution plan. And now you can see that we have on the right side something else. We don't have table scan. We have something called clustered index scan. It is either scanning the entire table again or only a range or a part of the index. And of course in the details we can see whether it read all the informations or not. Now if you go and check the number of rows again the whole index is read in order to get this results. So again we have here the total number of rows inside our table. And as well you can see over here the logical operation it is clustered index scan. So it is not table scan. Now of course we have to go and check the CPU and the input output costs whether we are consuming the same efforts or not. So we can go and compare stuff. So here we have like 0.07. And if you go over here you can see we didn't gain like a lot of information having an index on this table. And that's of course logical because this query is not using any indexes. It is just like selecting everything from the whole table. So now let's go and extend it where we're going to sort the data by the primary key sales order number. So let's go and get this one and as well for the heap structure. So let's go and execute it and check the execution plan and the same thing for our cluster table. Now let's check first the heap structure. As you can see here, we have like two steps. First, it's going to go and scan the whole table and then we have sort operator in order to go and sort all the data in order to present it in the output. And at the end, we have the select which is not really important. So here we have like two operators. But now if you go to our clustered index, you can see that we have only like two steps. There is no sort step, right? And that's because the clustered index is only sorted and SQL don't have to go and sort the data again. So it doesn't have to go and sort anything. The data is already sorted. So this is the first win that you have if you have an index. So everything is already sorted and if you have an order by on this column then SQL don't have to do it during the query. So now if you want to go and compare the cost you can see here we still have the same cost for the CPU and the input output in the h structure without any index we have here like double cost. The first cost is for the table scan. It is the exact same amount of CPU and input output like the clustered but as well on top of it we have high cost for sorting the data. So we are consuming more CPU and input output. And if you summarize those cost of course this query going to be slower and bad compared to the clustered index. So with that in the execution plan you can understand exactly the benefit of your index. And one more thing about this plan if you go over here. So if you go to the objects and let me just extend it like this. You can see the name of the index that has been used for your query. So it says the index is B key for primary key. And then we have the whole thing. So now if you go to our table on the left side, check the indexes, it going to be exactly this index. So in the execution plan you can find as well which index has been used in your query. And this is very important to check. If you create a new index then run your query and check whether the database is using your new created index. And if not then you are making the wrong decisions about your index. So each time you create a new index, make sure to check whether in the execution plan the database is using your new [Music] index. Okay, so now let's keep going. Now instead of using the primary key, I'm going to go and filter the data based on one of those columns that we have in this table. So let me check the results and let's take for example the carrier tracking number. So carrier tracking number and let's go and pick a value. the first one here like this and let's do the same thing for the heap table and execute it. And now in the execution plan you see we still have a table scan and on this table let's see the execution plan with the clustered index. Now let's say that I would like to go and create a nclustered index for this column. So let's go and do it. So create nonclustered index and I'm going to call it index fact reseller and then the column name. So on our table fact reseller and the column going to be carrier tracking number. So I'm going to take it from here and let's go and create it. Now let's see whether our query going to use this index. So let's go and execute it and let's go to the execution plan. Now things looks completely different than before. So what is going on? We can see that we have now something new. We don't have a clustered index. We have something called index seek. Index seek is an amazing sign in your execution plan because it tells us that SQL server did find a way to use the index in order to find the exact data that we need without scanning a lot of stuff. So that means now we have like three types of scans. We have the table scan where the SQL going to go and scan the whole table and this can happen in the heap structure and the second one we have the index scan and here we don't know whether it is scanning the whole index or a part of the index and the last one we have the index seek where the database is able to find directly the data without scanning a lot of stuff. So the worst type is the table scan. Then we have the index scan and the best one is the index seek. So if you check here the details you can see the number of rows that has been read is only 12. This is amazing. Let's go and check the heap scan over here. So to the execution plan and if you go over here you can see that we are reading around 60,000 rows in order to get 12. But with the index we are reading only 12 in order to get 12 and this is amazing and very fast of course and of course the cost of this is very very small. So if you check the CPU and the input output you can see those numbers are nothing and of course if you go to the object over here you can see which index has been used and this is exactly the index that we have just created. So that means it was a really good decision to create this index and the SQL was very happy about it and used it in order to fast find our data. So now let's go and check the rest of the plan. And now you can see over here we have key lookup. The key lookup is an operation that we need in order to get the rest of the columns because from this index we are getting the data of only one column the carrier tracking number. But since in our query we are saying select star that means we have a lot of columns and those columns are not part of the index. So in this index is called don't know anything about the rest. That's why has to go and search for the other columns and of course it is called a lookup not a scan or something like that and that's why we have here as well only 12 rows but from this step we will get the rest of the columns. So and now the next step is that SQL going to go and join those two informations. So we have from the first one the carrier tracking number and the second one we have the rest of course SQL has to go and merge all those stuff in one in order to have it as a results. And now this operation called a nested loops. Behind the scenes there are different types of joins not the one that we know the inner lift and so on but there is another types of joints. We have the nested loop. We have the merge join and the hash join. The nested loop is very good for small stuff. If you have large tables, then the merge and the hash joints are way better than the nested loop. So that means if you are getting here a lot of data from the index and the lookups and you seek is using a nested loop, this is not good. But for now it is okay because we are getting only 12 rows and the operation going to be fast enough. And now one more thing that we can see inside our execution plan is the cost in percentage. So from checking this plan you can see the select is almost costing nothing. The cost of the nested loop is as well like 0%. And then we have like 6% of the index seek. That's because it is pretty fast and the most expensive operation that done in our query is the key lookups of course because it's going to go and get all the columns. And now if you go and compare to the heap structure even though that the execution plan of the heap structure looks very small doesn't mean that is faster than the indexes that we have. Still if you go and add up all those numbers it is way way faster than the heap structure. Now I would like to show you one more thing. If you want to get rid of this key lookup and in your query you have only selecting the carrier tracking number. Let's go and execute it and go to the execution plan. As you can see there is no need for the lookup because we have only one column and this data we can get it completely from our index. So as you can see it is interesting to understand how SQL is working with your table and with your index and this is how to validate whether you are making correct decisions about your indexes. Okay. So now let's go and add more stuff where we are doing aggregations joins and so on. Let's extend our query. So I'm going to go and join it with another dimension like for example the dim products and the join going to be on the product key. So product key and equal to as well product key. Now after that we're going to go and aggregate few stuff. So we're going to aggregate by the product name. So I'm going to take the product name. So it's going to be the English product name and let's go and call it product name. And let's go and aggregate the sales. So sum and we're going to get it from the fact table. It's going to be sales amount. So as total sales and of course we have to go and do group by and not French name. It's going to be the English name. So let's group up by the product name. And that's it. Let's go and execute it. Now we have a nice list of products and total sales. But let's go and check the execution plan. And oh my god, we have a lot of stuff. So let's start from the right side. So let's do it quickly from the right to the left. So the first thing is that it's going to go and get the data from the fact. So it is using the clustered index. And then after that it's going to go and do a hashmatch for the aggregation. And after that it's going to go and sort the data because it is doing later a merge join. So all those steps are preparing the fact table. And then we have another cluster scan for the dimension. So it going to go and as well select the informations from the dimension. And we have here like not a lot of rows. So it is very small table 600 rows. And now of course the result of the cluster scan is as well sorted right and of course as we learned the cluster the index going to go and sort the data. So we have here a sorted output together with another sorted output. So we have like two data sets that are sorted and SQL here decided to go with the merge join which is a good join in order to join two sorted data sets. It is way faster than joining using the nested loop. So everything is fine and then the data going to be sorted and presented at the output. And now if you are checking this plan you can see the most expensive thing happened at the fact table. So 71% of the total cost happened in this step. Now let's say that the query is slow and I would like to go and optimize it. We have learned that if you are doing aggregations on big tables then the column store index is a good idea. So let's go and find whether that is true. So I'm going to go to our other table. So our sales table was with the heap structure. And now you say you know what let's go and convert this he structure to a column store. So let's go and do that. So we're going to say create clustered column store index and we're going to call it index and then the whole name fact reseller sales HP and we don't have to specify any columns. So it's going to be our table on and that's it. Let's go and execute it. So now our table is not anymore heap structure. It should be a column store. So if you go and check the informations we can see we have like clustered column stored index on it. So now let's go and do the same query and check whether we have a better performance. Let's go and execute it. And of course you have to go and activate the execution plan. So I'm going to and now let's go and check from the right again. So this is our fact table and as you can see already it is costing only 6%. Interesting. So let's go and compare what happened to our fact table. First of all, we can see that the physical operation is a column store index scan. And if you go to the objects over here, you can see that the SQL did use the column store. And that is of course going to happen because the whole data is stored only in the index. So there is no way around it. So it can go and of course and use the index. But now what is interesting maybe we have to go and compare the CPU costs. So if we check over here, it is like 0,000.67 almost the same thing for the input output. Let's go to the previous plan where we don't have a column store and check our facts. So as you can see here it is way more expensive reading the fact table than the column store and as well we have reduced the input output costs. So as you can see we went from 71% of total cost for the fact table to only 6%. And the resources that is used to execute the query it is way less than a normal clustered res store. And this is exactly the power of this index, the column store index. You can use it in big tables like the fact tables like we are doing here in this query, you will be getting amazing performance for this scenario. So of course you can go and compare the execution plan by moving left and right. So as you can see if I click over here and I just switch to the other tab, I can like quickly compare the numbers. But there is another way on how to compare execution plans and that is if you go to the execution plan and right click on it then go to save execution plan as and then you have to go and give it a name for example query pro store. So let's go and save it and then you can go to the second query where we have the row store and then right click on the execution plan and say compare show plan. So once you click on that then you have to go and select the one that you want to compare with. So open and now on top you have your query and at the bottom you have the execution plan that you have saved and then you have here a lot of informations where they compare both of the execution plan and with that you can go in more details in order to understand which plan is better. All right friends so as you can see having the execution plan is is amazing. We can see how the SQL is working behind the scenes and we can understand how SQL is processing my query step by step. How much resources it is consuming, whether my indexes are useful or useless and I can go and experiment stuff. I can go and add like an index then test and check whether I gained like few performance or not. And we can go and compare like multiple execution plans before and after until you get the right index for the right table and the right column. So the execution plan are amazing in order to help us understanding whether our indexing strategy is correct or not. All right friends, so so far we have learned that the SQL server going to make its own decisions on how to execute your queries and the SQL make those plans based on the statistics. But sometimes the plan that you are getting from the database might be not the best one for your query and there could be many reasons why this could happen. Maybe the statistics are not up to date or you have a lot of indexes and the database engine get confused and here exactly where we need the SQL hints. So you can use the SQL hints in order to command to force the SQL database on how exactly your SQL query should be executed. So you can intervene and change the steps in the execution plan. So let's see how we can do that. All right. So now let's have a very simple query. We are just joining the table orders with the customers and we are showing like few columns. Now if you go and execute it and we go and check the execution plan, we can see in this plan that it is using the clustered index in order to read the data from the orders and the customers and then it is using the nested loop in order to do the joins. Now let's say that our tables are really big but still the SQL is using the nested loops and of course this is not good for large tables and maybe the SQL was confused with the indexes and statistics and so on and it decided to use the nested loops. So now in order to force the SQL to use another type of join, we can go and give a hint in our query for the SQL to use different types for the join. So let's go and do that. We're going to go at the end of our query and we're going to say option and inside it we're going to say use the hash join like this. So that's it. This is our query and at the end we are giving the database a hint for the execution plan. So let's go and try that out. So let's check the execution plan. And now as you can see is using different type of join. So with that we are intervening in the execution plan and we are making choices. So with that we have changed the technicality on how the SQL is joining those two tables. All right. So now let's go and change something else like for example instead of having index scan I would like to have an index seek. So if you have the right index in your table, you can go and tell SQL how to read your data in the table. So let's go and do that. Currently here we have an index scan on the table customers. So we can go over here near the table and we're going to say with and inside it we're going to say for SQL force seek. So we are forcing SQL to use the seek index. So we can use those keywords near the table in order to specify for SQL how to load the data. If you are not specifying anything like here with the orders, we don't have here any hints. That means we are counting on the execution plan that is generated from the SQL. But if you don't want the recommendations, you can go and specify which one should be used. So now let's go and execute it. Now we got an error because the SQL is not able to process what we are asking for and I think maybe we are using the force command and as well the hash join. Let me just uncomment this and let's go and give it another try and now it is working. So let's go to the execution plan. So you can see we got again the nested loop. And now if you go to the customers table you can see now it is using the index seek. So it is not using anymore the index scan. So as you can see again we are intervening and forcing SQL to use the method that might be better for our query. Now if you are creating a lot of indexes in one table and the SQL is still not targeting the right index. So if you check the object you can see it is targeting specific index. But if you have a better index than that you can give a hint for the SQL to use a specific index. And we can do that like this. If you go over here and remove the force seek and you say use index and then we have to go and specify the index name. So let's go and get again the primary key over here. Now I'm telling SQL you have to go and use this index in order to scan the table customers. So let's go and try this out. And if you go to the execution plan you can see it is as well targeting this index. So not only you can force SQL for a specific type of loading or joining, you can force SQL to use a specific index that you created. All right friends, so as you can see, SQL hands are very powerful, but we have to be very careful with them because I really had a bad experience using them in my projects. So here are my recommendations and what happens. So what could happen is that you are optimizing the performance in the development database and you start using the hints and the speed was really good and once you roll that out to another database the production database this hint will not be working correctly. The same hint that you are using might not improve the performance and one reason is that sometimes the productive database has like large data compared to the development database. So you have really to test the hint in each database that you have. So if your hint is working in one environment that doesn't mean it going to work in the other one. So always make sure to test. And the second recommendation is that don't use the hint as a permanent fix for your queries. So what this means? Let's say that you are working in the project and one of your queries are very slow. Now, if it's not clear why the execution plan is really bad, you can go and use the hints as a workaround in order to speed up your query again, but it's still as a workaround temporary. You still have to invest and spend time in order to analyze what was the road cause. So maybe it is an old statistics or you have wrong indexing and so on. So use hints only to work around and speed up your queries, but don't use it as a permanent fix. So friends, SQL hints are really amazing in order to control the execution plan, but use it very carefully and only if there is like an emergency. All right friends, so now for each SQL data project, we have to make sure that we create a clear guidance about the index strategy and everyone in the team has to commit and follow the strategy in order to make sure that each index that is created in the project to fulfill a purpose and that's because without a clear strategy about the indexing, I'm going to promise you there will be a lot of redundancy, unused indexes, uh waste of storage and the whole system of your project is going to be slow and bad. So now what we're going to do, I'm going to show you my indexing strategy that I usually follow in my projects. But I'm going to tell you from now there is like not one strategy that can fit any project and any scenario. That's why the team of each project should brainstorm in order to make their own strategy. So now let's have a look to my indexing strategy. And now if I have to pick only one recommendation from me to you in this indexing tutorial, I'm going to have this advice for you. Avoid overindexing. Overindexing is the biggest mistake and trap that a lot of developers do where they think adding more indexes. That sounds like we are speeding up things and our queries can be fast. But I have to tell you this exactly lead to the opposite. And here's why. As we learned, each time you add a new data to your table, your index has to get updated, sorted, rearranged. That means having too many indexes, what's going to happen? Your insert, update, delete operations going to be slow. And this means your database is slower and not faster. And one more very important reason why overindexing is bad is you make the database confused while creating the execution plan. As we learned, the SQL database has to create the best execution plan for your query. And if you have a lot of indexes in your database, it's going to make the process of creating an execution plan complicated for the database, which makes it of course for database harder to choose the best path and index. And as well, you open the door for bad execution plans. And this means it's going to slow the query because first the database has to create the execution plan before executing your query. So again it has a bad effect for the performance and as well there is another bad thing. It can make it harder for the database to decide what is the best execution plan for a query and having too many indexes might make the SQL database choosing a really bad execution plan. So overindexing confuse the execution plan and as well makes the query slower. So that's why I call this a golden rule and you have to commit to it. Just avoid overindexing because it is double-edged sword and exactly you have to have the mindset of less is more. So having a few effective indexes is way better than having a lot of indexes. So keep it in mind and write it in your development guideline for the team with big statement avoid overindexing. So this is the first statement in your indexing strategy. So now let's check the [Music] rest. All right. So now we can split the indexing strategy into four phases and each phase has multiple steps. So now the first step is we're going to go and create an initial indexing strategy. So now once you start a new SQL project you have to define the objectives of the projects very clearly. So that means we have to make it clear what we are focusing on what we want to achieve and in order to define the goal of your indexing strategy you have to understand your system. We have mainly two types of databases. In one hand we have OLAB databases. It stands for online analytical processing. The purpose of this database is for data analytics and an example for that is the data warehouse. So in data warehousing we go and extract the data from multiple sources and then we prepare it and transform it and put it in one big storage and we call this process an ETL process. And then the front end we have like reports and dashboards where the data is summarized and aggregated and presented for the end user. And these reports could be used from users in order to analyze and have insights about the data. And now in order to generate those reports there will be like heavy reading on the data warehouse database. So that means there will be huge queries that's going to access the database in order to aggregate and prepare the data for the visualization. But now in the other hand we have the OLTP systems online transactional processing. It is like an e-commerce finance banking where you have at the back end a database where the data is stored and on the front end we have like an applications for the end users. So now as the users are interacting with the app this can cause write operations on the database. So inserting new data or changing data and as well there will be read operations on the database in order to show the data in the app. So we have both write and read. So now of course we have to ask ourself what is the goal what do we want to achieve and here mainly there is like two strategy either you want to improve the read performance or the right performance. Now if you are looking to the OLAP system here it's really you have to understand the project where is the struggle sometimes it could be like the ATL process itself it's slow and mainly the ATL is writing data from the sources in the data warehouse and maybe you have scenario where it takes like every day 10 hours and 10 hours is of course a problem because you cannot wait so long in order to get a new data fresh data to the report every day. So you can make the goal of the project is to optimize the right performance. You want to speed up the ETL. But actually most of those projects having another issue. Well, it is the read operation on the database because data warehouses normally have really big data sets and at the front end the reports generate large complex queries on the database. So that means the rate process going to be the pain point in each OLAP system. So normally the big goal in each OLAP system going to be how to optimize the read performance. But now in the right hand with the OLTB we have different nature of database and scenario. What going to happen? You will not have like big queries from the apps. You're going to have like many query many transactions happening between the application and the database. So you're going to have like massive amount of read and write transactions. So the whole time we are reading, writing, reading, writing and so on. But with the OL app we have like something bigger and slower because in the ATL we usually run it only once. That means we are writing only once new data to the database and this happen usually at the night but on the transactional systems you have a lot of readrs all time. Again depend on the project but usually the main pain point in the OLTP is the right operation. So it could be like this. If you are building OTP system, the main goal is to optimize the right performance. Now of course the question is how to do that? How we going to optimize that? Well, again we have to understand the nature of the database. What do we have in the OLAP systems is usually like a data model where you have a very big fact tables and around the fact we have like multiple dimensions that are connected to the facts. So those fact tables are really big tables in the database and each time they are used in order to build a report and the report going to be using all time those facts in order to prepare the data for the visualizations and a lot of aggregations query going to be done on the facts and now of course you have to answer now the question which type of index should we use in this scenario. Well we have a perfect one called a column store index. So the best practice here is and you can make it as a strategy for the whole project that we make all fact tables as a column store index because this is what we are doing in the OLAP. We are aggregating large data sets but now the data model and the scenario is completely different at the right side here. We're going to have like a lot of tables and they have like different sizes and so on and there are like a lot of relationship between all those tables. So it is completely connected. So you have a lot of like primary keys and foreign keys relationships between them and normally those tables are completely normalized table. So they are like small pieces but on the left side we have denormalized tables as a facts. So here is like one strategy that we can follow in the indexing of the ALTB is that we create clustered index for each primary key of our tables. This of course can improve a lot of stuff like searching, sorting and as well joining tables together. But of course since we are focusing on optimizing the right performance on the OLTP you have to be more sensitive by adding new indexes compared to the OLAP because each index you add it could be a reason why the data is written very slowly. So in the OLTB you have to be way more careful adding indexes. So now as you can see you have to understand the nature of your project. You have to understand what is the main issue. Once you understand your project, you can go and define like a goal for optimizing the system. So either read or write or maybe both of them and with that you are making like the initial strategy of indexing your [Music] system. All right. So with that we have an initial strategy for our indexing and we have a rough plan. Now in the next phase we have usage patterns indexing. So now we're going to do a deep dive into our project. And the first thing that we have to do is that we have to identify the frequently used tables and columns. So that means you have to go and check the queries used in your project in order to understand okay what is the most important table that is used in many queries. Like for example here we have the fact internet sales. It is used like in many many queries in our scripts. So here you are like developing a feeling about what are the most important frequently used tables and not only that you can go and check how we are filtering the data on those queries. So for example we have over here we are filtering by the order date key is this kind of filtering is used like in multiple queries. So as you can see we have like here a couple of queries where we are doing always the same where we are filtering the data by the dates. So with that we understand there is like a pattern inside our projects where this column is used mainly on filtering and as well for aggregating. So that means you do a deep dive in order to understand what are the most and frequently used tables and columns inside your scripts. And now of course what I usually do I go and use the help of the AI and IBT where I give it my code and then ask questions about it. For example, this prompt, it says, "Anal analyze the following SQL queries and generate a report on table and column usage statistics. And for each table, provide the total number of times the table is used across all queries. A breakdown for each column in the table showing the number of times each column appears. And I would like to see as well the primary usage of each column, filtering, joining, grouping, and so on. And in the output, as you can see, we got like nice statistics about my scripts. So as you can see the most used fact table is fact internet sales. It is like 13 times used in the projects and then we can see like statistics about each column that is inside these facts. So most of the time is the sales is used for aggregating and as we saw the order date key is used like five times for filtering and the other keys is used for joining tables. So as you can see it's amazing right now we can identify which tables are important which columns as well are important and we can like based on those informations maybe derive our indexing for our database. So with that we have identified our frequently used tables and columns and now the next step we have to go and choose the right index type and as we learned before we have multiple types of indexes and that's really depend on the usage and the scenario. So for examples, if your columns are primary keys, then go with the clustered index. And if you are using columns that are not primary key where you are doing joining filtering, then think about the non-clustered index. And of course, if the table is very big, as we said, you can go and use the column store index. And if you are targeting always like a subset of data only like one year informations, then you can think about the filtered index. And the last one, if you have like a unique column where you don't have any duplicates, then you can go and apply a unique index. So it depends on the scenario and the usages. You have to choose the right index. And of course the last step in this phase is that you have to go and test your index whether everything is working fine. So that's all for the phase two. Then we go to phase three scenario-based indexing. So here we have to tackle and focus on specific issues to specific pain points. So that means we have first to identify the slow queries. So it could be reported from users or the team is doing like analyzing on the logs and to understand which queries are causing like performance issues. And now once you get a list of slow queries then you have to analyze them one by one and it is time to dig into the execution plans. So as we learn we can check how SQL is implementing our queries and start looking for areas for example where the SQL is doing a full scan of the tables or maybe using expensive operations like nested loop joins and so on. So once you understood where is exactly the pain point the next step is that you have to go and choose the right index. So which type of indexes we're going to use in order to optimize the query. And once you go and create the index, the last step is that you have to go and test it. So you're going to run again the execution plan in order to make sure that your query is using the index that you have just created. So that means you have to go and compare the execution plans before and after. And if you see that there is no benefit, then something is wrong. That means you have to go and investigate more and analyze the execution query and maybe choose a better index way. And you have to do this process for each slow query until you get all your queries fast. But of course, don't forget indexing is not the only methods on how to optimize the speed of queries. So as you can see through these three phases, we went from a very generic methods on how to index our system to something very specific and scenario based. So as you can see as we moving in the phases, we are doing more deep dive into our projects. All right. So now moving to the last phase, we have the monitoring and maintenance of our indexes. As we learned, the job doesn't stop by just creating and implementing indexes. We have to be responsible by keeping eye on the health of our indexes. And here the databases offers a lot of statistics and metadata about your data that you could use in this phase. So the first step is to monitor the usage of the indexes. And as we learned, we can use the dynamic management views or functions that we can find in the system schema where we can see the number of usage of each index and when the last time our queries did use the indexes. So with that we can go and find out all those indexes that we have created and never been used in our projects. And now the next step is that we can go and monitor the missing indexes. So here we can go and check what are the recommendations from the database where the database is reporting missing indexes from the execution plan and again we can go and use those dynamic management views or functions in order to see more details and as well we can go and monitor whether we have duplicates in the indexing. It happens a lot if you have like a lot of developers in your team. So it could be that they are working parallelly to optimize the performance of slow queries and then go and create multiple indexes for the same column. So this is something that we can go and check whether we have duplicates in our indexes and if you have duplicates then you have to go and find how you can go and consolidate them. Then the next step we have to go and update the statistics. So as we learned statistics are very important for the execution plan because the database engine use those informations to decide the best execution plan for your query and if the statistics are old then the database going to make wrong decisions about how to execute your query which might lead to bad performance. So here again we have like special functions in order to monitor the statistics but here my recommendation that each weekend have a job that go and create all the statistics of your database. And the last step we don't have to forget about monitoring the fragmentations as we learned over the time as you are doing modifications on the tables. What could happen the order of the databases could get wrong or there are like free spaces on the database that are not used. So we have like fragmentations in the index and the same thing we have to monitor the fragmentations of each tables and here if the percentage is between 0 and 10 then there is no issue but if the fragmentation is between 10 and 30 then we have to go and reorganize the index and if it's more than 30 then this is alerting you have to go and rebuild the whole index and usually for the monitoring I go and build like automated dashboard in PowerBI or Tableau where I go and extract all those metad data and create a nice dashboards in order to monitor the health of the database or you can go and buy some other tools that are advanced in order to do those stuff. All right. So this is my indexing strategy that I usually follow in my projects. And as you can see, each phase builds upon the previous one. Moving from a general strategy to more targeted, refined, specific strategy where we define first the goal of the indexing strategy of the projects. And as we move with the phases, we're going to be targeting more specific scenarios. And this cycle keep repeating. It's not only one time. So you have to keep discussing is the goal still suitable for the projects. You have to keep analyzing the frequently used tables and columns and keep searching and finding those slow queries and always keep an eye monitoring the indexes and of course I can only keep repeating this avoid overindexing. All right my friends so that's all about the indexes that was a lot of informations and a lot of technique. So now you know everything about indexing in SQL. Now in the next one there is another important techniques on how to optimize the performance. So we're going to talk about the partitions. So how to divide our data in order to optimize the performance. So let's go. All right. So what is SQL partitioning? It's a technique in order to divide a large table into small pieces and each piece we call it a partition. Well, this sounds like we are dividing one big table into smaller tables but it's not like that. We are just dividing one table into smaller partitions. So we going to see it in the database still as one solid table but behind the scenes it is splitted into multiple partitions. So now let's go and understand what this means. Okay. So now let's say that you have a table at your database and over the time this table is getting bigger and bigger where you have like hundreds of millions of rows. Now once you have such a big table what's going to happen everything going to be slow. So for example, if you are reading the table and the execution plan is doing full scan of the table, this can take SQL long time until all the rows are fetched. And if you decide to make like an index for this table, what's going to happen? SQL going to go and build a very big B tree index where there are a lot of branches and files and so on. And having a big index is not always a good thing because if you do operations like delete rows, update rows or inserting rows, these operations going to need long time to process. So having a big index doesn't mean that you can have a good performance for your big table. So that means having a big table is a problematic because everything going to be slow. So now what we can do in order to optimize the performance of this big table? Well, we can use SQL partitioning and in order to do that, we have to understand the behavior and the transactions that are happening on our table and what usually happen with that the table grows over the time. So, you can have like subset of data that belongs to 2023 and another one that is created and updated in 2024 and then you have something like more current in 2025. So that means we have like in our table old data and as well new data and we usually interact with the new data more often than the old data. So maybe for example for 2023 there is like only one read transaction and for the data in 2024 we have done like two reads and one rights. So it is little bit more than 2023 but for the new data for the current year there will be heavy transactions. So we're going to have a lot of reads a lot of rights. We are updating, inserting, reading. So a lot of things are going on for the new data. So that means we are accessing frequently the big table only to interact with the new data and we rarely need the old data. So what we can do, we can go and divide this big table and we usually divide it by like a date. So that means we can go and split this table by the year and we put each year in one partition. So at the end we're going to have like three partitions. And now it's really important to understand that that those are three partitions. They are not three tables. So that means at the client side the users can see only one table but behind the scenes we have like three partitions. Now let's say that you have a query in order to read the data from 2025. And now what going to happen? SQL will not go and scan all the data from the table. It's going to go and only target one partition the 2025. So that means SQL is only scanning the relevant informations the relevant partition and not the entire table. And now we have another benefits of having partitions. Let's say that you're using a modern database and normally they support parallel processing. So if you have the infrastructure for that what can happen the database engine can process each partition independently and parallelly. So whether you are reading or writing data. So what's going to happen? SQL going to process your queries parallelly which of course can reduce the overall execution time. So that means if you have a modern infrastructure like maybe for example the Azure Synapse and so on go with the partitions because the partition then could be stored in different servers and this helps of course the SQL engine to use all the resources at once. So that means partitions allow scalability and as well parallel processing. partitions going to make the indexing more efficient. So instead of having one very big index for the whole table, if you put an index on a partition table, what's going to happen? Each partition going to get its own index, which means the size of the indexes going to be smaller. And of course, this helps a lot with searching for data or as well extending the index itself. So for example, if you are inserting data to the partition 2025, the SQL will not go and change anything on the other indexes, it's going to go and only change the index of the partition 2025. So that you can see the power of the partitioning. It improves significantly the performance of your table whether you are reading or writing data to this big table. So this is what we mean with partitioning and why we need it. All right, friends. So now we're going to go to the process of creating partitions in SQL. At the start it might sounds a little bit complicated but we're going to do it step by step and I have a sketch for that. So we have like four steps because we have in the database like multiple layers. So let's see how we can do that. Let's go. So the first step is that we're going to go and define the partition function. So what is that? We're going to go and define here in the function the logic on how to divide the table into partitions. And this can be based on the partition key. So that means we need a column in order to define the logic. And we usually use columns with the dates like for example the order dates or in other scenarios we can use the region or country and so on. But the most famous one is the dates and that's because our tables like get bigger over the time and there are like multiple types of functions. We're going to focus on the range function. So how it going to work? We're going to have like a range of dates and then we have to define like boundary values and let's say that I would like to make a partition for each year and in order to do that we have to define the partition boundary. So it is like a value the boundary of the years could be like the first day of the year or the last day of the year. So here in this example we're going to take for the boundary the last day of the year. So the last day of 2023, 2024 and 2025. So we call those values the boundary of our function. Now between the boundaries we going to have our partitions. So for examples all the rows for 2025 and earlier years is going to be the partition one. So between the boundary and everything before is one partition and after that between the two boundaries we have partition two. So this partition going to be for all rows of 2024. And then we have another section the partition three where we have all rows of 2025 and then between the last boundary and everything onwards is going to be partition 4 and here we're going to have all the rows from 2026 onward. So with that we have now a logic we are telling SQL how to divide our data into multiple partitions and here there is like two methods the left and the right. So what are those two methods? So again we have our boundary and now the big question to which partition does this boundary belongs to is it partition one or partition two and that's why we have those two methods. If you say it is left that mean the boundary belongs to the partition number one. But in the other hand if you say it is right then the boundary going to be part and belongs to the partition number two. So you have to decide whether the boundaries belongs to the left partition or to the right partition. And with that in the partition one, we're going to have all the rows of 2023 including the last day of 2023 because in the partition 2 we only focus on 2024. So it's just the boundary belongs to the left partition. It's very simple. Now let's go and implement that in SQL. So let's do it. The syntax is very simple. We're going to say create partition function and then we have to give it a name. So it's going to be partition by year since we are dividing the data by the year. And after that we have to define the data type. So we are splitting the data by a date. So it's going to be date. And after that we have to define the partition function type. So in our example we are using the range. And now we have to define whether it is left or right. We're going to stick with the left. And now comes the very important step. We have to define the boundaries. So we're going to say for values and we're going to enter here three boundaries like in our example for each year we're going to define a date. So 2023 and the last day of the year. Same goes for 2024 and for the last one 2025. So with that we have defined the logic the range we have defined the boundaries and we tell SQL the boundaries are a date. So let's go and execute our function. Okay, so that's it. As you can see, it's very simple. We just created a function that split the data by the date using the range lift. And of course, this function is not yet attached to any tables or anything. It is just a logic that is stored in the database. All right. So now since our partition function is stored inside the database, we will have metadata about those functions stored in the system schema. So we have there a dedicated table called partition functions and there we're going to find informations about all functions that we have inside our database. So let's go and execute it. And as you can see we find now our new created partition function. So partition by year it is a range and it has an ID and so on. And I really recommend you to check it before creating any new partition function. Maybe you have already one in the projects. Okay. Okay. So now let's check the next step in our process. We're going to go and build now the file groups. So what is a file group? It is like a logical container of one or more data files. So it's very simple. It's like folders. We're going to go and create now like multiple folders. So later we can insert inside them files. And this is really nice because it gives us like freedom and flexibility where we can go and decide how the data files are organized for each partition. So what we usually do, we go and create for each partition a file group. So we're going to have like four folders or four file groups for 2023, 2024 and so on. So now let's go back to SQL in order to do that. All right. So now let's go and create those file groups. The syntax is very simple. So it's going to say alter database. And now we have to tell the database where these file groups should be stored in which database. So I'm going to stay with the sales DB. And then we have to tell okay add file group and after that we have to define the name of the file group. So the first one going to be for 2023. So the syntax is very simple. Let's go and do it for the other years. So we need 2024 5 and six. Okay. So that's all. We can just select everything and execute. So as you can see it's very simple. We have just created four file groups and they are empty. So we don't have anything inside those containers. Now let's say that you have made mistake with the namings and so on and you would like to drop one of them. So the syntax is as well very easy. So it's going to say alter database sales DB and instead of add you're going to say remove. So once you execute this file group will be dropped but we need it. So let's go and recreate it. Now as usual after creating stuff let's check whether everything is created correctly and whether we have any duplicate or anything wrong. So with that we have as well a file group table inside the system schema and let's go and execute it. So I'm just filtering with the type FG for file group. So let's execute it. And now we can see in our database we have four file groups. Now four of those file groups we just created it right. So we have the 2023 24 and so on. But we have something called primary file group. This is the default file group that is created for each database. So it is a container for all data files in your database. And as you can see we have here a flag saying it is a default. So it's default and we have it one and for the rest they are not the defaults. So this is really nice to see all the file groups inside your database to check that you don't have duplicate and so on. Okay. Now moving on to the third step where things going to get more physically. So so far we have like a function the file group and all those stuff are logical stuff. We don't have data yet. In order to have data, we have to go and create data files. So, as we learned before, data files going to contain our actual data and they're going to be stored physically in the database. So, you can go and assign for each file group like one or multiple data files. And the file format here is MDF. It is secondary data files. We have like primary and secondary. But in the partitions, we usually go with this format, the NDF. So again the file groups are illogical containers and the data files are physical files where our actual data going to be stored inside it. So now let's go back to SQL in order to create some data files. Okay. So now we're going to come to the little bit annoying part where we're going to go and create files. But the syntax is as well very simple. So we're going to say the same things alter database and our database is sales DB. And then this time we're going to say add file. And now we have to give SQL not only the name but the physical place of the files. So let's do it step by step. We're going to open new two parenthesis. So first we have to define for SQL the logical name. It is not the file name. It is the logical name of the file. So let's give it a name for example B 2023 and then comma. So this is the logical name. And now the next one is we're going to give the physical name of the file together with the path. So we're going to say file name equal and now we have to define for SQL the complete path of the file in SQL server there is like a default path where the data going to be stored and I'm going to go and use the same path and the path really depends on the version and as well the type of the SQL server that you are using. So for the current version that I'm using for this tutorial we can find it over here in this path. So if you go to the C then program files Microsoft SQL Server MSSQL and the version for me is 16 SQL Express and then inside MSSQL data and so on. So we're going to go inside this folder and now we can see over here all the database files. So we can see for example here the sales DB the sales DB logs and we have here the adventure works and so on. So you're going to see all the files of your database. And what we're going to do, we're going to put as well our partitions files inside the default folder. But for real project, you have to ask the database administrators about the exact location where you can put your partitions. So let's go back to SQL and I'm going to put this path over here. And then we have to specify the file name. So it's going to be P 2023 dot. And now we have to specify the file name. So, NDF and with that we have now a complete path with the file name. So, we are almost there but we are not done yet. We have to tell SQL where to put this file in which container in which file group. So, we're going to go over here and we're going to say to file group and here make sure to select the correct one. So, FG 2023. All right. So, that's all. Let's go and execute it. So, let's do it. And with that we have created a file inside a file group. I will not be creating like multiple files inside one file group. It's going to be like one to one. So now what we're going to do we're going to go and create the other files for each file group for each year. So we just have to copy and paste and just change the names. So for 2024 going to be like this. So that's it. And the same thing for 2025. And for the last one 20 26 and we can go and select now everything and execute it. So that's it with that we have created now four different files and we have mapped as well each file to the correct file group and I usually don't create like a lot of files. I just create like one for each year or maybe for bunch of years. So you don't have to go and make for each day like partition or something like that. Okay. As usual after creating stuff we have to go and check the metadata. Now I have here prepared a query where we query the file groups together with the files. So all the data informations could be found inside the table master files and then we join those tables and select our database. So let's go and query this one. And now we're going to get a list of all files inside your database. So we see over here we have the primary for the database itself and you can see the path of the file and as well the size of it and we can see over here we have four files and the file group that is assigned to and the complete path of each file and you can monitor over here of course how the size of each file is growing over the time. Maybe one of them is getting like really big and then you can think about let's go and split it to multiple files. So that's it about how to create data files. All right. So now we're going to move to the last step where we're going to go and define the function scheme. Now if you have a look to this picture, you see that there is something missing. From one side, we have defined how to divide our data into multiple partitions. And from the other side, we have repaired all the files and the file groups and so on. And now what is missing is the connection. How to connect those partitions to the file groups. And we can do that by using the partition scheme. So all what we are doing now is just defining which partition belongs to which file group. So for example, we're going to go and map the partition one to the file group 2023. And with that all the data of 2023 and earlier going to go to the file group 2023. And of course we have to go and map each partition to a file group. If you don't do that, you will get error in SQL. And once we build the partition scheme then we can have all the component ready in order to have partition table. So now let's have a quick summarize. The partition function going to decide on how to split your data into multiple partitions. The partition scheme going to go and map the partitions to a file group. And the file groups are like folders in order to organize your files. And each file group has one or more data files where your actual data going to be stored physically. add these files at the start. It might be confusing, but now as you understand each layer, then it's going to make it easier for you to build partitions. So now let's go back to SQL in order to build the partition scheme. Okay, so now we have the easiest part where we're going to connect everything together. So the syntax as well very simple. It's going to say create partition scheme and now we have to give it a name. So let's go with like scheme partition by year. And now we have to map the partition function with the file groups. So first we're going to say as and then we define here the partition function. So as partition and now we need the partition function that we have created. So as partition by year and then after that we're going to map it to the file groups. And here it is very important to map it in the correct order. So the order is very important. So the first one was file group 2023. The second one 2024 and we have 2025 and the last one 2026. So again the order is very important and as well it's going to be a little bit tricky. So sometimes as you are creating like the functions maybe you make mistake that you don't know how much partitions are going to create like in our example we have three boundaries and SQL going to create four partitions. So it happens sometimes that you think okay I have three boundaries and then I'm going to get three partitions which is not really correct. So for example let me just remove one of those and let's say I have only three five groups and let's go and execute this one over here. Now we are getting error. It says the partition function generates more partitions than the five groups. And that is really correct because our definition of the logic can split the data into four partitions. And now we are giving SQL only three five groups which is not correct. So we have to go and add the plus one. And one more thing SQL will not go and check whether you are mapping things correctly to the five groups because it doesn't really care about the naming of those five groups. So for example, if you go and put this one at the end, what's going to happen? It's going to be a big problem. So all the years of 2023 going to be stored inside 2024, 2024 going to be in 2025. So everything going to be mixed and the skill can do it like you tell it. So that's why make sure you have the correct sorts. So that's it. Let's go and create our scheme. So it is working. This is very simple. We just map now the partitions to the five groups. And as usual we check things after creating and I have prepared here like really nice query from the metadata in order to see the whole thing the functions the file groups the schemes you can of course add to it the data files but I'm just going to stick with this over here. So again in SQL server we have a dedicated table for the partition schemes. Then I'm just joining it with the functions and then with the destination data spaces in order to get the partition number and the file groups. So let's go and execute it. And now we can see very nicely the scheme that we have created and the function name of the partition. And then we can see the partition number and the file group name. So we can see how things are mapped together. So if you get it like this then so far everything is good. All right. So so far what you have done we have prepared all the layers. So we have the setup is ready to be used in any table. So we have the functions, the files, the file groups and schema and everything is ready. But still we are not using it. The logic just exist and the files are empty. So now what we're going to do we're going to go and create a table but not a normal one a partition table. So let's go and do that. It's very simple as well. So create table and we have to give it a name. So let's get it as well in the schema sales orders and I'm just going to give it the name partitions. So now we have just to define like few columns inside this table. So let's get an order ID and data type int. And let's go and get an order date. We call it dates with the data type dates. And maybe just one more called sales and a data type in. So this is very normal table that we create in databases. But it's still not yet partitioned. Now in order to use everything that we have defined, we're going to go do the following. We're going to say on and now we have to tell SQL only the name of the partition scheme. So everything else is like connected and mapped together because the scheme is mapping the function with the file groups. The file groups are mapped to the data files and everything is like connected together. And here in the table we have just to give the name of the scheme. So the name of the partition scheme is scheme partition by year. And now it's very important to give a column. And since the whole logic and the function is based on a date, we cannot go and specify here for example the order ID or sales because it makes no sense. We're going to go and pick the order date and put it over here. And with that, we have created a partition table. So now what we're going to do, we're going to go and start inserting that out of our table. So let's go and do that. We're going to say insert into sales order partitioned and we're going to pick values like this. So one and then let's get any dates like 2023 like for example my the mid of the month and the sales could be anything like let's say 100. So let's go and execute this and let's go query our table. So it is this one over here. All right. So now we have one record inside our partition table. And now the big question is in which partition in which data file did SQL store this record. So we have to test whether everything is working fine. So in order to do that I have prepared as well a query. So we are again asking the table partitions with the destination data spaces where we're going to get the number of rows in each partition and then we have the file group and we are focusing on our table orders partitions. So let's go and execute this one. And now we can see very easily we have the four partitions. our new record is inserted in the correct place in 2023 file group and in the correct partition. So with that we make sure our function and the whole logic that we have built is working correctly. So now let's go and add more records. I'm just going to go and duplicate it. Record number two. And I'm just going to pick a date in 2024. And this one going to be like 20. Let's just change the value. So 50. Let's go and execute it. And now we have a second row inside our table. And again the big question is whether it is working. So let's go and execute this again. And now we can see our record is inserted in the partition 2 in the file group 2024 which is correct. Now let's go and check the boundaries whether it is working correctly. So I'm going to go and here in the third row I'm going to say the last day of 2025. So it's going to be month 12 and the last day. So 20. Let's go and insert it and check our table. So we have a new record. And now let's go and check. My expectation here that this row is going to be inserted in the file group 2025. So let's go and execute. And that is correct. As you can see the record is inserted in the correct partition. And this is really important to test the boundaries whether they are working correctly because it's a little bit tricky. You have this range left right and boundaries and so on. So you can do it like this to check whether the expectation of your logic is working correctly. And the last one I'm just going to do it very fast. So let's do it 2026. And I'm going to pick the first day of this year. So let's go and insert it. And now what is the expectation? I think it is pretty simple. So let's go and query. And the first day of this year is inserted in the partition number four. So I can say everything is working correctly. If you get it like this then you have created successfully a partition table and you have prepared all the layers of this partition correctly. I know this is a lot of work but to be honest it is fun because for the first time in database you feel like you are controlling stuff. Usually in database everything like behind the scenes and you don't know exactly where the files are stored of your tables and so on. There is a lot of abstraction in databases but here like we are getting deep in databases and we are controlling and managing all those files which is sometimes it's nice to have this freedom and flexibility. All right one quick thing that I would like to show you that if you go to the database in the explorer then let's go to the storage over here. So let's expand it and here you can find easily informations about the partitions. So over here we can find our partition scheme and as well the partition function that we have created. it is just a quick access instead of like querying the metadata. So now let's have a quick summarize how everything is connected together. So we have a table and then we specify for scale that is connected to a partition scheme and in the partition scheme we have everything connected. It is linked to a specific partition function and there we have the partitions and at the same time it is connected to file groups and the file groups are connected to the data files. So as you can see all those layers and elements are connected together. Now let's see how this works. So we have inserted the last day of 2025 and now the first thing that's going to happen the partition function going to decide to which partition it belongs. So as you can see it is a boundary value and since we have defined it as a lift it going to target the left partition the partition three and then the partition scheme going to connect it to the right file group and in this scenario it's going to be the file group 2025 and we have here only one file so it going to as well go to the correct data file and in this file the SQL going to store this row so it is pretty easy and now we come to very important part where we can understand how the partitions are really improving the performance of my query and of course we can do that by checking the execution plan. So now in order to compare like the behavior with and without the partition what we have to do is to create a mirror table without partition. So we have our table here the partitioned one what I'm just going to do I will go over here and say into and we're going to call it sales orders no partition. So we are taking the data and the structure from the orders partitions and of course it will not be partitioned. So let's go and execute it. Now if you go over here we can see that we have two tables. We have the no partition and the partitioned one. So now what we're going to do we're going to write a query on both tables and then compare the execution plan. So first let's start with the no partition. also from and and now in order to see the effect of the partition what we're going to do we're going to say where order dates equal to and now we're just going to pick a value like 2026 the 1st of January so let's go and query it and we're going to do the same thing a new query but this time for the partitions so now in order to see the execution plan make sure to activate it so we go to the action bar over here and we're going to say include the actual execution plan. So let's click on it and execute. And with that we have here an execution plan. And let's do the same thing for the no partitions. So execute and we have here execution plan. So now let's check what we have in execution plan. We're going to focus on this one over here. So right click on it and then go to properties. And now we can see a lot of details about the execution plan. But what is interesting is the number of rows. So as you can see we are reading four rows. That means the whole table. And of course we have here the CPU and the other costs. Now let's go and check the partition. So let's click over here. So now if you check over here, you can see that the total number of rows is one. So SQL didn't read all four rows. It reads only row and that's because we have in this partition only one row. And as you can see the number of partitions that is used is as well only one. So as you can see using partition we have reduced the number of rows that is retrieved from the files. Now let's go and retrieve like two data from two different partitions and check the execution plan. So let's target 2025 the last day of the year like this. So let's go and execute it. And the same thing for the other query. So let's check the without partition. We still we are reading like four rows. But now if you go to the other one, if you check the execution plan and check the table scan, you can see we are reading only two rows and this time the number of partitions that are involved in this query is two and that's because we have partition for 2025 and 2026. So as you can see it's worth the efforts. We have optimized our queries and this has a great impact on big tables. The number of resources and the number of reads going to be reduced massively. All right my friends. So that's all about the partitions in SQL. It is amazing and you can use it as well not only in databases but as well in many other data platforms and tools where you always can divide your data in order to optimize the performance. Now in the next step what I have prepared for you after 15 years working in real projects using SQL. I have a lot of best practices and tips for you. So I have collected everything that I know and now I'm going to show you the best practices and tips and tricks that I can give you in order to optimize the performance in SQL. So let's go. And now before we deep dive into the 30 best practices, I'm going to give you the golden rule. The SQL optimizer responds differently for different sizes of tables. So that means if you have small and medium tables like hundred of thousands, you might not notice any performance differences if you are following the best practices. And that's because the size of the data is small. But if you have like million or hundred of millions of records in tables, you will immediately notice how things can be faster if you follow the best practices. And here is my golden rule. If you get any best practice from me or let's say you are reading something in the internet, always you have to test using the execution plan. So for example, if you have like two queries are returning the same result of the data, I'm going to recommend you here to check the execution plan. And if you notice there is no differences between them in the execution plan then pick the one that you see it is easier to read and to understand because sometimes if you are following the best practices for the performance your query might be like little bit more complicated. So always write the query to be understandable and only optimize it if you notice it is slow. So the golden rule here is always test. If you find you are optimizing the performance with the new query then pick that and if there is no gain in the performance then focus on making your queries readable. So this is the golden rule always test test test using execution plan. So let's deep dive into best practices and we're going to start by optimizing the performance of our queries. All right let's start with the easy stuff. The first step is select only what you need. What I usually see in many queries is that the developers just go and select all the columns from one table and I can tell you I cannot think of one scenario where you need all the columns of one table in one query. So for sure in the result we will get like unnecessary columns and of course reading unnecessary informations going to make your query slower. So this is usually a bad practice. Don't use select star but instead of that go list all the columns that you need for your query. So make sure that you only select what you need. Don't go and select all the columns from one table and with that you don't risk reading unnecessary informations from the database. So always make sure that you select exactly what you need for a query don't go with a star. Okay. Tip number two avoid unnecessary distinct and order by. I have noticed that many developers as they are writing a lot of queries they tend by default adding always distinct and order by for each query. And as we review the code and discuss it with the developer, we see that we really don't need to remove any duplicates in the query because there are no duplicates and it was only a habit to remove the duplicates using distincts. And the same thing for the order by in many situations there is no need to sort the data at all. And those operations, the distinct removing the duplicate and sorting the data, they are very expensive operations in your execution plan. So they're going to take a lot of resources and slow down your query. So this considered as a bad practice if you always go and use distinct even though it's not needed or you are using the order by in order to sort the data when it is not necessary. So the best practice here is to avoid them. Don't use distinct or order by only if it is necessary. Okay. The next one for exploration purposes limit the rows. So sometimes especially if you are working with a new database you would like to explore the tables just to have a quick peek in order to see the content of the tables. And if your database has a lot of big tables with millions of rows and so on, you will be consuming a lot of resources. If you just select the data like this. So now imagine that the orders has like 100 million. As you run this query, the database has to fetch all the 100 million for you. And usually for exploration, it's enough to see like 10 rows and that's going to be enough. That's why it is considered as a bad practice if you are exploring the tables to not have a limit or top. So a good practice would be to say select top 10 and then have the same query. So if you go over here you will get only 10 rows and the database will not fetch 100 million. It can fetch only 10 rows. And now if you are exploring a lot of tables you will not consume a lot of resource from the database. So if you are exploring always limit the number of rows that you are retrieving. All right. Right. So now we're going to talk about how to optimize the filtering in SQL. So the tip here is to create an uncclustered index on frequently used columns in wear clause. So now of course you have to check your queries and so on. And if you see that you are frequently filtering the data using the order status then it makes sense to create a non-clustered index for this column in order to improve the performance of your query. So for this situation I'm going to go and create then a nonclustered index for the table sales order for the order status. So once you create it then you improving now the performance of your query. Okay. The next one is avoid applying functions to columns in the works. So in many cases what we usually do is that we go and transform the columns before like filtering the data. Like for example here I'm applying the function lower on the order status because I'm searching for the value delivered and I'm not sure about the values in the table whether they have like a camel case or uppercase or anything but in order to make sure that I'm going to find the value I'm going to go and say lower the order status and then give here a lower value and of course it's going to work. So if we go and search for it and as you can see we have here the status delivered and the value is different than the one I used because here we have like a capital first character but here we have a problem we have an index on the order status and now if you use any functions like for example here the lower the SQL will not use the index so that means the whole index is now useless and the SQL is not using it and that's why we consider it as a bad practice to use functions for the wear clause and Instead of that the good practice is that to not use any function and to write exactly the value that is used inside your data and with that the SQL going to be happy and use the index that you have created. Okay, let's have another example about this rule and here we are selecting all the customers where the first name start with the A. So with that we can go and use the function substring in order to get the first character of the first name and once you match it with a then you will get the result and here we have Anna. And this is again bad if you have an index on the first name and that's because we are applying a function on the column. So this considered to be a bad practice and instead of that we can go and use the help of the like. So we can go and search for this pattern where it start with the A and then we have a white card. We don't care about the rest. So it must start with a. So if you go and execute it you will get the same results. So try as much as you can to avoid the functions in the wear clouds in order to hit and get the index working. And in many scenarios, we have a workaround in order to use the function without transformations. So try your best to avoid using functions if your columns having an index. All right, one more example that you see a lot on queries that you filter by the year. So we are searching for the orders that happens in 2025 and we usually go and use the year order dates. And now if you have an index on the order dates, this again will not be working because you are using a function year. So this considered to be a bad practice. Instead of using the year function, you can go and use between. So we don't apply a function on the order date and we say the order date is between the boundaries of the year. Of course, now our query is not looking really cool and easy like the first one. But still with the second one, we are hitting the index. So again while you are filtering, try to not use functions on the columns because it is really waste if you have an index and you are not using it. and most of the cases you have like a workound for your function. So those are the three examples that I wanted to show you about this tip. All right, moving on to a similar one. It says avoid leading wild cards as they prevent index usage. So this is a similar one. Let's say for example I'm searching for the word gold inside the last name. And here we have to be careful what we are searching for. Should the gold exist somewhere in the last name or only we are searching for the last name that start with gold? If it's like that we are searching only the last name that starts with gold then we are doing it here wrong. And in SQL if you're using the leading wild card then the SQL will not be using the index. But if you are using the wild card at the end and the trailing this one is fine and will not avoid using the index. So this considered as a bad practice because you will not be hitting the index. Better than that to not use the white card as a leading and if that's enough for your search then with that you are hitting and using the index. Okay, moving on to the next one. It says use in instead of multiple or or operator is very evil for performance and try to avoid using it. It really kills your performance whether it is in the filters or joins and so on. So now we want to show the orders where the customers is equal to one or two or three. And of course this is considered to be bad practice and hard to read and so on. Please don't do that. Instead we have the in operator and we are saying if the customer is one of those values then show the orders. So if you go and run it you will get the exact results and it's not only looks nicer than the first query but it has as well a better performance. So if you find out writing a lot of ors think about the inoperator. So those are the best practices for filtering data to improve the performance. Okay, so now we're going to focus on how to optimize joining tables in SQL. So the first tip here is to understand the speed of joins and to use inner join when it's possible. Well, as we learned before, we have like different types of joins. We have the inner, left, right, and outer join. And if we talk about the performance, the best performance you will get from the inner join. And that's because SQL going to work only on the matching rows. That means the effort and the processing time is better than the other joins. Now in the next one in ranking we have the left and right joins. They are slightly slower than the inner join because usually they process more data and more rows than the inner join because SQL will work not only with the matching rows as well with the unmatching rows. So for right and left SQL has to do more stuff than the inner join. And now the worst type of joins we have the outer join. And that and that's because this type works with the biggest number of rows compared to the other types. It's going to present unmatching rows from the left and from the right tables. So that means SQL has a lot of to-do and that's why this join has the worst performance. So here my advice is always try to use the inner join if it's enough to work with the matching rows and if the matching rows is not enough then go with the lift join maybe. But try your best always to bring the inner join instead of lift join. But don't forget inner join filters the data. Okay. The next one it says use explicit join the unzi join instead of implicit join. Well it is considered as a bad practice if you join tables like this the implicit join or the nonzi join. It's better to use the normal modern join where you use the inner join for example. about the performance. There is like no differences between them. And for this scenario, it's very simple. But if you have like a complex query, then joining table like this might be very confusing and really hard to read and as well complex to optimize. That's why the best practice says go with the normal inner join. So go with the anzi join instead of the nonzi join. Okay. To the next tip. Make sure to index the columns used in the on clause. So we have to go and make sure that both of those columns has an index because indexes speed up the lookup process. Without an index, the SQL might go and do a full table scan. Without an index on those columns, the database might go and scan the entire tables in order to find a match. And that is really slow if you have big tables. So now if you go to the customers over here and then to the indexes, we can see that we have an index, a clustered index for the customer ID. But if you check the customer ID in the orders, we don't have an index for that. So this one doesn't have an index. So in order to fix that, we're going to go and create an uncclustered index on the table orders for the customer's ID since it is a foreign key. So once we do that, we have now an index for both of those columns and with that our join going to be faster. Okay. So now we come to a tip where we say really it depends on there is like not one clear way on how to do it. But let's say if you have a big tables, it is better to filter data before joining. And here we have like three different scenarios that going to deliver the same results. But of course the question is which one is the best for performance. So now let's have a look to them. What we are doing here we are just joining two tables and then we are filtering the result based on the order status that comes from the orders. So in the first query what we are doing we are first joining tables and at the ends we are using where clause in order to filter the data. So by looking to this we are just filtering the data after joining the tables. But there is another way on how to do it. You can go and join the tables but on the join condition you can go and add this order status equals to delivered. So we are matching the data by the customer ID and at the same time we are filtering the data by the order status since we are using the inner join. So the filtering is happening during the join or you can do it like this where we have here more stuff to be added where we don't join the table directly with the orders. We first prepare the table orders before joining it with the customers. And here our preparation is we are just selecting the columns that we need and we are already filtering the data before doing the join using the subquery. But if you run all those queries you will get the exact same results. And of course there is another way on how to do it. you can go and prepare the data not in subquery you can go and use a CTE and then join the result of the CTE with the table customers. So now about the performance if your query is like small not that complex and as well you don't have a big data inside your tables all those three queries going to deliver the same performance. I know it might sounds weird because here we are like filtering after joining or here we are filtering during the join. Normally in databases the SQL optimizers are now very smart can understand that there is a filter here and decide on the best execution plan for you. So actually wherever you put your filter after, during or before the SQL is smart enough to do it correctly. So if you don't have complex query and you don't have like big tables, go with the one that suits you. And I really recommend you to go with the first one because it's logical and easier to understand. But if you have big tables and complex queries, the best practices says try always to prepare the data before joining it. So try to isolate and abstract the pre-step in a subquery or in a CTE before joining it with any other tables. And in many scenarios in my project where I have a big table, this did help where the execution plan was better if I isolate and prepare the data before joining it. So if you have small or medium tables, go with the normal way, use the wear clause. But if you have complex big tables, prepare the data in subquery or CTE and then join it with the tables. Okay. And now moving on to tip number 12. It is similar to the previous one but this time it says aggregate data before joining tables and again it is special case to improve the performance of big tables. So now we have the following scenario where we are joining the orders and the customers and we are aggregating the data by the customer ID but we are just joining the table customers because we need the first name. So as a result we have the customer ID, the first name and the order count. So the standard way is to join the tables and then do a group by in order to summarize the data. Now if you look to this query, we actually don't need the join in order to do the aggregations. We can do first the aggregation like preparing the orders with the aggregated data and then join the result with the customers in order to get the first name. So again we prepare first and then we do the join and we can do that using either the subqueries or using the CTE. So in this scenario first we are doing the group by we are aggregating the data and the result of this is joined with the customers tables in order to get the first name. Now of course there are like many ways on how to do it like for example as well using the correlated queries where we can go and use the subquery in the select statements and then use the where condition over here to make the correlated query. Now all those three going to deliver the same results but the question here again which one has the best performance? Well, I can go immediately and tell you that correlated subqueries are the worst one. Always avoid using correlated subqueries. They has really bad performance. And that's because SQL going to go and do the aggregations for each customer individually. So it's going to go like for each row and doing aggregation then to the next row and so on. So it takes long time. So this is bad practices. Don't use it. Now we are left again with the first option and the second option. And here my tip going to be like the previous one. I'm going to say if you have small to medium size of tables then go with this one because it is easier to read and to understand and you will gain exactly the same performance as this subquery. But if your tables are big the best practices is to prepare first the data to group up the data to filter the data and to isolate it in a subquery or a CTE before joining it with the final table in the final query. But again here only for big tables and always test check the execution plan whether you are really getting any benefits from it. All right. So if you have big tables try to prepare the data first in city subquery and then join. Okay moving on to the next tip. It says use union instead of or operator in joins. So what this means sometime let's say that you are joining two tables the customers and the orders. And now about the join key, you can see over here it says the customer ID should be equal to the customer ID from the orders or the customer ID should be equal to the saleserson's ID. If one of these two conditions is fulfilled, then we have a match. And I can tell you the or operator over here is a performance killer. It has really bad performance. So try to avoid it. Don't use ore in the joins. It has a lot of problems like it avoid indexes, it create like loop joins and so on. That's why we consider it as a bad practice. And now in order to get the same results, we can go and split the joins. So we can go and have two queries. The first query is joining the data based on the customer ID and the second query based on the saleserson and then we go and merge those two results using the union. It sounds like bigger and too much for the SQL but with this you will get better performance than using this simple or operator. So again if you have big tables try to avoid using or and instead of that go and use union. Okay the next tip says check for nested loops and use SQL hints. Now imagine that we have like big tables and we are joining tables. So now if you are checking the execution plan you have to check always the join type. So for example here it is using the nested loops which is of course is okay because we have small tables but if you have big tables and still SQL is using for some reason the nested loops then this is alerting. So in order to change this what we can do we can go and use the SQL hints in order to force SQL to use the hash join. Hash join is really good if you have a big table like for example the orders that is joins with a small table like the customers. So now what we can do at the end we can write over here option hash join. So let's go and execute it and let's check the execution plan and with that we have forced SQL to use the hash join or hash match. Again you have here really to evaluate your tables. If you have like small tables don't bother with that. But if you have big tables and SQL still doing the nested loops, nested loops are usually very slow because you have a lot of iterations and so on and with the hash join that small table going to be stored in the memory and then you have really a quick matching between the two tables. So those are all the best practices and tips on how to optimize joining tables in SQL. All right, so now we're going to talk about union and here is the best practices. It says use union all instead of using union if duplicates are acceptable. So it's very simple. If the duplicates are acceptable or let's say that there is no duplicates then don't go with the union because it needs more time to be executed. SQL has to go and check row by row whether we have duplicates or not and this usually takes longer time than using the union all. So if duplicates are acceptable or you don't have any duplicates in your data go with the union all just have to go and merge all the data without checking anything and the performance going to be faster. All right, the next one is little bit tricky. So it says use union all together with the distinct instead of using union if the duplicates are not acceptable. So you want to remove the duplicates. So we have learned that in order to do that we're going to go and use the union. It's going to go and merge the data and as well remove the duplicates which is really okay to use it if you have like smaller data or medium. But let's say that you have like millions of row which is really okay if you have like medium and small tables. But again here if you have huge tables big tables hundreds of millions the best practice says go with the union all and afterwards use a distincts. So in the sub query we are using union all but in order to remove the duplicates we use the distincts. But again here you have to test it to check the execution plan. If you are getting benefit then go with this version. But if your data is not really big you have hundred of thousands. So go just with the normal union. the code is smaller and you will get the same effects but only for large tables you can go with this best practice. So that's all what I have for you for the [Music] union. Okay. So now let's talk about aggregations and here the tip says use column store index for aggregations on large tables like for example fact tables and that's because column store index going to compress the data. So the size of the data going to be smaller and as well the aggregation is super fast because we are selecting only the relevant informations only the relevant columns. So it makes it a perfect setup for aggregating large tables. And now let's say that we have hundreds of millions of orders and we have this query over here. So the best practice says convert this table to a clustered column store index. So if you go and create this clustered index over here, the whole table going to have amazing performance for aggregations like this. All right. So to the next one, it says pre-agregate data and store it in a new table for reporting. So let's say that we have like a big query where we are aggregating the data and so on. And this query takes really long time. Let's say like 5 minutes or something like that. But now the problem with that I would like to show the results as a report maybe to my manager or let's say during a meeting it's going to be really bad if everyone have to wait until the query is done. So the best practice here if you have like a query that runs very slow what you can do you can go and store the results in a table. So if I go over here and say into sales summary what going to happen going to store the result inside this table. So let's go and execute it. And now with that we have a nice table where everything is prepared. So all that you have to do is to go and query this table. And of course it's going to be very fast because it's only select statements. And with that you have like prepared and pre-agregated the data to have like fast reports. So don't forget about this. If you have a big query you can insert the result of this query in a new table in order later to use it for reporting. But one thing that you have to make sure that you have always to update this table. So if we have new orders, it will not be presented inside the sales summary. You have to go and run this query again in order to get new data inside the sales summary. So those are the tips on how to improve the performance of your aggregations in SQL. So now what is happening here? I would like to show the orders but only from customers from USA. So if you check this query over here, we are joining the tables order and customers but mainly we are showing only the orders information and that means we are using the customers only to filter the table orders and there are like multiple ways on how to do this task. So it's not only the joins you can go and use the exist as a subquery and as well you can go and use the in operator in the subquery. And now comes the old but gold question. Which one is better? Should we join or use exist or in? And oh my god, if you go to the forums, you will see people fighting about which one is the best. Clean tech. Come on, do that again. Do that again. I dare you. Okay, bring it. Oh, you can't say you can't say one point. Two point. Now, about the best practices, everyone agrees that's don't go and use the in operator. So this is the bad practice. So bad practice avoid it. Don't use it. And of course I'm always speaking about big tables, okay? Not small tables. So we don't go and use this in order to filter one table based on the result of another table. So don't use any operator in this scenario. Now here comes the conflicts. We have join and exist. Well, about the performance of those two, they are very similar for medium tables. like I'm speaking about hundred or thousand and so on. But still you have to test it. You have to go and compare the execution plans and if you are getting like identical results and both of them are having the same speed then I prefer to go with the join and that's because to be honest it is easier to write than writing that exists. So I'm going to say from my point of view this is best practice if the performance equal to exist. But now what happens for me is that sometimes I get better performance using exists. So I'm going to say from my point of view the best practice here. And now you might ask why we are getting with the exist better performance than in the inner join. And that's because SSQL has only to check the existence of data from the subquery. But in the other hand with the inner join SQL has to go and start doing matching between two tables. So it can go and evaluate all matching records and so on. It is not evaluating whether it exist or not. And as well sometimes SQL has to deal with more rows because you might introduce duplicates as you are joining tables. And this will not happen using exists. So for some scenarios if you are using exist you might get better performance than using join but everyone agrees to not use the end operator. Okay the next tip is to avoid redundant logic in your query. This happens a lot if you have a lot of sub queries and if you analyze it you might find sometimes there is like redundancy. So for example this query I would like to have like a tag for each employee whether the salary is above the average or below the average. So now we might do it like this. we say okay let's get the data for employees where the salary is higher than the average and you go and calculate the average in a subquery. So if it's higher then you write here above average and now we say okay let's go for the below average. So we do a union all and the condition going to be salary is less than the average. And now by checking this you see that there's a problem. First of all we are querying the employees like four times. We have 1 2 3 4. So we are scanning the table employees four times and as well we have the same logic over here. So we are calculating the average of salary at twice. So this is of course I can say a bad practice and there is like many ways on how to do it better than that. For example, you can go and put this subquery in CTE and then use it multiple times. But there is like better solution using the window function. So if you check this, it is very simple. Let's me execute it. We are reading the table employees only once and then we are using the case statements. If the salary is higher than the window function. So we are calculating the average on top of the whole table employees. If it's higher then write above average. If it's lower then below average. So as you can see it is easier to read and it is smaller and the performance here is way better than reading four times the employees and repeating the same logic. So here you have always to look to your queries and if you see that you are repeating the same things over and over then you are writing a bad query. Think about alternatives like CTE window functions and I'm sure you will find a better way than reading the table several times or repeating the same logic several times. So as you can see optimizing the queries is not always about using indexes and partitions. It's all about using best practices. All right guys, so with that we have covered a lot of best practices on how to optimize the performance of your query. And as you can see it's not always creating indexes, right? In many scenarios it's about how you write the query. And now in the next section I'm going to show you the best practices on how to create tables. So the best practices of DDL data definition language. If you have a poor definition of your tables, this has a great impact on the performance of your queries. All right. So now we have here like a DDL in order to create a table customer info and it is not really following best practices. So let's go through it one by one. The first tip is try to avoid the data types varchar and text if it's possible. The vchart and text they are like one of the worst data types for performance because they consume a lot of resources whatever you do like for example if you are sorting the data by a column that is var or text it is very expensive operation the same thing if you go like and create an index on top of such a column it's going to be as well expensive and they cause a lot of problems with the data fragmentations and many issues. So try as much as you can to skip those data type if it's possible. So now let's go and review all those columns in order to see whether we can change something about it because it has a lot of bar charts. So the first one over here we have is var because it is the first name. Well, it is okay. Now moving on to the next one. We have the last name as a text which is not really good because text is worse than vchar. So it's better to use var than a text. So here we have to fix it. So var and I'm going to go with the links 50. Now moving on to the countries. So the country is going to be vartar. We cannot change that. that contain characters. So the next one is the score of the customer. H here we can do something about it because scores are only numbers. So that's why we can go and skip this one. So let's remove it and say you are integer and with that we have avoided using the varchar. And the same thing goes for the birthday. The birthday is a date and here we have it as a vchar. Well this is not really good and we can skip that by having this column as a date. So date is way better than having a vchar. All right. And the next one is integer. So with that we have fixed few stuff. So we have fixed the score and the birthday. And with that we have saved some storage. If we have an index on the score it's going to be way better than having a var. And if you are filtering the data based on the birthday it's going to be faster. So again try your best to avoid the vchar and the text. I have seen in many projects that a lot of developers tend to use the vchar and I understand it is easier to make everything as a vchar than deciding whether it is an integer, date, float and so on because you can fit everything in the vchar and text but this is lazy. Take time to understand the content of this column and try to assign it to the correct data type because this has really impact on the performance. Okay, to the next one it says avoid using max or overly large lengths. So now we have to keep our eyes on the links of each data type especially the bar charts. Not only it going to waste like a lot of storage. It's also going to like mislead the SQL by creating large indexes which is totally unnecessary because the data itself is small but because you have defined like a large length SQL going to check those informations and make decision to make a big index and large indexes are always problematic because they're going to slow everything down by sorting the data by retrieving data by updating the index. So it is really bad practices if you go blindly and define everywhere max or 255. Again give it a chance to think about each column and predict a length for it. So for example if you check over here we are saying first name v chart max. Well most of the first names are short. So we don't need like the maximum size of a v chart to fit a first name. So here we can go easily instead of max with the 50. And the same thing goes for the column country. We don't need 255 characters for the country name. We can go with something more realistic like around 50. I think you can even go smaller, but it's fine to have 50. So, the best practice here is to analyze your data and to predict the size of each column. And don't be lazy by just defining max everywhere. I know it's faster, but it's bad for performance. Okay. What do you have else? Use the constraint nutnull as much as possible. The nutnull is amazing. It has a lot of advantages. Of course, the biggest advantage is that's the data integrity of your table. So with that, you make sure no nulls are inserted in specific column. But it is as well good practices to use it for improving the performance because if you are creating an index, you're going to get a better index performance since SQL knows there is no nulls inside my tree inside the index. And in the other side, if you are writing query, we tend to use a filter where we say a specific column should not be null. But if you make sure that in the DDL it is not null then you can skip this filter and with that you are reducing the size of your query. So what we're going to do we're going to go through all the columns and decide whether it is not null and null. So for example the first name and the last name they should not be null. So that's why I'm going to say not null and the same thing for the last name not null. For the customer ID we're going to talk about it soon because we're going to convert it to primary key and primary keys are usually not null. So now for the country we make have it in the business that it should not be null. So we go and make a constraint about it. Now about the total purchases and scores. If it is new customer, maybe we can have a null inside our data. So we're going to leave it empty. And I think birthday is going to be usually optional. So we're going to leave it as well. And whether the customer is employee or not. This could be as well a null. So with that we have found out like three columns where we can have a constraint about the not null. And if we go and create like an index on the country, it's going to be a better index. Okay. Moving on to the next one. It says make sure that all your tables inside the database have a clustered primary key and as well it can help you building the relationship between tables where you have primary keys and foreign keys and you can join tables then very easily and as well a primary key has importance for the performance and incale server the default going to be a clustered index which is really good to have an index on the primary key because sometimes you are doing like an update operations or delete operations it's going to help up by the lookups of joining tables. So there are a lot of performance benefits of having a primary key and make sure that all your tables having a primary key. So as you can see the issue of our table we don't have a primary key and our primary key going to be the customer ID. So let's go and do that primary key and as I said as a default it can be clustered but I'm going to write it down in case if you are working with different databases make sure it is clustered. Okay moving on to the next one. It's not only about the primary key we have to take care of our foreign keys. So the best practice says create non-clustered index for the foreign keys if they are frequently used. The foreign keys are usually important in order to connect and join two tables and usually we frequently use it and not only that we use it sometimes in order to filter the data and if you create a nonclustered index for that it can improve the speed. So what we can do it's very simple we're going to go and create a nclustered index on our table customers info for the foreign key employee ID. So how to do it is very simple. We're going to go and say create nonclustered index on our table the customer's info on our foreign key the employee ID. But again make sure that this is an important foreign key that is used frequently from your queries. All right friends so as you can see there are a lot of best practices on how to improve and optimize the DDL. Having a healthy DDL can improve the performance of your queries. Now in the next section I'm going to show you the best practices and tips and tricks about indexing. So let's go. All right, the fifth best practices and the most important one is avoid overindexing because too many index is going to slow down the insert, update, delete operations and it's going to confuse as well the execution plan about choosing the right index and the performance of the whole system going to go down. And another tip is to monitor the usage of the indexes and I can tell you 90% of the indexes that is being created usually are not used at all. So they are taking a lot of space slowing down everything. So go and drop those unused indexes in your system. The next best practice is to have a regular job like maybe a weekly job. So first you have to update the statistics regularly as you are inserting new data and modifying data inside your database. The statistics and the metadata of your tables might get outdated and this is really bad because you will not get an optimal execution plan for your queries and this can slow down your queries of course. So regularly make sure that all the statistics are updated in order to have an optimal execution plan. And what else we can do in this weekly job is that we can go and rebuild and reorganize our indexes. And that is to make sure that we are preventing data fragmentations in our indexes. Data fragmentations in your indexes is really bad because there will be a lot of unused spaces. The order of your clustered index will not be correct. So make sure that at least weekly you are rebuilding and reorganizing all your indexes. So those are the best practices of improving the performance and optimizing your indexing. If you are struggling with very large tables in your projects like having fact tables, then go and use SQL partitioning in order to divide these tables into smaller pieces which can improve the performance whether you are reading data from the table or writing data. And of course you can go and mix things where you can go and apply a column store index on this partition table then you will get the best performance if you are having large tables. All right friends so that's all those are the best practices tips and tricks that I've collected in the many years working with SQL. And now my final thought about this is that try always to focus on making clear queries. Make it like easy to read and easy to understand and try to optimize the performance only if it's needed. So if you have like small database don't worry a lot about the performance because the SQL optimizer going to pick the best plan for you and focus only on having simple queries and if there is like performance problem always test using the execution plan. It should be your judge. So if you are applying any index or you are rewriting your queries always compare before and after using the execution plan. And if you are gaining more performance then adopt the new query or the new index. All right my friends. So that's all the tips and tricks best practices that I have for you in order to optimize the performance. And with that we have covered now everything about this chapter the performance optimization. Now in the next chapter I'm going to show you how I use AI in order to assist me while I'm using SQL. So let's go. All right. Right. So now I would like to share something important with you especially as a future developer that is working with AI. One of the best ways in order to truly build skill and to grow as a developer is by working on complex task and issue on your own. So when you are stuck on complex task and you are pushing yourself to find a solution for it and you are writing your code in yourself here the magic happens and the real learning can happen. And if you jump too quickly and ask the AI for a solution, what you are doing, you are skipping an essential step in order to become an expert. And more important than that, you won't develop skills in order to understand when and where the AI was wrong. So my recommendation here is to have a discipline. Always try to solve the task on your own and only turn to AI if you don't have any more ideas on how to solve the task. So that's my opinion and my advice for you. So quickly what is shippet? It is an AI program that is developed by open AI that is trained to understand questions and provide humanlike answers. So what GPT stands for? The G stands for generative. So that means the data model can generate a new content new text and P stands for pre-trained. The data model is already trained on huge amount of data. And the T stands for transformer. It is type of neural network architecture that processes your sentences in the prompts in order to understand the context behind it very fast and accurate. And in the other hand we have the GitHub copilot. It is developed by the GitHub and as well using the same data models from the open AAI. So that means both shad and copilot both of them are using the same language model that is developed from OpenAI. So the GitHub copilot did train on tons of codes that is available in GitHub. So how it works as you are writing a code in the code editor like for example visual studio it going to provide realtime suggestions as you are writing and typing your code. So now if we compare those two shad and the copilot we can say that the shajibet is a standalone application where you can interact with it using a website or an app where you go and start a conversation with the AI where in the other hand the copilot is directly integrated in your code editor like for example the visual studio code this is way better than shibility because you have realtime interaction with the AI this is a great advantage for the copilot because everything in one place so with the copilot pilot you are getting realtime assistant during your coding. So the main purpose of the ship is to have a conversation with the AI for any topic that you like not limited only for software developments but in the other hand a copilot focuses only on assisting the software development where you as a developer as you are writing your code you are getting auto completion of the code or maybe a block of code as a suggestion. So these are the key differences between shad and copilot. Now if you are doing software developments or you are working with data projects and of course it depends on your role in the projects there will be many different types of tasks and activities that should be done in the project like there will be a lot of brainstormings about new ideas and coding solutions debugging generating documentations discussing the different types of architecture doing road cause analyzes. So the spectrum of activities and tasks in each projects usually is very huge. And of course we can go and use the help of different AI tools to assist us with those tasks and activities and there is like not one AI tool that can cover all those stuff. I tend to jump between co-pilots and something like Shajbet. Okay. So now I'm going to go and map those different tasks to either sht or copilot. So now let's focus on the shibbet. The first one is brainstorming and ideas. So now if we have in our project a big task or let's say a big issue that we want to find solution for it. I tend to use of course tools like shad in order to have a discussion about the topic in order to explore and discuss multiple ideas and then start evaluating all those ideas. The next one where I found myself using shbt is doing the project planning. So it is as well something high level. You can go and discuss with the shaj GBT about the design of your projects and you can as well discuss the milestones the road map of the projects. The next thing that I find myself using shajbt is for learning knowledge and research. If you are working with big data projects you will be overwhelmed with the amount of cloud services and AI analytics tools. So and of course you can go and learn new stuff gather informations and knowledge using shajibb. Okay, moving on to the next task. We have generating documentations. Writing documentations is always painful process and consumes a lot of time and I tend to use tools like shibbit in order to generate those documentations. But of course, I always review the documentations and make it short. Okay, moving on to another topic where I use shadet is that to discuss architecture. Of course, if you are starting new projects, they will be like different types of architecture in order to implement the projects. And of course, you can discuss with the shajibility about the different types of architecture and if you give the specifications about your projects then you can discuss with the shajibility which architecture is suitable for the project. And another task that I find myself always like researching is exploring the best practices, tips and tricks. So you can have a discussion with the SHP about the recommendations, what are the best practices, what are the common pitfalls in order to make sure that your code and your solution is always up to date with the best practices. And one more thing, if there's like in the projects a very complex task, then I tend to have a discussion with a tool like Shajibet in order to break this complex task into small pieces and start finding the solution for each piece. And now in the other hand, I'm using copilot in order to solve different type of tasks. So here where I get my hand dirty in the code. So while I'm coding I'm using alltime co-pilot in order to assist me because it provide directly inline suggestions and help me to code faster and reduce the human error that I might make. So while I'm writing a code or debugging I tend to use copilot and I don't find myself going to shy GBT to ask about code or syntax. We can do it directly in the copilot. And one task that is very famous in any software developments we have the refactoring. So if you have like a code that is slow and bad designs and you want to refactor the whole codes, you can do it directly in your code together with the copilot in order to find optimizations. And I use as well copilot in order to add inline comments. So I don't find myself going to ship and asking to add comments to my codes. You can do it directly in your code using cilot. And of course if everything is working perfectly, I have the best practices, the good performance, I have the comments, it's still you have to maintain nice style and format of your code. And of course now we can do that directly using the copilot. We don't have to go and jump to shajbt in order to style and format your code. And as you can see I'm currently using both of them for different types of tasks. So again if I have the feeling that I have to discuss something I go to shbt. But once the idea is very clear and I know the solution then I start using copilot in order to write the code and with the help of the copilot I can deliver clean and professional code. So this is how I currently use both Shajbuty and Copilot. Okay friends, so now what we're going to do, I'm going to show you a quick guide about the GitHub copilot in the Visual Studio Code. Once you create a profile and connect it to your Visual Studio, you will get a new icon for the copilot. So once you go there, you can see quickly the status and as well you can go and disable the copilot. So if you have it like this, that's means your co-pilot is active. So now once you have everything up and running, what you have to do is very simple. Just go and start writing your code. So start typing any select statements. And now you can see that we have a gray text. This gray text called the ghost text. It is an auto completion from the copilot. And now it says select star from table. And now as you can see as I mouse hover on it, we can see that I can go and switch between different suggestions. So here we have like three suggestions. One, two, three. And I'm going to go with the third one. So now here as it says if you want to accept the suggestion all what you have to do is to press tab. So let's go and do it. So you are accepting the whole thing. But now if you say you know what I'm going to accept only part of the code. So let's go again and write select. So this time we're going to be selective. In order to do that hold control and then with the right arrow and with that we are accepting part of the ghost not everything. But of course if you are accepting the whole thing just go with the tab. And now there is another way in order to trigger the ghost text and that's by defining first a comments. For example we want to select the top three customers based on the score. So now once you start writing the query the co-pilot going to go and write a query that is relevant for the comments. So now as you can see we are getting top three from customers because we want the top three customers and here we have like two suggestions like over here we have the order buy or without it. So I will go with order by and hit a tap. And now here another suggestion which is correct. In order to solve the data from the highest to the lowest. All right moving on to the next one. As we learned in SQL in order to solve a task there could be like multiple solutions and multiple variants of queries that solving the same task. So let's say that we have this task rank customers based on their total order sales. So what you can do if you start writing the query we are getting now the ghost text. But now what we can do we can go and hit ct controll enter. So now what happens on the right side you will get different suggestions and here we have like nine suggestions on how to solve this task in scale. So now what you have to do is to go through all those suggestions and pick one. For example I can go with the suggestion number three and say accept suggestion and you will get it in your code editor. So this is what we mean with the copilot autocomp completion and integrating the AI directly as you are developing and writing a code. Now in the co-pilot, not only using the ghost text and the autoco compilation, we can go and interact with the AI using inline shots. So it's something like shimity. Now in order to trigger the shot, what you're going to do, you're going to go and hit control I and then you're going to get a place in order to ask the copilot any question like for example join the query with the table sales orders. So let's go and hit it. And now as you can see we got a full query where the customers is joined with the orders and it is totally correct how the table are joins. So that means copilot knows already all the tables that I have in the database and as well the columns and how to join them. This is amazing. So if you like it you go and accept it of course and this is way faster than having shajibbd because in shajibity you have to introduce your database your columns and stuff before even asking anything. This is exactly the power of copilot. Now what else we can do with that? We can go and highlight part of our codes and then start again the shots and here we can say replace this column with an aggregation of the sales. So let's go and hit okay. Now as you can see it replaced it with an aggregate function. And one thing that is very important the code is not changed yet. So it is highlighted and showing you a suggestion and now you have to accept it or discard it. If you discard, nothing going to change in your codes. But once you say accept, it's going to go and replace your original codes. So if you go and do that, now your code is replaced with the AI suggestion. Okay. Another thing about the copilot, it's try to fix issues that you have in your codes. So for example, we have here an error. If you go and mouse hover it, you can see a menu from the copilot in order to view the error or to fix it. And another way to do that, if you right click on it, you go to the copilot. And here you can see we can explain or fix. So if you go and explain, you will get another window where you get an explanation about the issue in your code. And once you understand it, you can go and ask the copilot in order to fix it. So let's go over here and go to fix. And with that, the copilot did fix the issue. It was all about the order of the select statements. So first you have to do the group by then order by. So it helps you to find issues and to fix it as well. And now, as you might already noticed, as we are writing the code and interacting with the Visual Studio, you will often get a sparkle, this little yellow sparkle on the left side. So, you will see this icon each time the copilot thinks it can help. So, if you go and click on it, you will get a menu of different stuff that the copilot can do for you, like fixing, explaining, modifying, and so on. Well, my friends, that's it. This is the copilot, and it is very simple, but yet very powerful for developers. And of course, not only for SQL, for anything like for Python and so on. Everything is integrated in one place. I don't have to jump to Shajibbet and ask stuff. It is live and I can do it directly as I'm writing my code. So that's all for Copilot. All right friends. So now let's switch to Shajibet. So let's start first by understanding the structure and the basic components of Shajbet prompts. So the first component and the most important one we have the tasks. You have to be very clear by defining what the AI should do and without having a clear tasks the AI will not understand what to do. So this is mandatory in each prompt and then after that you have to provide some context. So you give some background informations like for example you say I am students or I am a data engineer and so on. And another components we have to add specifications. So in the task you give the main task what the AI should do but with the specifications you go in details like for example which topic should be added or maybe excluded the number of word counts. So here you are specifying a lot of wishes and small details and specifications in order to get an answer that meet your expectations. So both of the context and specifications they are important. And then after that we have some nice to have components like for example specifying a rule. So here you give the AI a role like for example you tell it to act as an expert as a teacher interviewer. So you are setting the AI to play a role and the last component that you can add as as well the tone. Here you are defining like the voice of the answer in order just to make the answer like more friendly and easy to read and engaging. So the role and the tone they are nice to have and if you go and use all those components you will get a better results from the AI. So let's take for example the following prompts explain SQL window functions. So this is very simple and very short and here we have only one component the task. So here you are not giving any context whether it is for data analytics or for data engineering. So you leave it up to the AI and maybe the answer that you will get will not meet the expectation that you have. And now if you want to shape it in the way that you want you have to add more components like for example this prompt you are saying you are a senior SQL expert. So here we are defining the rule for the AI. So the AI should act now as an SQL expert. And then the next section we are adding a context to the prompts. So we are saying I'm data analyst working on SQL projects using SQL server. So now the answer that you will get from the AI going to use the syntax of the SQL server and focus on the topic of analytics. That's why the context is very important and then we go specify in the prompt the task the main task. So we say explain the concept of SQL window functions and do the following. And now we go and give more fine details about what the AI should provide. We are saying explain each window function and show the syntax. describe why they are important and when to use them and list the top three use cases. So you are now specifying what you are expecting from the AI and after that of course it is nice to have we specify the tone of the explanation. So we say the tone should be conversational and direct as if you are speaking to me onetoone so that it is not like you are reading a document you are reading something that is engaging. So I know this prompt is really big but still you will get way better results than only saying explain the concepts. So those are the main components that I usually use if I'm starting like a conversation and a discussion with the shajuti. Okay. Next I'm going to show you the frequently used prompts that I use in my projects. Now little bit awareness about using shajib in companies. If you are working in new company, make sure to ask about the rules of using Shia Gibbt because some companies offer their own chatbots for few security reasons. So make sure always to check with the rules before jumping immediately to sht. All right. So let's start with the first prompts. We can use shad in order to solve an SQL task that you have in the project. So let's see this prompts. It start first with the context. So I'm telling that I have an SQL server database and we have like two tables. So now I have to explain for shad the database that I have. So I'm saying we have a table called orders and we have the following columns and we have another table called customers and here are the columns for the customers. So that I gave shy a context about the tables that I have in my database and as well I was precise about the database. It is SQL server. Now after we have the context the next step is that I'm going to tell SQL what to do. So I'm telling the AI do the following. write a query to rank customers based on their sales and then I'm detailing what I'm expecting to have at the output. So the result should include customer ID, full name, country, total sales and so on. And here I'm adding like more tasks. It's not enough to have a query. I would like as well to have a comments. So I'm saying include comments but avoid commenting on obvious parts because if you tell just include comments, you will get a lot of unnecessary comments. Now of course in square there is like not one solution for a task. there is always like different variants on how to achieve the same task. So usually I would like to understand what are my options. That's why I'm telling Shaji write three different versions of the query to achieve this task and then I would like to evaluate each of those versions and that's why I'm giving the task for the AI to evaluate those versions and to focus on two things. It is easy to read and as well has good performance. Okay. So let's see what shajivity going to give us the results. So we can see the first solution over here where shadivity is using the CTE. So we can see in the CT over here that the table first are joined and then we have like a group by in order to aggregate the sales. In the step two we can see over here we have the rank window function in order to rank the sales. So of course you can do that. Let's check the version number two over here. So they I used the subquery and it is as well a nice solution where the shad first prepared the data. So first done the aggregation before joining the data. Let's get the last solution over here. So we have here single query using window function which is as you can see it is the smallest one. We don't have CTE we don't have any sub queries. So first it is joining the tables and doing together the group by together with the window function and after that we get an evaluation from the AI where where as you can see it focus on two things the readability and the performance. So it is saying with the CTE the readability is really high compared to the sub query and to the last version where you have the group by together with the window function. So I totally agree with the shajibbity the first version was the best one for the readability. Now checking the performance. You can see the performance is moderate. The second one, the subquery is good. And the last one is the best for the performance. But of course, always test with the execution plan. So as you can see, there is like a trade-off between the readability and the performance. If the priority is readability, then go with the version one. But if the priority is the performance, then go with the version three. As you can see, we got three solutions for our one task. And you can now evaluate which one you want to use. And this is really amazing, right? All right, moving on to the next one that I frequently use. We have impromptability. As you are creating an SQL query for a complex task, you might end up writing a lot of CTE, sub queries. You might end up having a lot of joins, sub queries, CTE, hundreds of lines, and you might lose the big picture. So what I always do, I give the query to the SHBT and ask it to optimize it in order to be more readable and to find any redundancy in my query in order to consolidate it. So now let's check the prompt. It says the following SQL server query is long and hard to understand. And then we're going to give the AI tasks. So the first task is to improve its readability and the next one is to detect any redundancy in the code in order to remove it and to consolidate the query. So to make our query compact and small and of course to include some comments and not to comment the obvious parts and now always if there is like some optimizations there should be a learning process. So I'm asking now the AI to explain each improvement to understand the reasons behind it so that next time I'm writing the queries I can avoid those mistakes and of course you have to go and give the query to the AI. All right. So now let's check the answer from the ship for my prompt. So as you can see we have a really long query and here we have now from the result the improved query. So we can see that we have only one city. Well that is crazy. We had before like five six cities and we can see here that the team managed to put everything in one city and then do all the aggregations and the window function and then we have here the final select. Well this is huge improvement to the previous query. Let's check here the explanation. So it says it consolidated the cities so combined all the cities into one and many other stuff like there were a lot of unnecessary joins and so on. And here a small improvement where it uses the concat instead of the plus because concat is standards for multiple databases. And here we have a final benefits. So we have shorter query instead of five CDs we have only one and combining the logic you can reduce the number of scans of the tables which is correct. So as you can see it is the magic of the AI. It found the issues in my code, improved the readability and reduced all the redundancy and unnecessary joints and so on in the SQL script. Okay, moving on to the next prompt. It is about optimizing the performance of my query. And if you are working in big projects where you have like millions of data in your tables, it can be an issue if you are writing queries that are not following the best practices for performance. So that's why I go and double check with the AI whether my script is following the best practices for the performance. So as usual in the prompt we have to go and give the context. So the following SQL server query is slow and then we start giving the AI some tasks. So propose optimizations to improve its performance and provide me then the improved SQL query and I would like always to understand the reason why it's better to write it in another way so that by the next time I improve while I'm writing the query. So explain each improvement to understand the reasoning behind it and then at the end we go and give our query. Okay. So now let's write the prompts on the following query over here. So on this query we have a lot of bad practices like for example doing aggregations using correlated subquery. We are using a lot of functions inside the work clause which is not really good for indexing and we are using a lot of or operators and here we have again a subquery. So let's check whether shad going to find all those bad practices. So let's check the results from the shad. And as you can see now we have an optimized query. It is little bit longer but I think we have here better practices. So we have here a lot of changes. Let's check what did. So first it replaced the lower in the query. It says it's not really good to use functions in the works so that the index can work. So it replaced the lower with the order status without the function. the next one. So it is avoiding the correlated subquery. So instead of that it is using a lift join. So it is joining the table normally without doing any correlated queries and as well it is avoiding the function year in the works and instead of that it is using the range using between and the next one it is using exist better than in which is better for the performance of course. So as you can see you can use the AI in order to optimize the performance of your query and to convert it to a script that is following the best practices. Of course my recommendations always don't go blindly with all changes that is suggested from the shajibity. Always take each recommendation one by one. Test it and evaluate it using your knowledge. Okay to the next one. It is interesting one. We can use [Music] impromptution plan. So now the execution plans usually are advanced. So you need a lot of knowhow and experience in order to understand and read the execution plan and if you have a big query it's going to be really nightmare in order to understand the flow and where is exactly the issue. But now we are not alone. We have assistant the AI in order to help us understanding this complex stuff. So what we can do we can take a screenshot of the execution plan and upload it to Shajib and we say the image is execution plan of SQL server query and now we give the following task to say describe the execution plan step by step after that I'm going to tell SQL to identify the performance bottlenecks and where is exactly the issue what makes my query slow this is of course the hardest part of reading an execution plan and once it identify the performance issues I'm going to ask it to suggest ways to improve improve the performance and optimize the execution plan. So first understand the execution plan identify the issues and how to optimize it. Okay. So now after uploading the photo and asking the AI we have the following results. So now we can see a detailed explanation about the execution plan and there is like a lot of details. I will not go through everything. So we start with the table scans then the cluster scan and the nested loops. So we have several nested loops and then the aggregation and the final step. So that now we have like a nice explanation what is SQL is doing behind the scenes for my query and you don't have to be an expert understanding the execution plan. You can ask the AI about it. Now what is very important is to understand where are the bottlenecks what are the problems. So let's see what's we have here. So let's say the first one we have a table scan which is really bad. That means this table the orders archive does not has any index. So it says the table scan indicates a lake of useful index on the table which forces the engine to scan the whole table or rows. And now what is very important is the nested loops in the joins. This is really bad if you have big tables. So here it's saying it's fine if you have like small data sets but it going to be really problematic if you have many rows. So as you can see we are getting more knowledge about the issues that we have from our execution plan. And the last step it is the suggestions. So the first one and the most obvious one is to add an index to the orders archive. The nonclustered index. Well, if there's no index at all, I would go first with a clustered index, not immediately with a nonclustered index. And then some other best practices, but I think this one is very relevant is to change the join type. So you can use the hints in order to use a merge join or a hash join. So now we understand how it works, where are the issues and what the suggestions to fix it. All right, the next prompt is about debugging. As you are writing a complex SQL query, you might get from the database an error when you execute it and sometimes it is challenging to find the root cause of the issue. So we have the following prompts. First the context is going to say the following SQL server query causing this error. Then we can paste the error message that we are getting and then we ask the AI to do the following stuff. First explain the error message. So I would like to have better understanding of the error. And then we ask the AI to find the root cause of the issue from my scripts. And after finding the problem and the issue, we're going to ask the AI to suggest how to fix it. And of course, we have to give in the prompt as well our SQL query. All right. So now I have the following query and if I execute it, I'm getting the following error. It says the column sales.order dot sales in invalid in the select list because it is not contained in the aggregations and so on. So I'm not really understanding what's going on. Let's ask the AI about it. So let's check what shity did answer. When you are using group by every column in the select must be used in the group by as well. And it says in your query you are selecting few columns which is this one is valid. The other two as well valid but we have one inside the rank function. It is invalid. Okay. So now we can see here more details about the root cause. It is saying when you are using window function like the rank it doesn't directly work with the aggregate functions. So here it's indicate clearly that the sales inside the rank function is the issue. So let's see the fix over here. So since we don't have here sales at all you cannot have here sales in the partition. That's why the fix here is to use the sum of sales because we have it in the select. And here you have as well a nice explanation about the fix. So you can see here we have an explanation about the error message the road cause it's pointing exactly where there's the issue suggesting a fix and explaining the fix and this is exactly the steps that you have to do if you are debugging a code all right moving on to the next prompt we can use AI to explain the result that I'm getting from SQL well sometimes you might have an SQL query that you have in the project and you are not understanding why you are getting specific results so as usual we start with the context we tell the AI I didn't understand the result of the following SQL server query and then we ask the AI to do the following. First break down how SQL processes the query step by step and as well I would like to get an explanation for each stage and how the result is formed. So as you can see here I don't need any optimizations. I don't need in the output any query. I just need an explanation and then at the end you're going to go and paste your query. Okay. So now we have the following query. We have a recursive CTE where we are generating like numbers between 1 and 20. Can tell you recursive CTE are usually like complicated to understand. So now maybe we are having hard time understanding the result of this query. After asking the AI about it, we got the explanation first about the query structure. So it says you are using the CTE with the main query. Well, okay. But what is very interesting is to understand step by step how SQL executed this query. So it tells the step one it's going to go and execute the anchor query and that's why we will get first the one and then the next step the recursive query going to be executed for the first time. So it is saying okay we are adding one to the current value. So as you can see 1 + 1 we will get two and then in the iteration two we will get 2 + 1 3 and it will keep repeating this process until we get all the result from 1 to 20. And then as well we have here an explanation about the termination of the recursive query. So it's saying the filter is the way out of the loop. So once we reach the 20 it will stop. And then a few informations about the main query and with that you will get a deep knowledge about how works and why you are seeing those results. This is really amazing use case for the GBT. All right friends. So now we're going to talk about my favorite prompts. So we can use the AI to style and format my code. So now once you are done writing a complex query to solve a task and everything is correct and optimized as well for the performance. Now it's time to go and review your code in order to style and format your script. So we have the following prompt. It says the following SQL server query is hard to understand. So now we ask the AI to do the following. Restyle the code to make it easier to read. And the next task for AI is to align all the columns aliases. Sometimes if you are using any tool to style and format your code, you will find that it is bringing a lot of new lines. So I tell he AI, keep it compact, do not introduce unnecessary new lines. And the last task for the AI is to make sure it is following the best practices. And of course, what do we need at the end? Our query. Okay, so now we have the following query. And as you can see, we have very annoying query where it is really hard to read and that's because the format and the styling of the query is really bad. I don't want to speak about the alignment and so on. But as you can see, we have here lower cases, we have here uppercase sometimes for the keywords. And of course, if you are developing and writing codes and you are delivering something like this, it is really not nice. So let's see how shipy can fix it. Okay. So now after executing the prompts, as you can see, now my query looks way nicer. So first of all all the keywords are uppercase and then you can see our CTE are really nice to read. We have here enough spacing. The alignment of everything looks really nice and the case is very clear and the main query over here is as well easy to read. So they done wonderful job styling and formatting my code and here you have like explanation what did change. So first it is saying okay all the keywords are capitalized the alignment of the aliases and the columns and so on. So with that we got a really nice style formatted query that we can share with others. Okay, moving on to the next one. We can use AI in order to generate documentations and as well to add comments to my code. Creating documentations and adding comments to code is usually something very annoying for the developers. And sadly I see a lot of developers that they tend to not add any comments or anything to their code. And of course, this is really bad because you are not thinking about other developers that are reading your code. No god, no god, please no. And since this process is annoying and takes time, we can use the help of AI to improve the speed of creating those stuff. So let's check the following prompt. It says the following SQL server query lakes comments and documentation. So we are saying first insert a leading comment at the start of the query describing its overall purpose. So this is what we usually do. We add at the start a short description about the following code and then it should go and add comments only where clarifications is necessary and very important it should avoid obvious statements. So it's like indexing don't over commenting your code and usually if you are creating query for data analytics it's really good to explain the business rules and transformations that you are doing inside your query and maybe another documentations describing how the query works. So for now we are asking to add comments and documentations and of course you have to go and add your query. Okay. So now I just used this prompt to one of my queries. Let's go and check the results. Now the first comment is the most important one because it gives the overall purpose of the whole query. So let's see what it's saying. It's saying this query identify customers based on their total salaries and provide list of customers with their total sales and their assigned segments. So we have here like customer segmentations. We have high value, medium value and low value. So with this comment we have the overall purpose of the query and then we have the inline comments like here. So it says it's calculate the total sales for each customer for the first CTE and now for the second CTE we have here a full description how the segment is built and this is built of course from the business rule of the customer segments. So it say the high values for total sales above like 100 and between and so on. Well this case win is really easy. So actually you can read it from the case win. But if you have like complex queries, it's really nice to have the full text of the case win and then add the main query. You can see here the final output and the inline comments. So as you can see it's really nice comments inside our codes. And now the next one we have like a document about the business rule. And I totally agree with the AI that the business rule is here about the customer segmentations. So we have here again very nice like short documentations about the business rules that we have and then we have another document about how the query is working. Well I think this is too much for small query. We can go and ask the shibility to make the documentation like shorter. So as you can see we have a full documentation about our query about our business rules and we have really nice comments in our code. All right. Now moving on to the next prompts. It is very important to improve the whole project, the whole database. So what we're going to do, we're going to go and take our DDL scripts and give it to the AI and start asking AI to optimize our database DDL. So here there is a lot of things that you can optimize with the database. So let's check this prompts. It's going to say the following SQL server DDL script has to be optimized and we ask the following task from the AI. The first one is to check the naming. So if you have a database where you have a lot of tables and columns and so on, you should be always working with a specific naming convention. So here just to make sure that the naming that you are using is correct. Then what is very important in DDLs is the data type. Data types plays very crucial role in optimizing your queries. So we are telling the AI to check the data types and whether they are optimized as well. And now the next point is about the data integrity. So if you are building a relational database, you will have a lot of primary keys and foreign keys and you can tell the AI to check the integrity of all those keys. The next point is about indexes. Here you can tell the AI to check the overall indexing that you are using in the DDL scripts just to make sure that you are not missing anything and as well to check whether we have duplicates. So it is really great check and the last check is that to check the normalizations of the table to check the data model and whether there is like any suggestions about splitting tables and normalizing tables or they are like some weird redundancy. Okay. So now what we're going to do we're going to let the chat activity to optimize the DDL of the sales DB. So now we have here the DDL of the customers employees orders and so on. And after running it we have the following results. So now we have here again the DDL but optimized one. And here the AI is adding comment about the changes. So here it added the auto incremental for the primary key. And here for example a check that is not a negative score and for the employees. Here another check to make sure that the birthday is not something in the future. So all those constraints in order to make sure that the quality of the table is good. And here for the gender it is restricting the valid values that could be used inside this column and many other stuff. And at the end we have like the key changes. So about the naming it's saying that we have to stick with one naming convention. So here it did understand that we are using the bascal case and for those two columns we have an issue like for example this product it should called product name. And for the data types I don't want to go in all details. So here for example it says don't use the int use a decimal for the price and sales for the integrity saying go and add foreign keys. I think for the orders we don't have any foreign keys that is used in the DDL. So the sht did go and add all the foreign keys in the DDL. So that was good. And now about the indexing it says since we have primary keys we will get automatically the clustered indexing and the foreign keys should get as well an index in order to improve the queries and so on. So as you can see there is a lot of optimizations that could be done in our DDL. So now if you are working on the project and you have a DDL go ask the AI what could we optimize I'm sure you will find something and this is very critical because having a solid and optimized DDL improves of course the speed of the queries. All right so now we come to very useful use case of using AI for your SQL projects and that is by using AI to generate test data sets. It is always really nice to have small data sets in order to test the logic of your query. Sometimes you are building a logic that does not exist yet in your database and of course if you are not able to test the scenario that you are developing it can be really bad and it is always very painful process in order to generate a data sets for your code but of course now it is easier because we have the help of AI. So let's check the following prompt. It says I need the data sets for testing the following SQL server DDL. And now next we have to specify for the AI different tasks. The first one is we have to define the shape of the data sets. So how do you want the output? Do you want it as an insert statements or do you want it as an excel or a file and so on. Now the next specifications I would like always to have a data set that is realistic. So I would like to always to have a data set that is relevant and realistic not to get dummy word data. So again he's like only configurations about the data set. The next configuration is that I would like to have small data sets. Of course, you can go and specify for charge the exact size of your data sets. You can say I would like to have like 100,000 rows or millions of rows and so on. So you can define the size that you want. For me, I would like to have like small data sets. And now what is very important that if you have multiple tables in your DDL and those table have primary keys and foreign keys, the data set should be correct. So the AI should generate keys that is joinable. So if you go and join data together, you will not get weird results. And of course, you can go and keep adding specifications whether you want to have nulls or no nulls inside your data set. So here for example, I'm saying don't introduce any null values. And of course at the end you have to go and give the DDL for the AI. It could be one table or the whole database. So you could generate a data set for one table or hundreds of tables. Okay. So now I'm asking the SHT to create test data sets for two tables. the employees and the orders. Let's check the results. So now we can see very small nice insert statements for the table employees. So we have over here like five employees with the different informations. And now for the table orders we have a lot of columns. So as you can see we have four orders. And what is very important is that the salesperson ID comes from the table employees. So as you can see we have two and one where we have it already in the employees. and the rest of the informations we have like here fake addresses and stuff. So with that we have a very nice test data sets in order to be inserted to our database to test our queries. Of course we can go and ask maybe to extend it maybe instead of only four orders we can go with 20 orders and so on. So we can go and change the size of it and here we have some notes about the data itself. So it is really amazing we are now generating this data using our DLS. All right. So now we have the following query and of course we are using the SQL server and let's say that you are migrating from SQL server to MySQL. So let's ask Shajbet to convert my code to MySQL. All right. So after running it as we can see now we have the same query but in MySQL. So instead of the isnull we are using Kawalis and here we are using the concatenation instead of the plus operator and instead of the get date in MySQL we use the now function. And the last thing we are using here top 10 but in my scale we use limit 10. And here we have really nice explanation about the transition. So as you can see it is amazing and if you are working on companies and in projects this might happen that there is like decision to start migrating from one database to another database and then your project going to get a big task of migrating the data migrating the DDLs and the queries and everything and I really recommend using the shad in order to help with the migration otherwise this big task might take really long time. So as you can see this is really amazing how shad can improve the speed of your projects. Okay. Now in the next section I'm going to show you the prompts that you can use as a student or if you are learning any new programming language. Okay. So the first thing that you can do with Shajibet is that you can ask it to generate an SQL course. So you can ask the shajibet to guide you step by step in your journey learning any programming language and you want to do it completely onetoone with the AI. So first it is very important in creating a course is that to give enough context. So in this example it is very short I'm saying create an SQL course with a detailed road map and agenda. But of course you can go and give more specifications. You can tell about your current knowledge. You can specify which database type you would like to work with MySQL SQL server. So the more context and details you give for the AI, the better results you're going to get. And then you go and configure your course. So you can say for example start with SQL fundamentals and advance to complex topics. And as well we can say make it beginner friendly and it is important if it is the first time you are learning about the topic. And now we have to shape the focus of the course like I'm saying here include topics that is relevant for data analytics because SQL is widely used in different topics for data engineering data analytics and it's really important in each course to focus on use cases. So we are saying focus on real world data analytics use cases and scenarios and of course you can go and add more details about your course. Okay. So now I just asked the shivity in order to make this course. So now let's see the road map and the structure of our course. So let's start with the phase one with the SQL fundamentals. So it start with the basic select where and so on. Then the next section we are talking about order by group by and insert update delete. So the basic stuff. Now in the road map you get the phase two intermediate SQL. So here we are talking about inner joins few functions about the text the date and the case statements and views. And now to the phase three we have the advanced SQL for analytics. So we have the window functions, the CTE and data cleaning using the null functions and few transformations. Then we go to the phase number four. Here in your road map you start talking about real world use cases. And here you have like multiple projects. So as you can see this is really solid road map in order to learn SQL. And now in the next step what you can do you can start deep diving into each of those chapters until SQL to start okay with the phase number one with the week one to give more details. All right. So now the next one once you have the agenda and the road map learning the SQL now you can go and focus on specific chapter specific SQL concepts. So in this prompt we are saying the context first I want detailed explanation about SQL window functions and now after that we are specifying for the AI the exact structure of the explanation. So first it should explain what are the window functions and maybe as well to give an analogy in order to understand exactly what is window functions and after that it should explain why we need them and when to use the window functions. So once you understand the basics then you can start learning about the syntax of the window functions and it should provide as well few simple examples and at the end the AI should show you the best or the most frequently use cases used for the SQL window functions. So this is the pattern that I like in order to learn something new. All right. So now let's see how the AI going to explain the SQL window functions. So as you can see it start with the big title understanding SQL with the functions. So we have here a quick definition and then we have here an analogy and the analogy about like a teacher grading students. Well that's nice because we have the rank function. So you have here a nice analogy about the window function and then we understand why do we need the window functions. Well I totally agree in order to have row level details with the aggregations. So you can do aggregations while maintaining the raw level details and as well you can do complex calculations because you cannot do everything with a group I there's functions that only work with the window and then we have some explanation when to use them. So we see here for example the syntax of the window function. So it divided to a function partition order by over and here few explanation about that. Then we have few simple examples with queries. So explaining the different functions but not all of them. Of course, you can go and ask the schedule to extend the examples for all functions. And now we can see the top three use cases for the window functions. So we use it in order to rank the data and as well to build the running totals and the moving average. And at the end we have a summary. So as you can see we have wonderful explanation about the concept of the SQL window functions. Okay, moving on to the next one. And this one I use it very frequently in my projects. There is like in programming always different concepts that are very close to each others and sometimes it is confusing and naturally clear what are the big differences between them. So here I have for you a prompt in order to compare different SQL concepts. So now the prompt says I want to understand the differences between SQL window functions and the group by. So both of them are used usually to aggregate data in SQL and I would like to understand more what are the differences between them. So we define for the AI the following task. Explain the key differences between the two concepts and then it's really important to understand when to use what. So describe when to use each concept with examples and it's really nice to understand as well the advantages and the disadvantages of each concept and at the end you would like maybe to get a quick summarization about the differences between those two functions side by side in one table. Okay. So now let's see how the share GBD can compare those two concepts. So first we have really nice table in order to see the differences between those two. So for example the output granularity it says the wind function provides calculation at the rowle details where the group by provides aggregated results at the group level detail and if you are talking about the functions it allow ranking running total moving average and the group by it allows only the basic aggregations like sum average count. So this is really nice overview for the differences. Then we have when to use which concepts. So it's telling the window function it is used if you want role level details together with the aggregations and here you have like a nice example for the group by it says you can use it for example when summarizing data into categories like here grouping up the data by the region and then after that we have like pros and cons for each concept. So the advantage of the window function we get all the rows and for the group I it is like easier to understand and to use. For the disadvantage of the window function it is more complex. For the group I the disadvantage is it removes the details about the rows and at the end we have like sideby-side comparison between those two concepts. So as you can see we have really nice full detailed comparison between those two SQL concepts. Practicing SQL with the AI. Well, it is really not enough to just read about something or maybe to follow and watch a course in order to learn something. You have always to practice. And of course, it is really hard to find a materials in order to practice a new programming language. So, we can do it like this. We give a rule act as an SQL trainer and then a context where we say and help me practice SQL window functions and then we go and configure this training this practice by doing the following. We tell it to make it interactive practicing. So the AI provide a task and you give a solution. And what else is important is that it provides you a simple data set and of course you can specify which data set you want. Is it industrial data set or healthcare or anything you want and then we tell the AI give SQL task that gradually increase in difficulty. So we start with the basics until getting advanced tasks. And you can tell the AI to act as an SQL server and show the results of your query. So you would like to get as a result not only the correct solution or feedback you want to see the result of the query that you gives and then finally the AI should go and review your queries provide a feedback and suggest improvements okay so now let's start practicing I gave the prompt to shity and now we have simple data sets so it is very simple we have the sales ID employee region sales dates and amounts and then we have the first task so it says write a query to rank employees by their total sales. So here you have like an example output and now it says your turn. So the shad is waiting for your answer. Okay. So now I just prepared a query for it. Let's see what can happen once I post it. Oh no, I got some errors in the query. So let's see what we have. So it says error in the aggregations. You should use the amount instead of sales. And it says unnecessary partition by in the rank and so on. So let's check the correct query. So we have here the group pi and then we have to do the window function without using partition pi. So that was a mistake and the result of this query going to be this one. And here I have really nice feedback about the first task. So now it ask me about the next task. So I'm going to say yes. So now we have this task number two about the running total. We have a task and we have the data and we have now to write query in order to solve the task. So my friends it is nice right interactive and not only SQL you can go and practice any programming language. Now moving on to the last prompt you can use AI in order to prepare you for SQL interview. So let's say that you are invited to an interview and you would like to prepare yourself for it. So you can do a quick preparation together with the AI. So you can say the following act as interviewer and prepare me for SQL interview. And now you can go and configure the interview where you can say ask common SQL interview questions and make it interactive. So it provide a question and then wait for you to answer and then you can say gradually progress to advanced topics. So from basics to advanced and it is very important that it evaluates your answer and give you a feedback. So it is a really great way to prepare for interviews and I really recommended to do it and you can prepare yourself not only for an SQL interview, you can prepare yourself for an SQL exam. Okay. Okay. So now let's prepare for an ISQL interview. And here we have the first question. Shibility says what is the difference between where and having. So now it is waiting for an answer. We can say where filters data before aggregation and having filters data after aggregation. So let's check the answer. So here it is giving me an example of a very solid answer. But in general I have answered correctly. So it says the answer is correct. But the feedback says here maybe the interviewer like needs more details not only one sentence about the differences. So here it is like encouraging me to speak more and to give more details as an answer but still the answer is correct. So now let's go to the next question. What we have here can you explain the differences between inner join and left join. So I hope you know the answer but as you can see it is very interactive and nice and I think those questions are really relevant. So if I'm interviewing someone I'm going to go and ask this question. What is the difference between where and having and as well the differences between the joint types. So this is amazing right? I really recommend you if you have like an interview go and prepare yourself using shajbt and you can go and practice and prepare yourself before going to the interview. All right. So with that you have learned how I use AI in order to assist me while I'm coding using SQL. And now my friends we come to the most important chapter from the whole course. You have now learned a lot of things about SQL. A lot of advanced techniques, a lot of functions, how to transform data, how to aggregate data. But now what you have to do is to take everything and to apply it in SQL projects. And those projects are not only like easy projects. I bought projects for you that is very similar to the real project that I do in the industry. So you will not learn only like how to do project in SQL but as well what are the main steps and how we implement projects in real world. And here I have for you three projects data warehousing data exploration and advanced data analytics. We're going to start with the first one the data warehousing projects. This one can be amazing. So let's go and deep dive in that. All right my friends. So now if you want to do data analytics projects using SQL we have three different types. The first type of projects you can do data warehousing. It's all about how to organize, structure and prepare your data for data analyszis. It is the foundations of any data analytics projects. And in the next step, you can do exploratory data analyzes, EDA. And all what you have to do is to understand and cover insights about our data sets. In this kind of project, you can learn how to ask the right questions and how to find the answer using SQL by just using basic SQL skills. Now moving on to the last stage where you can do advanced analytics projects where you're going to use advanced SQL techniques in order to answer business questions like finding trends over time, comparing the performance, segmenting your data into different sections and as well generate reports for your stakeholders. So here you will be solving real business questions using advanced SQL techniques. Now what we're going to do, we're going to start with the first type of projects SQL data warehousing where you will gain the following skills. So first you will learn how to do ETL ELT processing using SQL in order to prepare the data. You will learn as well how to build data architecture, how to do data integrations where we're going to merge multiple sources together and as well how to do data load and data modeling. So if I got you interested, grab your coffee and let's jump to the projects. All right, my friends. So now before we deep dive into the tools and the cool stuff, we have first to have good understanding about what is exactly data warehouse why the companies try to build such a data management system. So now the question is what is a data warehouse? I will just use the definition of the father of the data warehouse bill in a data warehouse is subject-oriented integrated time variant and nonvolatile collection of data designed to support the management's decision-making process. Okay, I I know that might be confusing. Subject-oriented it means that the warehouses always focus on a business area like the sales, customers, finance and so on. Integrated because it goes and integrate multiple source systems. Usually you build a warehouse not only for one source but for multiple sources. Time variance it means you can keep historical data inside the data warehouse. Nonvolatile it means once the data enter the data warehouse it is not deleted or modified. So this is how build inmon defined data warehouse. Okay. So now I'm going to show you the scenario where your company don't have a real data management. So now let's say that you have one system and you have like one data analyst has to go to this system and start collecting and extracting the data and then he going to spend days and sometimes weeks transforming the raw data into something meaningful. Then once they have the reports they're going to go and share it. And this data analyst is sharing the report using an Excel. And then you have like another source of data and you have another data analyst that she is doing maybe the same steps collecting the data spending a lot of time transforming the data and then share at the end like a report and this time she is sharing the data using PowerPoint and a third system and the same story but this time he is sharing the data using maybe PowerBI. So now if the company works like this then there is a lot of issues. First this process it take two way long. I saw a lot of scenarios where sometimes it takes weeks and even months until the employee manually generating those reports. And of course, what can happen for the users? They are consuming multiple reports with multiple state of the data. One report is 40 days old, another one 10 days and a third one is like 5 days. So it's going to be really hard to make a real decision based on this structure. A manual process is always slow and stressful and the more employees you involved in the process the more you open the door for human errors and errors of course in reports leads to bad decisions and another issue of course is handling the big data. If one of your sources generating like massive amount of data then the data analyst going to struggle collecting the data and maybe in some scenarios it will not be anymore possible to get the data. So the whole process can breaks and you cannot generate anymore fresh data for specific reports. And one last very big issue with that. If one of your stakeholders asks for an integrated report from multiple sources, well good luck with that because merging all those data manually is very chaotic, time-conuming and full of risk. So this is just a picture. If a company is working without a proper data management, without a data leak, data warehouse, data lake houses. So in order to make real and good decisions, you need data management. So now let's talk about the scenario of a data warehouse. So the first thing that's going to happen is that you will not have your data team collecting manually the data. You're going to have a very important component called ETL. ETL stands for extract, transform and load. It is a process that you do in order to extract the data from the sources and then apply multiple transformations on those sources and at the end it loads the data to the data warehouse and this one going to be the single point of truth for analyzes and reporting and it is called data warehouse. So now what can happen all your reports going to be consuming this single point of truth. So with that you create your multiple reports and as well you can create integrated reports from multiple sources not only from one single source. So now by looking to the right side it looks already organized right and the whole process is completely automated. There is no more manual steps which of course it reduces the human error and as well it is pretty fast. So usually you can load the data from the sources until the reports in matter of hours or sometimes in minutes. So there is no need to wait like weeks and months in order to refresh anything. And of course the big advantage is that the data warehouse itself it is completely integrated. So that means it goes and bring all those sources together in one place which makes it really easier for reporting and not only integrated you can build in the data warehouse as well history. So we have now the possibility to access historical data and what is also amazing is that all those reports having the same data status. So all those reports can have the same status maybe sometimes one day old or something. And of course if you have a modern data warehouse in cloud platforms you can really easily handle any big data sources. So no need to panic if one of your sources is delivering massive amount of data. And of course in order to build the data warehouse you need different types of developers. So usually the one that builds the ETL component and the data warehouse is the data engineer. So they are the one that is accessing the sources, scripting the ATLs and building the database for the data warehouse. And now for the other part, the one that is responsible for that is the data analyst. They are the one that is consuming the data warehouse, building different data models and reports and sharing it with the stakeholders. So they are usually contacting the stakeholders, understanding the requirements and building multiple reports based on the data warehouse. So now if you have a look to those two scenarios, this is exactly why we need data management. Your data team is not wasting time and fighting with the data. They are now more organized and more focused and with like a data warehouse and you are delivering professional and fresh reports that your company can count on in order to make good and fast decisions. So this is why you need a data management like a data warehouse. Think about data warehouse as a busy restaurant. Every day different suppliers bring in fresh ingredients, vegetables, spices, meat, you name it. They don't just use it immediately and throw everything in one pot, right? They clean it, shop it, and organize everything and store each ingredients in the right place, fridge or freezer. So, this is the preparing phase. And when the order comes in, they quickly grab the prepared ingredients and create a perfect dish and then serve it to the customers of the restaurant. And this process is exactly like the data warehouse process. It is like the kitchen where the raw ingredients, your data are cleaned, sorted and stored. And when you need a report or analyzes, it is ready to serve up exactly like what you need. Okay. So now we're going to zoom in and focus on the component ETL. If you are building such a project, you're going to spend almost 90% just building this component, the ETL. So it is the core element of the data warehouse and I want you to have a clear understanding what is exactly an ETL. So our data exist in a source system. And now what we want to do is is to get our data from the source and move it to the target. Source and target could be like database tables. So now the first step that we have to do is to specify which data we have to load from the source. Of course we can say that we want to load everything but let's say that we are doing incremental loads. So we're going to go and specify a subset of the data from the source in order to prepare it and load it later to the target. So this step in the ATL process we call it extract. We are just identifying the data that we need. We pull it out and we don't change anything. It's going to be like one to one like the source system. So the extract has only one task to identify the data that we have to pull out from the source and to not change anything. So we will not manipulate the data at all. It can stay as it is. So this is the first step in the ETL process, the extract. Now moving on to the stage number two. We're going to take this extract data and we will do some manipulations, transformations and we're going to change the shape of those data. And this process is really heavy working. We can do a lot of stuff like data cleansing, data integration and a lot of formatting and data normalizations. So a lot of stuff we can do in this step. So this is the second step in the ETL process, the transformation. We're going to take the original data and reshape it, transform it into exactly the format that we need into a new format and shapes that we need for analyzes and reporting. Now, finally, we get to the last step in the ATL process. We have the load. So, in this step, we're going to take this new data and we're going to insert it into the target. So, it is very simple. We're going to take this prepared data from the transformation step and we're going to move it into its final destination, the target like for example data warehouse. So that's ETL in a nutshell. First extract the raw data, then transform it into something meaningful and finally load it to a target where it's going to make a difference. So that's it. This is what we mean with the ETL process. Now in real projects, we don't have like only source and targets. Our data architecture going to have like multiple layers depend on your design whether you are building a warehouse or a data lake or a data warehouse. And usually there are like different ways on how to load the data between all those layers. And in order now to load the data from one layer to another one there are like multiple ways on how to use the ATL process. So usually if you are loading the data from the source to the layer number one like only extract the data from the source and load it directly to the layer number one without doing any transformations because I want to see the data as it is in the first layer. And now between the layer number one and the layer number two you might go and use the full ETL. So we're going to extract from the layer one, transform it and then load it to the layer number two. So with that we are using the whole process the ATL. And now between layer two and layer three we can do only transformation and then load. So we don't have to deal with how to extract the data because it is maybe using the same technology and we are taking all data from layer 2 to layer three. So we transform the whole layer 2 and then load it to layer three. And now between three and four you can use only the LM. So maybe it's something like duplicating and replicating the data and then you are doing the transformation. So you load to the new layer and then transform it. Of course, this is not a real scenario. I'm just showing you that in order to move from source to a target, you don't have always to use a complete ETL. Depend on the design of your data architecture. You might use only few components from the ETL. Okay. So this is how ETL looks like in real projects. Okay. So now I would like to show you an overview of the different techniques and methods in the ETLs. We have wide range of possibilities where you have to make decisions on which one you want to apply to your projects. So let's start first with the extraction. The first thing that I want to show you is we have different methods of extraction. Either you are going to the source system and pulling the data from the source or the source system is pushing the data to the data warehouse. So those are the two main methods on how to extract data. And then we have in the extraction two types. We have a full extraction everything all the records from tables and every day we load all the data to the data warehouse or we make more smarter one where we say we're going to do an incremental extraction where every day we're going to identify only the new changing data. So we don't have to load the whole thing only the new data we go extract it and then load it to the data warehouse. And in data extraction we have different techniques. The first one is like manually where someone has to access a source system and extract the data manually or we connect ourselves to a database and we have then a query in order to extract the data or we have a file that we have to parse it to the data warehouse or another technique is to connect ourself to API and do their calls in order to extract the data or if the data is available in streaming like in CFKA we can do eventbased streaming in order to extract the data. Another way is to use the change data capture CDC is as well something very similar to streaming or another way is by using web scrabbing where you have a code that going to run and extract all the informations from the web. So those are the different techniques and types that we have in the extraction. Now if you are talking on the transformation there are wide range of different transformations that we can do on our data like for example doing data enrichment where we add values to our data sets or we do a data integration where we have multiple sources and we bring everything to one data model or we derive new columns based on already existing one. Another type of data transformations we have the data normalization. So the sources has values that are like a code and you go and map it to more friendly values for the analyzers which is more easier to understand and to use. Another transformations we have the business rules and logic depend on the business you can define different criterias in order to build like new columns. And what belongs to transformations is the data aggregation. So here we aggregate the data to a different granularity and then we have type of transformation called data cleansing. There are many different ways on how to clean our data. For example, removing the duplicates, doing data filtering, handling the missing data, handling invalid values or removing unwanted spaces, casting the data types and detecting the outliers and many more. So we have different types of data cleansing that we can do in our data warehouse and this is very important transformation. So as you can see we have different types of transformations that we can do in our data warehouse. Now moving on to the load. So what do we have over here? We have different processing types. So either we are doing patch processing or stream processing. Patch processing means we are loading the data warehouse in one big patch of data that's going to run and load the data warehouse. So it is only one time job in order to refresh the content of the data warehouse and as well the reports. So that means we are scheduling the data warehouse in order to load it in the day once or twice. And the other type we have the stream processing. So this means if there is like a change in the source system, we're going to process this change as soon as possible. So we're going to process it through all the layers of the data warehouse once something changes from the source system. So we are streaming the data in order to have real time data warehouse which is very challenging things to do in data warehousing. And if you are talking about the loads we have two methods either we are doing a full load or incremental load. It's the same thing as extraction right? So for the full load in databases there are like different methods on how to do it like for example we truncate and then insert that means we make the table completely empty and then we insert everything from the scratch or another one you are doing an update insert we call it upsert. So we can go and update all the records and then insert the new one and another way is to drop create and insert. So that means we drop the whole table and then we create it from scratch and then we insert the data. It is very similar to the truncate but here we are as well removing and dropping the whole table. So those are the different methods of full loads. The incremental load we can use as well the upserts. So update and insert. So we're going to do an update or insert statements to our tables. Or if the source is something like a log, we can do only insert. So we can go and append the data always to the table without having to update anything. Another way to do incremental load is to do a merge. And here it is very similar to the upsert but as well with a delete. So update, insert, delete. So those are the different methods on how to load the data to your tables. And one more thing in data warehousing, we have something called slowly changing dimensions. So here it's all about the historicizations of your table. And there are many different ways on how to handle the historiizations in your table. The first type is sedd0. We say there is notoriizations and nothing should be changed at all. So that means you are not going to update anything. The second one which is more famous, it is the sedd one. you are doing an overwrite. So that means you are updating the records with the new informations from the source system by overwriting the old value. So we are doing something like the upsert. So update and insert but you are losing of course history. Another one we have the sedd2 and here you want to add historiizations to your table. So what we do each change that we get from the source system that means we are inserting new records and we are not going to overwrite or delete the old data. we are just going to make it inactive and the new record going to be active one. So there are different methods on how to do historiizations as well while you are loading the data to the data warehouse. All right. So those are the different types and techniques that you might encounter in data management projects. So now what I'm going to show you quickly which of those types we will be using in our projects. So now if we are talking about the extraction over here we will be doing a pull extraction and about the full or incremental it's going to be a full extraction. And about the technique we are going to be parsing files to the data warehouse. And now about the data transformations. Well, this one we will cover everything all those types of transformations that I'm showing you now is going to be part of the project because I believe in each data project you will be facing those transformations. Now if you have a look to the load our project going to be patch processing and about the load methods we will be doing a full load since we have full extraction and it's going to be truncate and inserts. And now about the historiizations we will be doing the sedd one. So that means we will be updating the content of the data warehouse. So those are the different techniques and types that we will be using in our ETL process for this project. All right. So with that we have now clear understanding what is a data warehouse and we are done with the theory parts. So now the next step we're going to start with the projects. The first thing that we have to do is to prepare our environment to develop the projects. So let's start with that. All right. So now we go to the link in the description and from there we're going to go to the downloads and you can find all the materials of all courses and projects. But the one that we need now is the SQL data warehouse projects. So let's go to the link and here we have bunch of links that we need for the projects. But the most important one to get all data and files is this one download all project files. So let's go and do that. And after you do that you're going to get a zip file where you have there a lot of stuff. So let's go and extract it. And now inside it if you go over here you will find the repository structure from git. And the most important one here is the data sets. So you have two sources the CRM and the ARP. And in each one of them there are three CSV files. So those are the data set for the projects. For the other stuffs don't worry about it. We will be explaining that during the project. So go and get the data and put it somewhere at your PC where you don't lose it. Okay. So now what else do we have? We have here a link to the get repository. So this is the link to my repository that I have created through the projects. So you can go and access it. But don't worry about it. We're going to explain the whole structure during the projects and you will be creating your own repository. And as well we have the link to the notion. Here we are doing the project management. Here you're going to find the main steps the main phases of the SQL projects that we will do and as well all the task that we will be doing together during the projects. And now we have links to the project tools. So if you don't have it already go and download the SQL server express. So it's like a server that's going to run locally at your PC where your database going to live. Another one that you have to download is the SQL Server Management Studio. It is just a client in order to interact with the database and there we're going to run all our queries and then link to the GitHub and as well link to the draw AO if you don't have it already go and download it. It is free and amazing tool in order to draw diagrams. So through the projects we will be drawing data models the data architecture a data lineage. So a lot of stuff we'll be doing using this tool. So go and download it. And the last thing it is nice to have you have a link to the notion where you can go and create of course free accounts if you want to build the project plan and as well follow me by creating the project steps and the projects tasks. Okay. So that's all those are all the links for the projects. So go and download all those stuff create the accounts and once you are ready then we continue with the projects. All right. So now I hope that you have downloaded all the tools and created the accounts. Now it's time to move to very important step that almost all people skip while doing projects and that is by creating the project plan and for that we will be using the tool notion. Notion is of course a free tool and it can help you to organize your ideas, your plans and resources all in one place. I use it very intensively for my private projects like for example creating this course and I can tell you creating a project plan is the key to success. Creating a data warehouse project is usually very complex. And according to Gartner reports, over 50% of data warehouse projects fail. In my opinion about any complex project, the key to success is to have a clear project plan. So now at this phase of the project, we're going to go and create a rough project plan because at the moment we don't have yet clear understanding about the data architecture. So let's go. Okay. So now let's create a new page and let's call it data warehouse projects. The first thing is that we have to go and create the main phases and stages of the projects and for that we need a table. So in order to do that hit slash and then type database in line and then let's go and call it something like data warehouse epics and we're going to go and hide it because I don't like it. And then on the table we can go and rename it like for example projects epics something like that. And now what we're going to do we're going to go and list all the big task of the project. So an epic is usually like a large task that needs a lot of efforts in order to solve it. So you can call it epics, stages, phases of the project, whatever you want. So we're going to go and list our project steps. So let's start with the requirements analyzes and then designing data architecture and another one we have the project initialization. So those are the three big task in the project first. And now what do we need? We need another table for the small chunks of the tasks, the subtasks and we're going to do the same thing. So we're going to go and hit slash and we're going to search for the table in line and we're going to do the same thing. So first we're going to call it data warehouse tasks and then we're going to hide it and over here we're going to rename it and say this is the project tasks. So now what we're going to do, we're going to go to the plus icon over here and then search for relation. This one over here with the arrow. And now we're going to search for the name of the first table. So we called it data warehouse eix. So let's go and click it and we're going to say as well two-way relation. So let's go and add the relation. So with that we got a field in the new table called data warehouse eix. This comes from this table and as well we have here data warehouse tasks that comes from the below table. So as you can see we have linked them together. Now what I'm going to do I'm going to take this to the left side and then what we're going to do we're going to go and select one of those epics. Like for example let's take design the data architecture. And now what we're going to do, we're going to go and break down this epic into multiple tasks. Like for example, choose data management approach. And then we have another task. What we're going to do, we're going to go and select as well the same epic. So maybe the next step is brainstorm and design the layers. And then let's go to another epic for example the project initialization. And we say over here for example create get repo prepare the structure. we can go and make another one in the same epic. Let's say we're going to go and create the database and the schemas. So, as you can see, I'm just defining the subtasks of those epics. So, now what we're going to do, we're going to go and add a checkbox in order to understand whether we have done the task or not. So, we go to the plus and search for check. We need a checkbox. And what we're going to do, we're going to make it really small like this. And with that, each time we are done with the task, we're going to go and click on it just to make sure that we have done the task. Now, there is one more thing that is not really working nice and that is here. We're going to have like a long list of tasks and it's really annoying. So, what we're going to do, we're going to go to the plus over here and let's search for roll up. So, let's go and select it. So, now what we're going to do, we have to go and select the relationship. It's going to be the data warehouse task. And after that, we're going to go to the property and make it as a checkbox. So, now as you can see in the first table, we are saying how many tasks is closed. But I don't want to show it like this. What we can do, we're going to go to the calculation and to the percent and then percent checked. And with that, we can see the progress of our project. And now instead of the numbers, we can have really nice bar. Great. So as well, we can go and give it a name like progress. So that's it. And we can go and hide the data warehouse tasks. And now with that, we have really nice progress bar for each epic. And if we close all the tasks of this epic, we can see that we have reached 100%. So this is the main structure. Now we can go and add some cosmetics and rename stuff in order to make things looks nicer. Like for example, if I go to the tasks over here, I can go and call it tasks and as well go and change the icon to something like this. And if you'd like to have an icon for all those epics, what you're going to do, we're going to go to the epic for example design data architecture. And then if you hover on top of the title, you can see add an icon. And you can go and pick any icon that you want. So for example, this one. And now as you can see, we have defined it here in the top. And the icon going to be as well in the below table. Okay. So now one more thing that we can do for the project tasks is that we can go and group them by the epics. So if you go to the three dots and then we go to groups and then we can group up by the epics. As you can see now we have like a section for each epic and you can go and sort the epics if you want. If you go over here sort then manual and you can go over here and start sorting the epics as you want. And with that you can expand and minimize each task. if you don't want to see always all tasks in one go. So this is really nice way in order to build like data management for your projects. Of course, in companies, we use professional tools in order to do projects like for example Gyra. But for private personal projects that I do, I always do it like this and I really recommend you to do it not only for this project, for any project that you are doing. Cuz if you see the whole project in one go, you can see the big picture and closing tasks and doing it like this. These small things going to makes you really satisfied and keeps you motivated to finish the whole project and makes you proud. Okay friends, so now I just went and added few icons, a renamed stuff and as well more tasks for each epic and this going to be our starting point in the project and once we have more informations we're going to go and add more details on how exactly we're going to build the data warehouse. So at the start we're going to go and analyze and understand the requirements and only after that we're going to start designing the data architecture and here we have three tasks. First we have to choose the data management approach and after that we're going to do brainstorming and designing the layers of the data warehouse and at the end we're going to go and draw a data architecture. So with that we have clear understanding how the data architecture looks like and after that we're going to go to the next epic where we're going to start preparing our projects. So once we have clear understanding of the data architecture the first task here is to go and create detailed project tasks. So we're going to go and add more AP and more tasks. And once we are done then we're going to go and create the naming conventions for the project just to make sure that we have rules and standards in the whole project. And next we're going to go and create a repository in the git and we're going to prepare as well the structure of the repository so that we always commit our work there. And then we're going to start with the first script where we're going to create a database and schemas. So my friends this is the initial plan for the project. Now let's start with the first epic. We have the requirements analyzes. Now analyzing the requirement, it is very important to understand which type of data warehouse you're going to go and build because there is like not only one standard on how to build it. And if you go blindly implementing the data warehouse, you might be doing a lot of stuff that is totally unnecessary and you will be burning a lot of time. So that's why you have to sit with the stakeholders with the department and understand what we exactly have to build and depend on the requirements you design the shape of the data warehouse. So now let's go and analyze the requirement of this project. Now the whole project is splitted into two main sections. The first section we have to go and build a data warehouse. So this is a data engineering task and we will go and develop ETLs and data warehouse. And once we have done that we have to go and build analytics and reporting business intelligence. So we're going to do data analyszis. But now first we will be focusing on the first part building the data warehouse. So what do we have here? The statement is very simple. It says develop a modern data warehouse using SQL server to consolidate sales data enabling analytical reporting and informed decision making. So this is the main statements and then we have specifications. The first one is about the data sources. It says import data from two source systems ERB and CRM and they are provided as CSV files. And now the second task is talking about the data quality. We have to clean and fix data quality issues before we do the data analyzers because let's be real there is no raw data that is perfect is always messy and we have to clean that up. Now the next task is talking about the integration. So it says we have to go and combine both of the sources into one single userfriendly data model that is designed for analytics and reporting. So that means we have to go and merge those two sources into one single data model. And now we have here another specifications. It says focus on the latest data sets. So there is no need for historiization. So that means we don't have to go and build histories in the database. And the final requirement is talking about the documentation. So it says provide clear documentations of the data model. So that means the last product of the data warehouse to support the business users and the analytical teams. So that means we have to generate a manual that's going to help the users that makes lives easier for the consumers of our data. So as you can see maybe this is very generic requirements but it has a lot of informations already for you. So it's saying that we have to use the platform SQL server. We have two source systems using the CSV files and it sounds that we really have a bad data quality in the sources and as well it wants us to focus on building completely new data model that is designed for reporting and it says we don't have to do historiization and it is expected from us to generate documentations of the system. So these are the requirements for the data engineering part where we're going to go and build a data warehouse that fulfill these requirements. All right. Right. So with that we have analyzed the requirements and as well we have closed the first easiest ebick. So we are done with this. Let's go and close it. And now let's open another one. Here we have to design the data architecture and the first task is to choose data management approach. So let's go. Now designing the data architecture it is exactly like building a house. So before construction starts, an architect's going to go and design a plan, a blueprint for the house. How the rooms will be connected, how to make the house functional, safe and wonderful. And without this blueprint from the architects, the builders might create something unstable, inefficient or maybe unlivable. The same goes for data projects. A data architect is like a house architecture. They design how your data will flow, integrate and be accessed. So as data architects we make sure that the data warehouse is not only functioning but also scalable and easy to maintain. And this is exactly what we will do now. We will play the role of the data architect and we will start brainstorming and designing the architecture of the data warehouse. So now I'm going to show you a sketch in order to understand what are the different approaches in order to design a data architecture. And this phase of the projects usually is very exciting for me because this is my main role in data projects. I am a data architect and I discuss a lot of different projects where we try to find out the best design for the projects. All right. So now let's go. Now the first step of building a data architecture is to make a very important decision to choose between four major types. The first approach is to build a data warehouse. It is very suitable if you have only structured data and your business want to build solid foundations for reporting and business intelligence. And another approach is to build a data leak. This one is way more flexible than a data warehouse where you can store not only structured data but as well semi and unstructured data. We usually use this approach if you have mixed types of data like database tables, logs, images, videos and your business want to focus not only on reporting but as well on advanced analytics or machine learning but it's not that organized like a data warehouse and data leaks if it's too much unorganized and turns into data swamp and this is where we need the next approach. So the next one we can go and build data lakehouse. So it is like a mix between data warehouse and data lake. You get the flexibility of having different types of data from the data lake but you still want to structure and organize your data like we do in the data warehouse. So you mix those two words into one and this is a very modern way on how to build that architecture and this is currently my favorite way of building data management system. Now the last and very recent approach is to build data mesh. So this is a little bit different. Instead of having centralized data management system the idea now in the data mesh is to make it decentralized. You cannot have like one centralized data management system because always if you say centralized then it means bottleneck. So instead you have multiple departments and multiple domains where each one of them is building a data product and sharing it with others. So now you have to go and pick one of those approaches and in this project we will be focusing on the data warehouse. So now the question is how to build the data warehouse. Well there is as well four different approaches on how to build it. The first one is the enimmon approach. So again you have your sources and the first layer you start with the staging where the row data is landing and then the next layer you organize your data in something called enterprise data warehouse where you go and model the data using the third normal format. It's about like how to structure and normalize your tables. So you are building a new integrated data model from the multiple sources. And then we go to the third layer. It's called the data marts where you go and take like small subset of the data warehouse and you design it in a way that is ready to be consumed from reporting and it focus on only one topic like for example the customers sales or products and after that you go and connect your BI tool like PowerBI or Tableau to the data marts. So with that you have three layers to prepare the data before reporting. Now moving on to the next one we have the Kimple approach. He says you know what building this enterprise data warehouse it is wasting a lot of time. So what we can do we can jump immediately from the stage layer to the final data because building this enterprise data warehouse it is a big struggle and usually waste a lot of time. So he always want you to focus and building the data ms quickly as possible. So it is faster approach than in but with the time you might get chaos in the data MS cuz you are not always focusing in the big picture and you might be repeating same transformations and integrations in different data ms. So there is like trade-off between the speed and consistent data warehouse. Now moving on to the third approach we have the data vault. So we still have the stage and the data marts but it says we still need this central data warehouse in the middle but this middle layer we're going to bring more standards and rules. So it tells you to split this middle layer into two layers the row vault and the business vault. In the row vault you have the original data but in the business vault you have all the business rules and transformations that prepares the data for the data marks. So that vault it is very similar to the inmon but it brings more standards and rules to the middle layer. Now I'm going to go and add a fourth one that I'm going to call it medallion architecture and this one is my favorite one because it is very easy to understand and to build. So it says you're going to go and build three layers bronze, silver and gold. The bronze layer it is very similar to the stage but we have understood with the time that the stage layer is very important because having the original data as it is it going to helps a lot by traceability and finding issues. Then the next layer we have the silver layer. It is where we do transformations data cleansing but we don't apply yet any business rules. Now moving on to the last layer the gold layer. It is as well very similar to the data marts but there we can build different type of objects not only for reporting but as well for machine learning for AI and for many different purposes. So they are like business ready objects that you want to share as a data products. So those are the four approaches that you can use in order to build a data warehouse. So again if you are building a data architecture you have to specify which approach you want to follow. So at the start we said we want to build a data warehouse and then we have to decide between those four approaches on how to build a data warehouse and in this project we will be using the medallion architecture. So this is a very important question that you have to answer as the first step of building a data architecture. All right. So with that we have decided on the approach. So we can go and mark it as done. The next step we're going to go and design the layers of the data warehouse. Now there is like not 100% standard way and rules for each layer. What you have to do as a data architects you have to define exactly what is the purpose of each layer. So we start with the bronze layer. So we say it's going to store row and unprocessed data as it is from the sources. And why we are doing that it is for traceability and debugging. If you have a layer where you are keeping the raw data, it is very important to have the data as it is from the sources because we can go always back to the bronze layer and investigate the data of specific source if something goes wrong. So the main objective is to have raw untouched data that's going to helps you as a data engineer by analyzing the root cause of issues. Now moving on to the server layer. It is the layer where we're going to store clean and standardized data and this is the place where we're going to do basic transformations in order to prepare the data for the final layer. Now for the go layer it's going to contain business ready data. So the main goal here is to provide data that could be consumed by business users and analysts in order to build reporting and analytics. So with that we have defined the main goal for each layer. Now next what I would like to do is to define the object types and since we are talking about a data warehouse in database we have here generally two types either a table or a view. So we are going for the bronze layer and the silver layer with tables but for the gold layer we are going with the views. So the best practice says for the last layer in your data warehouse make it virtual using views. It going to gives you a lot of dynamic and of course speed in order to build it since we don't have to make a load process for it. And now the next step is that we're going to go and define the load method. So in this project I have decided to go with the full load using the method of truncating and inserting. It is just faster and way easier. So we're going to say for the bronze layer we're going to go with the full load. And you have to specify as well for the silver layer as well. We're going to go with the full load. And of course for the views we don't need any load process. So each time you decide to go with tables you have to define the load methods with our full load, incremental loads and so on. Now we come to the very interesting part the data transformations. Now for the bronze layer, it is the easiest one about this topic because we don't have any transformations. We have to commit ourself to not touch the data, do not manipulate it, don't change anything. So it's going to stay as it is. If it comes bad, it's going to stay bad in the bronze layer. And now we come to the silver layer where we have the heavy lifting. As we committed in the objective, we have to make clean and standardized data. And for that we have different types of transformations. So we have to do data cleansing, data standardizations, data normalizations. We have to go and derive new columns and data enrichment. So there are like bunch of transformations that we have to do in order to prepare the data. Our focus here is to transform the data to make it clean and following standards and try to push all business transformations to the next layer. So that means in the god layer we will be focusing on business transformations that is needed for the consumers for the use cases. So what we do here we do data integrations between source system we do data aggregations we apply a lot of business logics and rules and we build a data model that is ready for for example business intelligence. So here we do a lot of business transformations and in the silver layer we do basic data transformations. So it is really here very important to make the fine decisions what type of transformations to be done in each layer and make sure that you commit to those rules. Now the next aspect is about the data modeling in the bronze layer and the silver layer. We will not break the data model that comes from the source system. So if the source system deliver five tables, we're going to have here like five tables and as well in the silver layer. We will not go and denormalize or normalize or like make something new, we're going to leave it exactly like it comes from the source system because what we're going to do, we're going to build the data model in the gold layer. And here you have to define which data model you want to follow. Are you following the star schema, the snowflake or are you just making aggregated objects? So you have to go and make a list of all data models types that you're going to follow in the gold layer. And at the end, what you can specify in each layer is the target audience. And this is of course very important decision. In the bronze layer, you don't want to give access to any end user. It is really important to make sure that only data engineers access the bronze layer. It makes no sense for data analysts or data scientists to go to the bad data because you have a better version for that in the silver layer. So in the silver layer of course the data engineers have to have an access to it and as well the data analysts and the data scientists and so on but still you don't give it to any business user that can't deal with the raw data model from the sources because for the business users you're going to get a better layer for them and that is the go layer. So in the gold layer it is suitable for the data analyst and as well the business users because usually the business users don't have a deep knowledge on the technicality of the server layer. So if you are designing multiple layers you have to discuss all those topics and make clear decision for each layer. All right my friends. So now before we proceed with the design I want to tell you a secret principle concept that each data architect must know and that is the separation of concerns. So what is that? As you are designing an architecture, you have to make sure to break down the complex system into smaller independent parts and each part is responsible for a specific task. And here comes the magic. The component of your architecture must not be duplicated. So you cannot have two parts are doing the same thing. So the idea here is to not mix everything. And this is one of the biggest mistakes in any big projects and I have shown that almost everywhere. So a good data architects follow this concept this principle. So for example if you are looking to our data architecture we have already done that. So we have defined unique set of tasks for each layer. So for example we have said in the server layer we do data cleansing but in the gold layer we do business transformations and with that you will not be allowing to do any business transformations. In the server layer and the same thing goes for the gold layer. You don't do in the gold layer any data cleansing. So each layer has its own unique tasks and the same thing goes for the bronze layer and the silver layer. You do not allow to load data from the source systems directly to the silver layer because we have decided the landing layer. The first layer is the bronze layer otherwise you will have like set of source systems that are loaded first to the bronze layer and another set is skipping the layer and going to the silver and with that we have overlapping. You are doing data ingestion in two different layers. So my friends, if you have this mindset, separation of concerns, I promise you, you're going to be a top data architect. So think about it. All right, my friends. So with that, we have designed the layers of the data warehouse. We can go ahead close it. The next step, we're going to go to DYO and start drawing the data architecture. So there is like no one standard on how to build a data architecture. You can add your style and the way that you want. So now the first thing that we have to show in that architecture is the different layers that we have. The first layer is the source system layer. So let's go and take a box like this and make it a little bit bigger. And I'm just going to go and make the design. So I'm going to remove the fill and make the line dotted one. And after that I'm going to go and change maybe the color to something like this gray. So now we have like a container for the first layer. And then we have to go and add like a text on top of it. So what I'm going to do, I'm going to take another box. Let's go and type inside it sources. And now I'm going to go and style it. So I'm going to go to the text and make it maybe 24. And then remove the lines like this. Make it a little bit smaller and put it on top. So this is the first layer. This is where the data come from. And then the data going to go inside a data warehouse. So I'm just going to go and duplicate this one. This one is the data warehouse. All right. So now the third layer what it going to be? It's going to be the consumers. who will be consuming this data warehouse. So I'm going to put another box and say this is the consume layer. Okay. So those are the three containers. Now inside the data warehouse, we have decided to build it using the medallion architecture. So we're going to have three layers inside the warehouse. So I'm going to take again another box. I'm going to call this one. This is the bronze layer. And now we have to go and put a design for it. So I'm going to go with this color over here. And then the text and maybe something like 20. And then make it a little bit smaller and just put it here. And beneath that we're going to have the component. So this is just a title of a container. So I'm going to have it like this. Remove the text from inside it. And remove the filling. So this container is for the bronze layer. Let's go and duplicate it for the next one. So this one going to be the silver layer. And of course, we can go and change the coloring to gray because it is silver. And as well the lines and remove the filling. Great. And now maybe I'm going to make the font as bold. All right. Now the third layer going to be the gold layer. And we have to go and pick a color for that. So style and here we have like something like yellow. The same thing for the container. I remove the filling. So with that we are showing now the different layers inside our data warehouse. Now those containers are empty. What we're going to do, we're going to go inside each one of them and start adding contents. So now in the sources, it is very important to make it clear what are the different types of source systems that you are connecting to the data warehouse because in real project there are like multiple types. You might have a database, API, files, cafka and here it's important to show those different types. In other projects we have folders and inside those folders we have CSV files. So now what you have to do we have to make it clear in this layer that the input for our project is CSV file. So it really depend how you want to show that. I'm going to go over here and say maybe folder and then I'm going to go and take the folder and put it here inside and then maybe search for file more results and go pick one of those icons. For example, I'm going to go with this one over here. So I'm going to make it smaller and add it on top of the folder. So with that we make it clear for everyone seeing the architecture that the sources is not a database is not an API it is a file inside the folder. So now very important here to show is the source systems. What are the sources that is involved in the project. So here what we're going to do we're going to go and give it a name. For example we have one source called CRM like this and maybe make the icon and we have another source called ERP. So we're going to go and duplicate it put it over here and then rename it ERP. So now it is for everyone clear. We have two sources for this project and the technology is used is simply a file. So now what we can do as well we can go and add some descriptions inside this box to make it more clear. So what I'm going to do, I'm going to take a line because I want to split the description from the icons something like this and make it gray. And then below it, we're going to go and add some text and we're going to say is CSV file. And the next point and we can say the interface is simply files in folder. And of course you can go and add any specifications and explanation about the sources. If it is a database, you can say the type of the database and so on. So that we made it in the data architecture clear what are the sources of our data warehouse. And now the next step what we're going to do we're going to go and design the content of the bronze silver and gold. So I'm going to start by adding like an icon in each container. It is to show about that we are talking about database. So what we're going to do we're going to go and search for database and then more result. More results. I'm going to go with this icon over here. So let's go and make it bigger. Something like this. Maybe change the color of dots. So, we're going to have the bronze and as well here the silver and the gold. So, now what we can do, we're going to go and add some arrows between those layers. So, we're going to go over here. So, we can go and search for arrow and maybe go and pick one of those. Let's go and put it here. And we can go and pick a color for that. Maybe something like this. And adjust it. So, now we're going to have this nice arrow between all the layers just to explain the direction of our architecture, right? So we can read it from left to right and as well between the go layer and the consume. Okay. So now what I'm going to do next we're going to go and add one statement about each layer the main objective. So let's go and grab a text and put it beneath the database and we're going to say for example for the bronze layer it's going to be the row data. Maybe make the text bigger so you are the row data. And then the next one in the silver you are clean standard data. And then the last one for the gold we can say business ready data. So with that we make the objective clear for each layer. Now below all those icons what we're going to do we're going to have a separator again like this. Make it like colored. And beneath it we're going to add the most important specifications of this layer. So let's go and add those separators in each layer. Okay. So now we need a text below it. Let's take this one here. So what is the object type of the bronze layer? That's going to be a table and we can go and add the load methods. We say this is patch processing. Since we are not doing streaming, we can say it is a full load. We are not doing incremental load. So we can say here trank and insert. And then we add one more section maybe about the transformations. So we can say no transformations. And one more about the data model. We're going to say none as is. And now what I'm going to do I'm going to go and add those specifications as well for the silver and gold. So here what we have discussed the object type the load process the transformations and whether we are breaking the data model or not the same thing for the gold layer. So I can say with that we have really nice layering of the data warehouse and what we are left is with the consumers over here you can go and add the different use cases and tools that can access your data warehouse like for example I'm adding here business intelligence and reporting maybe using PowerBI or Tableau or you can say you can access my data warehouse in order to do at analyzes using the SQL queries and this is what we're going to focus on the projects after we build the data warehouse and as well you can offer it for machine learning purposes and of course it It's really nice to add some icons in your architecture and usually I use this nice websites called flat icon. It has really amazing icons that you can go and use it in your architecture. Now, of course, we can go and keep adding icons and stuff to explain the data architecture and as well the system. Like for example, it is very important here to say which tools you are using in order to build this data warehouse. Is it in the cloud? Are using Azure datab bricks or maybe snowflake? So we're going to go and add for our project the icon of SQL server since we are building this data warehouse completely in the SQL server. So for now I'm really happy about it. As you can see we have now a plan right. All right guys so with that we have designed the data architecture using the doyo and with that we have done the last step in this epic and now with that we have a design for the data architecture and we can say we have closed this epic. Now let's go to the next one. We will start doing the first step to prepare our project. And the first task here is to create a detailed project plan. All right, my friends. So now it's clear for us that we have three layers and we have to go and build them. So that means our big epics going to be after the layers. So here I have added three more epics. So we have build bronze layer, build silver layer and gold layer. And after that I went and start defining all the different tasks that we have to follow in the projects. So at the start we will be analyzing then coding and after that we're going to go and do testing and once everything is ready we're going to go and document stuff and at the end we have to commit our work in the get repo. All those epics are following the same like pattern in the tasks. So as you can see now we have a very detailed project structure and now things are more cleared for us how we're going to build the data warehouse. So with that we are done from this task and now the next task we have to go and define the naming convention of the projects. All right. So now at this phase of the projects we usually define the naming conventions. So what is that? It is set of rules that you define for naming everything in the projects whether it is a database, schema, tables, stored procedures, folders, anything. And if you don't do that at the early phase of the projects, I promise you chaos can happen because what going to happen? You will have different developers in your projects and each of those developers have their own style of course. So one developer might name a table dimension customers where everything is lowerase and between them underscore and you have another developer creating another table called dimension products but using the camel case. So there is no separation between the words and the first character is capitalized and maybe another one using some prefixes like dim categories. So we have here like a shortcut of the dimension. So as you can see there are different designs and styles and if you leave the door open what can happen in the middle of the project you will notice okay everything looks inconsistent and you can define a big task to go and rename everything following a specific rule. So instead of wasting all this time at this phase you go and define the naming conventions and let's go and do that. So we usually start with a very important decision and that is which naming convention we going to follow in the whole project. So you have different cases like the camel case, the Pascal case, the kebab case, and the snake case. And for this project, we're going to go with the snake case where all the letters of a word going to be lowercased. And the separation between words going to be an underscore. For example, a table name called customer info. Customer is lowercased. Info is as well lowercased. And between them an underscore. So this is always the first thing that you have to decide for your data projects. The second thing is to decide the language. So for example, I work in Germany and there is always like a decision that we have to make whether we use Germany or English. So we have to decide for our project which language we're going to use. And a very important general rule is that avoid reserved words. So don't use a square reserved word as an object name like for example table. Don't give a table name as a table. So those are the general principles. So those are the general rules that you have to follow in the whole project. This applies for everything for tables, columns, stored procedures, any names that you are giving in your scripts. Now moving on, we have specifications for the table names. And here we have different set of rules for each layer. So here the rule says source system underscore entity. So we are saying all the tables in the bronze layer should start first with the source system name like for example CRM or ARB and after that we have an underscore and then at the end we have the entity name or the table name. So for example we have this table name CRM. So that means this table comes from the source system CRM and then we have the table name the entity name customer info. So this is the rule that we're going to follow in naming all tables in the bronze layer. Then moving on to the silver layer, it is exactly like the bronze because we are not going to rename anything. We are not going to build any new data model. So the naming going to be one one to one like the bronze. So it is exactly the same rules as the bronze. But if we go to the gold here, since we are building new data model, we have to go and rename things. And since as well we are integrating multiple sources together, we will not be using the source system name in the tables because inside one table you could have multiple sources. So the rule says all the names must be meaningful business aligned names for the tables starting with the category prefix. So here the rule says it start with category then underscore and then entity. Now what is category? We have in the code layer different types of tables. So we could build a table called a fact table. Another one could be a dimension. A third type could be an aggregation or a report. So we have different types of tables and we can specify those types as a prefix at the start. So for example we are saying here effect sales. So the category is fact and the table name called sales. And here I just made like a table with different type of patterns. So we could have a dimension. So we say it start with the dim underscore for example dimim customers or products. And then we have another type called fact table. So it start with fact underscore or aggregated table where we have the first three characters like aggregating the customers or the sales monthly. So as you can see as you are creating a naming convention you have first to make it clear what is the rule describe each part of the rule and start giving examples. So with that we make it clear for the whole team which names they should follow. So we talked here about the table naming convention. Then you can as well go and make naming convention for the columns. Like for example in the code layer we're going to go and have surrogate keys. So we can define it like this. The surrogate key should start with a table name and then underscore a key. Like for example we can call it customer key. It is a surrogate key in the dimension customers. The same thing for technical columns. As a data engineer, we might add our own columns to the tables that don't come from the source system. And those columns are the technical columns or sometimes we call them metadata columns. Now, in order to separate them from the original columns that comes from the source system, we can have like a prefix for that. Like for example, the rule says if you are building any technical or metadata columns, the column should start with DWH underscore and then the column name. For example, if you want the metadata load dates, we can have DWH load dates. So with that, if anyone sees that column starts with DWH, we understand this data comes from a data engineer. And we can keep adding rules like for example the store procedure over here. If you are making an ETL script, then it should start with the prefix load underscore and then the layer. For example, the store procedure that is responsible for loading the bronze going to be called load bronze. and for the silver load underscore silver. So those are currently the rules for the start procedure. So this is how I do it usually in my projects. All right my friends. So with that we have a solid naming conventions for our projects. So this is done and now the next step is that we're going to go to git and you will create a brand new repository and we're going to prepare its structure. So let's go. All right. Right. So now we come to as well important step in any projects and that's by creating the G repository. So if you are new to Git, don't worry about it. It is simpler than it sounds. So it's all about to have a safe place where you can put your codes that you are developing and you will have the possibility to track everything happens to the codes and as well you can use it in order to collaborate with your team and if something goes wrong you can always roll back. And the best part here once you are done with the project you can share your repository as a part of your portfolio and it is really amazing thing if you are applying for a job by showcasing your skills that you have built a data warehouse by using well doumented get repository. So now let's go and create the repository of the project. Now we are at the overview of our account. So the first thing that we have to do is to go to the repositories over here and then we're going to go to this green button and click on new. The first thing that we have to do is to give the repository name. So let's call it SQL data warehouse project and then here we can go and give it a description. So for example I'm saying building a modern data warehouse with SQL server. Now the next option whether you want to make it public and private. I'm going to leave it as a public and then let's go and add here a readme file. And then here about the license we can go over here and select the MIT. MIT license gives everyone the freedom of using and modifying your code. Okay. So I think I'm happy with the setup. Let's go and create the repository. And with that we have our brand new repository. Now the next step that I usually do is to create the structure of the repository. And usually I always follow the same patterns in any projects. So here we need few folders in order to put our files right. So what I usually do I go over here to add file create a new file and I start creating the structure over here. So the first thing is that we need data sets then slash and with that the repository going to understand this is a folder not a file and then you can go and add anything like here placeholder just an empty file this just going to help me to create the folders so let's go and commit so commit the changes and now if you go back to the main projects you can see now we have a folder called data sets so I'm going to go and keep creating stuff so I will go and create the documents placeholder commit the changes and then I'm going to go and create the scripts placeholder and the final one what I usually add is the tests something like this. So that as you can see now we have the main folders of our repository. Now what I usually do the next that I'm going to go and edit the main readme. So you can see it over here as well. So what we're going to do, we're going to go inside the readme and then we're going to go to the edit button here and we're going to start writing the main information about our project. This is really depend on your style. So you can go and add whatever you want. This is the main page of your repository. And now as you can see the file name here is MD. It stands for markdown. It is just an easy and friendly format in order to write a text. So if you have like documentations, you are writing a text. It is a really nice format in order to organize it, structure it and it is very friendly. So what I'm going to do at the start I'm going to give a few description about the project. So we have the main title and then we have like a welcome message and what this repository is about. And in the next section maybe we can start with the project requirements and then maybe at the end you can say a few words about the licensing and few words about you. So as you can see it's like the homepage of the project and the repository. So once you are done we're going to go and commit the changes. And now if you go to the main page of the repository you can see always the folder and files at the start and then below it we're going to see the informations from the readme. So again here we have the welcome statement and then the projects requirements and at the end we have the licensing and about me. So my friends that's it. We have now a repository and we have now the main structure of the project and through the projects as we are building the data warehouse we're going to go and commit all our work in this repository. Nice, right? All right. So with that we have now your repository ready and as we go in the project we will be adding stuff to it. So this step is done and now the last step finally we're going to go to the SQL server and we're going to write our first script where we're going to create a database and schemas. All right. Now the first step is we have to go and create a brand new database. So now in order to do that first we have to switch to the database master. So you can do it like this. Use master and semicolon. And if you go and execute it now we are switched to the master database. It is a system database in SQL server where you can go and create other databases. And you can see here from the toolbar that we are now logged into the master database. Now the next step we have to go and create our new database. So we're going to say create database and you can call it whatever you want. So I'm going to go with data warehouse semicolon. Let's go and execute it. And with that we have created our database. Let's go and check it from the object explorer. Let's go and refresh. And you can see our new data warehouse. This is our new database. Awesome. Right now to the next step we're going to go and switch to the new database. So we're going to say use data warehouse and semicolon. So let's go and switch to it. And you can see now we are logged into the data warehouse database. And now we can go and start building stuff inside this data warehouse. So now the first step that I usually do is I go and start creating the schemas. So what is schema? Think about it. It's like a folder or a container that helps you to keep things organized. So now as we decided in the architecture we have three layers, bronze, silver, gold. And now we're going to go and create for each layer a schema. So let's go and do that. We're going to start with the first one. Create schema. And the first one is bronze. So let's do it like this. And a semicolon. Let's go and create the first schema. Nice. So we have new schema. Let's go to our database. And then in order to check the schemas, we go to the security and then to the schemas over here. And as you can see, we have the bronze. And if you don't find it, you have to go and refresh the whole schemas. and then you will find the new schema. Great. So now we have the first schema. Now what we're going to do, we're going to go and create the others two. So I'm just going to go and duplicate it. So the next one going to be the silver and the third one going to be the gold. So let's go and execute those two together. We will get an error and that's because we are not having the go in between. So after each command, let's have a go. And now if I highlight the silver and gold and then execute, it will be working. the go in SQL it is like separator. So it tells SQL first execute completely the first command before go to the next one. So it is just separator. Now let's go to our schemas refresh and now we can see as well we have the gold and the silver. So with that we have now a database. We have the three layers and we can start developing each layer individually. Okay. So now let's go and commit our work in the git. So now since it is a script and code we're going to go to the folder scripts over here and then we're going to go and add a new file let's call it in it database.sql and now we're going to go and paste our code over here. So now I have done few modifications like for example before we create the database we have to check whether the database exists. This is an important step if you are recreating the database otherwise if you don't do that you will get an error where it's going to say the database already exists. So first it is checking whether the database exists then it drops it. I have added few comments like here we are saying creating the data warehouse creating the schemas and now we have a very important step. We have to go and add a header comment at the start of each script. To be honest after 3 months from now you will not be remembering all the details of this script. And adding a comment like this it is like a sticky note for you later once you visit this script again. And it is as well very important for the other developers in the team because each time you open the scripts the first question going to be what is the purpose of this script because if you or anyone in the team open the file the first question going to be what is the purpose of this scripts why we are doing this stuff. So as you can see here we have a comment saying this script creates a new data warehouse after checking if it already exists. If the database exists, it's going to drop it and recreate it. And additionally, it's going to go and create three schemas, bronze, silver, gold. So that it gives clarity what this script is about. And it makes everyone life easier. Now, the second reason why this is very important to add is that you can add warnings and especially for this script, it is very important to add these notes because if you run this script, what's going to happen? It's going to go and destroy the whole database. Imagine someone open this script and run it. Imagine an admin open this script and run it in your database. Everything going to be destroyed and all the data will be lost and this can be a disaster if you don't have any backup. So with that we have nice header comments and we have added few comments in our code and now we are ready to commit our code. So let's go and commit it. And now we have our script in the git as well. And of course if you are doing any modifications make sure to update the changes in the git. Okay my friends. So with that we have an empty database and schemas and we are done with this task and as well we are done with the whole epic. So we have completed the project initialization and now we're going to go to the interesting stuff. We will go and build the bronze layer. So now the first task is to analyze the source systems. So let's go. All right. So now the big question is how to build the bronze layer. So first thing first we do analyzing. As you are developing anything, you don't immediately start writing a code. So before we start coding the bronze layer, what we usually do is we have to understand the source system. So what I usually do, I make an interview with the source system experts and ask them many many questions in order to understand the nature of the source system that I'm connecting to the data warehouse. And once you know the source systems, then we can start coding. And the main focus here is to do the data ingestion. So that means we have to find a way on how to load the data from the source into the data warehouse. So it's like we are building a bridge between the source and our target system the data warehouse. And once we have the code ready, the next step is we have to do data validation. So here comes the quality control. It is very important in the bronze layer to check the data completeness. So that means we have to compare the number of records between the source system and the bronze layer just to make sure we are not losing any data in between. And another check that we will be doing is the schema checks and that's to make sure that the data is placed on the right position. And finally we don't have to forget about documentation and committing our work in the G. So this is the process that we're going to follow to build the bronze layer. All right my friends. So now before connecting any source systems to our data warehouse, we have to make very important step is to understand the sources. So how I usually do it, I set up a meeting with the source systems expert in order to interview them to ask them a lot of stuff about the source. And gaining this knowledge is very important because asking the right question will help you to design the correct scripts in order to extract the data and to avoid a lot of mistakes and challenges. And now I'm going to show you the most common questions that I usually ask before connecting anything. Okay. So we start first by understanding the business context and the ownership. So I would like to understand the story behind the data. I would like to understand who is responsible for the data, which IT departments and so on. And then it's nice to understand as well what business process it supports. Does it support the customer transactions, the supply chain, logistics or maybe finance reporting. So with that you can understand the importance of your data. And then I ask about the system and data documentation. So having documentations from the source is your learning materials about your data. And it's going to saves you a lot of time later when you are working and designing maybe new data models. And as well I would like always to understand the data model for the source system. And if they have like descriptions of the columns and the tables, it's going to be nice to have the data catalog. This can helps me a lot in the data warehouse. How I'm going to go and join the tables together. So with that you get a solid foundations about the business context, the processes and the ownership of the data. And now in the next step we're going to start talking about the technicality. So I would like to understand the architecture and as well the technology stack. So the first question that I usually ask is how the source system is storing the data. Do we have the data on the on-prem like in SQL server, Oracle or is it in the cloud like Azure, AWS and so on. And then once we understand that then we can discuss what are the integration capabilities like how I'm going to go and get the data. Do the source system offer APIs maybe cafka or they have only like file extractions or they're going to give you like a direct connection to the database. So once you understand the technology that you're going to use in order to extract the data then we're going to deep dive into more technical questions and here we're going to understand how to extract the data from the source system and then load it into the data warehouse. So the first things that we have to discuss with the experts can we do an incremental load or a full load and then after that we're going to discuss the data scope the historicizations do we need all data do we need only maybe 10 years of the data are there histories already in the source system or should we build it in the data warehouse and so on and then we're going to go and discuss what is the expected size of the extracts are we talking here about megabytes gigabytes terabytes and this is very important to understand whether we have the right tools and platform to connect that source system and then I try to understand whether there are any data volume limitations like if you have some old source systems they might struggle a lot with performance and so on. So if you have like an ETL that is extracting large amount of data you might bring the performance down of the source system. So that's why you have to try to understand whether there are any limitations for your extracts and as well other aspects that might impact the performance of the source system. This is very important. If they give you an access to the database, you have to be responsible that you are not bringing the performance of the database down. And of course, very important question is to ask about the authentication and the authorization like how you going to go and access the data in the source system. Do you need any tokens, keys, password and so on. So those are the questions that you have to ask if you are connecting a new source system to the data warehouse. And once you have the answers for those questions, you can proceed with the next steps to connect the sources to the data warehouse. All right, my friends. So with that, you have learned how to analyze a new source systems that you want to connect to your data warehouse. So this step is done and now we're going to go back to coding where we're going to write scripts in order to do the data ingestion from the CSV files to the pros layer. And let's have a quick look again to our bronze layer specifications. So we just have to load the data from the sources to the data warehouse. We're going to build tables in the bronze layer. We are doing a full load. So that means we are truncating and then inserting the data. There will be no data transformations at all in the bronze layer. And as well we will not be creating any data model. So this is the specifications of the bronze layer. All right. Right now in order to create the DDL script for the bronze layer creating the tables of the bronze we have to understand the metadata the structure the schema of the incoming data and here either you ask the technical experts from the source system about these informations or you can go and explore the incoming data and try to define the structure of your tables. So now what we're going to do we're going to start with the first source system the CRM. So let's go inside it and we're going to start with the first table the customer info. Now if you open the file and check the data inside it, you see we have a header information and that is very good because now we have the names of the columns that are coming from the source and from the content you can define of course the data types. So let's go and do that. First we're going to say create table and then we have to define the layer. It's going to be the bronze. And now very important we have to follow the naming convention. So we start with the name of the source system. It is CRM underscore and then after that the table name from the source system. So it's going to be the cost underscore info. So this is the name of our first table in the bronze layer. Then the next step we have to go and define of course the columns. And here again the column names in the bronze layer going to be one to one exactly like the source system. So the first one going to be the ID and I will go with the data type integer. Then the next one going to be the key invar char and the length I will go with 50. [Music] And the last one going to be the create date. It's going to be date. So with that we have covered all the columns available from the source system. So let's go and check. And yes the last one is the create date. So that's it for the first table. Now a semicolon of course at the end. Let's go and execute it. And now we're going to go to the object explorer over here. Refresh. And we can see the first table inside our data warehouse. Amazing right? So now next what you have to do is to go and create a DDL statement for each file for those two systems. So for the CRM we need three DDLs and as well for the other system the ERP we have as well to create three DDLs for the three files. So at the end we're going to have in the bronze layer six tables six DTLs. So now pause the video go create those DDLs. I will be doing the same as well and we will see you soon. [Music] All right. So now I hope you have created all those details. I'm going to show you what I have just created. So the second table in the source CRM we have the product informations and the third one is the sales details. Then we go to the second system and here we make sure that we are following the naming convention. So first the source system ERB and then the table name. So the second system was really easy. You can see we have only here like two columns and for the customers like only three and for the categories only four columns. All right. So after defining those stuff of course we have to go and execute them. So let's go and do that. And then we go to the object explorer over here. Refresh the tables. And with that you can see we have six empty tables in the bronze layer. And with that we have all the tables from the two source systems inside our database. But still we don't have any data. And you can see our naming convention is really nice. You see the first three tables comes from the CRM source system and then the other three comes from the ERB. So we can see in the bronze layer the things are really splitted nicely and you can identify quickly which table belong to which source system. Now there is something else that I usually add to the DDL script is to check whether the table exists before creating. So for example, let's say that you are renaming or you would like to change the data type of specific field. If you just go and run this query, you will get an error because the database going to say we have already this table. So in other databases you can say create or replace table. But in the SQL server you have to go and build a TSQL logic. So it is very simple. First we have to go and check whether the object exists in the database. So we say if object ID and then we have to go and specify the table name. So let's go and copy the whole thing over here and make sure you get exactly the same name as the table name. So there you see like space. I'm just going to go and remove it. And then we're going to go and define the object type. So it's going to be the U. It stands for user. It is the user defined tables. So if this table is not null. So that means the database did find this object in the database. So what's going to happen? We say go and drop the table. So the whole thing again and semicolon. So again if the table exist in the database is not null then go and drop the table and after that go and create it. So now if you go and highlight the whole thing and then execute it it will be working. So first drop the table if it exist then go and create the table from scratch. Now what you have to do is to go and add this check before creating any table inside our database. So it's going to be the same thing for the next table and so on. I went and added all those checks for each table and what can happen if I go and execute the whole thing it going to work. So with that I'm recreating all the tables in the bronze layer from the scratch. Now the methods that we're going to use in order to load the data from the source to the data warehouse is the bulk inserts. Pulk insert is a method of loading massive amount of data very quickly from files like CSV files or maybe a text file directly into a database. It is not like the classical normal inserts where it's going to go and insert the data row by row but instead the bulk insert is one operation that's going to load all the data in one go into the database and that's what makes it very fast. So let's go and use this method. Okay. Okay, so now let's start writing the script in order to load the first table in the source CRM. So we're going to go and load the table customer info from the CSV file to the database table. So the syntax is very simple. We're going to start with saying bulk insert. So with that SQL understand we are doing not a normal insert, we are doing a bulk insert and then we have to go and specify the table name. So it is bronze dot CRM cost info. So now we have to specify the full location of the file that we are trying to load in this table. So now what we have to do is to go and get the path where the file is stored. So I'm going to go and copy the whole path and then add it to the bulk insert exactly like where the data exists. So for me it is in CSQL data warehouse project data set in the source CRM. And then I have to specify the file name. So it's going to be like cost info. CSV. You have to get it exactly like the path of your files otherwise it will not be working. So after the path now we come to the with clause. Now we have to tell the SQL server how to handle our file. So here comes the specifications. There is a lot of stuff that we can define. So let's start with the very important one is the row header. Now if you check the content of our files you can see always the first row includes the header information of the file. So those informations are actually not the data. It's just the column names. The actual data starts from the second row and we have to tell the database about this information. So we're going to say first row is actually the second row. So with that we are telling SQL to skip the first row in the file. We don't need to load those informations because we have already defined the structure of our table. So this is the first specifications. The next one which is as well very important in loading any CSV file is the separator between fields. The delimiter between fields. So it's really depend on the file structure that you are getting from the source. As you can see all those values are splitted with a comma and we call this comma as a file separator or a delimter and I saw a lot of different CSVs like sometime they use a semicolon or a pipe or special character like a hash and so on. So you have to understand how the values are splitted and in this file it's splitted by the comma and we have to tell SQL about this info. It's very important. So we're going to say filled terminator and then we're going to say it is the comma and basically those two informations are very important for SQL in order to be able to read your CSV file. Now there are like many different options that you can go and add. For example, tape lock. It is an option in order to improve the performance where you are locking the entire table during loading it. So as SQL is loading the data to this table, it going to go and lock the whole table. So that's it for now. I'm just going to go and add the semicolon and let's go and insert the data from the file inside our bronze table. Let's execute it. And now we can see SQL did insert around 80,000 rows inside our table. So it is working. We just loaded the file into our database. But now it is not enough to just write this script. you have to test the quality of your bronze table especially if you are working with files. So let's go and just do a simple select. So from our new table and let's run it. So now the first thing that I check is do we have data like in each column? Well yes as you can see we have data and the second thing is do we have the data in the correct column. This is very critical as you are loading the data from a file to a database. Do we have the data in the correct column? So for example, here we have the first name which of course makes sense and here we have the last name. But what could happen and this mistakes happens a lot is that you find the first name informations inside the key and as well you see the last name inside the first name and the status inside the last name. So there is like shifting of the data and this data engineering mistake is very common if you are working with CSV files and there are like different reasons why it happens. Maybe the definition of your table is wrong or the field separator is wrong. Maybe it's not a comma, it's something else or the separator is a bad separator because sometimes maybe in the keys or in the first name there is a comma and the SQL is not able to split the data correctly. So the quality of the CSV file is not really good and there are many different reasons why you are not getting the data in the correct column. But for now everything looks fine for us. And the next step is that I'll go and count the rows inside this table. So let's go and select that. So we can see we have 18,493. And now what we can do, we can go to our CSV file and check how many rows do we have inside this file. And as you can see we have 18,494. We are almost there. There is like one extra row inside the file. And that's because of the header. the first header information is not loaded inside our table and that's why always in our tables we're going to have one less row than the original files. So everything looks nice and we have done this step correctly. Now if I go and run it again what's going to happen we will get duplicates inside the bronze layer. So now we have loaded the file like twice inside the same table which is not really correct. The method that we have discussed is first to make the table empty and then load truncate and then insert. In order to do that before the bulk inserts, what we're going to do, we're going to say truncate table and then we're going to have our table and that's it with a semicolon. So now what we are doing is first we are making the table empty and then we start loading from the scratch. We are loading the whole content of the file inside the table and this is what we call full load. So now let's go and mark everything together and execute. And again if you go and check the content of the table you can see we have only 18,000 rows. Let's go and run it again. The count of the bronze layer you can see we still have the 18,000. So each time you run this script now we are refreshing the table customer info from the file into the database table. So we are refreshing the bronze layer table. So that means if there's like now any changes in the file, it will be loaded to the table. So this is how we do a full load in the bronze layer by truncating the table and then doing the inserts. And now of course what we have to do is to pause the video and go and write the same script for all six files. So let's go and do [Music] that. Okay, back. So I hope that you have as well written all those scripts. So I have the three tables in order to load the first source system and then three sections in order to load the second source system. And as I'm writing those scripts, make sure to have the correct path. So for the second source system, you have to go and change the path for the other folder. And as well, don't forget the table name on the bronze layer is different from the file name because we start always with the source system name with the files. We don't have that. So now I think I have everything is ready. So let's go and execute the whole thing. Perfect. Awesome. So everything is working. Let me check the messages. So we can see from the message how many rows are inserted in each table. And now of course the task is to go through each table and check the content. So that means now we have really nice script in order to load the bronze layer. And we will use this script in daily basis. every day we have to run it in order to get a new content to the data warehouse. And as we learned before, if you have like a script of SQL that is frequently used, what we can do, we can go and create a stored procedure from those scripts. So let's go and do that. It's going to be very simple. We're going to go over here and say create or alter procedure. And now we have to define the name of the S procedure. I'm going to go and put it in the schema bronze because it belongs to the bronze layer. So then we're going to go and follow the naming convention. The source procedure start with load underscore and then the bronze layer. So that's it about the name and then very important we have to define the begin and as well the end of our skill statements. So here is the begin and let's go to the end and say this is the end. And then let's go highlight everything in between and give it one push with tab. So with that it is easier to read. So now next what we're going to do we're going to go and execute it. So let's go and create this store procedure. And now if you want to go and check your store procedure, you go to the database and then we have here a folder called programmability. And then inside it we have start procedure. So if you go and refresh, you will see our new stored procedure. Let's go and test it. So I'm going to go and have a new query. And what we're going to do, we're going to say execute bronze.load bronze. So let's go and execute it. And with that, we have just loaded completely the bronze layer. So as you can see SQL did go and insert all the data from the files to the bronze layer. It is way easier than each time running those scripts of course. All right. So now the next step is that as you can see the output message it is really not having a lot of informations. The message of your ETL sold procedure it will not be really clear. So that's why if you are writing an ETL script always take care of the messaging of your code. So let me show you a nice design. Let's go back to our store procedure. So now what we can do we can go and divide the message based on our code. So now we can start with the message for example over here let's say print and we say what we are doing with this store procedure we are loading the bronze liar. So this is the main message the most important one and we can go and play with the separators like this. So we can say print and now we can go and add some nice separators like for example the equals at the start and at the end just to have like a section. So this is just a nice message at the start. So now by looking to our code we can see that our code is splitted into two sections. The first section we are loading all the tables from the source system CRM and the second section is loading the tables from the ERP. So we can split the prints by the source system. So let's go and do that. So we're going to say print and we're going to say loading CRM tables. This is for the first section. And then we can go and add some nice separators like the one. Let's take the minus. And of course, don't forget to add semicolons like me. So, we're going to have semicolon for each prints. Same thing over here. I will go and copy the whole thing because we're going to have it at the start and as well at the ends. Let's go copy the whole thing for the second section. So, for the ERP, it starts over here. And we're going to have it like this. And we're going to call it loading ERP. So, with that in the output, we can see nice separation between loading each source system. Now we go to the next step where we go and add like a print for each action. So for example here we are truncating the table. So we say print and now what we can do we can go and add two arrows and we say what we are doing. So we are truncating the table and then we can go and add the table name in the message as well. So this is the first action that we are doing and we can go and add another print for inserting the data. So we can say inserting data into and then we have the table name. So with that in the output we can understand what SQL is doing. So let's go and repeat this for all other tables. Okay. So I just added all those prints and don't forget the semicolon at the end. So I would say let's go and execute it and check the output. So let's go and do that and then maybe at the start just to have quick output execute our stored procedure like this. So let's see now if you check the output you can see things are more organized than before. So at the start we are reading okay we are loading the bronze layer. Now first we are loading the source system CRM and then the second section is for the ERP and we can see the actions. So we are truncating inserting truncating inserting for each table and as well the same thing for the second source. So as you can see it is nice and cosmetic but it's very important as you are debugging any errors. And speaking of errors, we have to go and handle the errors in our store procedure. So let's go and do that. It's going to be the first thing that we do. We say begin try and then we go to the end of our script and we say before the last end we say end try and then the next thing we have to add the catch. So we're going to say begin catch and end catch. So now first let's go and organize our code. I'm going to take the whole codes and give it one more push and as well the begin try. So it is more organized and as you know the try and catch going to go and execute the try and if there is like any errors during executing this script the second section going to be executed. So the catch will be executed only if the SQL failed to run the try. So now what we have to do is to go and define for SQL what to do if there's like an error in your code. And here we can do multiple stuff like maybe creating a logging tables and add the messages inside this table or we can go and add some nice messaging to the output like for example we can go and add like a section again over here. So again some equals and we can go and repeat it over here and then add some content in between. So we can start with something like to say error accord during loading bronze layer and then we can go and add many stuff like for example we can go and add the error message and here we can go and call the function error message and we can go and add as well for example the error number. So error number and of course the output of this going to be a number but the error message here is a text. So we have to go and change the data type. So we're going to do a cast as invar like this and then there is like many functions that you can add to the output like for example the error state and so on. So you can design what can happen if there is an error in the ETL. Now what else is very important in each ATL process is to add the duration of each like step. So for example, I would like to understand how long it takes to load this table over here. But looking to the output, I don't have any informations how long is taking to load my tables. And this is very important because as you are building like a big data warehouse, the ETL process going to take long time and you would like to understand where is the issue, where is the bottleneck, which table is consuming a lot of time to be loaded. So that's why we have to add those informations as well to the output or even maybe to protocol it in a table. So let's go and add as well this step. So we're going to go to the start and now in order to calculate the duration you need the starting time and the end time. So we have to understand when we start loaded and when we ended loading the table. So now the first thing is we have to go and declare the variables. So we're going to say declare and then let's make one called start time and the data type of this going to be the date time. I need exactly the second when it started and then another one for the end time. So another variable end time and as well the same thing date time. So with that we have declared the variables and the next step is to go and use them. So now let's go to the first table to the customer info and at the start we're going to say set start time equal to get date. So we will get the exact time when we start loading this table. And then let's go and copy the whole thing and go to the end of loading over here. So we're going to say set this time the end time equal as well to the get dates. So with that now we have the values of when we start loading this table and when we completed loading the table. And now the next step is we have to go and print the duration those informations. So over here we can go and say print and we can go and have as again the same design. So two arrows and we can say very simply load duration and then double points and a space. And now what we have to do is to calculate the duration and we can do that using the date and time function date diff in order to find the interval between two dates. So we're going to say plus over here and then use date diff. And here we have to define three arguments. First one is the unit. So here you can define second, minute, hours and so on. So we're going to go with the second and then we're going to define the start of the interval. It's going to be the start time. And then the last argument it going to be the end of the boundary. It's going to be the end time. And now of course the output of this going to be a number that's why we have to go and cast it. So we're going to say cast as invar and then we're going to close it like this and maybe at the end we're going to say plus space seconds in order to have a nice message. So again what we have done we have declared the two variables and we are using them at the start we are getting the current date and time and at the end of loading the table we are getting the current date and time and then we are finding the differences between them in order to get the load duration and in this case we are just printing this information and now we can go of course and add some nice separator between each table so I'm going to go and do it like this just few minuses not a lot of stuff so now what we have to do is to go and add this mechanism for each table in order to measure the speed of the ETL for each one of [Music] them. Okay. So now I have added all those configurations for each table and let's go and run the whole thing now. So let's go and edit the store procedure this and we're going to go and run it. So let's go and execute. So now as you can see we have here one more info about the load durations and it is everywhere I can see we have zero seconds and that's because it is super fast of loading those informations we are doing everything locally at PC so loading the data from files to database going to be mega fast but of course in real projects you have like different servers and networking between them and you have millions of rows in the tables of course the duration going to be not like 0 seconds things going to be slower and now you can see easily how long it takes to load each of your tables. And now of course what is very interesting is to understand how long it takes to load the whole bronze layer. So now your task is as well to print at the end informations about the whole patch. How long it took to load the bronze [Music] layer. Okay, I hope we are done. Now I have done it like this. We have to define two new variables. So the start time of the batch and the end time of the batch. And the first step in the start procedure is to get the date and time informations for the first variable. And exactly at the end the last thing that we do in the start procedure, we're going to go and get the date and time informations for the end time. So we say again set get date for the patch and time. And then all what we have to do is to go and print a message. So we are saying loading bronze layer is completed and then we are printing total load duration and the same thing with a date difference between the patch start time and the end time and we are calculating the seconds and so on. So now what we have to do is to go and execute the whole thing. So let's go and refresh the definition of the start procedure and then let's go and execute it. So in the output we have to go to the last message and we can see loading bronze layer is completed and the total load duration is as well 0 seconds because the execution time is less than 1 second. So with that you are getting now a feeling about how to build an ETL process. So as you can see the data engineering is not all about how to load the data. It's how to engineer the whole pipeline. how to measure the speed of loading the data. What can happen if there is like an error and to print each step in your ETL process and make everything organized and cleared in the output and maybe in the logging just to make debugging and optimizing the performance way easier. And there's like a lot of things that we can add. We can add the quality measures and stuff. So we can add many stuff to our ETL script to make our data warehouse professional. All right, my friends. So with that we have developed a code in order to load the bronze layer and we have tested that as well. And now in the next step we're going to go back to draw because we want to draw a diagram about the data flow. So let's go. So now what is a data flow diagram? We're going to draw a simple visual in order to map the flow of your data where it come from and where it ends up. So we want just to make clear how the data flows through different layers of your projects. And that's help us to create something called the data lineage. And this is really nice especially if you are analyzing an issue. So if you have like multiple layers and you don't have a real data lineage or flow, it's going to be really hard to analyze the scripts in order to understand the origin of the data and having this diagram going to improve the process of finding issues. So now let's go and create one. Okay. So now back to draw and we're going to go and build the flow diagram. So we're going to start first with the source system. So, let's build the layer. I'm going to go and remove the fill dot it. And then we're going to go and add like a box saying sources and we're going to put it over here. Increase the size 24 and as well without any lines. Now, what do we have inside the sources? We have like folder and files. So, let's go and search for a folder icon. I'm going to go and take this one over here and say you are the CRM. And we can as well increase the size. And we have another source. We have the ERP. Okay. So, this is the first layer. Let's go and now have the bronze layer. So, we're going to go and grab another box. And we're going to go and make the coloring like this. And instead of auto, maybe take the hatch, maybe something like this, whatever, you know. So, rounded. And then we can go and put on top of it like the title. So, we can say you are the bronze layer. and increase as well the size of the font. So now what we're going to do, we're going to go and add boxes for each table that we have in the bronze layer. So for example, we have the sales details. We can go and make it a little bit smaller. So maybe 16 and not bold. And we have other two tables from the CRM. We have the customer info and as well the product info. So those are the three tables that comes from the CRM. And now what we're going to do, we're going to go and connect now the source CRM with those three tables. So what we're going to do, we're going to go to the folder and start making arrows from the folder to the bronze layer like this. And now we have to do the same thing for the ERP source. So as you can see the data flow diagram shows us in one picture the data lineage between the two layers. So here we can see easily those three tables actually comes from the CRM and as well those three tables in the bronze layer are coming from the ERP. I understand if we have like a lot of tables it's going to be a huge mess. But if you have like small or medium data warehouse building those diagrams going to make things really easier to understand how everything is flowing from the sources into the different layers in your data warehouse. All right. So with that we have the first version of the data flow. So this step is done and the final step is to commit our code in the get repo. Okay. So now let's go and commit our work. Since it is scripts, we're going to go to the folder scripts. And here we're going to have like script for the bronze, silver, and gold. That's why maybe it makes sense to create a folder for each layer. So let's go and start creating the bronze folder. So I'm going to go and create a new file. And then I'm going to say bronze slash. And then we can have the DDL script of the bronze layer SQL. So now I'm going to go and paste the DDL codes that we have created. So those six tables and as usual at the start we have a comment where we are explaining the purpose of this script. So we are saying this scripts creates tables in the bronze schema. And by running this scripts you are redefining the DDL structure of the bronze tables. So let's have it like that. And I'm going to go and commit the changes. All right. So now as you can see inside the scripts we have a folder called bronze and inside it we have the DDL script for the bronze layer and as well in the bronze layer we're going to go and put our start procedure. So we're going to go and create a new file let's call it proc load bronze dossql and then let's go and paste our script and as usual I have put it at the start an explanation about the store procedure. So we are saying this third procedure going to go and load the data from the CSV files into the bronze schema. So it going to go and truncate first the tables and then do a bulk insert. And about the parameters, this source procedure does not accept any parameter or return any values. And here a quick example how to execute it. All right. So I think I'm happy with that. So let's go and commit it. All right. My friends, so with that we have committed our code into the g. And with that we are done building the bronze layer. So the whole op is done. Now we're going to go to the next one. This one going to be more advanced than the bronze layer because there will be a lot of struggle with cleaning the data and so on. So we're going to start with the first task where we're going to analyze and explore the data in the source systems. So let's go. Okay. So now we're going to start with the big question. How to build the server layer? What is the process? Okay. As usual, first things first, we have to analyze. And now the task before building anything in the server layer we have to go and explore the data in order to understand the content of our sources once we have it what we're going to do we will be starting coding and here the transformation that we're going to do is data cleansing this is usually process that take really long time and I usually do it in three steps the first step is to check first the data quality issues that we have in the bronze layer so before writing any data transformations first we have to understand what are the issues and only then I start writing think data transformations in order to fix all those quality issues that we have in the bronze and the last step once I have clean results what we're going to do we're going to go and insert it into the server layer and those are the three faces that we will be doing as we are writing the code for the silver layer and the third step once we have all the data in the server layer we have to make sure that the data is now correct and we don't have any quality issues anymore and if you find any issues of course what you going to do we're going to go back to coding we're going to do the data cleansing and again object. So it is like a cycle between validating and coding. Once the quality of the silver layer is good, we cannot skip the last phase where we're going to document and commit our work in the G. And here we're going to have two new documentations. We're going to build the data flow diagram and as well the data integration diagram after we understood the relationship between the sources from the first step. So this is the process and this is how we're going to build the server layer. All right. So now exploring the data in the bronze layer. So why it is very important? Because understanding the data it is the key to make smart decisions in the server layer. It was not the focus in the bronze layer to understand the content of the data at all. We focus only how to get the data to the data warehouse. So that's why we have now to take a moment in order to explore and understand the tables and as well how to connect them. what are the relationship between these tables and it is very important as you are learning about the new source system is to create like some kind of documentation. So now let's go and explore the sources. Okay. So now let's go and explore them one by one. We can start with the first one from the CRM. We have the customer info. So right click on it and say select top thousand rows. And this is of course important if you have like a lot of data. Don't go and explore millions of rows. Always limit your query. So for example here we are using the top thousands just to make sure that you are not impacting the system with your queries. So now let's have a look to the content of this table. So we can see that we have here customer informations. So we have an ID, we have a key for the customer, we have first name, last name, marital status, gender and the creation date of the customer. So simply this is a table for the customer information and a lot of details for the customers. And here we have like two identifiers. one it is like technical ID and another one it's like the customer number so maybe we can use either the ID or the key in order to join it with other tables so now what I usually do is to go and draw like data model or let's say integration model just to document and visual what I am understanding because if you don't do that you're going to forget it after a while so now we go and search for a shape let's search for a table and I'm going to go and pick this one over here so here we can go and change the style for example we can make it rounded or you can go make it sketch and so on. And we can go and change the color. I'm going to make it blue. Then go to the text. Make sure to select the whole thing. And let's make it bigger. 26. And then what I'm going to do for those items, I'm just going to select them and go to our range and maybe make it 40. Something like this. So now what we're going to do, we're going to just go and put the table name. So this is the one that we are now learning about. And what I'm going to do, I'm just going to go and put here the primary key. I will not go and list all the informations. So the primary key was the ID. And I will go and remove all those stuff. I don't need it. Now, as you can see, the table name is not really friendly. So I can go and bring a text and put it here on top and say this is the customer information. Just to make it friendly and to not forget about it. And as well going to increase the size to maybe 20 something like this. Okay. With that, we have our first table. and we're going to go and keep exploring. So let's move to the second one. We're going to take the product information, right click on it and select the top thousand rows. I will just put it below the previous query. Query it. Now by looking to this table we can see we have product informations. So we have here a primary key for the product and then we have like key or let's say product number and after that we have the full name of the product the product costs and then we have the product line and then we have like start and end. Well this is interesting to understand why we have start and ends. Let's have a look for example for those three rows all of those three having the same key but they have different ids. So it is the same product but with different costs. So for 2011 we have the cost of 12. Then 2012 we have 14 and for the last year 2013 we have 13. So it's like we have like a history for the changes. So this table not only holding the current informations of the product but also history informations of the product and that's why we have those to date start and end. Now let's go back and draw this information over here. So I'm just going to go and duplicate it. So the name of this table going to be the BRD info and let's go and give it like a short description current and history products information something like this just to not forget that we have history in this table and here we have as well the PRD ID and there is like nothing that we can use in order to join those two tables we don't have like a customer ID here or in the other table we don't have any product ID okay so that's it for this table let's jump to the third table and the last one in the CRM M. So let's go and select. I just made the other queries as well short. So let's go and execute. So what do we have over here? We have a lot of informations about the order, the sales and a lot of measures. Order number. We have the product key. So this is something that we can use in order to join it with the product table. We have the customer ID. We don't have the customer key. So here we have like ID and here we have key. So there's like two different ways on how to join tables. And then we have here like dates. the order date, the shipping date, the due date and then we have the sales amount, the quantity and the price. So this is like an event table. It is transactional table about the orders and sales and it is great table in order to connect the customers with the products and as well with the orders. So let's document this new information that we have. So the table name is the sales details. So we can go and describe it like this. Transactional records about sales and orders. And now we have to go and describe how we can connect this table to the other two. So we are not using the product ID. We are using the products key. And now we need a new column over here. So you can hold control and enter or you can go over here and add a new row. And the other row going to be the customer ID. So now for the customer ID it is easy. we can go and grab an arrow in order to connect those two tables. But for the product key, we are not using the ID. So that's why I'm just going to go and remove this one and say product key. Let's have again a check. So this is a product key. It's not the product ID. And if we go and check the old table, the products info, you can see we are using this key and not the primary key. So what we're going to do now, we will just go and link it like this. And maybe switch those two tables. So I will put the customers below. Just perfect. It looks nice. Okay. So, let's keep moving. Let's go now to the other source system. We have the ARP and the first one is ARB cost and we have this cryptical name. Let's go and select the data. So, now here it's small table and we have only three informations. So, we have here something called CD and then we have something I think this is the birthday and the gender information. So, we have here male, female and so on. So, it looks again like the customer informations but here we have like extra data about the birthday. And now if you go and compare it to the customer table that we have from the other source system. Let's go and query it. You can see the new table from the ARB don't have ids. It has actually the customer number or the key. So we can go and join those two tables using the customer key. Let's go and document this information. So I will just go and copy paste and put it here on the right side. I will just go and change the color now since we are now talking about different source system. And here the table name going to be this one. and the key called C ID. Now, in order to join this table with the customer info, we cannot join it with the customer ID. We need the customer key. That's why here we have to go and add a new row. So, ctrl enter and we're going to say customer key. And then we have to go and make a nice arrow between those two keys. So, we're going to go and give it a description, customer information. And here we have the birth date. Okay. So, now let's keep going. We're going to go to the next one. We have the ERP location. Let's go and query this table. So, what do we have over here? We have the CD again. And as you can see, we have country informations. And this is of course again the customer number. And we have only this information, the country. So, let's go and document this information. This is the customer location. Table name going to be like this. And we still have the same ID. So, we have here still the customer ID and we can go and join it using the customer key. And we have to give it the description location of customers and we can say here the country. Okay. So now let's go to the last table and explore it. We have the ERP ex catalog. So let's go and query those informations. So what do we have here? We have like an ID, a category, a subcategory and the maintenance. Here we have like either yes and no. So by looking to this table we have all the categories and the subcategories of the products and here we have like special identifier for those informations. Now the question is how to join it. So I would like to join it actually with the product informations. So let's go and check those two tables together. Okay. So in the product we don't have any ID for the categories but we have these informations actually in the product key. So the first five characters of the product key is actually the category ID. So we can use this information over here in order to join it with the categories. So we can go and describe this information like this and then we have to go and give it a name. And then here we have the ID and the ID could be joined using the product key. So that means for the product information we don't need at all the product ID the primary key. All what we need is the product key or the product number. And what I would like to do is like to group those informations in a box. So, let's go grab like any boxes here on the left side and make it bigger and then make the edges a little bit smaller. Let's remove the fill and the line. I will make a dotted line. And then let's grab another box over here and say this is the CRM. And we can go and increase the size maybe something like 40 smaller 35 bold and change the color to blue and just place it here on top of this box. So with that we can understand all those tables belongs to the source system CRM and we can do the same stuff for the right side as well. Now of course we have to go and add the description here. So it's going to be the products categories. All right. So with that we have now a clear understanding how the tables are connected to each others. We understand now the content of each table and of course it can help us to clean up the data in the silver layer in order to prepare it. So as you can see it is very important to take time understanding the structure of the tables the relationship between them before start writing any code. All right. So with that we have now clear understanding about the sources and with that we have as well created a data integration in the draw. So with that we have more understanding about how to connect the sources. And now in the next two task we will go back to SQL where we're going to start checking the quality and as well doing a lot of data transformations. So let's go. Okay, so now let's have a quick look to the specifications of the server layer. So the main objective to have clean and standardized data. We have to prepare the data before going to the gold layer. And we will be building tables inside the silver layer. And the way of loading the data from the bronze to the silver is a full load. So that means we're going to truncate and then insert. And here we're going to have a lot of data transformations. So we're going to clean the data. We're going to bring normalizations, standardizations. We're going to derive new columns. We will be doing as well data enrichments. So a lot of things to be done in the data transformation. But we will not be building any new data model. So those are the specifications and we have to commit ourself to this scope. Okay. So now building the DDL script for the silver layer going to be way easier than the bronze because the definition and the structure of each table in the silver going to be identical to the bronze layer. We are not doing anything new. So all what you have to do is to take the DDL script from the bronze layer and just go and search and replace for the schema. I'm just using the Notepad++ for the scripts. So I'm going to go over here and say replace the bronze dots with silver dots and I'm going to go and replace all. So with that now all the DDL is targeting the schema silver layer which is exactly what we need. All right. Now before we execute our new DDL script for the silver, we have to talk about something called the metadata columns. They are additional columns or fields that the data engineers add to each table that don't come directly from the source systems. But the data engineers use it in order to provide extra informations for each record. Like we can add a column called create date is when the record was loaded or an update date when the record got updated or we can add the source system in order to understand the origin of the data that we have or sometimes we can add the file location in order to understand the lineage from which file the data come from. Those are great tool if you have data issue in your data warehouse if there is like corrupt data and so on. This can help you to track exactly where this issue happens and when. And as well it is great in order to understand whether I have gap in my data especially if you are doing incremental loads. It is like putting labels on everything and you will thank yourself later when you start using them in hard times as you have an issue in your data warehouse. So now back to our DDL scripts and all what you have to do is to go and do the following. So for example for the first table I will go and add at the end one more extra column. So it start with the prefix TWW as we have defined in the naming convention and then underscore let's have the create date and the data type going to be date time 2 and now what we can do is we can go and add a default value for it. I want the database to generate these informations automatically. We don't have to specify that in any scripts. So which value? It's going to be the get date. So each record going to be inserted in this table will get automatically a value from the current date and time. So now as you can see the naming convention it is very important. All those columns comes from the source system and only this one column comes from the data engineer of the data warehouse. Okay. So that's it. Let's go and repeat the same thing for all other tables. So I will just go and add this piece of information for each DDL. All right. So I think that's it. All what you have to do is now to go and execute the whole DDL script for the silver layer. Let's go and do that. All right, perfect. There's no errors. Let's go and refresh the tables on the object explorer. And with that, as you can see, we have six tables for the silver layer. It is identical to the bronze layer, but we have one extra column for the metadata. All right. All right. So now in the server layer before we start writing any data transformations and cleansing we have first to detect the quality issues in the bronze without knowing the issues we cannot find solution right we will explore first the quality issues only then we start writing the transformation scripts. So let's go. Okay. Okay. So now what we're going to do, we're going to go through all the tables over the bronze layer, clean up the data, and then insert it to the server layer. So let's start with the first table, the first bronze table from the source CRM. So we're going to go to the bronze CRM customer info. So let's go and query the data over here. Now, of course, before writing any data transformations, we have to go and detect and identify the quality issues of this table. So usually I start with the first check where we go and check the primary key. So we have to go and check whether there are nulls inside the primary key and whether there are duplicates. So now in order to detect the duplicates in the primary key what we have to do is to go and aggregate the primary key. If we find any value in the primary key that exist more than once that means it is not unique and we have duplicates in the table. So let's go and write query for that. So what we're going to do, we're going to go with the customer ID and then we're going to go and count and then we have to group up the data. So group by based on the primary key and of course we don't need all the results. We need only where we have an issue. So we're going to say having count higher than one. So we are interested in the values where the count is higher than one. So let's go and execute it. Now as you can see we have issue in this table. we have duplicates because all those ids exist more than one in the table which is completely wrong. We should have the primary key unique and you can see as well we have three records where the primary key is empty which is as well a bad thing. Now there is an issue here. If we have only one null it will not be here at the result. So what I'm going to do I'm going to go over here and say or the primary key is null just in case if we have only one null I'm still interested to see the results. So if I go and run it again, we'll get the same results. So this is equality check that you can do on the table. And as you can see, it is not meeting the expectation. So that means we have to do something about it. So let's go and create a new query. So here what we're going to do, we can start writing the query that is doing the data transformation and the data cleansing. So let's start again by selecting the data and execute it again. So now what I usually do I go and focus on the issue. So for example let's go and take one of those values and I focus on it before start writing the transformation. So we're going to say where customer ID equal to this value. All right. So now as you can see we have here the issue where the ID exist three times but actually we are interested only on one of them. So the question is how to pick one of those. Usually we search for a time stamp or date value to help us. So if you check the creation date over here we can understand that this record this one over here is the newest one and the previous two are older than it. So that means if I have to go and pick one of those values I would like to get the latest one because it holds the most fresh information. So what we have to do is we have to go and rank all those values based on the create dates and only pick the highest one. So that means we need a racking function and for that in scale we have the amazing window functions. So let's go and do that. We will use the function row number over and then partition by and here we have to divide the table by the customer ID. So we're going to divide it by the customer ID and in order now to rank those rows we have to sort the data by something. So order by and as we discussed we want to sort the data by the creation date. So create date and we're going to sort it descending. So the highest first then the lowest. So let's go and do that. And now we're going to go and give it a name flag last. So now let's go and execute it. Now the data is sorted by the creation date. And you can see over here that this record is the number one. Then the one that is older is two and the oldest one is three. Of course we are interested in the rank number one. Now let's go and remove the filter and check everything. So now if you have a look to the table you can see that on the flag we have everywhere like one and that's because the those primary keys exist only one but sometimes we will not have one we'll have two three and so on. If there's like duplicates we can go of course and do a double check. So let's go over here and say select star from this query we can say where flag last is in equal to one. So let's go and query it. And now we can see all the data that we don't need because they are causing duplicates in the primary key and they have like an old status. So what we're going to do we're going to say equal to one. And with that we guarantee that our primary key is unique and each value exist only once. So if I go and query it like this you will see we will not find any duplicate inside our table. And we can go and check that of course. So let's go and check this primary key. And we're going to say and customer ID equal to this value. And you can see it exists now only once and we are getting the freshest data from this primary key. So with that we have defined like transformation in order to remove any duplicates. Okay. So now moving on to the next one. As you can see in our table we have a lot of values where they are like string values. Now for these string values we have to check the unwanted spaces. So now let's go and write a query that's going to detect those unwanted spaces. So we're going to say select this column the first name from our table bronze customer information. So let's go and query it. Now by just looking to the data it's going to be really hard to find those unwanted spaces especially if they are at the end of the word. But there is a very easy way in order to detect those issues. So what we're going to do we're going to do a filter. So now we're going to say the first name is not equal to the first name after trimming the values. So if you use the function trim, what it going to do? It's going to go and remove all the leading and trailing spaces. So the first name. So if this value is not equal to the first name after trimming it, then we have an issue. So it is very simple. Let's go and execute it. So now in the result, we will get a list of all first names where we have spaces either at the start or at the end. So again the expectation here is no results. And the same thing we can go and check something else like for example the last name. So let's go and do that over here and here. Let's go and execute it. We see in the results we have as well 17 customers where they have like space in their last name which is not really good. And we can go and keep checking all the string values that we have inside the table. So for example the gender. So let's go and check that and execute. Now as you can see we don't have any results. That means the quality of the gender is better and we don't have any unwanted spaces. So now we have to go and write transformation in order to clean up those two columns. Now what I'm going to do, I'm just going to go and list all the columns in the query instead of the star. All right. So now I have a list of all the columns that I need. And now what we have to do is to go to those two columns and start removing the unwanted spaces. So we will just use the trim. It's very simple. And give it a name, of course, the same name. And we will trim as well the last name. So let's go and query this. And with that we have cleaned up those two columns from any unwanted spaces. Okay. So now moving on we have those two informations. We have the maritalial status and as well the gender. If you check the values inside those two columns as you can see we have here low cardality. So we have limited numbers of possible values that is used inside those two columns. So what we usually do is to go and check the data consistency inside those two columns. So it's very simple what we're going to do. We're going to do the following. We're going to say distinct and we're going to check the values. Let's go and do that. And now as you can see we have only three possible values either null, f or m which is okay. We can stay like this of course. But we can make a rule in our project where we can say we will not be working with data abbreviations. We will go and use only friendly full names. So instead of having an F, we're going to have like a full word female. And instead of m we're going to have like male and we make it as a rule for the whole project. So each time we find the gender informations we try to give the full name of it. So let's go and map those two values to a friendly one. So we're going to go to the gender over here and say case when and we're going to say the gender is equal to f then make it a female. And when it is equal to m then map it to male. And now we have to make decision about the nulls. As you can see over here we have nulls. So do we want to leave it as a null or we want to use always the value unknown. So with that we are replacing the missing values with a standard default value or you can leave it as null. But let's say in our project that we are replacing all the missing value with a default value. So let's go and do that. We're going to say else I'm going to go with the NA not available or you can go with the unknown of course. So that's for the gender information like this. And we can go and remove the old one. And now there is one thing that I usually do in this case where sometimes what happens currently we are getting the capital F and the capital M but maybe in the time something change and you will get like lower M and lower F. So just to make sure in those cases we still are able to map those values to the correct value. What we're going to do we're going to just use the function upper just to make sure that if you get any lowerase values we are able to catch it. So the same thing over here as well. And now one more thing that you can add as well. Of course, if you are not trusting the data because we saw some unwanted spaces in the first name and the last name, you might not trust that in the future. You will get here as well unwanted spaces. You can go and make sure to trim everything just to make sure that you are catching all those cases. So that's it for now. Let's go and execute. Now, as you can see, we don't have an M and an F. We have a full word, male and female. And if we don't have a value, we don't have a null, we are getting here not available. Now we can go and do the same stuff for the maritial status. You can see as well we have only three possibilities. The s null and an M. We can go and do the same stuff. So I will just go and copy everything from here. And I will go and use the marital status and just remove this one from here. And now what are the possible values? We have the S. So it's going to be single. We have an M for married. And we have as well a null and with that we are getting the not available. So with that we are making as well data standardizations for this column. So let's go and execute it. Now as you can see we don't have those short values. We have a full friendly value for the status and as well for the gender. And at the same time we are handling the nulls inside those two columns. So with that we are done with those two columns. And now we can go to the last one that create date. For this type of informations, we make sure that this column is a real date and not as a string or varchar. And as we defined it in the data type, it is a date which is completely correct. So nothing to do with this column. And now the next step is that we're going to go and write the insert statement. So how we going to do it? We're going to go to the start over here and say insert into silverm customer info. Now we have to go and specify all the columns that should be inserted. So we're going to go and type it. So something like this. And then we have the query over here. Let's go and execute it. So let's do that. So with that we have inserted clean data inside the silver table. So now what we're going to do we're going to go and take all the queries that we have used in order to check the quality of the bronze and let's go and take it to another query and instead of having bronze we're going to say silver. So this is about the primary key. Let's go and execute it. Perfect. We don't have any results. So we don't have any duplicates. The same thing for the next one. So the silver and it was for the first name. So let's go and check the first name and run it. As you can see there is no results. It is perfect. We don't have any issues. You can of course go and check the last name and run it again. We don't have any results over here. And now we can go and check those low cardality columns like for example the gender. Let's go and execute it. So as you can see we have the not available or the unknown male and female. So perfect and you can go and have a final look to the table to the silver customer info. Let's go and check that. So now we can have a look to all those columns. As you can see everything looks perfect and you can see it is working this metadata information that we have added to the table definition. Now it says when we have inserted all those records to the table which is really amazing information to have a track and audit. Okay. So now by looking to this script we have done different types of data transformations. The first one is with the first name and the last name. Here we have done trimming removing unwanted spaces. This is one of the types of data cleansing. So we remove unnecessary spaces or unwanted characters to ensure data consistency. Now moving on to the next transformation. we have this case when so what we have done here is data normalization or we call it sometimes data standardization so this transformation is type of data cleansing where we're going to map coded values to meaningful user friendly description and we have done the same transformation as well to the gender another type of transformation that we have done as well in the same case when is that we have handled the missing values so instead of nulls we going to have not available so handling missing data is as type of data cleansing where we are filling the blanks by adding for example a default value. So instead of having an empty string or a null we're going to have a default value like the not available or unknown. Another type of data and transformations that we have done in this script is we have removed the duplicates. So removing duplicates is as well type of data cleansing where we ensure only one record for each primary key by identifying and retaining only the most relevant row to ensure there is no duplicates inside our data and as we are removing the duplicates of course we are doing data filtering. So those are the different types of data transformations that we have done in this script. All right, moving on to the second table in the bronze layer from the CRM. We have the product info. And of course, as usual, before we start writing any transformations, we have to search for data quality issues. And we start with the first one, we have to check the primary key. So we have to check whether we have duplicates or nulls inside this key. So what we have to do, we have to group up the data by the primary key or check whether we have nulls. So let's go and execute it. So as you can see, everything is safe. We don't have duplicates or nulls in the primary key. Now moving on to the next one, we have the product key. Here we have in this column a lot of informations. So now what we have to do is to go and split this string into two informations. So we are deriving new two columns. So now let's start with the first one is the category ID. The first five characters they are actually the category ID and we can go and use the substring function in order to extract part of a string. It needs three arguments. The first one going to be the column that we want to extract from. And then we have to define the position where to extract. And since the first part is on the left side, we going to start from the first position. And then we have to specify the length. So how many characters we want to extract, we need five characters. So 1 2 3 4 5. So that's it for the category ID. Category ID. Let's go and execute it. Now, as you can see, we have a new column called the category ID. and it contains the first part of the string and in our database from the other source system we have as well the category ID. Now we can go and double check just in order to make sure that we can join data together. So we're going to go and check the ID from the bronze table ERP and this canopy from the category. So in this table we have the category ids and you can see over here those are the ids of the category and in the code layer we have to go and join those two tables. But here we still have an issue. We have here an underscore between the category and the subcategory. But in our table we have actually a minus. So we have to replace that with an underscore in order to have matching informations between those two tables. Otherwise we will not be able to join the tables. So we're going to use the function replace. And what we are replacing? We are replacing the minus with an underscore something like this. And if you go now and execute it, we will get an underscore exactly like the other table. And of course we can go and check whether everything is matching by having very simple query where we say this new information not in. And then we have this nice subquery. So we are trying to find any category ID that is not available in the second table. So let's go and execute it. Now as you can see we have only one category that is not matching. We are not finding it in this table which is maybe correct. So if you go over here you will not find this category. I just make it a little bit bigger. So we are not finding this one category from this table which is fine. So our check is okay. Okay. So that we have the first part. Now we have to go and extract the second part and we're going to do the same thing. So we're going to use the substring and the three argument the product key but this time we will not start cutting from the first position we have to be in the middle. So 1 2 3 4 5 6 7. So we start from the position number seven. And now we have to define the length how many characters to be extracted. But if you look over here you can see that we have different length of the product keys. It is not fixed like the category ID. So we cannot go and here specify number. We have to make something dynamic and there is trick in order to do that. We're going to go and use the length of the whole column. With that we make sure that we are always getting enough characters to be extracted and we will not be losing any informations. So we will make it dynamic like this. We will not have it as a fixed length and with that we have the product key. So let's go and execute it. As you can see we are now extracting the second part from this string. Now why we need the product key? We need it in order to join it with another table called sales details. So let's go and check the sales details. So let me just check the column name. It is SLS product key. So from bronze CRM sales. Let's go and check the data over here. And it looks wonderful. So actually we can go and join those informations together. But of course we're going to go and check that. So we're going to say where and we're going to take our new column and we're going to say not in the sub query just to make sure that we are not missing anything. So let's go and execute. So it looks like we have a lot of products that don't have any orders. Well, I don't have a nice feelings about it. Let's go and try something like this one here. And we say where sld key like this value over here. So I'll just cut the last three just to search inside this table. So we really don't have such a keys. Let me just cut the second one. So let's go and search for it. We don't have it as well. So anything that starts with the F key, we don't have any order with the product where it starts with the F key. So let's go and remove it. But still we are able to join the tables, right? So if I go and say in instead of not in. So with that you are able to match all those products. So that means everything is fine. Actually it's just products that don't have any orders. So with that I'm happy with this transformation. Now moving on to the next one. We have here the name of the product. We can go and check whether there is unwanted spaces. So let's go to our quality checks. Make sure to use the same table and we're going to use the product name and check whether we find any unmatching after trimming. So let's go and do it. Well, it looks really fine. So we don't have to trim anything. This column is safe. Now moving on to the next one. We have the costs. So here we have numbers and we have to check the quality of the numbers. So what we can do? We can check whether we have nulls or negative numbers. So negative costs or negative prices which is not realistic depend on the business of course. So let's say in our business we don't have any negative costs. So it's going to be like this. Let's go and check whether it's something less than zero or whether we have costs that is null. So let's go and check those informations. Well, as you can see, we don't have any negative values, but we have nulls. So we can go and handle that by replacing the null with a zero. Of course, if the business allow that. So in SQL server, in order to replace the null with a zero, we have a very nice function called is null. So we are saying if it is null then replace this value with a zero. It is very simple like this and we give it a name of course. So let's go and execute it. And as you can see we don't have any more nulls. We have zero which is better for the calculations if you are later doing any aggregate functions like the average. Now moving on to the next one we have the product line. This is again abbreviation of something and the cardinality is low. So let's go and check all possible values inside this column. So we're just going to use the distinct going to be BRD line. So let's go and execute it. And as you can see the possible values are null M R ST. And again those are abbreviations but in our data warehouse we have decided to give full nice names. So we have to go and replace those codes those abbreviations with a friendly value. And of course in order to get those informations I usually go and ask the expert from the source system or an expert from the process. So let's start building our case win. And then let's use the upper and as well the trim just to make sure that we are having all the cases. So the BRD line is equal to so let's start with the first value the M. Then we will get the friendly value it's going to be mountain. then to the next one. So I will just copy and paste here. If it is an R then it is road and another one for let me check what do we have here? We have M R and then S. The S stands for other sales and we have the T. So let's go and get the T. So the T stands for touring. We have at the end an else for unknown not available. So we don't need any nulls. So that's it. And we're going to name it as before. So product line. So let's remove the old one. And let's execute it. And as you can see, we don't have here anymore those shortcuts and the abbreviations. We have now full friendly value. But I will go and have here like capital O. It looks nicer. So that we have nice friendly value. Now by looking to this case when as you can see it is always like we are mapping one value to another value and we are repeating all time upper time upper time and so on. We have here a quick form in the case when if it is just a simple mapping. So the syntax is very simple we say case and then we have the column. So we are evaluating this value over here and then we just say when without the equal so if it is an M then make it mountain. the same thing for the next one and so so with that we have the functions only once and we don't have to go and keep repeating the same function over and over and this one only if you are mapping values but if you have complex conditions you cannot do it like this but for now I'm going to stay with the quick form of the case when it looks nicer and shorter so let's go and execute it we will get the same results okay so now back to our table let's go to the last two columns we have the start and end date so it's like defining an interval we have start and end so Let's go and check the quality of the start and end dates. We're going to go and say select star from our bronze table. And now we're going to go and search it like this. We are searching for the end date that is smaller than the start. So we are key to start dates. So let's go and query this. So you can see the start is always like after the end which makes no sense at all. So we have here data issue with those two dates. So now for this kind of data transformations what I usually do is I go and grab few examples and put it in Excel and try to think about how I'm going to go and fix it. So here I took like two products this one and this one over here. And for that we have like three rows for each one of them. And we have this situation over here. So the question now how we going to go and fix it? I will go and make like a copy of one solution where we're going to say it's very simple. Let's go and switch the start date with the end date. So if I go and grab the end date and put it at the start, things going to look way nicer, right? So we have the start is always younger than the end. But my friends, the data now makes no sense because we say it start from 2007 and ends by 2011 the price was 12. But between 2008 and 2012, we have 14. which is not really good because if you take for example the year 2010 for 2010 it was 12 and at the same time 14. So it is really bad to have an overlapping between those two dates. It should start from 2007 and end with 11 and then start Feb from 12 and end with something else. There should be no overlapping between years. So it's not enough to say the start should be always smaller than the ends but as well the end of the first history should be younger than the start of the next records. This is as well a rule in order to have no overlapping. This one has no start but has already an end which is not really okay because we have always to have a start. Each new record in historiizations has to has a start. So for this record over here this is as well wrong. And of course it is okay to have the start without an end. So in this scenario it's fine because this indicate this is the current informations about the costs. So again this solution is not working at all. So now for the solution two what we can say let's go and ignore completely the end date and we take only the start date. So let's go and paste it over here. But now we go and rebuild the end date completely from the start date following the rules that we have defined. So the rule says the end of date of the current records comes from the start date from the next records. So here this end date comes from this value over here from the next record. So that means we take the next start date and put it at the end date for the previous records. So with that as you can see it is working the end date is higher than the start date. And as well we are making sure this date is not overlapping with the next record. But as well in order to make it way nicer we can subtract it with one. So we can take the previous day like this. So with that we are making sure the end date is smaller than the next start. And now for the next record this one over here the end date going to come from the next start date. So we will take this one for here and put it as an end date and subtract it with one. So we will get the previous day. So now if you compare those two you can see it's still higher than the start. And if you compare it with the next record this one over here it is still smaller than the next one. So there is no overlapping. And now for the last record since we don't have here any informations it will be a null which is totally fine. So as you can see I'm really happy with this scenario over here. Of course you can go and validate this with an expert from the source system. But let's say I have done that and they approved it and now I can go and clean up the data using this new logic. So this is how I usually brainstorm about fixing an issues. If I have like a complex stuff, I go and use Excel and then discuss it with the expert using this example. It's way better than showing a database queries and so on. It just makes things easier to explain and as well to discuss. So now how I usually do it, I usually go and make a focus on only the columns that I need and take only one two scenarios while I'm building the logic and once everything is ready I go and integrate it in the query. So now I'm focusing only on these columns and only for these products. So now let's go and build our logic. Now in SQL if you are at specific record and you want to access another information from another records and for that we have two amazing window functions. We have the lead and log. In this scenario, we want to access the next records. That's why we have to go with the function leads. So, let's go and build it lead. And then what do we need? We need the lead of the start date. So, we want the start date of the next record. And then we say over and we have to partition the data. So, the window going to be focusing on only one product which is the product key and not the product ID. So, we are dividing the data by product key. And of course, we have to go and sort the data. So order by and we are sorting the data by the start date and ascending. So from the lowest to the highest and let's go and give it another name. So as let's say test for example just to test the data. So let's go and execute. And I think I missed something here. It is partition by. So let's go and execute again. And now let's go and check the results for the first partition over here. So the start is 2011 and the end is 2012. And this information came from the next record. So this data is moved to the previous record over here. And the same thing for this record. So the end date comes from the next record. So our logic is working. And the last record over here is null because we are at the end of the window and there is no next data. That's why we will get null and this is perfect of course. So it looks really awesome. But what is missing is we have to go and get the previous day. And we can do that very simply using minus one. we are just subtracting one day. So we have no overlapping between those two dates and the same thing for those two dates. So as you can see we have just built a perfect end date which is way better than the original data that we got from the source system. Now let's take this one over here and put it inside our query. So we don't need the end date, we need our new end date. Let's just remove that test and execute. Now it looks perfect. All right. Now we are not done yet with those two dates. Actually we are saying all time dates because we don't have here any informations about the time always zero. So it makes no sense to have these informations inside our data. So what we can do we can do a very simple cast and we make this column as a date instead of date time. So this is for the first one and as well for the next one as date. So let's try that out. And as you can see it is nicer. We don't have the time informations. Of course, we can tell the source systems about all those issues. But since they don't provide a time, it makes no sense to have date and time. Okay, so it was a long run, but we have now a cleaned product informations. And this is way nicer than the original product information that we got from the source CRM. So if you grab the DDL of the server table, you can see that we don't have a category ID. So we have product ID and product key. And as well those two columns, we just changed the data type. So it's date time here but we have changed that to a date. So that means we have to go and do few modifications to the DDL. So what we're going to do we're going to go over here and say category ID and I will be using the same data type for the start and the end. This time going to be date and not date and time. So that's it for now. Let's go and execute it in order to repair the DDL. And this is what happen in the silver layer. Sometimes we have to adjust the metadata if the quality of the data types and so on is not good or we are building new derived informations in order later to integrate the data. So it will be like very close to the bronze layer but with few modifications. So make sure to update your DTL scripts. And now the next step is that we're going to go and insert the data into the table. And now the next we're going to go and insert the result of this query that is cleaning up the bronze table into the silver table. So as we done it before insert into silver the product info and then we have to go and list all the columns. I've just prepared those columns. So with that we can go and now run our query in order to insert the data. So now as you can see this did insert the data and the very important step is now to check the quality of the silver table. So we go back to our data quality checks and we go switch to the silver. So let's check the primary key. There is no issues and we can go and check for example here the trims there is as well no issue and now let's go and check the costs it should not be negative or null which is perfect let's go and check the data standardizations as you can see they are friendly and we don't have any nulls and now very interesting the order of the dates so let's go and check that as you can see we don't have any issues and finally what I do I go and have a final look to the silver table and As we can see everything is inserted correctly in the correct columns. So all those columns comes from the source system and the last one is automatically generated from the DDL indicate when we loaded this table. Now let's sit back and have a look to our script. What are the different types of data transformations that we have done here is for example over here the category ID and the product key we have derived new columns. So it is when we create a new column based on calculations or transformations of an existing one. So sometimes we need columns only for analytics and we cannot each time go to the source system and ask them to create it. So instead of that we derive our own columns that we need for the analytics. Another transformation we have is the is null over here. So we are handling here missing information. Instead of null we're going to have a zero. And one more transformation we have over here for the product line. We have done here data normalization. Instead of having a code value we have a friendly value. And as well we have handled the missing data. For example, over here instead of having a null, we're going to have not available. All right, moving on to another data transformation. We have done data type casting. So we are converting the data type from one to another. And this considered as well to be a data transformation. And now moving on to the last one. We are doing as well data type casting. But what's more important, we are doing data enrichment. This type of transformation, it's all about adding a value to your data. So we are adding new relevant data to our data sets. So those are the different types of data transformations that we have done for this table. Okay. So let's keep going. We have the sales details and this is the last table in the CRM. So what do we have over here? We have the order number and this is a string. Of course we can go and check whether we have an issue with the unwanted spaces. So we can search whether we're going to find something. So we can say trim and something like this. and let's go and execute it. So we can see that we don't have any unwanted spaces. That means we don't have to transform this column. So we can leave it as it is. Now the next two columns they are like keys and ids in order to connect it with the other tables. As we learned before we are using the product key in order to connect it with the product informations and we are connecting the customer ID with the customer ID from the customer info. So that means we have to go and check whether everything is working perfectly. So we can go and check the integrity of those columns where we say the product key not in and then we make a subquery and this time we can work with the silver layer right so we can say the product key from silver dot product info so let's go and query this and as you can see we are not getting any issue that means all the product keys from the sales details can be used and connected with the product info the same thing we can go and check the integrity of the customer ID and we can use not the product we and go to the customer info and the name was CST ID. So let's go and query that and the same thing we don't have here any issues. So that means we can go and connect the sales with the customers using the customer ID and we don't have to do any transformations for it. So things looks really nice for those three columns. Now we come to the challenging one. We have here the dates. Now those dates are not actual dates. They are integer. So those are numbers and we don't want to have it like this. We would like to clean that up. we have to change the data type from integer to a dates. Now if you want to convert an integer to a date, we have to be careful with the values that we have inside each of those columns. So now let's check the quality for example of the order dates. Let's say where order dates is less than zero for example something negative. Well, we don't have any negative values which is good. Let's go and check whether we have any zeros. Well, this is bad. So we have here a lot of zeros. Now what we can do? We can replace those informations with a null. We can use of course the null if function like this. We can say null if and if it is zero then make it null. So let's execute it. And as you can see now all those informations are null. Now let's go and check again the data. So now this integer has the year's information at the start then the months and then the day. So here we have to have like 1 2 3 4 5. So the length of each number should be h. And if the length is less than eight or higher than eight then we have an issue. Let's go and check that. So we're going to say or length sales order is not equal to h that means less or higher. Let's go and execute it. Now let's go and check the results over here. And those two informations they don't look like a date. So we cannot go and make from these informations a real date. They are just bad data quality. And of course you can go and check the boundaries of a date. Like for example it should not be higher than for example let's go and get this value 2050 and then any for the month and the date. So let's go and execute it. And if we just remove those informations just to make sure. So we don't have any date that is outside of the boundaries that you have in your business. Or you go for example and say the boundary should be not less than depend when your business started. Maybe something like this. We are getting of course those values because they are less than null. But if you have values around this dates you will get it as well in the query. So we can go and add the rests. So all those checks like validate the column that has a date informations and it has the data type integer. So again what are the issues over here? We have zeros and sometimes we have like strange numbers that cannot be converted to a dates. So let's go and fix that in our query. So we can say case when the sales order the order dates is equal to zero or of the order date is not equal to 8 then null. Right? We don't want to deal with those values. they are just wrong and they they are not real dates otherwise we say else it's going to be the order date. Now what we're going to do we're going to go and convert this to a date. We don't want this as an integer. So how we can do that? We can go and cast it first to a varchar because we cannot cast from integer to date in SQL server. First you have to convert it to a varchchar and then from varchchar you go to a date. Well this is how we do it in SQL server. So we cast it first to a varchar and then we cast it to a date like this. That's it. So we have end and we are using the same column name. So this is how we transform an integer to a date. So let's go and query this. And as you can see the order date now is a real date. It is not a number. So we can go and get rid of the old column. Now we have to go and do the same stuff for the shipping dates. So, we can go over here and replace everything with the shipping date and let's go and query. Well, as you can see, the shipping date is perfect. We don't have any issue with this column. But still, I don't like that we found a lot of issues with the order date. So, what we're going to do just in case this happens for the shipping date in the future, I will go and apply the same rules to the shipping dates. Oh, let's take the shipping date like this. And if you don't want to apply it now, you have always to build like quality checks that runs every day in order to detect those issues. And once you detect it, then you can go and do the transformations. But for now, I'm going to apply it right away. So that is for the shipping date. Now we go to the due date and we will do the same test. Let's go and execute it. And as well, it is perfect. So still, I'm going to apply the same rules. So let's get the due date everywhere here in the query. Just make sure you don't miss anything here. So let's go and execute now. Perfect. As you can see, we have the order date, shipping date, and due date. And all of them are date and don't have any wrong data inside those columns. Now, still there is one more check that we can do and it's that the order date should be always smaller than the shipping date or the due date because it makes no sense, right? If you are delivering an item without an order. So first the order should happen then we are shipping the items. So there is like an order of those dates and we can go and check that. So we are checking now for invalid date orders where we can say the order date is higher than the shipping date or we are searching as well for an order where the order date is higher than the due date. So we can have it like this due date. So let's go and check. Well, that's really good. We don't have such a mistake on the data and the quality looks good. So the order date is always smaller than the shipping date or the due date. So we don't have to do any transformations or cleanup. Okay friends, now moving on to the last three columns. We have the sales, quantity and the price. All those informations are connected to each others. So we have a business rule or calculation. It says the sales must be equal to quantity multiplied by the price. And all sales quantity and price informations must be positive numbers. So it's not allowed to be negative, zero or null. So those are the business rules and we have to check the data consistency in our table. Does all those three informations following our rules? So we're going to start first with our rule, right? So we're going to say if the sales is not equal to quantity multiplied by the price. So we are searching where the result is not matching our expectation. And as well we can go and check other stuff like the nulls. So for example we can say or sales is null or quantity is null and the last one for the price and as well we can go and check whether they are negative numbers or zero. So we can go over here and say less or equal to zero and apply it for the other columns as well. So with that we are checking the calculation and as well we are checking whether we have null, zero or negative numbers. Let's go and check our informations. I'm going to have here extinct. So let's go and query it. And of course we have here bad data. But we can go and sort the data by the sales quantity and the price. So let's do it. Now by looking to the data we can see in the sales we have nulls. We have negative numbers and zeros. So we have all bad combinations and as well we have here bad calculations. So as you can see the price here is 50, the quantity is one but the sales is two which is not correct. And here we have as well wrong calculations. Here we have to have a 10 and here nine or maybe the price is wrong. And by looking to the quantity now you can see we don't have any nulls. We don't have any zeros or negative numbers. So the quantity looks better than the sales. And if you look to the prices we have nulls we have negatives and yeah we don't have zeros. So that means the quality of the sales and the price is wrong. The calculation is not working and we have these scenarios. Now of course how I do it here I don't go and try now to transform everything on my own. I usually go and talk to an expert maybe someone from the business or from the source system and I show those scenarios and discuss and usually there is like two answers either they going to tell me you know what I will fix it in my source so I have to live with it there is incoming bad data and the bad data going to be presented in the warehouse until the source system clean up those issues. And the other answer you might get you know what we don't have the budget and those data are really old and we are not going to do anything. So here you have to decide either you leave it as it is or you say you know what let's go and improve the quality of the data. But here you have to ask for the experts to support you solving these issues because it really depend on the rules. Different rules makes different transformations. So now let's say that we have the following rules. If the sales informations are null or negative or zero, then use the calculation the formula by multiplying the quality with the price. And now if the prices are wrong, for example, we have here a null or zero, then go and calculate it from the sales and the quantity. And if you have a price that is a minus like minus 21, a negative number, then you have to go and convert it to a 21. So from a negative to a positive without any calculations. So those are the rules and now we're going to go and build the transformations. based on those rules. So let's do it step by step. I will go over here and we're going to start building the new sales. So what is the rule says case when of course as usual if the sales is null or let's say the sales is negative number or equal to zero or another scenario we have a sales information but it is not following the calculation. So we have wrong information in the sales. So we're going to say the sales is not equal to the quantity multiplied by the price. But of course we will not leave the price like this by using the function APS. The absolute is going to go and convert everything from negative to a positive. Then what we have to do is to go and use the calculation. So it going to be the quantity multiplied by the price. So that means we are not using the value that's come from the source system. We are recalculating it. Now let's say the sales is correct and not one of those scenarios. So we're going to say else. We will go with the sales as it is that comes from the source because it is correct. It's really nice. Let's go and say an end and give it the same name. I will go and rename the old one here as an old value and the same for the price. The quantity will not touch it because it is correct. So like this. And now let's go and transform the prices. So again as usual we go with case when. So what are the scenarios? The price is null or the price is less or equal to zero. Then what we going to do? We're going to do the calculation. So it's going to be the sales divided by the quantity the SLS quantity. But here we have to make sure that we are not dividing by zero. Currently we don't have any zeros in the quantity but you don't know in the future you might get a zero and the whole code going to break. So what you have to do is to go and say if you get any zero replace it with a null. So null if if it is zero then make it null. So that's it. Now if the price is not null and the price is not negative or equal to zero then everything is fine and that's why we're going to have now the else it going to be the price as it is from the source system. So that's it. We're going to say end as price. So I'm totally happy with that. Let's go and execute it and check of course. So those are the old informations and those are the new transformed cleaned up informations. So here previously we have a null but now we have two. So two multiplied with one we are getting two. So the sales is here correct. Now moving on to the next one we have in the sales 40 but the price is two. So two multiplied with one we should get two. So the new sales is correct. It is two and not 40. Now to the next one over here the old sales is zero. But if you go and multiply the four with the quantity you will get four. So the sales here is not correct. That's why in the new sales we have it correct as a four. And let's go and get a minus. So in this case we have a minus which is not correct. So we are getting the price multiplied with one. We should get here a nine. And this sales here is correct. Now let's go and get a scenario where the price is null like this here. So we don't have here a price but we calculated from the sales and the quantity. So we divided the 10 by two and we have five. So the new price is better. And the same thing for the minuses. So we have here minus 21 and in the output we have 21 which is correct. So for now I don't see any scenario where the data is wrong. So everything looks better than before. And with that we have applied the business rules from the experts and we have cleaned up the data in the data warehouse. And this is way better than before because we are presenting now better data for analyszis and reporting but it is challenging and you have exactly to understand the business. So now what we're going to do we're going to go and copy those informations and integrate it in our query. So instead of sales we're going to get our new calculation and instead of the price we will get our correct calculation and here I'm missing the end. Let's go and run the whole thing again. So with that we have as well now cleaned sales quantity and price and it is following our business rules. So with that we are done cleaning up the sales details. The next step we're going to go and insert it to the sales details. But we have to go and check again the DDL. So now all what you have to do is to compare those results with the DDL. So the first one is the order number. It's fine. The product key, the customer ID, but here we have an issue. All those informations now are date and not an integer. So we have to go and change the data type. And with that we have better data type than before. Then the sales quantity price it is correct. Let's go and drop the table and create it from scratch again. And don't forget to update your DDL script. So that's it for this. And we're going to go now and insert the results into our silver table sales details. And we have to go and list now all the columns. I have already prepared the list of all the columns. So make sure that you have the correct order of the columns. So let's go now and insert the data. And with that and with that we can see that the SQL did insert data to our sales details. But now very important is to check the health of the silver table. So what we're going to do instead here of bronze, we're going to go and switch it to silver. So let's check over here. So here always the order is smaller than the shipping and the due date, which is really nice. But now I'm very interested on the calculations. So here we're going to switch it from bronze to silver. And I'm going to go and get rid of all those calculations because we don't need it this. And now let's see whether we have any issue. Well, perfect. Our data is following the business rules. We don't have any nulls, negative values, zeros. Now as usual the last step the final check we will just have a final look to the table. So we have the order number the product key the customer ID those three dates we have the sales quantity and the price and of course we have our metadata column. Everything is perfect. So now by looking to our code what are the different types of data transformation that we are doing. So in those three columns we are doing the following. So at the start we are handling invalid data and this is as well type of transformation and as well at the same time we are doing data type casting. So we are changing it to more correct data type. And if you are looking to the sales over here then what we are doing over here is we are handling the missing data and as well the invalid data by deriving the column from already existing one. And it is as well very similar for the price. We are handling as well the invalid data by deriving it from specific calculation over here. So those are the different types of data transformations that you have done in these scripts. All right. Now let's keep moving to the next system. We have the customer AZ2. So here we have like only three columns and let's start with the ID first. So here again we have the customer's informations and if we go and check again our model you can see that we can connect this table with the CRM table customer info using the customer key. So that means we have to go and make sure that we can go and connect those two tables. So let's go and check the other table. We can go and check of course the server layer. So let's query it and we can query both of the tables. Now we can see there is here like extra characters that are not included in the customer key from the CRM. So let's go and search for example for this customer over here where C ID like so we are searching for customer has similar ID. Now as you can see we are finding this customer but the issue is that we have those three characters NAS. There is no specifications or explanation why we have the NAS. So actually what we have to do is to go and remove those informations. We don't need it. So let's again check the data. So it looks like the old data have an NAS at the start and then afterward we have new data without those three characters. So we have to clean up those ids in order to be able to connect it with other tables. So we're going to do it like this. We're going to start with the case when since we have like two scenarios in our data. So if the C ID is like the three characters in as so if the ID start with those three characters then we're going to go and apply transformation function otherwise it's going to stay like it is. So that's it. So now we have to go and build the transformation. So we're going to use substring and then we have to define the string. It's going to be the CD and then we have to define the position where it start cutting or extracting. So we can say 1 2 3 and then four. So we have to define the position number four. And then we have to define the string how many characters should be extracted. I will make it dynamic. So I will go with the length. I will not go and count how much. So we're going to say the C ID. So it looks good. If it's like NAS then go and extract from the CD at the position number four the rest of the characters. So let's go and execute it. And I'm missing here a comma again where we don't have any NAS at the start. And if you scroll down you can see those as well are not affected. So with that we have now a nice ID to be joined with other table. Of course we can go and test it like this where then we take the whole thing the whole transformation and say not in we remove of course the alias name we don't need it. And then we make very simple substring select distinct CST key the customer key from the silver table can be silver CRM cost info. So that's it. So let's go and check. So as you can see it is working fine. So we are not able to find any unmatching data between the customer info from ERB and the CRM. But of course after the transformation if you don't use the transformation. So if I just remove it like this, we will find a lot of unmatching data. So this means our transformation is working perfectly and we can go and remove the original value. So that's it for the first column. Okay. Now moving on to the next field, we have the birthday of the customers. So the first thing to do is to check the data type. It is a date. So it's fine. It is not an integer or a string. So we don't have to convert anything. But still there is something to check with the birth date. So we can check whether we have something out of range. So for example, we can go and check whether we have really old dates at the birth dates. So let's take 19, 100, and let's say 24 and we can take the first date of the month. So let's go and check that. Well, it looks like that we have customers that are older than 100 year. Well, I don't know. Maybe this is correct, but it sounds of course strange to do the business. Of course. Hey, this is Creed and he is in charge of something. That is correct. Say hi to the kids. Hi kids. Yay. And then we can go and check the other boundary where it is almost impossible to have a customer that the birthday is in the future. So we can say birth date is higher than the current date like this. So let's go and query this information. Well, it will not work because we have to have like an or between them. And now if we check the list over here, we have dates that are invalid for the birth dates. So all those dates they are all per day in the future and this is totally unacceptable. So this is an indicator for bad data quality. Of course you can go and report it to the source system in order to correct it. So here it's up to you what to do with those dates. Either leave it as it is as a bad data or we can go and clean that up by replacing all those dates with a null or maybe replacing only the one that is extreme where it is 100% is incorrect. So let's go and write the transformation for that. As usual, we're going to start with case when birth date is larger than the current date and time then null. Otherwise, we can have an else where we have the birth date as it is and then we have an end as birth date. So, let's go and execute it. And with that, we should not get any customer where the birthday in the future. So, that's it for the birth date. Now, let's move to the next one. We have the gender. Now again the gender informations is low cardalities. So we have to go and check all the possible values inside this column. So in order to check all the possible values we're going to use select distinct gen from our table. So let's go and execute it. And now the data doesn't look really good. So we have here a null, we have an f, we have here an empty string, we have male, female, and again we have the M. So this is not really good. And what we're going to do, we're going to go and clean up all those informations in order to have only three values. Male, female, and not available. So, we're going to do it like this. We're going to say case when and now we're going to go and trim the values just to make sure there is like no empty spaces. And as well, I'm going to go and use the upper function just to make sure that in the future if we get any lower cases and so on, we are covering all the different scenarios. So case this is in F or let's say female then make it as female and we can go and do the same thing for the male like this. So if it is an M or a male make sure it is capital letters because here we are using the upper then it is a male otherwise all other scenarios it should be not available. So whether it is an empty string or nulls and so on. So we have to have an end of course as gen. So now let's go and test it and check whether we have covered everything. So you can see the M is now male. The empty is not available. The F is female. The empty string or maybe spaces here is not available. Female going to stay as it is. And the same for the male. So with that we are covering all the scenarios and we are following our standards in the project. So I'm going to go and cut this and put it in our original query over here. So let's go and execute the whole thing. And with that we have cleaned up all those three columns. Now the question is did we change anything in the DDL? Well we didn't change anything. We didn't introduce any new column or change any data type. So that means the next step is we're going to go and insert it in the server layer. So as usual we're going to say here insert into silver ERP the customer and then we're going to go and list all the column names. So C ID birth date and the gender. All right. So let's go and execute it. And with that we can see it inserted all the data. And of course the very important step as the next is to check the data quality. So let's go back to our query over here and change it from bronze to silver. So let's go and check the silver layer. Well of course we are getting those very old customers but we didn't change that. We only change the birthday that is in the future and we don't see it here in the results. So that means everything is clean. So for the next one, let's go and check the different genders. And as you can see, we have only those three values. And of course, we can go and take a final look to our table. So you can see the C ID here, the birth date, the gender, and then we see our metadata column. And everything looks amazing. So that's it. What are the different types of data transformations that we have done? First with the ID, what we have done, we have handled invalid values. So we have removed this part where it is not needed. And the same thing goes for the birth dates. We have handled as well invalid values. And then for the last one, for the gender, we have done data normalizations by mapping the code to more friendly value. And as well, we have handled the missing values. So those are the types that we have done in this code. Okay. Moving on to the second table, we have the location informations. So we have ERP location A101. So now here the task is easy because we have only two columns and if you go and check the integration model we can find our table over here. So we can go and connect it together with the customer info from the other system using a CID with the customer key. So those two informations must be matching in order to join the tables. So that means we have to go and check the data. So let's go and select the data CST key from let's go and get the silver data customer info. So let's go. Now if you go and check the result you can see over here that we have an issue with the CI ID there is like a minus between the characters and the numbers but the customer ID the customer number we don't have anything that splits the characters with the numbers. So if you go and join those two informations it will not be working. So what we have to do we have to go and get rid of this minus because it is totally unnecessary. So let's go and fix that. It's going to be very simple. So what we're going to do we're going to say CI ID. So we're going to go and search for the minus and replace it with nothing. It's very simple like this. So let's go and query it again. And with that things looks very similar to each others. And as well we can go and query it. So we're going to say where our transformation is not in then we can go and use this as a subquery like this. So let's go and execute it. And as you can see we are not finding any unmatching data now. So that means our transformation is working. And with that we can go and connect those two tables together. So if I take the transformation away you can see that we will find a lot of unmatching data. So the transformation is okay. We're going to stay with it. And now let's speak about the countries. Now we have here multiple values and so on. What I'm going to do this is low cardinality and we have to go and check all possible values inside this column. So that means we are checking whether the data is consistent. So we can do it like this. distinct the country from our table. I'm just going to go and copy it like this. And as well, I'm going to go and sort the data by the country. So, let's go and check the informations. Now, you can see we have a null. We have an empty string, which is really bad. And then we have a full name of country and then we have as well an abbreviation of the countries. Well, this is a mix. This is not really good because sometimes we have DE and sometimes we have Germany and then we have the United Kingdom and then for the United States we have like three versions of the same information which is as well not really good. So the quality of the country is not really good. So let's go and work on the transformation. As usual we're going to start with the case win. If trim country is equal to D, then we're going to transform it to Germany. And the next one it's going to be about the USA. So if trim country is in. So now let's go and get those two values the US and the USA. So US and USA then it's going to be the United States states. So with us we have covered as well those three cases. Now we have to talk about the null and the empty string. So we're going to say when trim country is equal to empty string or country is null then it's going to be not available otherwise I would like to get the country as it is. So trim country just to make sure that we don't have any leading or trailing spaces. So that's it. Let's go and say this is the country. So it is working and the country information is transformed. And now what I'm going to do, I'm going to take the whole new transformation and compare it to the old one. Let me just call this as old country and let's go and query it. So now we can check those values state as before. So nothing did change. The DE is now Germany. The empty string is not available. The null the same thing and the United Kingdom stayed as like it's like before. And now we have one value for all those information. So it's only the United States. So it looks perfect. And with that we have cleaned as well the second column. So with that we have now clean results. And now the question did we change anything in the DDL? Well we haven't changed anything. Both of them are varchar. So we can go now immediately and insert it into our table. So insert into silver customer location. And here we have to specify the columns. It's very simple the ID and the country. So let's go and execute it. And as you can see we got now inserted all those values. Of course, as a next, we go and double check those informations. I would just go and remove all those stuff as well here. And instead of bronze, let's go with the silver. So, as you can see, all the values of the country looks good. And let's have a final look to the table. So, like this. So, we have the ids without the separator. We have the countries and as well our metadata information. So, with that, we have cleaned up the data for the location. Okay. So now what are the different types of data transformation that we have done here is first we have handled invalid values. So we have removed the minus with an empty string and for the country we have done data normalization. So we have replaced codes with friendly values and as well at the same time we have handled missing values by replacing the empty string and null with not available. And one more thing of course we have removed the unwanted spaces. So those are the different types of transformation that we have done for this table. Okay guys, now keep the energy up, keep the spirit up. We have to go and clean up the last table in the bronze layer. And of course, we cannot go and skip anything. We have to check the quality and to detect all the errors. So now we have a table about the categories for the products. And here we have like four columns. Let's go and start with the first one, the ID. As you can see in our integration model, we can connect this table together with the product info from the CRM using the product key. And as you remember in the silver layer, we have created an extra column for that in the product info. So if you go and select those data, you can see we have a column called category ID and this one is exactly matching the ID that we have in this table and we have done the testing. So this ID is ready to be used together with the other table. So there is nothing to do over here. And now for the next columns they are string. And of course we can go and check whether there are any unwanted spaces. So we are checking for the unwanted spaces. So let's go and check select start from and we're going to go and get the same table like this here. And first we are checking the category. So the category is not equal to the category after trimming the unwanted spaces. So let's go and execute it. And as you can see we don't have any results. So there are no unwanted spaces. Let's go and check the other column. For example, the subcategory, the next one. So let's get the subcategory and run the query as well. We don't have anything. So that means we don't have unwanted spaces for the subcategory. Let's go now and check the last column. So I will just copy and paste. Now let's get the maintenance and let's go and execute. And as well, no results. Perfect. We don't have any unwanted spaces inside this table. So now the next step is that we're going to go and check the data standardizations because all those columns has low cardinality. So what we can do we can say select distinct let's get the cats category from our table. I'll just copy and paste it and check all values. So as you can see we have the accessories, bikes, clothing and components. Everything looks perfect. We don't have to change anything in this column. Let's go and check the subcategory. And if you scroll down, all values are friendly and nice as well. Nothing to change here. And let's go and check the last column, the maintenance. Perfect. We have only two values, yes and no. We don't have any nulls. So my friends, that's means this table has really nice data quality and we don't have to clean up anything. But still, we have to follow our process. We have to go and load it from the bronze to the silver even if we didn't transform anything. So our job is really easy. Here we're going to go and say insert into silver dot ERP px and so on. And we're going to go and define the columns. So it's going to be the ID, the category, subcategory, maintenance. So that's it. Let's go and insert the data. Now, as usual, what we're going to do, we're going to go and check the data. So silver ERP. Let's have a look. All right. So we can see the ids are here, the categories, the subcategories, the maintenance and we have our meta column. So everything is inserted correctly. All right. So now I have all those queries and the insert statements for all six tables. And now what is important before inserting any data, we have to make sure that we are truncating and emptying the table because if you run this query twice, what's going to happen? You will be inserting duplicates. So first truncate the data and then do a full load insert all data. So we're going to have one step before it's like the bronze layer. We're going to say truncate table and then we will be truncating the silver customer info and only after that we have to go and insert the data. And of course we can go and give this nice information at the start. So first we are truncating the table and then inserting. So if I go and run the whole thing. So let's go and do it. It will be working. So if I can run it again, we will not have any duplicates. So we have to go and add this step before each insert. So let's go and do that. All right. So I'm done with all tables. So now let's go and run everything. So let's go and execute it. And we can see in the messaging everything working perfectly. So with that we made all tables empty. And then we inserted the data. So perfect. With that we have a nice script that loads the silver layer. But of course like the front layer, we're going to put everything in one stored procedure. So let's go and do that. We'll go to the beginning over here and say create or alter procedure and we're going to put it in the schema silver and using the naming convention load silver and we're going to go over here and say begin and take the whole code end it is long one and give it one push with a tab and then at the end we're going to say edge. Perfect. So we have our stored procedure but we forgot here the ass with that we will not have any error. Let's go and execute it. So the stored procedure is created. If you go to the programmability and you will find two procedures load bronze and load silver. So now let's go and try it out. All what you have to do is now only to execute the silver load silver. So let's execute the start procedure and with that we will get the same results. This third procedure now is responsible of loading the whole silver layer. Now of course the messaging here is not really good because we have learned in the bronze layer we can go and add many stuff like handling the error doing nice messaging catching the duration time. So now your task is to pause the video take this start procedure and go and transform it to be very similar to the bronze layer with the same messaging and all the add-ons that we have added. So pause the video now. I will do it as well offline and I will see you [Music] soon. Okay. So I hope you are done and I can show you the results. It's like the bronze layer. We have defined at the start few variables in order to catch the duration. So we have the start time, the end time, patch start time and patch end time. And then we are printing a lot of stuff in order to have like nice messaging in the output. So at the start we are saying loading the server layer and then we start splitting by the source system. So loading the CRM tables and I'm going to show you only one table for now. So we are setting the timer. So we are saying start time get the date and time informations to it. Then we are doing the usual. We are truncating the table and then we are inserting the new informations after cleaning it up. And we have this nice message. We will say load duration where we are finding the differences between the start time and the end time using the function date diff. And we want to show the result in the seconds. So we are just printing how long it took to load this table. And we're going to go and repeat this process for all the tables. And of course we are putting everything in try and catch. So the SQL going to go and try to execute the try part. And if there are any issues the SQL going to go and execute the catch. And here we are just printing few information like the error message the error number and the error states. And we are following exactly the same standard at the bronze layer. So let's go and execute the whole thing. And with that we have updated the definition of the third procedure. Let's go now and execute it. So execute silver dot load silver. So let's go and do that. It went very fast like fewer than 1 seconds again because we are working on local machine loading the server layer loading the CRM tables and we can see this nice messaging. So it start with truncating the table inserting the data and we are getting the load duration for this table and you will see that everything is below 1 second and that's because in real projects you will get of course more than 1 second. So at the end we have load duration of the whole silver layer. And now I have one more thing for you. Let's say that you are changing the design of this store procedure for the server layer. You are adding different types of messaging or maybe you're creating logs and so on. So now all those new ideas and redesigns that you are doing for the silver layer, you have always to think about bringing the same changes as well in the other store procedure for the pros layer. So always try to keep your codes following the same standards. Don't have like one idea in one store procedure and an old idea in another one. Always try to maintain those scripts and to keep them all up to date following the same standards. Otherwise, it can be really hard for other developers to understand the cause. I know that needs a lot of work and commitments, but this is your job to make everything following the best practices and following the same naming convention and standards that you put for your projects. So guys, now we have very nice two ETL scripts. One that loads the bronze layer and another one for the server layer. So now our data warehouse is very simple. All what you have to do is to run first the bronze layer and with that we are taking all the data from the CSV files from the source and we put it inside our data warehouse in the bronze layer and with that we are refreshing the whole bronze layer. Once it's done the next step is to run the store procedure of the server layer. So once you execute it you are taking now all the data from the bronze layer transforming it cleaning it up and then loading it to the server layer. And as you can see the concept is very simple. We are just moving the data from one layer another layer with different tasks. All right guys, so as you can see in the server layer we have done a lot of data transformations and we have covered all the types that we have in the data cleansing. So we remove duplicates, data filtering, handling missing data, invalid data, unwanted spaces, casting the data types and so on. And as well we have derived new columns, we have done data enrichment and we have normalized a lot of data. So now of course what we have not done yet business rules and logic data aggregations and data integration. This is for the next layer. All right my friends. So finally we are done cleaning up the data and checking the quality of our data. So we can go and close those two steps. And now to the next step we have to go and extend the data flow diagram. So let's go. Okay. So now let's go and extend our data flow for the silver layer. So, what I'm going to do, I'm just going to go and copy the whole thing and put it side by side to the bronze layer. And let's call it silver layer. And the table name is going to stay as before because we have like one to one like the bronze layer. But what we're going to do, we're going to go and change the coloring. So, I'm going to go and mark everything and make it gray like silver. And of course, what is very important is to make the lineage. So, I'm going to go now from the bronze and take an arrow and put it to the silver table. And now with that we have like a lineage between three layers and you are checking this table the customer info you can understand aha this comes from the bronze layer from the customer info and as well this comes from the source system CRM so now we can see the lineage between different layers and without looking to any scripts and so on in one picture you can understand the whole projects so I don't have to explain a lot of stuff by just looking to this picture you can understand how the data is flowing between sources is bronze layer, silver layer, and to the gold layer, of course, later. So, as you can see, it looks really nice and clean. All right. So, with that, we have updated the data flow. Next, we're going to go and commit our work in the G repo. So, let's go. Okay. So, now let's go and commit our scripts. We're going to go to the folder scripts. And here we have a server layer. If you don't have it, of course, you can go and create it. So, first we're going to go and put the DDL scripts for the server layer. So let's go and I will paste the code over here. And as usual, we have this commit as the header explaining the purpose of this script. So let's go and commit our work. And we're going to do the same thing for the store procedure that loads the server layer. So I'm going to go over here. I have already filed for that. So let's go and paste that. So we have here our stored procedures. And as usual at the start, we have as well. So this script is doing the ATL process where we load the data from bronze into silver. So the action is to truncate the table first and then insert transformed cleans data from bronze to silver. There are no parameters at all. And this is how you can use the source procedure. Okay. So we're going to go and commit our work. And now one more thing that we want to commit in our project all those queries that you have built to check the quality of the server layer. So this time we will not put it in the scripts. We're going to go to the tests and here we're going to go and make a new file called quality checks silver and inside it we're going to go and paste all the queries that we have filled. I just here reorganize them by the tables. So here we can see all the checks that we have done during the course and at the header we have here nice comments. So here we are just saying that this script is going to check the quality of the server layer and we are checking for nulls, duplicates, unwanted spaces, invalid date range and so on. So that each time you come up with a new quality check, I'm going to recommend you to share it with the project and with other team in order to make it part of multiple checks that you do after running the ATL. So that's it. I'm going to go and put those checks in our repo and in case I come up with new check, I'm going to go and update it. Perfect. So now we have our code in our repository. All right. So with that, our code is saved and we are done with the whole epic. So we have built the silver layer. Now let's go and minimize it. And now we come to my favorite layer, the code layer. So we're going to go and build it. The first step as usual, we have to analyze. And this time we're going to explore the business objects. So let's go. All right. So now we come to the big question. How we going to build the gold layer? As usual, we start with analyzing. So now what we're going to do here is to explore and understand what are the main business objects that are hidden inside our source system. So as you can see we have two sources six files and here we have to identify what are the business objects. Once we have this understanding then we can start coding and here the main transformation that we are doing is data integration. And here usually I split it into three steps. The first one we're going to go and build those business objects that we have identified. And after we have a business objects we have to look at it and decide what is the type of this table. Is it a dimension? Is it a fact? Or is it like maybe a flat table? So what type of table that we have built and the last step is of course we have now to rename all the columns into something friendly and easy to understand so that our consumers don't struggle with technical names. So once we have all those steps what we're going to do it's time to validate what we have created. So what we have to do the new data model that we have created it should be connectable and we have to check that the data integration is done correctly and once everything is fine we cannot skip the last step. we have to document and as well commit our work in the g. And here we will be introducing a new type of documentations. So we're going to have a diagram about the data model. We're going to build a data dictionary where we're going to describe the data model. And of course we're going to extend the data flow diagram. So this is our process. Those are the main steps that we will do in order to build the code layer. Okay. So what is exactly data moduling? Usually the source system going to deliver for you row data unorganized messy not very useful in its current states. But now the data modeling is the process of taking this row data and then organize it and structure it in meaningful way. So what we are doing we are putting the data in new friendly and easy to understand objects like customers, orders, products. Each one of them is focused on specific information and what is very important is we're going to describe the relationship between those objects. So by connecting them using lines. So what you have built on the right side we call it logical data model. If you compare to the left side you can see the data model makes it really easy to understand our data and the relationship the processes behind them. Now in data modeling we have three different stages or let's say three different ways on how to draw a data model. The first stage is the conceptual data model. Here the focus is only on the entity. So we have customers, orders, products and we don't go in details at all. So we don't specify any columns or attributes inside those boxes. We just want to focus what are the entities that we have and as well the relationship between them. So the conceptual data model don't focus at all on the details. It just gives the big picture. So the second data model that we can build is the logical data model. And here we start specifying what are the different columns that we can find in each entity like we have the customer ID the first name last name and so on and we still draw the relationship between those entities and as well we make it clear which columns are the primary key and so on. So as you can see we have here more details but one thing we don't describe a lot of details for each column and we are not worry how exactly we going to store those tables in the database. The third and last stage we have the physical data model. This is where everything gets ready before creating it in the database. So here you have to add all the technical details like adding for each column the data types and the length of each data type and many other database techniques and details. So again if you look to the conceptual data model it gives us the big picture and in the logical data model we dive into details of what data we need and the physical layer model prepares everything for the implementation in the database. And to be honest in my projects I only draw the conceptual and the logical data model because drawing and building the physical data model needs a lot of efforts and time and there are many tools like in data bricks they automatically generate those models. So in this project what we're going to do we're going to draw the logical data model for the gold layer. All right. It's now for analytics and especially for data warehousing and business intelligence. We need a special data model that is optimized for reporting and analytics and it should be flexible, scalable and as well easy to understand. And for that we have two special data models. The first type of data model we have the star schema. It has a central fact table in the middle and surrounded by dimensions. The fact table contains transactions, events, and the dimensions contains descriptive informations. And the relationship between the fact table in the middle and the dimensions around it forms like a star shape. And that's why we call it star schema. And we have another data model called snowflake schema. It looks very similar to the star schema. So we have again the fact in the middle and surrounded by dimensions. But the big difference is that we break the dimensions into smaller subdimensions. And the shape of this data model as you are extending the dimensions it's going to looks like a snowflake. So now if you compare them side by side you can see that the star schema looks easier right? So it is usually easy to understand easy to query it is really perfect for analyzers but it has one issue with the dimension might contain duplicates and your dimensions get bigger with the time. Now if you compare it to the snowflake you can see the schema is more complex. You saw you need a lot of knowledge and efforts in order to query something from the snowflake. But the main advantage here comes with the normalization as you are breaking those redundancies in small tables. You can optimize the storage. But to be honest, who care about the storage? So for this project, I have chose to use the star schema because it is very commonly used. Perfect for reporting like for example if you're using PowerBI and we don't have to worry about the storage. So that's why we're going to adopt this model to build our gold layer. Okay. So now one more thing about those data models is that they contain two types of tables fact and dimensions. So when I say this is a fact table or a dimension table well the dimension contains descriptive informations or like categories that gives some context to your data. For example a product info you have product name, category, subcategories and so on. This is like a table that is describing the products and this we call it dimension. But in the other hand we have facts. They are events like transactions. They contain three important informations. First you have multiple ids from multiple dimensions. Then we have like date informations like when the transaction or the event did happen. And the third type of information you're going to have like measures and numbers. So if you see those three types of data in one table, then this is a fact. So if you have a table that answers how much or how many, then this is a fact. But if you have a table that answers who, what, where, then this is a dimension table. So this is what dimension and fact tables. All right my friends. So so far in the bronze layer and in the silver layer we didn't discuss anything about the business. So the bronze and silver were very technical. We are focusing on data ingestion. We are focusing on cleaning up the data quality of the data. But still the tables are very oriented to the source system. Now comes the fun part in the god layer where we're going to go and break the whole data model of the sources. So we're going to create something completely new to our business that is easy to consume for business reporting and analyzes. And here it is very important to have a clear understanding of the business and the processes. And if you don't know it already at this phase you have really to invest time by meeting maybe process experts, the domain experts in order to have clear understanding what we are talking about in the data. So now what we're going to do, we're going to try to detect what are the business objects that are hidden in the source systems. So now let's go and explore that. All right. Now in order to build a new data model, I have to understand first the original data model. What are the main business objects that we have? How things are related to each others? And this is very important process in building a new model. So now what I usually do, I start giving labels to all those tables. So if you go to the shapes over here, let's go and search for label. And if we go to more icons, I'm going to go and take this label over here. So, drag and drop it. And then I'm going to go and increase maybe the size of the font. So, let's go with 20 and bold. Just make it a little bit bigger. So, now by looking to this data model, we can see that we have product informations in the CRM and as well in the ARP. And then we have like customer informations and transactional table. So, now let's focus on the product. So, the product information is over here. We have here the current and the history product informations and here we have the categories that's belong to the products. So in our data model we have something called products. So let's go and create this label. It's going to be the product and let's go and give it a color to the style. Let's pick for example the red one. Now let's go and move this label and put it beneath this table over here. And with that I have like a label saying this table belongs to the objects called products. Now I'm going to do the same thing for the other table over here. So I'm going to go and tag this table to the product as well. So that I can see easily which tables from the sources does has informations about the product business object. All right. Now moving on, we have here a table called customer information. So we have a lot of information about the customer. We have as well in the ARP customer information where we have the birthday and the country. So those three tables has to do with the object customer. So that means we're going to go and label it like that. So let's call it customer and I'm going to go and pick different color for that. Let's go with the green. So I will tag this table like this. And the same thing for the other tables. So copy tag the second table and the third table. Now it is very easily for me to see which table to belong to which business objects. And now we have the final table over here and only one table about the sales and orders. In the arb we don't have any informations about that. So this one going to be easy. Let's call it sales. And let's move it over here. And as well maybe change the color of that to for example this color over here. Now this step is very important by building any data model in the gold layer. It gives you a big picture about the things that you are going to module. So now the next step is that we're going to go and build those objects step by step. So let's start with the first objects with our customers. So here we have three tables and we're going to start with the CRM. So let's start with this table over here. All right. So with that we know what are our business objects and this task is done and now in the next step we're going to go back to scale and start doing data integrations and building completely new data model. So let's go and do that. Now let's have a quick look to the good layer specifications. So this is the final stage. We're going to provide data to be consumed by reporting and analytics. And this time we will not be building tables. We will be using views. So that means we will not be having like stored procedure or any load process to the code layer. All what we are doing is only data transformation and the focus of the data transformation going to be data integration, aggregation, business logic and so on. And this time we're going to introduce a new data model. We will be doing star schema. So those are the specifications for the gold layer and this is our scope. So this time we make sure that we are selecting data from the silver layer not from the bronze because the bronze has bad data quality and the silver is everything is prepared and cleaned up. In order to build the good layer going to be targeting the server layer. So let's start with select star from and we're going to go to the silver CRM customer info. So let's go and hit execute. And now we're going to go and select the columns that we need to be presented in the go layer. So let's start selecting the columns that we want. So we have the ID, the key, the first name. I will not go and get the metadata information. This only belongs to the silver. Perfect. The next step is that I'm going to go and give this table an alias. So let's go and call it CI. And I'm going to make sure that we are selecting from this alias because later we're going to go and join this table with other tables. So something like this. So we're going to go with those columns. Now let's move to the second table. Let's go and get the birthday information. So now we're going to jump to the other system and we have to join the data by the CID together with the customer key. So now we have to go and join the data with another table. And here I try to avoid using the inner join because if the other table doesn't have all the information about the customers, I might lose customers. So always start with the master table and if you join it with any other table in order to get informations try always to avoid inner join because the other source might not have all the customers and if you do inner join you might lose customers. So I tend to start from the master table and then everything else is about the lift join. So I'm going to say lift join silver ERP customer a12. So let's give it the alias ca. And now we have to join the tables. So it's going to be by CE from the first table. It's going to be the customer key equal to CA and we have the CI ID. Now of course we're going to get matching data because we checked the server layer. But if we haven't prepared the data in the server layer, we have to do here preparation step in order to join the tables. But we don't have to do that because that was a pre-step in the server layer. So now you can see the systematic that we have in this bronze, silver, gold. So now after joining the tables we have to go and pick the information that we need from the second table which is the birth date. So B date dates and as well from this table there is another nice information it is the gender information. So that's all what we need from the second table. Let's go and check the third table. So the third table is about the location information the countries and as well we connect the tables by the CID with the key. So let's go and do that. We're going to say as well left join silver ERP location and I'm going to give it the name LA and then we have to join Y the keys the same thing it's going to be CI customer key equal to LA CI ID again we have prepared those ids and keys in the server layer so the join should be working now we have to go and pick the data from the second table so what do we have over here we have the ID the country and the metadata information so let's go and just get the country Perfect. So now with that we have joined all the three tables and we have picked all the columns that we want in this object. So again by looking over here we have joined this table with this one and this one. So with that we have collected all the customer informations that we have from the two source systems. Okay. So now let's go and query in order to make sure that we have everything correct and in order to understand that your joints are correct you have to keep your eye in those three columns. So if you are seeing that you are getting data that means you are doing the the joints correctly but if you are seeing a lot of nulls or no data at all that means your joints are incorrect but now it looks for me it is working and another check that I do is that if your first table has no duplicates what could happen is that after doing multiple joins you might now start getting duplicates because the relationship between those tables is not clear one to one you might get like one to many relationship ship or many to many relationships. So now the check that I usually do at this stage is that I have to make sure that I don't have duplicates from their results. So we don't have like multiple rows for the same customer. So in order to do that, we go and do a quick group by. So we're going to group by the data by the customer ID and then we do the count from this subquery. So this is the whole subquery and then after that we're going to go and say group by the customer ID and then we say having count higher than one. So this query actually try to find out whether we have any duplicates in the primary key. So let's go and execute it. We don't have any duplicates and that means after joining all those tables with the customer info those tables didn't cause any issues and didn't duplicate my data. So this is very important check to make sure that you are in the right way. All right. So that means everything is fine about the duplicates. We don't have to worry about it. Now we have here an integration issue. So let's go and execute it again. And now if you look to the data we have two sources for the gender informations. one comes from the CRM and another one come from the ERP. So now the question is what we're going to do with this? Well, we have to do data integration. So let me show you how I do it. First I go and have a new query and then I'm going to go and remove all other stuff and I'm going to leave only those two informations and use it distinct just to focus on the integration and let's go and execute it and maybe as well to do an order by. So let's do one and two. Let's go and execute it again. So now here we have all the scenarios and we can see sometimes there is a matching. So from the first table we have female and the other table we have as well female but sometimes we have an issue like those two tables are giving different informations and the same thing over here. So this is as well an issue different informations. Another scenario where we have a data from the first table like here we have the female but in the other table we have not available. Well this is not a problem. So we can get it from the first table but we have as well the exact opposite scenario where from the first table the data is not available but it is available from the second table. And now here you might wonder why I'm getting a null over here. We did handle all the missing data in the silver layer and we replace everything with not available. So why we are still in getting a null? This null doesn't come directly from the tables. It just come because of joining tables. So that means there are customers in the CRM table that is not available in the ARB table and if there is like no match what going to happen we will get a null from SQL. So this null means there was no match and that's why we are getting this null. It is not coming from the content of the tables and this is of course an issue. But now the big issue what can happen for those two scenarios here we have the data but they are different. And here again we have to ask the experts about it. What is the master here? Is it the CRM system or the ARP? And let's say from their answer going to say the master data for the customer information is the CRM. So that means the CRM informations are more accurate than the ERP information and this is only about the customers of course. So for this scenario where we have female and male then the correct information is the female from the first source system. The same goes over here and here we have like male and female then the correct one is the male because this source system is the master. Okay. So now let's go and build this business rule. We're going to start as usual with the case win. So the first very important rule is if we have a data in the gender information from the CRM system from the master then go and use it. So we're going to go and check the gender information from the CRM table. So customer gender is not equal to not available. So that means we have a value male or female. Let me just have here a comma like this. Then what's going to happen? Go and use it. So we're going to use the value from the master. CRM is the master for gender info. Now otherwise that means it is not available from the CRM table. Then go and use and grab the information from the second table. So we're going to say CA gender. But now we have to be careful with this null over here. We have to convert it to not available as well. So we're going to use the kis. So if this is a null then go and use the not available like this. So that's it. Let's have an end. And let me just push this over here. So let's go and call it new gen for now. Let's go and execute it and let's go and check the different scenarios. All those values over here we have data from the CRM system and this is as well represented in the new column. But now for the second part we don't have data from the first system. So we are trying to get it from the second system. So for the first one is not available and then we try to get it from the second source system. So now we are activating the else. Well it is null and with that the kalis is activated and we are replacing the null with not available. For the second scenario as well, the first search system don't have the gender information. That's why we are grabbing it from the second. So with that we have a female. And then the third one the same thing we don't have information but we get it from the second source system. We have the male and the last one it is not available in both source systems. That's why we are getting not available. So with that as you can see we have a perfect new column where we are integrating two different source system in one. And this is exactly what we call data integration. This piece of information, it is way better than the source CRM and as well the source ARP. It is more rich and has more information. And this is exactly why we try to get data from different source system in order to get rich information in the data warehouse. So with that we have a nice logic and as you can see it's way easier to separate it in separate query in order first to build the logic and then take it to the original query. So what I'm going to do, I'm just going to go and copy everything from here and go back to our query. I'm going to go and delete those informations the gender and I will put our new logic over here. So a comma and let's go and execute. So with that we have our new nice column. Now with that we have very nice objects. We don't have duplicates and we have integrated data together. So we took three tables and we put it in one object. Now the next step is that we're going to go and give nice friendly names. The rule in the gold layer that to use friendly names and not to follow the names that we get from the source system and we have to make sure that we are following the rules by the naming conventions. So we are following the snake case. So let's go and do it step by step. For the first one let's go and call it the customer ID. And then the next one I will get rid of using keys and so on. I'm going to go and call it customer number because those are customer numbers. Then for the next one, we're going to call it first name without using any prefixes. And the next one last name and we have here marital status. So I will be using the exact name but without the prefix. And here we just going to call it gender. And this one we're going to call it career date. And this one birth date. And the last one going to be the country. So let's go and execute it. Now as you can see the names are really friendly. So we have customer ID, customer numbers, first name, last name, material status, gender. So as you can see the names are really nice and really easy to understand. Now the next step I'm going to think about the order of those columns. So the first two it makes sense to have it together. The first name, last name, then I think the country is very important information. So I'm going to go and get it from here and put it exactly after the last name is just nicer. So let's go and execute it again. So the first name, last name, country. It's always nice to group up relevant columns together, right? So we have here the status of the gender and so on. And then we have the career date and the birth date. I think I'm going to go and switch the birth date with the career date. It's more important than the career dates like this. And here not forget the comma. So execute again. So it looks wonderful. Now comes a very important decision about these objects. Is it a fact table or a dimension? Well, as we learned, dimensions hold descriptive informations about an object. And as you can see, we have here a descriptions about the customers. So all those columns are describing the customer information. And we don't have here like transactions and events. And we don't have like measures and so on. So we cannot say this object is a fact. It is clearly a dimension. So that's why we're going to go and call this object the dimension customer. Now there is one thing that if you are creating a new dimension you need always a primary key for the dimension. Of course we can go over here and depend on the primary key that we get from the source system but sometimes you can have like dimensions where you don't have like a primary key that you can count on. So what we have to do is to go and generate a new primary key in the data warehouse. And those primary keys we call it surrogate keys. Srogate keys are system generated unique identifier that is assigned to each records to make the record unique. It is not a business key. It has no meaning and no one in the business knows about it. We only use it in order to connect our data model. And in this way we have more control on how to connect our data model and we don't have to depend always on the source system. And there are different ways on how to generate surrogate keys like defining it in the DDL or maybe using the window function row number in this data warehouse. I'm going to go with a simple solution where we're going to go and use the window function. So now in order to generate a surrogate key for this dimension what we're going to do it is very simple. So we're going to say row number over and here we have to order by something. You can order by the create date or the customer ID or the customer number. whatever you want but in this example I'm going to go and order by the customer ID. So we have to follow the naming convention that all surrogate keys with a key at the end as a suffix. So now let's go and query those informations. And as you can see at the start we have a customer key and this is a sequence. We don't have here of course any duplicates. And now this target key is generated in the data warehouse and we're going to use this key in order to connect the data model. So now with that our query is ready and the last step is that we're going to go and create the object and as we decided all the objects in the gold layer going to be virtual one. So that means we're going to go and create a view. So we're going to say create view gold dot dim. So follow the naming convention stand for the dimension and we're going to have the customers and then after that we have ass. So with that everything is ready. Let's go and execute it. It was successful. Let's go to the views now and you can see our first objects. So we have the dimension customers in the gold layer. Now as you know me in the next step that we're going to go and check the quality of this new objects. So let's go and have a new query. So select star from our view temp customers. And now we have to make sure that everything in the right position like this. And now we can do different checks like the uniqueness and so on. But I'm worried about the gender information. So let's go and have a distinct of all values. So as you can see it is working perfectly. We have only female, male and not available. So that's it with that. We have our first new dimension. Okay friends. So now let's go and build the second object. We have the products. So as you can see product information is available in both source systems. As usual, we're going to start with the CRM informations and then we're going to go and join it with the other table in order to get the category informations. So those are the columns that we want from this table. Now we come here to a big decision about this objects. This object contains historical informations and as well the current informations. Now of course depend on the requirement whether you have to do analyszis on the historical informations. But if you don't have such a requirements we can go and stay with only the current informations of the products. So we don't have to include all the history in the objects and it is anyway as we learned from the model over here we are not using the primary key we are using the product key. So now what we have to do is to filter out the historical data and to stay only with the current data. So we're going to have here a wear condition. And now in order to select the current data what we're going to do we're going to go and target the end dates. If the end date is null that means it is a current data. Let's take this example over here. So you can see here we have three records for the same product key and for the first two records we have here an information in the end dates because it is historical informations but the last record over here we have it as a null and that's because this is the current information it is open and it's not closed yet. So in order to select only the current informations it is very simple we can say brd in dates is null. So if you go now and execute it, you will get only the current products. You will not have any history. And of course we can go and add comment to it. Filter out all historical data. And this means of course we don't need the end date in our selection of course because it is always a null. So with that we have only the current data. Now the next step is that we have to go and join it with the product categories from the ERP. And we're going to use here the ID. So as usual the master information is the CRM and everything else going to be secondary. That's why I use the lift join just to make sure I'm not losing I'm not filtering any data because if there is no match then we lose data. So lift join silver ERP and the category. So let's call it PC. And now what we're going to do we're going to go and join it using the key. So en from the CRM we have the category ID equal to PC ID. And now we have to go and pick columns from the second table. So it's going to be the PC. We have the category very important PC. We have the subcategory and we can go and get the maintenance. So something like this. Let's go and query. And with that we have all those columns comes from the first table and those three comes from the second. So with that we have collected all the product informations from the two source systems. Now the next step is we have to go and check the quality of these results. And of course what is very important is to check the uniqueness. So what we're going to do we're going to go and have the following query. I want to make sure that the product key is unique because we're going to use it later in order to join the table with the sales. So from and then we have to have group by product key and we're going to say having counts higher than one. So let's go and check. Perfect. We don't have any duplicates. The second table didn't cause any duplicates for our join. And as well this means we don't have historical data and each product is only one records and we don't have any duplicates. So I'm really happy about that. So let's go and query again. Now, of course, the next step, do we have anything to integrate together? Do we have the same information twice? Well, we don't have that. The next step is that we're going to go and group up the relevant informations together. So, I'm going to say the product ID, then the product key, and the product name are together. So, all those three informations are together. And after that, we can put all the category informations together. So, we're going to have the category ID, the category itself, the subcategory. Let me just query and see the results. So we have the product ID key name and then we have the category ID name and the subcategory and then maybe as well to put the maintenance after the subcategory like this and I think the product cost and the line can start could stay at the end. So let me just check. So those three four informations about the category and then we have the cost line and the start date. I'm really happy with that. The next step we're going to go and give nice names, friendly names for those columns. So let's start with the first one. This is the product ID. The next one going to be the product number. We need the key for the surrogate key later. And then we have the product name. And after that we have the category ID and the category. And this is the subcategory. And then the next one going to stay as it is. I don't have to rename it. The next one going to be the cost and the product line and the last one going to be the start stage. So let's go and execute it. Now we can see very nicely in the output all those friendly names for the columns and it looks way nicer than before. I don't have even to describe those informations the name describe it. So perfect. Now the next big decision is what do we have here? Do we have a fact or dimension? What do you think? Well, as you can see here again, we have a lot of descriptions about the products. So all those informations are describing the business object products. We don't have like here transactions, events, a lot of different keys and ids. So we don't have really here facts. We have a dimension. Each row is exactly describing one object, describing one product. That's why this is a dimension. Okay. So now since this is a dimension, we have to go and create a primary key for it. Well, actually the surrogate key and as we have done it for the customers, we're going to go and use the window function row number in order to generate it over and then we have to sort the data. I will go with the start date. So let's go with the start dates and as well the product key and we're going to give it a name products key like this. So let's go and execute it. With that, we have now generated a primary key for each product and we're going to be using it in order to connect our data model. All right. Now, the next step with that, we're going to go and build the view. So, we're going to say create view. We're going to say gold and dimension products and then us. So, let's go and create our object. And now, if you go and refresh the views, you will see our second object, the second dimension. So, we have here in the gold layer the dimension products. And as usual, we're going to go and have a look to this view just to make sure that everything is fine. So dem products. So let's execute it. And by looking to the data everything looks nice. So with that we have now two dimensions. All right friends. So with that we have covered a lot of stuff. So we have covered the customers and the products and we are left with only one table where we have the transactions the sales and for the sales information we have only data from the CRM. We don't have anything from the ERP. So let's go and build it. Okay. So now I have all those informations and now of course we have only one table. We don't have to do any integrations and so on. And now we have to answer the big question. Do we have here a dimension or a fact? Well by looking to those details we can see transactions. We can see events. We have a lot of dates, informations. We have as well a lot of measures and metrics and as well we have a lot of ids. So it is connecting multiple dimensions. And this is exactly a perfect setup for effect. So we're going to go and use those informations as a facts. And of course as we learned a fact is connecting multiple dimensions. We have to present in this fact the surrogate keys that comes from the dimensions. So those two informations the product key and the customer ID those informations comes from the source system and as we learned we want to connect our data model using the surrogate keys. So what we're going to do we're going to replace those two informations with the surrogate keys that we have generated and in order to do that we have to go and join now the two dimensions in order to get the surrogate key and we call this process of course data lookup. So we are joining the tables in order only to get one information. So let's go and do that. We will go with a lift join of course not to lose any transaction. So first we're going to go and join it with the product key. Now of course in the silver layer we don't have any surrogate keys. We have it in the gold layer. So that means for the fact table we're going to be joining the silver layer together with the gold layer. So, gold dots and then the dimension products and I'm going to just call it PR. And we're going to join the SD using the product key together with the product number from the dimension. And now the only information that we need from the dimension is the key, the surrogate key. So, we're going to go over here and say product key. And what I'm going to do, I'm going to go and remove this information from here because we don't need it. We don't need the original product key from the source system. We need the surrogate key that we have generated in our own in this data warehouse. So the same thing going to happen as well for the customer. So gold dimension customer again we are doing here a lookup in order to get the information on SD. So we are joining using this ID over here equal to the customer ID because this is a customer ID. And what we're going to do the same thing we need the surrogate key the customer key and we're going to delete the ID because we don't need it. Now we have the surrogate key. So now let's go and execute it. And now with that we have in our fact table the two keys from the dimensions. And now this can help us to connect the data model to connect the facts with the dimensions. So this is very necessary step building the fact table. You have to put the surrogate keys from the dimensions in the facts. So that was actually the hardest part building the facts. Now the next step all what you have to do is to go and give friendly names. So we're going to go over here and say order number. Then the surrogate keys are already friendly. So we're going to go over here and say this is the order date. And the next one going to be shipping date. And then the next one due age and the sales going to be I'm going to say sales amount the quantity and the final one is the price. So now let's go and execute it and look to the results. So now as you can see the columns looks very friendly and now about the order of the columns we use the following schema. So first in the fact table we have all the surrogate keys from the dimensions. Then second we have all the dates and at the end you group up all the measures and the metrics at the end of the fact. So that's it for the query for the facts. Now we can go and build it. So we're going to say create view gold in the gold layer and this time we're going to use the fact underscore and we're going to go and call it sales and then don't forget about the ass. So that's it. Let's go and create it. Perfect. Now we can see the fact. So with that we have three objects in the go there. We have two dimensions and one facts. And now of course the next step with that we're going to go and check the quality of the view. So let's have a simple select fact sales. So let's execute it. Now by checking the result you can see it is exactly like the result from the query and everything looks nice. Okay. So now one more trick that I usually do after building effect is try to connect the whole data model in order to find any issues. So let's go and do that. We will do just simple lift join with the dimensions. So gold dimension customers see and we will use the keys and then we're going to say where customer key is null. So there is no matching. So let's go and execute it. And with that as you can see in the results we are not getting anything that means everything is matching perfectly and we can do as well the same thing with the products. So left join called then products p on product key and then we connect it with the fact product key and then we going go and check the product key from the dimension like this. So we are checking whether we can connect the fact together with the dimension products. Let's go and check and as you can see as well we are not getting anything and this is all right. So with that we have now SQL codes that is tested and as well creating the gold layer. Now in the next step as you know in our requirements we have to make clear documentations for the end users in order to use our data model. So let's go and draw a data model of the star schema. So let's go and draw our data model. Let's go and search for a table. And now what I'm going to do, I'm going to go and take this one where I can say what is the primary key and what is the foreign key. And I'm going to go and change a little bit the design. So it's going to be rounded. And let's say I'm going to go and change to this color. And maybe go to the size, make it 16. And then I'm going to go and select all the columns and make it as well 16 just to increase the size. And then go to our range and we can go and increase it 39. So now let's go and zoom in a little bit for the first table. Let's go and call it gold dimension customers and make it a little bit bigger like this. And now we're going to go and define here the primary key. It is the customer key. And what else we're going to do? We're going to go and list all the columns in the dimension. It is a little bit annoying but the result is going to be awesome. So what do we have? The customer ID. We have the customer number and then we have the first name. Now in case you want a new rows so you can hold control and enter and you can go and add the other columns. So now pause the video and then go and create the two dimensions the customers and the products and add all the columns that you have built in the [Music] view. Welcome back. So now I have those two dimensions. The third one going to be the fact table. Now for the fact table I'm going to go with different color. for example, the blue and I'm going to go and put it in the middle. Something like this. So, we're going to say gold fact sales and here for that we don't have primary key. So, we're going to go and delete it. And I have to go and add all the columns of the facts. So, order number, products key, customer key. Okay. All right. Perfect. Now, what we can do, we can go and add the foreign key information. So, the product key is a foreign key for the products. So, we're going to say FK1. And the customer key going to be the foreign key for the customers. So FK2 and of course you can go and increase the spacing for that. Okay. So now after we have the tables the next step in data modeling is to go and describe the relationship between these tables. This is of course very important for reporting and analytics in order to understand how I'm going to go and use the data model. And we have different types of relationships. We have one to one, one to many. And in star schema data model the relationship between the dimension and the fact is one to many. And that's because in the table customers we have for a specific customer only one record describing the customer but in the fact table the customer might exist in multiple records and that's because customers can order multiple times. So that's why in fact it is many and in the dimension side it is one. Now in order to see all those relationships we're going to go to the menu to the left side and as you can see we have here entity relations and now we have different types of arrows. So for example we have zero to many, one to many, one to one and many different types of relations. So now which one we going to take? We're going to go and pick this one. So it says one mandatory. So that means the customer must exist in the dimension table. Too many but it is optional. So here we have three scenarios. The customer didn't order anything or the customer did order only once or the customer did order many things. So that's why in the fact table it is optional. So we're going to take this one and place it over here. So we're going to go and connect this part to the customer dimension and the many parts to the facts. Well actually we have to do it on the customers. So with that we are describing the relationship between the dimensions and fact with one to many. One is mandatory for the customer dimension and many is optional to the facts. So we have the same story as well for the products. So the many part to the facts and the one goes to the products. So it's going to look like this. Each time you are connecting new dimension to the fact table, it is usually one to many relationship. So you can go and add anything you want to this model like for example a text like explaining something. For example, if you have some complicated calculations and so on, you can go and write this information over here. So for example, we can say over here sales calculation, we can make it a little bit smaller. So let's go with 18. So we can go and write here the formula for that. So sales equal quantity multiplied with the price and make this little bit bigger. So it is really nice info that we can add it to the data model and even we can go and link it to the column. So we can go and take this arrow for example put it like this and link it to the column and with that you have as well nice explanation about the business rule or the calculation. So you can go and add any descriptions that you want to the data model. Just to make it clear for anyone that is using your data model. So with that you don't have only like three tables in the database. You have as well like some kind of documentations and explanation. In one click we can see how the data model is built and how you can connect the tables together. It is amazing really for all users of your data model. All right. So now with that we have really nice data model. And now in the next step we're going to go and create quickly a data catalog. All right, great. So with that we have a data model and we can say we have something called a data products and we will be sharing this data product with different types of users and there is something that every data products absolutely needs and that is the data catalog. It is a document that can describe everything about your data model. columns, the tables, maybe the relationship between the tables as well. And with that, you make your data product clear for everyone. And it's going to be for them way easier to derive more insights and reports from your data product. And what is the most important one? It is time-saving because if you don't do that, what's going to happen? Each consumer, each user of your data product will keep asking you the same questions about what do you mean with this column? What is this table? How to connect the table A with the table P? and you will keep repeating yourself and explaining stuff. So instead of that you prepare a data catalog, a data model and you deliver everything together to the users and with that you are saving a lot of time and stress. I know it is annoying to create a data catalog but it is investments and best practices. So now let's go and create one. Okay. So now in order to do that I have created a new file called data catalog in the folder documents. And here what we're going to do is very straightforward. We're going to make a section for each table in the code layer. So for example we have here the table dimension customers. What you have to do first is to describe this table. So we are saying it stores details about the customers with the demographics and geographics data. So you give a short description for the table and then after that you're going to go and list all your columns inside this table and maybe as well the data type. But what is way important is the description for each column. So you give a very short description like for example here the gender of the customer. And now one of the best practices of describing a column is to give examples because you can understand quickly the purpose of the columns by just seeing an example. Right? So here we are saying we can find inside the male, female and not available. So with that the consumer of your table can immediately understand uh it will not be an M or an F. It's going to be a full friendly value without having them to go and query the content of the table. They can understand quickly the purpose of that column. So with that we have a full description for all the columns of our dimension. The same thing we're going to do for the products. So again, a description for the table and as well a description for each column and the same thing for the facts. So that's it. With that you have like a data catalog for your data products at the code layer. And with that the business user or the data analyst have better and clear understanding of the content of your code layer. All right my friends. So that's all for the data catalog. In the next step we're going to go back to DO where we're going to finalize the data flow diagram. So let's go. Okay. So now we're going to go and extend our data flow diagram, but this time for the gold layer. So now let's go and copy the whole thing from the silver layer and put it over here side by side. And of course we're going to go and change the coloring to the gold. And now we're going to go and rename stuff. So this is the gold layer. But now of course we cannot leave those tables like this. We have completely new data model. So what do we have over here? We have the fact sales, we have dimension customers, and as well we have dimension products. So now what I'm going to do, I'm going to go and remove all those stuff. We have only three tables. And let's go and put those three tables somewhere here in the center. So now what you have to do is to go and start connecting those stuff. I'm going to go with this arrow over here, direct connection, and start connecting stuff. So the sales details goes to the fact table. Maybe put the fact table over here. And then we have the dimension customer. This comes from the CRM customer info. And we have two tables from the ERP. It comes from this table as well. And the location from the ERP. Now the same thing goes for the products. It comes from the product info and comes from the categories from the ERP. Now, as you can see here, we have cross arrows. So what you can do, we can go and select everything and we can say line jumps with a gap. And this makes it a little bit like better in the visual for the arrows. So now for example if someone asks you where the data come from for the dimension products you can open this diagram and tell them okay this comes from the server layer. We have like two tables. The product info from the CRM and as well the categories from the ERP and those several tables comes from the bronze layer and you can see the product info comes from the CRM and the category comes from the ERP. So it is very simple. We have just created a full data lineage for our data warehouse from the sources into the different layers in our data warehouse and data lineage is this really amazing documentation that can help not only your users but as well the developers. All right. So with that we have very nice data flow diagram and a data lineage. All right. So we have completed the data flow. It's really feel like progress like achievements as we are clicking through all those tasks. And now we come to the last task in building the data warehouse where we're going to go and commit our work in the get repo. Okay. So now let's put our scripts in the project. So we're going to go to the scripts over here. We have here bronze silver but we don't have a gold. So let's go and create a new file. We're going to have gold/ and then we're going to say ddl gold.sql. So now we're going to go and paste our views. So we have here our three views. And as usual at the start we can describe the purpose of the views. So we are saying create gold views. This script can go and create views for the code layer and the code layer represent the final dimension and fact tables. The star schema each view perform transformations and combination data from the server layer to produce business ready data sets and those views can be used for analytics and reporting. So that's it. Let's go and commit it. Okay. So with that as you can see we have the bronze the silver. So we have all our ETLs and scripts in the repository. And now as well for the code layer, we're going to go and add all those quality checks that we have used in order to validate the dimensions and facts. So we're going to go to the test over here and we're going to go and create a new file. It's going to be quality checks gold and the file type is SQL. So now let's go and paste our quality checks. So we have the check for the fact, the two dimensions and as well an explanation about the script. So we are validating the integrity and the accuracy of the go layer. And here we are checking the uniqueness of the surrogate keys and whether we are able to connect the data model. So let's put that as well in our git and commit the changes. And in case we come up with a new quality checks, we're going to go and add it to our script here. So those checks are really important if you are modifying the ATLs or you want to make sure that after each those script should run and so on. It is like a quality gate to make sure that everything is fine in the gold layer. Perfect. So now we have our code in our repository. Okay friends. So now what you have to do is to go and finalize the get repo. So for example all the documentations that we have created during the projects we can go and upload them in the docs. So for example you can see here the data architecture the data flow data integration data model and so on. So that each time you edit those pages you can commit your work and you have like a version of that. And another thing that you can do is that you go to the readme like for example over here I have added the project overview some important links and as well the data architecture and a little description of the architecture of course and of course don't forget to add few words about yourself and important profiles in the different social medias. All right my friends. So with that we have committed our work and as well closed the last epic building the god layer and with that we have completed all the phases of building a data warehouse. Everything is 100% and this feels really nice. All right my friends. So with that we have covered the first type of SQL projects that data warehousing projects. This is usually a very complex project that you can get involved in a company and this is really amazing project if you are planning to be a data engineer. But of course, if you are a data analyst, you might end up as well building warehouses. So now we have everything prepared for the second type of projects in SQL. We will deep dive now into the exploratory data analyzers. So let's go. And now here we're going to cover the second type of projects where we're going to use our basic SQL skills in order to do something called data profiling where we're going to try to understand all the aspects of our data sets using simple aggregations like the sum, average, count and as well we will be using techniques like some [Music] queries. All right my friends. So the first step in any data project is that we need data sets. If you have done the previous project where we have built the SQL data warehouse, then you have everything the data and the database. So you don't have to worry about it. But if you skip that, which I don't recommend, I still have prepared for you the files and the database. So let's get the data and create our database. All right. So now if you go to the link in the description, we're going to go to the downloads. And of course, you can subscribe to my newsletter. And then here we have the SQL course materials. And here we have a link for data analytics projects. Let's go to the link. And now here you have some important links like downloading the server the management studio where we're going to write our SQLs and as well there is a link to the g repository and as well what is very important is to download all the project files. So click on that and download all the files. Now extract the file and put it somewhere safe at your PC and now inside it you can find all the scripts and the data sets. Now there is like three ways on how to create the database in SQL server. So the first one is by executing scripts. If you go to the scripts over here, the first one we have a file called init database. Just go inside it and copy the whole thing and then let's go to SQL server. Now make a new query and make sure you switch to the master database and then paste the whole code. So now what you are doing here is we are creating a new database. We are creating a schema and then three very important tables that we're going to use in our data analyzes. Now there is like only one thing that you have to change in this script and that is the path of the files. And once you have done that just go and execute the whole script. And now as you can see everything is done and there is like data inserted. Now if you go to the left side to the database and refresh you can find a new database called data warehouse analytics. And if you go inside the tables you will find our three tables customer products and sales. So this is one way on how to create the database. The second methods is to go to the databases over here. Right click on it and say new database. And for example, let's call it data warehouse analytics. I'm going to call it two because I have already one. And then click okay. And with that you have a new database. So what we're going to do now, we're going to right click on it and then go to tasks and then import flat file. And now what we're going to do, we're going to go and import the CSV files to our new database. So we can go next and then you have to go and locate your files. I have them somewhere over here. So data set CSV files and we have to focus on the gold tables. So I'm going to go and select this one and then next. Now I'm just getting an overview of my data. So next. Now just to make sure that you are not getting any error, I'm going to go and allow nulls and that's all. So next and finish. So perfect. The data has been inserted. Now let's go to our database tables. And as you can see, we have here our new table. So you have to go and repeat this three times in order to import the data. Well, you can use this method if the first method didn't work. But I really recommend you to use the script in order to create the database. The third way is to go and restore the database itself. Now how we're going to do it? We're going to go again to the data sets and as you can see we have here a database backup. So as you can see we have here a PAK file. So now what you have to do is to go and copy that and then we're going to go to the database location. So it really depend where you have installed the SQL server. So currently I have it here program files Microsoft SQL server and then the express MSSQL backup and you have to place the file over here. So I have it here data warehouse analytics backup. And now all what you have to do is to right click on the database and then say restore database and then we're going to go to the device three points and we're going to say add. And now you can see our database data warehouse analytics. Once we say okay and then okay and now since I have it already I will get an error but once I click okay the whole database can be restored without running any scripts. So those are the three ways on how to create the database of the projects and if you have built with me the data warehouse projects before you don't have to do it because we have built that together. So pause the video and get the data for the projects. All right my friends. So we're going to start with a secret, a little trick that I usually do by analyzing any data sets. So let's start with little coffee before we start. H this is really hot. Okay. So the secret says as I'm looking to any data sets in any projects, I see the data always divided between dimensions and measures. What truth? You take the blue pill, you take the red pill. All I'm offering is the truth. Nothing more. If you see your data like me as dimensions and measures, you can generate like endless amount of insights from any projects from any data sets and you will find me through the projects that I'm always speaking about measures and dimensions. So I'm going to show you how I usually do it. So now usually by looking to any data sets in any projects. So you have like multiple columns and rows here I see the data always splitted into two categories either a dimension or a measure. And now of course the question is here is my column a dimension or a measure? Well in order to assign it to one of those categories you have to ask the first question is it a numeric value? If it's not so you have like string or date or any other data type then it is a dimension and if it is yes in numeric then you have to ask the second question does it make sense to aggregate it. So if the answer for both questions is yes, it is numeric and it makes sense to aggregate it then it is a measure otherwise it is a dimension. Now let's practice and have some examples. So now by looking to the values of the column category you can see all the values are characters. So it is not numeric that means this column is a dimension. So it is very simple. Let's take another column. We have the sales amount. So now as you can see the values are numeric and as well it makes sense to aggregate those values. we can get the total sales or the average sales and so on. So it fulfill both of the conditions. It is numeric and it makes sense to aggregate it. That's why we say sales is a measure. Now if you're checking the values of the product name, you can see that all of them are characters and names. So it is not numeric. That means the product is a dimension. Moving on to the next one, we have the quantity. The values are numeric and as well it makes sense to aggregate it. Can summarize all those values to have the total quantity. So quantity is a measure. Now if you're looking to the values of the birth dates you can see this is a date information it is not numeric so that means it is a dimension right but if you calculate the age from the birth dates age of the customer going to be in numeric and it makes sense to aggregate it for example finding the average age of customers. So if we derive a numeric value from a dimension then we can use it as a measure. So age is measure and now we come to something really tricky. This is the ID. So for example if you are checking the customer ID you can see all those values are numeric. So the first condition is fulfilled. Now the very important question does it make sense to aggregate the ids? Well those ids are unique identifier for a customer and if you find like the average of that it is not like helpful right I cannot think of one use case of aggregating the customer ID like having the average of all those ids or summarizing the ids. So it makes no sense to aggregate it. That's why we can consider the ID of a customer as a dimension not as a measure. So as you can see it is very simple. If it is numeric and it makes sense to aggregate then it is measure otherwise it is a dimension. And this is the foundations of any data analytics. If you see your data as dimensions and measures you can generate a lot of use cases and insights from your data sets. Now I totally understand if you are still confused about dimensions and measures and you might be asking why do I need measures and dimensions. Well if you are doing any type of data analysis or you are exploring any data sets you will be end up always like grouping up the data by something like you are grouping the data by countries or grouping the data by for example products or categories. So we need dimensions to group up our data and in the other sides you will be asking questions like how much how many what is the total of something. So you always need to aggregate or calculate something right and for that you need the measure. So we need the measures in order to answer the question how many and how much and we need the dimensions in order to group up the data by something. So that's why almost in any type of data analyzes you need dimensions and measures and this going to be more clear as we progress in the projects. All right. So now I'm going to walk you through the project road map and I have split that into six steps. So we're going to do different types of explorations like the database dimensions, measures, dates and we're going to do some basics analyszis like the magnitude and the ranking. So let's start with the first step in our projects. We're going to do database exploration. So let's say that you have joined a team and you got an access to a database. The first thing that I usually do is that I explore the structure of the database just to have basic understandings about the database tables, the views, columns. Are we talking about like 10 tables, hundreds of tables? So it is just a few queries in order to say hello to the database. So now let's go to SQL and explore the database of our projects. So now how we going to do it? Either you go to the left side over here and start clicking the objects of your database and explore the tables, views, columns and so on. Or a better way that I usually do it that I explore the database using a query. So what we can do, we can go and select data from the system tables because the database stores metadata informations about our tables and objects. So we're going to target an information schema. This is an internal schema in the database where we have like multiple tables and views to explore the metadata and the structure of our database. So for example, we can go with the tables. So let's go and create it. And with that you have a list of tables and with that you can see multiple informations like a catalog, the schema and the table names and you can see over here the object type whether it is a table or a view. If you done the data warehouse project with me then you will find a lot of tables. But if you are just doing the data analyzes you will see only those three tables. So customers, products and sales. So with that we can see in our database there are like around 15 tables or three tables. Now in the output you can see the database name the schema and a list of all tables and of course don't forget that you are using the database that we created. So with that we have a nice quick list with all tables inside our database. Now the next step we can go and drill down and check what are the columns that we have inside our database. And for that we can as well target the same schema. So select star from information schema and it is very simple. So we're going to go to the table columns. So let's go and execute it. And now we will see a lot of informations over here. So we can see in our database we have around 101 columns. So that we can see all the columns available in our database. And what I usually do with that I go and select the columns only for specific table. So we can say where are table name equal let's get for example the dimension customers. So let's query the whole thing and with that we can see we have 10 columns inside this dimension and this is how the columns are sorted inside our table or view and we can see all the metadata informations about each column. So now as you can see we are now exploring the structure of our database and this is really helpful to get an overview of the database and the projects. Are we talking about like 20 tables or hundreds of tables? And we can quickly see the naming of the columns, the tables. This is really important to get a feeling about the projects and sets the foundations for exploring the data inside those tables. All right friends, so with that we have done the first step. We have explored the database structure and now we can start diving into the actual data. The first thing that we can explore is the dimensions. Okay. So what we going to do with the dimension exploration? All what we have to do is to go and identify the unique values of each dimension that we have inside our database. This can help us to understand what are the categories, which countries, what are the product types that we have inside our database and we have a very simple formula for that. So all what you need is the SQL keyword distinct together with any dimension in your data set like distinct country, distinct category. So for example if you are checking any column that is dimension you can see a lot of values and repeating stuff but now once you say distinct column what going to happen you will get a list of all unique values and with that you can understand quickly I have three different types so I have a bc and this as well going to help you to understand the granularity of your dimension does the dimension has like three values or 100 value so it is very simple let's go and analyze our dimensions okay so now let's explore the dimension values inside our database so let's start with the first table the customers and if you check those columns we have to find an interesting dimension like for example the country. So now what we can do we can go and explore all the countries our customers come from. So let's go and do that. It is very simple. Select distinct and then we have our column the dimension country from our table customers. So let's go and execute it. And with that we can see in the result we have six countries. This is really nice in order to understand the geographical spread. So we have customers for our business that comes from six different countries. Germany, United States, France, Canada and so on. So now with that we have like the first little insights about our business. Now let's jump to another table the products. So what we have to do is to explore all the categories inside our business the major divisions. So we're going to say select distinct category from our table products. So let's go and execute it. Now in the output you can see we have four categories. We have the accessories, bikes, clothing and components. This is like giving us an overview of the product range. What are the major divisions inside our business? Now the next one I'm digging deeper in this information. So not only I want to see the categories, I would like as well to see the subcategories. I'm not starting a new query because there is of course relationship between the category and the subcategory. Let's go now and execute it. Now you can see in the output our categories are now splitted into more specific groups. So for example the bikes over here we have mountain bikes, road bikes and so on. So as you can see the subcategories has more details about the products than the category. And now in order to get the full picture we going to bring now the product name. So with that we're going to get a big picture in one shot. So now you can see the whole hierarchy of our products. And of course it is more interesting if you go and sort the data by those three informations. So let me just execute it again. So now if you go and explore our data for example we have here the category accessories and we have a subcategory inside it called lights. And in this subcategory we have three different products. And if you scroll to the end of our table you can see that we have around 295 products. So you can see the granularity of the product name is different than the category and the subcategory. And all those three informations are related to each others. So now as you can see after exploring those dimensions we have now better understanding on how the data is organized and this can help us by the analyzes if you are aggregating by the category you will get only four rows. If you are aggregating by the products you will get hundreds of rows. So this is how we explore the dimensions of our database. Okay. So now with that we have a clear picture about the dimensions inside our data sets. And now in the next step we're going to deep dive into one special type of dimensions. We have the dates. So we're going to explore the date columns. Okay. So now what we going to do with the date exploration? We're going to go and explore the boundaries of the dates that we have in the data sets. What is the earliest and the latest dates in my data? We're going to understand the time span. Do we have in our business 2 years or like 10 years? And this is of course very important to understand in order later to make different types of time analyzes. Now the formula for that is very simple. All what we need is the min and max functions in order to get the earliest and the latest dates. And of course we're going to apply that on date columns, date dimensions. So for example, we're going to have like min order date, max create date, min birth date. So any date that you have in your data set. And here if you look to any date column inside your data, you will find multiple values. But what is interesting is to understand what is the earliest date like here for example 2018 and what is the latest date for example 2028 and with that we can understand aha we have like time span of 10 years using the date diff function. So now let's go and apply our new formula on our date columns. All right. So now let's search for date informations inside our database. And usually you're going to find a lot in the facts. So let's go to the fact cells. And here we have like multiple dates. the order date, shipping date and due dates. Now let's go and explore the boundaries of the order date. So we have the following task. Find the date of the first and last order. So how we going to do that? We're going to say select and we are targeting the order date from our table sales. So let's go and execute it. And now we can see we have a lot of values inside our database. So now in order to find the first dates, what we're going to do, we're going to go and use the function min in order to get the minimum order dates. So we're going to go and call it first order dates. So let's go and execute it. So now we can see the date of the first order. It is in December 2010. Now let's go and find the date of the last order. So we're going to have this time the max order date. Uh let's go and call it last order date. So let's go and explore now the other boundary and with that we can see in January 2014 it is the date of the last order in our system. So with that we have explored the boundaries of the order dates the first and the last and of course we can now understand very quickly that we have four years of sales inside our business but we can go and calculate it. So now the task says how many years of sales are available. Now in order to find the years between those two dates, we have another scale function. It's called date diff. And now we have to go and subtract two dates. Now this function need three arguments. The first one you have to specify whether it is a year, month and day. And we start with the smallest date. So it's going to be the min order dates. And then the last argument is going to be the latest or the highest date. And it's going to be the max order dates. And we can go and call it order range in years. Okay. So let's go and execute it. And with that you can see in the output we have four years. Of course if you want to go and check the months you can go over here and say month and execute. So between those two dates we have 37 months. And of course now we have to go and rename it. So with that we have explored the dimension order dates. But what is more interesting is to check the customers and here we have the birth date. So now what we can do, we can go and find the youngest and the oldest customer. So let's go and do that. We're going to say select minates and with that we are getting the oldest birth date and we will get now the max birth date and with that we will get the youngest birth date from our table customers. So let's go and explore that. Now we can see the birth date of the oldest customer. I hope he or she is still alive. So it is more than 100 years and the youngest customer is around like 40 years. So we don't have really young customers inside our business. And of course if you don't want to see the birth dates, you want to see the age, what you have to do is actually very simple. You're going to use as well diff and we want the year and then we're going to say min birth date with the current date and time. And for that we have a function called get a date and we're going to call it oldest age. So if you go ahead and execute this one over here you can see the age of the oldest customer it is 109. Of course you can do the same thing for the youngest. If you just replace this with max and here we have the youngest age. So let's go and execute it. It is 39. So my friends this is how we explore the boundaries of a date and by finding the first date and the last date and the years between them we are having now more understanding of the time span of our business and that's going to help us later by making different type of complex analyzers. So this is how we explore the dates. All right. So with that we have now a clear picture about the scope of our projects and the date range inside our data sets. Now in the next step, we're going to go and explore the second type of data, the measures. All right. So now what is exactly exploring the measures? What we're going to do is to calculate and find out the key metrics of our business, the big numbers, the highest level of aggregations of our data. And the formula for that is very simple. We're going to go and use the aggregate functions in SQL like the sum, average, count for any measure inside our data sets. So for example, we're going to find the total sales by summarizing the sales value, finding the average price, finding the sum of quantity in order to have a big number about all sold items. So always an aggregate function together with a measure. So for example, if you have a column where you have a lot of values and you go and summarize all those values, you will get for example 240. So this is a key metric. This is the highest level of aggregations and the value is not splitted at all. So for example, we say this is the total revenue of our business. And this is exactly what we mean by exploring the measures. We will get those big numbers. So now let's go and apply those aggregate functions to the measures that we have inside our data set. Okay. So now we're going to go and spotlight on the big numbers that matters the most of our business. So now based on those three tables, I have collected here the following questions. So let's go and solve them one by one. The first one is find the total sales. So we're going to go and summarize by using the sum function for the sales amount as total sales from our table fact sales. So let's go and execute it. So this is the total amount of sales in our business. It is around 29 millions. So this is the business total revenue. Now we can go to the second one. It says show how many items are sold. So this time we need another column but from the same table from the fact sales. So the question is how many items that means we want the quantity and we're going to stay with the same function. So we are summarizing all the values of the quantity and we can call it total quantity. Let's go and explore that. So we can see our business did sold around 60,000 items and these 60,000 items did generate around 30 million. So let's keep going. The next question, find the average selling price. So that means we are targeting the same table. And here we have the price informations. So we're going to say the price. This time the aggregate function going to be the average. And we're going to call it average price. So let's go and execute it. So the average price in our business is 486. So that means our business is selling like expensive items. Now let's go to the next question. It says find the total number of orders. And for that we're going to go and use the function count and we can count the order numbers. So order number total orders let's go and execute it. So it says we have 60,000 orders. And now as you are working with the count function what I usually do I try to count the same thing but using a distinct. So distinct order number. So, what I'm trying to do here is first eliminate any duplicates in the order number and then count it. I don't want to count the same order twice inside our sales. So, let's go and execute that. Now, as you can see, we have only 27,000 orders out of 60,000. So, that means the same order is repeating in our database. Let's have actually a look. So, select star from our table and let's go and have a look. Now as you can see from the first order over here you can see the same order is repeated three times and that's because this customer did order three things in the same order. So now of course what is the definition of order? Usually the whole thing is one order. That's why in order to get an accurate number of orders you have to go and use a distinct in order to eliminate first all duplicates and then count how many orders we have. So in this scenario I'm going to say in our business we have around 27,000 orders. So that's why it is little bit tricky using the count function. Always try to compare the numbers before and after using distinct. So let's keep going to the next one. It says find the total number of products. So it is very simple. We're going to say select count and we're going to say product key as total products from the table gold products. So let's go and execute it. So as you can see we have 295 and if you go and make it distinct just to check you will get the same number. So that means there is no duplicates and of course you can go and count the product name instead. The names of the product is unique. So that's why we are as well getting the same numbers. So that's it. Let's continue find the total number of customers. So the same thing select count and you can go with a customer key for example from called a dimension customers and I'm going to call it as total customers. So let's go and execute it. So we can see in our system we have 18,000 registered customers. Now the next one it says find the total number of customers that has placed an order. So that means having a customer inside our database doesn't mean that this customer did already placed an order. Maybe we have customer that just registered and didn't order anything. So what we're going to do, we're going to take the same query, but instead of targeting the customers table, we're going to target our fact the sales. So let's go and execute it. So now, as you can see, we are getting 16,000, which makes no sense because one customer might order multiple stuff. So what we're going to do, we're going to say distinct and let's query it again. So now it is more correct. We are getting around 18,000 customers. Now we can go and compare them one by one. So as you can see we are getting the same numbers. So that means all our registered customers did already placed an order because the numbers are matching. So it is very simple. We are just using an aggregate functions and that we are getting those key values. But what I usually do is that I collect all those measures in one query in order to have an overview of all key numbers in our business. So instead of me querying each one of them individually, I combine them in one go. So now what we're going to do, we're going to generate a report that shows all key metrics of our business. So how I usually do it, I'm going to go and get the first query for the total sales and put it over here. And now I'm going to build only two columns. The first one is the name of the measure and the second one is the value of the measure. So let me show you what I mean. Now this one over here, I will not call it total sales. I'm going to make it like generic. So I'm going to say measure value. And before it we're going to make another column from a static string value is the total sales and we're going to call it measure name like this. So let's go and just execute this one over here. So the measure is total sales. So it is not anymore like the column name. It is now a value in the output and the measure value is like around 30 millions. Now what I'm going to do I'm going to go and add another measure as a second row. And in order to do that, we're going to use the union all and then copy the whole thing over here and say total quantity and we're going to change the measure to quantity. So now let's select both of them and query. And now as you can see we have now the two big numbers in one query. So the total sales and the total quantity. So now what we can do we can go and collect all those big numbers and measures and put it in one query. So with that we have the average price, the total number of orders, product, customers and as well you can go and target different tables because SQL cares here only about the number of columns and the data type of columns must be matching. So now let's go and query this and now in single query we can see the big numbers the key metrics of our business. We can see the total sales, total quantity, average price and so on. This is a super report where you can generate it for any business where you have in one go the full big picture about the business. So this is how I generally do if I'm exploring a new database. I put all those big numbers and measures in one query to have better understanding about the business. All right my friends. So with that we have now a clear understanding about the dimensions and as well the measures of our data sets. Now in the next step we're going to go and start combining stuff together in order to generate insights. And we're going to focus now in a very basic analyszis. It is the magnitude analyzis. Okay. So now what is exactly a magnitude analyszis? It's all about comparing the measure values across different categories and dimensions. And this can help us of course to understand the importance of different categories. Now the formula for that going to be interesting. So now this time we will be mixing stuff together. So first we have to go and aggregate a specific measure and then we say by dimension. We need here the dimension in order to split the measure. It sounds complicated but it is very simple and basics. So for example we can say the total sales by country, the total quantity by category, the average price by products, the total orders by customer and if you follow this formula you will be generating endless amount of insights by just combining any measure with any dimension. You can call it it is a new insight. So it's going to look like like this. If you have one measure that is like for example 600 and if you put now this measure together with dimension what's going to happen this 600 is going to be splitted by the dimension values. So A going to have like 200, B going to have 300 and C 100. And now with that we can go and compare those categories right. So we can see now that category B has the highest measure and the C has the lowest. And this help us to compare the values of the measure. what is the best category and what is the worst category. So this is very basics analyszis. So let's go and apply this formula on our data sets. Okay. So now let's go and break all our measures by dimensions. So here I have prepared few interesting examples where first we're going to break the total number of customers. As we learned we have 18,000 by the countries. So the measure is total customers and the dimension going to be the countries. So let's go and write the query for that. So we're going to select. So the first thing that we're going to go and add is the dimension. So it's going to be the country. And then we need the measure. It's going to be the count of the customer key. So this will give us the total customers. And we need to select our table. So it's going to be the dimension customers. And of course we have to go and group up the data by the countries. So group up country. So let's go and execute it. And with that you see again the list of countries. So we have our six countries and then the total customers for each country. So that we can see the distribution of customers by the country. But what we usually do is that we go and sort the data by the measure the total customers like this. And we're going to sort it by descending. So with that we will get first the countries with the highest customers. So let's go and execute it. So now we can see in the results the highest number of customers come from United States then Australia, United Kingdom 337 customers without the country informations it is not available. So that's it right it is very simple. So with that we have splitted the total number of customers by a dimension the country. Now of course we can go and split the data by different type of dimension. So for the next one we are saying find the total customers by gender. So here's the same thing. We have the same measure that to other customers but we are splitting the data by different type of dimension. So just copy and paste and now instead of countries we just going to switch it to gender and over here and that's it. So let's go and execute. So now as you can see the granularity of the gender over here is different than the countries. We have here only three values and we can see it is almost splitted evenly between male customers and female customers. And of course this going to help us to understand the demography of our customers. And as you can see it was very simple. We just switch the dimension. So you can go and split as well by the marital status and so on. Now let's go and split the total products by the category. Well actually the query is going to be very simple as well. So select and here we're going to have the same aggregate function the count products key as total products from our table gold dimension products and then we're going to group up by the dimension the category and we're going to order by as well the same thing total products distinct from the highest to the lowest. So let's go and execute it. And with that we can see how many products do we have in each of those categories. And we can see the biggest category the components and after that the pikes. And this is interesting that we have seven products where we have nulls where they don't belong to any category. This is really nice. Let's go to the next one. What do we have over here? What is the average costs in each category? So this is like different style of question but at the ends we're going to have the same thing. We have over here the average costs. This is the measure and the category is our dimension. It's like we are saying find average costs by category. So what we're going to do, we're going to go and copy the same query and the dimension is the same. So the categories but the measure is different. We are not talking about the total products. We are going to say average and here we're going to have the column costs and let's go and rename it average costs. So that's it as well for the order by we have to use the new measure. So let's go and execute it. So now we can see the most expensive category is the bikes costs a lot compared to the accessories of course. So you can see the accessories is only 13 and the bikes is 900. So this is as well gives us insights about how expensive each category is and as you can see it is always the same templates. We are splitting specific measure by a dimension. So let's keep going to the next one. It says what is the total revenue generated for each category. So again here the question is find the total revenue by category. So the total revenue here is the measure and the category again is the dimension. So now the total revenue comes from the fact and the category comes this time from the dimension. So that means we have to go and join tables right. So how we going to do it? Let's go and start with the select star from and I would like always to start from the fact table. So fact sales f and then we're going to go and join it with the dimension and usually I go with the left join in order to not lose anything because if you use an inner join you might lose in the fact few orders and few sales I don't want that. So lift join with the dimension this one going to be the products and the key for that going to be very simple going to be the product key and the same thing for the facts. So with that we join the fact table with the dimension. So now we have to go and pick what do we need? We need from the fact the sales right. So sales amount and we need from the products the category and we want to group up the data by the category. So so this part is done. What is missing is of course the aggregations. So we are aggregating actually the sales. So sum sales and we can call it total revenue. So like this. And of course we can go and order the data by the total revenue by our measure and distinct from highest to the lowest. So as you can see it is exactly like the previous one. But here the data doesn't come from only one table. Here it comes from two tables. So the measure come from the facts and the dimension come from the dimension products. And this is classic right? The dimension has all those descriptions and details about the products like the categories. And the fact table has all those measures and dates that we use in order to calculate our measures. So that's it. Let's go and execute it. Now, as you can see in the output, the category bikes is bringing the most of revenue. So here it's like in millions 28 millions of sales and the accessories and the closing is not really bringing a lot of like revenue. Both of them are below like 1 million. So with that you can understand our business is making a lot of money selling bikes, right? So my friends as we are exploring the data we are understanding more and more about our business right so let's keep going to the next one we have here the question what is the total revenue generated by each customer so now we want to find out the top spender right select star and as well we start from the fact table and this time we're going to lift join it with the customers right so the dimension customers and we're going to go join the data so we're going to use the customer key for the join And what we're going to do, we're going to go and get maybe the customer key. And let's go and get as well the first name, maybe few details about the customer and as well the last name. So those are the columns that we want from the customers. And now what do we need? We need the aggregation. So it's going to be the same thing. Sales amount as total revenue. And we have to go and group up the data by all those three informations. So we're going to go and copy paste. And at the end as usual, we're going to order by the measure total revenue descending. So that's it. It is exactly like previous one but with different dimensions. So let's go and query it. And now we get a full list of all our customers, the 18,000s. And we can see the total revenue for each customer. So we can see Nicole and Caitlyn, they are our top spenders and the most royal customers that generated sales and revenue for our business. This is really cool. Right now let's go to the next one. It says what is the distribution of sold items across countries. It is like finding the total quantity by countries. So it is very simple. I'm going to go and take the same query because countries comes from the dimension customers and the sold items the quantity come from the sales. So we are doing the same joints but with different dimensions and measures. So what do we need from the customers is only the country and the measure going to be the quantity. And here we're going to go and say total sold items and we have to change the group by to the countries and sorting the data by the new measure. That's it. And with that we are generating new reports by just changing the dimensions and measures. So again this is very interesting to understand which country is generating like good business for us. So my friends as you might already noticed if in the dimension we have like small number of unique values like in the countries we have here only seven values in the gender we have only three we call those dimensions low cardality dimensions because we have low number of values inside it and in the result we will get only here for example seven rows but if our dimension is high cardality like by the customers we have 18,000 unique customers then our measure going to be splitted by those 18,000 and in the results we will get exactly the same number of customers. So the number of rows and results really depends on the cardality of the dimension. So as you can see we can generate a lot of different reports by only following this formula dividing the measure by a dimension. So we just generated eight different insights and reports by only few measures and dimensions. So now what you can do you can pause the video and try different dimensions and measures in order to have more insights about our business. Okay. So as you can see this is the basics analyszis that we can do in any data set or any domain where we are aggregating a measure by dimension. Now in the next and last step in our projects we will be doing ranking analyszis. Okay. So what is ranking analyszis? It is very basic. We're going to go and order the value of our dimension based on a measure in order to identify the top performers and as well the bottom performers. And the formula for that is going to be the following. So this time we're going to be ranking the dimensions by an aggregated measure. So for example, we're going to rank the countries by the total sale or we're going to find the top five products by the sold item, the quantity or the bottom three customers by total orders. So it's like the magnitude analyzes. We're going to have like an ordered list of dimensions value. For example, from the highest to the lowest in order to identify quickly the top performers. And of course we can go and filter the data by saying I would like to have only the top two categories. And with that you are removing all other dimensions that are not on the top two. And in SQL we can use for that the keyword top or we can use the ranking window functions like rank, dense rank, row number and so on. So let's go and apply our formula in order to rank our data set. Okay. So now let's check our data. We're going to start with the first question. Which five products generate the highest revenue? So we are searching for the best performing products in our business. So of course the first question what is the dimension and measure that we have in this question. Well the revenue that means we need the sales from the facts and the products that means we need the dimension products. Now in order to write this query it's going to be very simple. So we can use as well the group by I will not write it from the scratch. So I'm just going to take this query over here where we aggregated the total sales by the category. Now what I have to do is just to change the dimension. So instead of the category we need the product name and we are aggregating now the data by the product name because we need the top five products right. So the revenue is the sales amount and with that we have like almost everything is ready. So let's go and execute it. And now we can see we have a list of all products in our business and as well we can see the total revenue. But the task says here we need the top five. So we don't need all the products from our database. We have to go and select only this subset. Now in order to do that in SQL server, it's very simple. We're going to go over here and say top five and SQL going to go and return only the first five rows from the results. So let's go and execute it. And as you can see now in the results, we have only five products with the highest sales. And that's it. With that, we have solved the task and we can see the top five products and all of them are pikes. Now let's go and check the other sides. We want to find the five worst performing products by the same measure, the sales. And this is very simple. So what we're going to do, we're going to go and take the same query over here. And now what we're going to do, we're going to go and sort the data from the lowest to the highest. So instead of descending, we're going to remove it. And with that, SQL going to use the ascending. So let's go and execute it. And with that, as you can see, we are getting the worst five performing products by just sorting the data differently. So it is very simple right and with that we can see our five best sellers and the five worst sellers. And now what we can do we can go and just change the dimension and generate different reports like instead of the product name let's go and check the subcategories what are the best subcategories of our data. So I just change the dimension let's go and query. So with that we can see the best subcategories we have in our business and the same thing if you want to go and check the worst performing subcategories. So generating reports is very simple and now my friends in SQL there is like two ways on how to create ranking. We have a simple one where we are using the group by clouds together with the keyword top. But if you are generating a reports where it's things are more complex and you need more flexibility, you should use the window functions. So let me show you how I can solve this task using the window function. So now I'm going to go and take almost the same query. Let's put it over here. I'm going to get rid of the top five. And let's see, we are still speaking about the products name as well with a group I. But now what we're going to do, we're going to go and generate a rank. So we can go and use for example the row number. And in scale there's like different types of window functions for ranking. One of them is the row number or the rank and then we're going to say over. Now we're going to go and sort the data. It's like we have done in the previous one. We have to sort the data by the total revenue and the total revenue is the sum of sales and descending and we're going to call this rank products. So let's go and execute it. Now as you can see we have created a new column where we have like a rank. So we have for each products like one rank until the last products 130. So now what we are interested is to go and select the top five. Right now in order to do that we need a second step. That's why we're going to go and use the subquery. So we're going to say select star from and then we're going to put the whole thing in a subquery something like that. And all what you have to do is to use the new flag that we have created in order to filter the data. So we're going to say where the rank products is smaller or equal to five. And with that we should get only the top five products. So let's go and execute it. And as you can see we are getting the same results. Now, of course, with the window function, it is more complicated than the first one. But with the window function, we get more flexibility on selecting more columns or adding more different types of aggregations and details on the query. And as well, we can go and use different types of ranking functions that handles the tice differently. So, if the task is very simple like this, I'm going to go with the simple group pie. But if you are generating like complex reports, I'm going to go with the window function. So now what you can do, you can go and rank the data by different dimensions and measures. For example, find the top 10 customers who have generated the highest revenue. And as well, you can go and find the three customers with the fewest orders placed. So again, we can go and reuse the previous queries that we have generated. So this query generates the customers and their total sales. And all what you have to do is to say top 10 and then rerun the query. And with that, we are getting the top 10 customers. and about the lowest three customers. All what we have to do is to go and replace the measure. So we are counting the unique number of orders. So we're going to say total orders and as well go change the order by not descending ascending. And we need the top three. So let's go and execute it. So we can see the three customers that did order only once and they are the three customers with the fewest orders. So as you can see by just switching the dimensions and measures we are generating completely new important insights and as you can see as we are exploring the data we are understanding what are the best products what are the top customers that are usually very important for reporting. All right my friends so with that we have covered the last step in our projects how to rank our data and with that we have covered all the steps of the project road map. We have done a lot of explorations for the database, dimensions, measures. We have combined the dimensions and measures in order to do magnitude and ranking analyszis. Okay, my friends. So that's all about the EDA projects. And now in the next one, we will do the last type of projects, the advanced data analytics. So let's go. And now the type that we're going to cover is advanced analytics projects using SQL where we're going to write complex SQL queries to answer real business questions. So we're going to use the advanced window functions, the CTE subqueries and we're going to go and script two big queries in order to generate two reports. So with this type of project, you will learn how to solve real business questions using advanced techniques. All right. So for this project as well, we have a road map where we're going to progress through different type of steps and analyzes. So we're going to do many stuff like change over time, cumulative analyszis, performance, data segmentations and at the end reporting and all using SQL. So let's start with the first step in the road map. We going to analyze the change over time. So let's go. Okay. So now what is change over time? It is a technique in order to analyze how a measure evolves over the time. And this is very important in order to track the trends and as well to identify seasonality of your data. And the formula for that is very simple. We're going to go and aggregate a measure but this time based on a date dimension. For example, the total sales by a year, the average cost by the month. So if you combine any aggregated measure together with a date column or dimension, then all what you are doing is you are analyzing the change over time. So for example, we're going to go and break our measure this time for example by the years. And with that we can track immediately how our business is doing over the time over the years. So for example, we can see here the best year was 2024 and then we have really hard decline in our business in 2025 and then slightly it's going up in 2026. So with that we can quickly analyze the trends of our business. So now let's go and check the trends and the changes over time in our business. Okay. So now let's analyze the trends and changes over time in our data and in order to do this kind of analyzes usually we target the fact table because there usually we have our measures and as well dates. So we have the order date, shipping date and due date. Now what we can do we can go and analyze there the sales performance over time. So as we learned all what we need is a metric and a date. Let's go for example and select the order date and as well one of those measures sales amount from our fact table. So let's go and query it. And we can go and order the data by the order dates ascending. So let's go and execute. And as you can see we have nulls in our data. What we can do? We can go and filter those data out. We don't need it. So we're going to say where order date is not null. So let's go and execute it again. All right. So that we don't have those orders. Now, as you can see, we have sales over time, right? We have a date and we have a measure. So this looks really good. But now what we're going to do, we're going to go and aggregate the data by the sales amount. So let's go and say sum. And we're going to call it total sales. And then we group up the data by the order dates. So let's go and execute it. And with that, as you can see, for each day, we have the total sales. So now the granularity of our data is the day and we can say of course now we are analyzing the sales over time but usually we don't aggregate the data on the day level we want to have higher aggregations for example let's go to the years and now in order to change the dimension date here from a day to a year we have to use date functions and there are a lot of date functions in order to extract that date part and now in order just to get the year we have a quick function called year and it going to convert convert our date to year. So let's call it order year and of course we have to go and group up the data by the year and as well sort it by the year. So let's go and execute. Now we are at the year level and we have only five years. So that means we have changed the aggregation from the day to year and now it is very easily to analyze the performance of our business over the years. So the first year was the lowest and you can see 2013 is the best year in our business and then it is declined massively in 2014. And of course we can go and add more measures to our data not only the total sales. For example, let's go and calculate the total number of customers. So we can say count distinct customer key as total customers. So let's go and execute it. And with that we can check are we gaining like customers over the time if there are any trends that we can see and we can go and keep extending stuff like we can go and add the total number of quantities. So summarize quantity as total quantity. So let's go and execute and with that we have really nice picture in order to understand is the revenue increasing or decreasing over the time what is the best year the worst year are we gaining customers over time if there any like trends that we can spot now by looking to the result you can see this gives us highlevel long-term view of your data and of course it helps for strategic decisions and now what we can do we can go and drill down to the months so we can go and aggregate the data by the month regardless list the years in order to give us an idea how each month is performing on average. So all what we have to do is to switch the function from year to a month like this. And of course for the group by and the order by let's go and execute and of course in the output we will get all the months and guess what which month is the best for sales is of course December because you have all those Christmas and stuff and the worst months as you can see is February. So with that we are understanding the seasonality of our business and the trends patterns of our business. And as you are not including the year in our analyzes you are aggregating all the data from all years. Now what we can do we can make it more specific for each year where you go and add the year informations to our query. So we can have both a year and months. Let me just change this to a month. And of course we have to go and add it to the group by and the order by. So let's go and execute and with that we are aggregating the data of a month of specific year. So now we have all the months of all years and now if you want to focus on only one year what you can do you can go and filter the data by the order year and with that you can see how the data is evolving over time. Now of course in SQL we can go and format the date differently. So instead of using the year and the month in separate columns what we can do we can use the date trunk function. So instead of here we're going to say date trunk and if you want the granularity of your date at the month level we're going to say month and then the date and with that you will get both the year and the date and let's call it order date like this. So let's go and execute. Now in the output we will get exactly the same result as before but instead of having like two columns for the year and the month we have everything in one and because we saved the month that means it's still going to go and remove all the days. So as you can see it always starts with the one. So the first day of the month and with that you will get one row for each month for each year. And if you want to change that quickly to a year just you go and change the date parts to a year and you will get the granularity of the year. Now if you don't like this format and you would like to have your specific format what you can do you can go and use the format function. So format the first argument is going to be the date and then you go and do your format that you want. So for example it start with the years and let's say I would like to have the abbreviation of the month name. So something like this and of course group by and order by. So let's go and execute it. And with that we got our format the year minus then the abbreviation of the month. But you have to be careful which function you are using because the format you will get in the output a string. And as you can see you cannot sort it correctly. So the data here is sorted by the year but not by the month. But if you are using date trunk you can see the data is correctly sorted. So if we switch it to a month it will be as well. Okay. So everything is sorted correctly because the output here is a date and SQL going to sort the date correctly. It is not string. And if you are using the year and the month the output here going to be an integer and sorting an integer is not a problem. So of course you can go and pick the one that you like. So that's it. Let's go and execute it. And now you can go and keep analyzing by finding another date in our data set and another measure. So as you can see it is very simple. Okay. So that's all about how to analyze the trends and the change over time. Now in the next step we're going to do some kind of advanced aggregations by doing cumulative analyszis. Okay. So what is cumulative analyszis? It is aggregating the data progressively over the time and this is very important technique in order to understand how our business is growing over the time. So how our business is progressing over the time whether it is growing or declining it is very interesting analyszis. So the formula going to be very similar to the changes over time but instead of having a simple aggregations on the measure we're going to aggregate our measure but this time cumulative. So we are like adding stuff on top of each others and the data again can split it by the date dimension cuz we want to track the progress over the time. For example, we can find the running total of sales or the moving average of sales by a month. So now let's have again our simple example where our sales is splitted by the years. Now this is the classic change over time. But in order now to make it cumulative what can happen? We're going to take the measure and add to it. For example, 2024 we have 300. And now for 2025, we're going to add the 300 together with the 100 in order to make it cumulative. So for 2025, we're going to have 400. And the same thing for 2026, we're going to go and add the 400 together with the 200. And with that, we will get 600. So as you can see, we are keep adding the values in order to generate something called cumulative value. Now for this type of analysis, we use in SQL the aggregate window functions. in order to find out the cumulative values. So now let's go and apply our formula in order to find whether our business is growing or declining. So let's go. Okay, so now we have to analyze the following. We're going to calculate the total sales for each month and as well the running total of sales over time in order to analyze the trends. So let's see how we're going to do that. Let's start with the easy stuff where we're going to calculate the total sales for each month. So we are calculating the changes over time and we have already done that. So all what we need is a date and a measure. Our date going to be the order date and the measure going to be the sales amount from our fact table. So let's query this. And now we want to find the total sales for each month. That means we're going to change the granularity of the order date from a day to a month. And I usually like using the date rank for this kind of tasks. And the granularity going to be the month. So this is the order dates. And now for the sales we're going to use aggregate function sum sales as total sales. And of course we have to go and group up the data by the date. So let's go and execute it. So as you can see we have now the total sales for each month. And don't forget to get rid of the nulls. So where we can say where order date is not null. Now it looks better. We don't have nulls. And of course we can go and order the data by our date. Now our measure is just aggregated for each month individually. Right? But we don't want that. We want to have like a running total. So we'd like to have like commumulative metric. In order to do that, we have to use window function. So let's go and do that. We will use a subquery for that. In order just to make it simple. So what we need? We need the order date and let's say the total sales and here we have to have our window function. Then we're going to put the rest in a subquery. And of course we can go and get rid of the order by because anyway our data going to be sorted using the window function. So now let's start writing our window function. We will have the sum of total sales. So we want to summarize those new values. And we're going to build a window function like this over. We don't have to go and partition anything. So we can go immediately and say order by our new order date that we have calculated. And we want it to be ascending. So actually that's it. So as running total sales. So let's try that out. Now if you look to the result you can see that all those values are cumulative and it is working like this. The first total sales is equal to the total sales because previously we don't have anything. Now for the next row what going to happen is going to go and add this value to the previous one. And with that we get the running total value. Now moving on to the third row is going to go and add all those three values together. And of course this going to give us the running total for this month and so on. So as SQL is moving through the window it is always adding the current value to all previous values. And this is because of the default frame of the window. The frame going to be between the unbounded preceding and the current row. So that means for example if we are at this row over here current total sales for this month is this one and the unbounded preceding is all the values before this month. So that means we are getting all the previous values together with the current value and with that we will get the effect of the running total sales. And now of course as you can see it is going through all the years. Right now we can go and limit the running total for only one year. So for each new year it has to reset and start from the scratch. So that means we are partitioning the data. For each year we would like to have partition. For the first year, it's going to be 2010. It is one row. And for the 2011, we're going to get the whole partition over here. So, in order to partition our window, it's very simple. We're going to go and say partition by the order date. That's it. Let's go and execute it. Now, let's go and check for the first partition for 2010. You can see the running total is the same as the first month. But since we have only one month, that's it for this year. Now, as we go to the next year, as you can see, it resets. So you can see the running total sales for 2011. It is exactly as January. It is not adding up now the value of the current value with the previous one because the previous one is outside of the window. So as you can see we are getting running total for the whole year and once we hit a new year it is going to reset. So it is working and this is how you can create cumulative values in SQL. And of course if you would like to change the granularity of our data it is very simple. All what you have to do is to go over here and say instead of month we're going to make it as a year. And of course don't forget to change as well the group by. So let's go ahead and execute. And with that we are creating cumulative values for each year. But of course it makes no sense to partition by the years. Let's go and remove it and execute it again. And with that you are creating the running total sales the cumulative metric over the years. So as you can see it is very simple. Now we can go and add like another measure and another aggregation like for example instead of finding the running total we can find the moving average. So let's for example go and get the moving average of the price. So first we have to calculate the average of the price as average price. And now what we have to do is to go and make another window function over here where we are saying average the average price and we're going to go and call it moving average. That's it. So let's go and execute it. And with that you are getting the moving average price of our sales. All right. So now you might still asking what is really different between using a normal aggregation and cumulative aggregation. Well, we usually use normal aggregations in order to check the performance of each individual row. Like if I want to see how each year is performing, I'm going to go and do a normal aggregation. But if you want to see a progression and you want to understand how your business is growing, you have to go and use cumulative aggregations because you can see easily here the progress of your business over the years. So there is like a difference between using cumulative value and normal aggregation. All right. So with that you have done with the cumulative analyszis and you have learned all different types of aggregations. Now the next step in our road map we're going to do performance analyszis. Okay. So what is performance analyszis? It is the process of comparing the current value with a target value to compare the performance of specific category and this can help us in order to measure the success to compare the performance. So the formula for that is very simple. We're going to find the difference between the current measure and the target measure by subtracting them. Like for example, we can go and compare the current sale with the average sale or the current year sales with the previous year sales or the current sales with the lowest sales or maybe the highest sales. So as you can see we are always comparing the current measure together with a target with something else. So for example, we have here again a measure that is splitted by three categories. So those values are the current values. Now if you have a target like for example the average. Now as you can see for each row we have like the 200. Now what we can do once we have those two things in one row we can go and simply subtract them. So for the A the current value is exactly equal to the average. Both of them is 200 and the difference between them is zero. So this product is performing as an average. Now for the next one we have 300 and the target is 200. So the differences between them is 100. That means this category is performing very well. So this is a good performer. Now for the last one we will get minus 100. So that means it is below the average. So it is not performing very well. And for this type of analysis we usually use window functions like the aggregate window functions, the sum, average, max, min or the value window functions like lead and lag. So now let's go back to SQL and apply this formula in order to measure the performance of our business. So let's go. All right my friends. So now we have the following task. analyze the yearly performance of products by comparing their sales to both the average sales performance of the products and the previous year sales. Okay, this sounds a little bit complicated and serious. Let's have some coffee before we start. Okay, so what do we have over here? So it is talking about the yearly performance of products. So that means we need the order date as a dimension and as well the product and the measure that is used over here is the sales. So let's do it step by step. So we need things from our fact table. So fact sales and we need the product. So I'm going to go and get it from the dimension product in order to have a nice name. So we have to join the data by the product key and I'm going to go and change the alias to P. So product key. Okay. So with that we have our two tables. Now let's go and select our columns. So we need the order date. We need the product name and we need our measure. So it's going to be the sales amount. All right. So now let's go and query those informations. Now we have to analyze the yearly performance. That means we don't need the day. The granularity is the years. So that's why let's go and convert it using year function. And we're going to call it order year. And of course we have to go and aggregate then the sales. And I'm going to call it current sales. And of course we have to group up the data by the date, the year and as well by the product name. So that's it. Let's go and execute it. And of course I'm going to go and get rid of all those nulls. So where order date is not null. All right. So with that we have solved the first part. So we have the yearly performance of the product. Now in the task we have to compare this value the current sales to the average sales performance of the products. So that means we need the average and as well the previous year sales. So that means we have to compare each value to the previous year for the same product of course. So that means things are getting a little bit more complicated and with that we need the help of the window functions. Let's do it one by one. Let's focus on the average sales. So now what we're going to do based on those values based on this results we will do a new calculations and aggregations. And now in order to do that either we use a subquery or a city. I'm going to go with a city because it looks nicer. So with yearly product sales this is the new name that we are giving for this results. And now what we're going to do we're going to build queries on top of these results. So first of all I will just select everything from this table. yearly product sales just to test. So it is working. Now I'm selecting data from our city. So now the next step I'm going to go and list all the columns that I want in my results. So the order date, the product name, the current sales. This is just nicer in order to have control on which columns you want to present at the end results. Now the next step, I'm going to go and order the data by first the product name and then the order year. And with that we can have better understanding of the results. So we can see this product has three years of sales and those are the current sales for each year. So now we have to go and calculate the average of those three sales. So in order to do that we're going to use the average current sales over we have to decide now how to partition the data. Since we are focusing on the products we have to partition the results by the product name. So we're going to say partition by product name and we don't have to sort the data because we are using the average. So it doesn't matter how the data is sorted. So let's call it average sales. So let's go ahead and execute it. And now if you are looking to the results for this product the average sales of all those three values is 13,000. So now as you can see for each row we have the current sales and side by side with the average sales and the same thing for the next product as well. So now since we have both of the informations on the same row current sales and the average the change the difference between the current value and the average value. So all what we have to do is to go and subtract right. So we're going to say the current sales subtracted by the average sales and we're going to call it the difference in average. So let's go and execute it. And now as you can see we are getting now the comparison. we have the differences between the current and the average and of course what I like to do is to make a flag or like indicator whether we are above the average below the average or at the average so in order to do that we're going to go and use the case when statement so if the difference is higher than zero then we are above the average right above average oh let's have an abbreviation for that and if we are below zero that means we are below the average right so below then below average and if it is exactly zero else then it is average. So that's it. Let's end it and I'm going to call it average change. So let's go and execute it. Now if you focus again on one of the products you can see the current sales of this product in 2012 it is below the average. It is really low. And for the next year for 2013 it is above the average. It was really nice year for these products and the last year 2014 it was again below the average. So with that we have really nice flag in order to see quickly whether we are above or below the average and it is interesting to see whether we have zeros. So yeah sometimes it is exactly like the average and here we have like a zero. It's not below or above. So with that we are comparing the performance of the sales of each products with the average. And as you can see it is really simple. Yeah. using the window functions. So let's go and check again our task. We have compared the current sales to the average sales performance. Now we have to compare it as well with the previous year sales. So let's go back to our example over here. This time we have to compare the current sales not with the average but with the previous year. So we don't have to write like another CTE or query. We can continue with the same results. So now all what you have to do is to access the previous year. And in order to do that, we have amazing window function called lag. So let's do it step by step. So now we're going to go and create a new column that's called lag. I want to access the previous value of what the current sales, right? So current sales and over we still have to partition the data by the product name because we focus on the products. So partition by product name. But now in order to access the previous value that means we have to sort the data and we're going to sort it by the years. We need the previous year. So we're going to say order by order year and we're going to sort it ascending from the lowest to the highest. So we're going to leave it like this. And with that this window function going to give us the previous year sales of the products. So I'm just going to call it previous year sales like this. And I think here we have something wrong. Okay. So let's go a and execute it and let's go and focus on one of those products. So now for the first year of this product, the previous year was null, right? So we don't have any data from the previous year. But for the 2013, we have a previous year of 2012. So that's why now we are getting the previous value of the sales based on the years. And the same thing for the last year over here. You can see we are getting the previous sales. So it is working. And for the next window, same thing for the first year. we will get null and the previous sales we will get it from the previous year. So with that we have now the previous sales and if you check this over here we have in the same row now the current sales of the current year and as well the sales of the previous year. Now what we have to do the same thing we have to go and subtract those two informations in order to compare them. Right? So we're going to go and do the same thing. So we will get the current sales minus the whole thing the whole window function and we're going to call it previous year. So difference of the previous year and with that we are calculating the differences between them. So for this year for this product as you can see the difference here is really big between the current sales and the previous year. Now of course what we can do we can go and make as well a flag or an indicator. I'm going to go and copy the whole thing from the previous average but we have to go and get the right function this and the same over here and now it is not above or below the average I'm going to say it is increasing or decreasing right so increase or decrease and we're going to call it previous year change and instead of average we can say no change so let's go and execute it and I'm having here an extra comma let's go and execute it so again let's go and focus of one of those products. For the first year of this product, there is no change because there is no previous year. For the next year of this product, we have an increase, right? Because the current sales is way higher than the previous year. And now by going to the last year of this product, we have a decrease because the current sales is less than the previous year. So my friends, we call this type of analyszis year over year analyszis. And if you want to calculate the month over month analyzes, it's very simple. All what you have to do is to go and change the function from year to a month and with that you are extracting the month part. And the difference between analyzing the months and years is of course the scope. Year-over-year is good for long-term trends analyzes where on the other hand the month over month it is shortterm trends analyzes. You are just focusing on the seasonality of your data. So this is how we analyze the performance of our business by comparing the current measure with a target measure and you can go and use different dimensions and stuff. So instead of the sales you can check the quantity instead of products you can check the customers and you can go and compare the current information not only with the average or the previous year you can compare it with the lowest sales and the highest sales and it can open the door for many different insights. But we are always using the same methods using the window functions. We compare the current value with another value in our data sets. So this is how we do performance comparison. All right. So that you have learned how to analyze the performance of our business. Now in the next step we're going to do partto-hole analyszis. So let's go. Okay. So now what is exactly part to whole analyszis? Well, we use it in order to find out the proportion of a part relative to the whole. Well, here we're going to analyze how an individual category is contributing to the overall in order to understand what is the most impacting category to the overall business. So now for the formula, it is very simple. You have to go and pick one of your measures divided by the total of the measure and then multiply it by 100 in order to find the percentage by a specific dimension. Like for example, if you take the sales, so you divide the sales by the total sales, multiplied by 100 by the category or if you take the quantity divided by the total quantity and then find the percentage by a country. So for example, again we have our measure splitted by categories. But now instead of having this number, what we're going to do, we're going to calculate the percentage. So for the first one, we're going to take the 200 divided by 600 multiply it by 100. So we're going to get the percentage 33. So once we do that for the all categories, it's going to be now very easy to see that the category P it is contributing to the overall number by 50%. Which makes it of course a top performer. So you can visual in your head as like a pie chart and you can see how each part is contributing to the whole pie chart and with that it can help us to understand the importance of each category to our business. So now let's go and apply this formula to our measures in order to understand the importance of our categories. So let's go. Okay. So now let's do part hole analyszis. All what we need one dimension and one measure. So for example we have the following task. It is very simple. Which categories contribute the most to the overall sales. So now let's go and do it step by step. So first we're going to go and collect the informations. So we need the category. We need the sales amount and those informations come as usual from the fact sales and from our dimension the product. Right? So we have quickly to go and connect them using the product key. Okay. So that's all what we need for our query. So let's go and select. So we have here the categories and the sales amount. So now the first thing we have to calculate the total sales for each category. So let's go and do that. It is very simple. So sum total sales and we are grouping up the data by the category. So this is basics. Right now we have the total sales for each of those categories. Now in order to calculate the percentage we need two measures the total sales for each category and we have it here already and as well side by side we need the total sales across all categories. So the big number without any dimension but now as you look to the result you can see the granularity here is that category. Now we need the total sales again by different granularity. And in order to mix those stuff together we use the window functions. So now how we going to do it? either you go over here and start writing your window function. And of course, you can do it together with the group by or you can do it as a second step in your query using either a CTE or a subquery. So I'm going to go with the CTE just to make it clear. So category sales like this. So now let's start again selecting the same information. So category total sales from our table category or CTE sales. So let's go and execute it. So now we have the same results and now we're going to go and build our window function like this. So we're going to say the sum we want to aggregate all those values right to get the total sales over the whole data sets. So we're going to say sum total sales. And now in order to get the big number we're going to say over and inside it we will not define anything because we don't want to partition the data. We don't want to introduce any dimension. We just want the big number. And with that we will get the overall sales. So let's go and execute it. Now as you can see this is the total sales by the category. So the total sales is splitted by the categories. And this is the overall sales of all orders of everything the highest number. Now since we have them side by side what we can do we can very easily calculate the path to whole or the percentage. So let's start doing that. We need the total sales and we want to go and divide it by the overall sales. So we're going to take our window function and put it over here. So let's go and multiply it now with 100. I'm going to go and call it percentage of total. So let's go and execute it. Now as you can see we are getting zeros and that's because the total sales is not float. So what we have to do is to go and cast it to something like a decimal. So floats like this. So let's go and reexecute it. And now, as you can see, we are getting now the percentages, but we have a lot of numbers after the comma. So, we're going to go and round the numbers now. So, let's go to the start round and then go to the end, comma, and let's have like two decimals. So, let's go and execute it again. Now, looks perfect. Now, what we can do, we can go and add like a percentage. And with that, we are converting the whole thing to a string. So, we're going to do concatenation. So, concat at the start and go to the end. And let's add the percentage character. And as well we can go and order the data by the total sales descending. So let's go and execute it. So now by looking to the result you can see the category bikes is dominating. So it is overwhelming top performing the categories. It is making 69% of the total sales of our business. So this means my friends most of the business revenue comes from the bikes. And as you can see the accessories and clothing they are really minor contributors to our business which is not really good and this is actually dangerous thing. If you have like one category dominating your whole business you are over relying on only one category in your business and if this fails this category then the whole business is going to fail. So by looking to this either the business has to decide removing all those products by those two categories or to focus more on bringing more revenue for the products that are inside those two categories. So as you can see guys those insights are really amazing for the business and helps the managers and the decision makers to understand what is going on quickly and make very critical decisions. And now you can see as well from the results perfectly why the part to whole analyszis is very important because by just looking to those numbers it's going to be really hard to understand the importance of the categories. But seeing the data as a percentage how each category is contributing to the whole sales of the business makes it easier to understand which category is underperforming or top performing. And now you have a very simple formula where you can go and change the metrics. For example, instead of total sales, you can go and change the aggregations to total number of orders or the total number of customers. So you can go and bring any type of measures and bring it to this analyszis and you're going to generate completely new view for the decision makers in order to develop a new strategy for the business. It was very interesting. Now in the next step, we're going to do my favorite topic where we're going to start doing data segmentations using SQL. So let's go. Okay. So now what is data segmentations? What we're going to do here is we're going to go and group up the data based on specific range. So that means we're going to go and create a new categories and then go and aggregate the data based on the new category. And the formula for that going to be very interesting. So it's going to be this time we're going to have a measure by a measure not by dimension. So you have to go and pick two different measures and convert one of those measures to a range or to a group and then aggregate the data by this measure. So for example, we're going to go and calculate the total number of products by the sales range or the total number of customers by the age group. So as you can see we have two measures and we are trying to combine them together in order to create new insights. Let's have the following example. So here for example we have like two measures and now the first step is that we're going to take one of those measures and convert it to a dimension. converted to a category. For example, we're going to say if the values are like equal or below 100, it will be converted to a category called low. And between 100 and 200, it's going to be assigned to a new category called medium. And everything above 200, it's going to be large. So, as you can see what we are doing, we are taking one measure and based on the range of this measure, we are building a new categories, new dimension. And now the final step is the easiest one. We're going to go and aggregate another measure based on the new category. So we're going to have seven for low, six for medium, and 15 for large. So with that, as you can see, we are creating new categories or segments based on a measure. And then we are aggregating another measure based of this new segments. And in SQL, in order to create those new categories and segments, we use the amazing case when statements because it's going to help us to define the rules and based on the range, it's going to go and create a new category and labels. So now let's go and apply this formula on our data set in order to segment our data. So let's go. Okay. So now let's go and segment our data and all what we need is two measures. So now we have the following task and it says segment products into cost ranges and count how many products fall into each segment. So now by looking to this task we have two measures. First the costs and as well the second one is the total number of products. And of course we have to go and segment one of those two measures. And in this task we are segmenting the costs. So we have to focus now on taking this measure and convert it to a dimension. So now all those informations are available in the table products. So now let's go and select few columns. We're going to get the product key and let's get the product name and the costs. That's all what we need. So let's execute it. Now as you can see this is our measure the costs. Now we have to go and convert this measure to dimension. And in order to do that, we use the case win statements. We always use the case win statement in order to create new categories. So let's go and do that. Case win. Let's start with the first range. Let's say it is below 100. So all the costs that are below 100. We're going to label it with a new value. It's going to be below 100. So now let's go to the next range. We are saying when costs now between 100 and 500. So all costs between this range. They will get the label 100 and 500. So this is very simple. Let's go and get another range. For example, between 500 and 1,000. Then it's going to get a label between 500 and 1,000. And now it depend how many categories and segments you want to create. Each row of this case when each condition will be creating like a new value for your dimension. So I'm going to stop with that. I'm going to say at the end else. So if the cost is not fulfilling any of those, it's going to be above 1,000. Right? So that's it. Let's give it a name. It's going to be cost range. So now let's go and execute it. Now let's go and check the result. For example, the cost here is zero. It is below 100, which is correct. This value is above 1,000. This is between 500 and 1,000. And this is between 100 and 500. So everything looks correct. Nice. So with that we are done with the first step where we have converted one measure into a dimension. So with that we have now our segments. The next step with that we're going to go and aggregate the data based on this a new dimension. So either you do it in one go or what I usually do I put everything in one city or a subquery and I'm going to call it products segments as based on this results I'm going to go and aggregate the data. So this is my temporary results and now we're going to go and just aggregate the data like this. So let's get first our dimension cost range and then we need our measure. So it's going to be count product key as total products from our city. It was the product segments and then group by our new dimension. That's it. It's very simple. Let's go and execute it now. Now you can see in the output we have our segmented measure and we can see the total numbers in each of those segment and range and of course we can go and order the data by our aggregation the total products. Let's go and execute it maybe descending. So now as you can see we have a lot of products that are not costing a lot. It is below 100. After that between 100 500 and the lowest number of products is in the range that is above 1,000. So we don't have a lot of products that are costing a lot and that's because maybe we have a lot of accessories in the business. So my friends this is very powerful. If your dimensions in the data set is not enough to create insights you can take one of your measures convert it to a dimension using case win and then aggregate your other measures based on this new dimension. So we are deriving new informations and as I told you by just following this concept measures and dimensions you can generate endless amount of reports even if your business or your data set is small. Okay my friends so now let's go and segment something else. So this time it's going to be a little bit more complicated. So we have the following task and it says group customers into three segments based on their spending behavior. So we have the VIB customers. They are the customers with at least 12 months of history and spending more than 5,000. And the second category we have the regular customers. They have at least as well 12 months of history but they spend like less than 5,000. And the last category we have the new customers. Their lifespan is less than 12 months. And we have to find the total number of customers by each group. So now here we have a lot of measures and stuff. So the first one is the total number of customers. This is going to be the final aggregation that we're going to do. But what is interesting, we're going to build the segments and this time is based on different columns. So first it is based on a measure the total number of months for each customer and as well the total spending, the total number of sales. So we have the sales, we have the total number of months and as well the total number of customers. So now we're going to do it step by step. Don't you worry about it. So now what I usually do, I start collecting all the data that I need. So what do we need? We need a customer key. In order to do the aggregation for the total number of customers, we need as well the sales amount right for the spending. And now in order to calculate those number of months, we need a date. And for that, we have to calculate the lifespan of a customer. And usually we create it using the order date. I'm going to show you how we're going to do it. So we need the order date. And of course, we have to select our table. So let's start with the fact table. So fact sales and we're going to join it with the customers. So our dimension customers and the key for that it is the customer key as well for the customers. And here we have to specify which column come from which table. So the first one from the customers, the sales from the fact and the order date from the fact as well. So now let's go and execute. Now we can see we have our customers, the sales and the order dates. So now the sales going to help us in order to specify the range of spending. But now what is interesting we have to calculate the lifespan. So now in order to get the lifespan we have to find out the first order and the last order of each customer. So how many months is between the first order and the last order. So in order to do that we need the min function for the order dates. So this is the first order and the max in order to get the last order. Right. And since we are using min and max, we have to go and group up the data. And we need to do that anyway in order to get the total spending. So for the sales amount, we're going to have the sum in order to have the total spend total spending. And we don't need the order age. And the dimension where we're going to group up the data is by the customer key. So let's go and execute it. So now in the results we have a list of all our customers and as well the total spending for each customer and we have the first order date and the last order dates. Now in order to calculate how many months between the first order and the last order we can go and use the function date diff in order to get a new measure. So let's go and do that date diff. And now since we need the number of months we're going to use the month and then the second argument going to be the first order. So order date and the second one going to be the latest. So max order date and we're going to call this lifpan. So let's go and query and let's have a look to our results. You can see for this customer 712 between the first order and the last order we have 11 muscles and for this customer over here we have zero because the first order and the last order is in the same month and maybe there is only one order. So with that we have the lifespan and as you can see guys we have derived a new measure from the dimension order age in order later to derive from this new measure a new dimension the segments. So we are converting a dimension to a measure and then from a measure to a new dimension and this is usually what we do in analyzes and in SQL. So now do we have all the informations for the logic? So we have the lifespan. So we have the total number of monsters, we have the total spending and I think we are ready to start building our segments. So now what we're going to do, we're going to create the segments based on these results that we have prepared. So this result is the intermediate result before the final one. Now either you're going to put it in a CTE or subquery. Well, I usually go and use the CTE. It is nicer. So with customer spending and I'm going to put the whole thing in ECT and we can start writing a new query from the scratch based on the inter results. So let's go and select again the customer key. I'm going to get the total spending and the lifpan. So we don't actually need the first and the last order and we're going to get all those informations from our new city. So let's go and execute. And now let's start building the segments. And as usual, we're going to go and use the case win statements. It is just amazing statements in order to derive and build new columns. So now what do we have for the first category? So they are the customers over 12 months and spending more than 5,000. So now we're going to say if the laugh span is higher than 12 and the total spending is higher than 5,000 then we have our VIB customers. So this is the first label. Let's go to the second one. If the lifespan as well I think more than 12. So let's go and check. Well, it is at least 12. I have here mistake. So it's going to be larger or equal. So now it is more correct. So the customers that has at least 12 months but they spend like 5,000 or less. So that means it's going to stay the same condition but the total spending will be less or equal 5,000s and they are the regular customers. So they will get this label. Now if it is not fulfilling those two conditions what this means this means this is a new customer right. So they will get this label. Let's go and have an end and let's call it customer segments. So let's go and execute it. Now let's have a look for this customer 712. So the total spending is less than 5,000. So this customer is not a VIB and as well the lifespan is less than 12. So that means for us it is a new customer. Now the next one we have a VIB. So this customer has a history at least 12 months. So we have here 16 months and as well the total spending more than 5,000. That's why this customer is a VIB. But now let's go and search for a regular customer 2349. So this customer spent less than 5,000. So we are fulfilling this condition over here and as well this customer has at least 12 months of history that's why we have a regular. So now as you can see we have derived a new dimension from two measures the lifespan and the total spending. Now of course the last step what is going to be we have to go and find the total number of customers for each of those categories. So now what we're going to do we're going to remove all those stuff and we're going to start with our new dimension and then comes the aggregation count customer key. So as total customers and then we have to group up the data by our new dimension. So this going to be really annoying if I'm going to take this here and put it in the group I because this means each time I'm changing the logic I have to take care of that twice. One in the select statement and the second one in the group I. So now actually instead of that what I'm going to do I changed my mind. I'm going to still having the aggregation in the second step. So we need the customer key we have the definition of our customer segments. And now I'm going to go and use the subquery where I put the aggregation as a second step. So my friends that means this is again a second intermediate results. You can of course put it in a second city. So that means this is the first intermediate results where we have created the lifespan and the total spending and the second intermediate result is creating the customer segments and the third step and the last one is by doing the final aggregation. So we're going to do it like this. Select our dimension customer segments. Then we're going to go and count the customer key from our sub query. So this is our subquery and don't forget to group by our dimension customer segments. I think I have it wrong. All right. So this is the subquery and this is the final step where we are aggregating everything. I'm going to go and order the data by the total customers like this. So now let's go and execute the whole thing. Well descending not ascending. Okay. Okay. So now we can see from our results the highest number of our customers belong to the category new. So we have 14,000 customers that are new in our business. And then the second category we have the regular customers. So we have around 2,000 customers. And in VIB we have a lot of VIB customers. So we have 1,655 VIB customers in our business. So with that my friends, we have done data segmentation. It is amazing. We have segmented our customers based on their spending behavior and as you can see all those informations are totally derived from the our data and this help us to have a deep understanding of the behavior of our customers and of course this can help as well making smart decisions. All right my friends so with that we have covered the five different types of data analytics thus we can do using SQL. Now what I usually do as the last tip in my project is that I try to collect all the different types of explorations and analyzes that I have done in my data sets so that I can put everything in one for example view or table and then offer it to other users and with that it going to help the other users or stakeholders to make a quick analyszis for decision- making. So now what we're going to do, we're going to have like some kind of requirements where we're going to bring a lot of different analyzes in one big script in order to have insights about one object like for example the customers. So I'm going to show you the requirement of this reports and we're going to analyze it and start writing the scripts. So let's go. Okay friends. So now let's create a customer report and here are the requirements for the report. So now we have like a general statement. It says this report should consolidate key customer metrics and behaviors. So it says first we have to gather all the details about the customers like names, age, transaction details and then we have to segment the customers into categories VIB, regular and new and as well by the age groups and we have to provide as well aggregations like the total order, total sales, quantity, products and so on. And we have to generate important KPIs like the recency, the average order value, the average monthly spends. So we have a lot of things and we're going to do it step by step. All right. Now I'm going to take you step by step in the process of building a complex query that I usually use in order to build a report. Now the first thing that I usually do is I start selecting the data from the database and I usually start with the fact table. So this is my starting point and then usually I join it with the dimensions and here I use lift join and after that I think about how to filter the data because usually we don't need all the data that is available in the database and of course in the result I will not be selecting all the columns. I'm going to be selecting only the relevant columns that I need for my reports. So since we have like complex query we will be dividing the process into multiple steps and I usually call this step the base data and this going to be the foundation the scope for the next steps and since we have like multiple steps I'm going to put this in a CTE so we have this as an intermediate results and what we're going to do in this step as well we're going to do few transformations like maybe calculating and deriving new columns maybe formatting the date so some basic transformations so now let's go and build this results for our report so the first step is retrieving the core columns from the tables. So let's go and do it together. So we need of course our fact table facts and we need our dimension gold customer and as usual we're going to go and connect them. All right. Okay. So this is the basic and now what we're going to do we're going to go and retrieve all the columns that we need for our reports. So let's start picking stuff. So order number let's get the product key the order date sales amount quantity and I think that's all from the facts let's go and get few informations from the customers so let's get the customer key the customer number the first name and as well the last name and what else we can go and get the birth dates because we have to create the age groups so birth dates let's go and query. So I think those are all the columns that we need in order to do the next steps. And now before we go and proceed with the aggregations, what we're going to do, we're going to think about filtering the data. As I recall, we have some orders where the order date is null. So I'm going to go and remove those stuff. So order date is not null. So that means in the first query the base query not only I'm selecting the columns that I need for the reports also I'm defining the scope of the data sets by filtering the data. So you can as well make the scope here only one year or something. Now what else we can do is to think about all those columns and whether we can do any type of transformations in order to prepare them for the aggregations. Like for example I'm going to go and say you know what instead of first and last name I'm going to put them together in one. So it's going to be the customer name. It's better than having like two columns. So, let's go and do it. We're going to say concat and then we're going to start with the first name and we're going to have a separator between them. You can have like a minus or a white space like this and after that the last name. So, let's call it customer name. And we can go and get rid of those two columns. So, let's go and execute. And with that, you have everything in one column. Now, another thing that we can prepare that we don't need the birth date. We actually need for our reports the age groups. So that means we have to go and calculate the age. So let's go and transform it. So date diff we want it in years, the birth date and the current date from system and we're going to call it age. So let's execute again. Perfect. So with that we have all the data that we need for our reports. Let's go and put everything in one city. So I'm going to call it with query as and put everything in this city. And I'm going to go and put this comment over here inside the city. Perfect. And now we're going to go and write a query from the scratch. Paste on our intermediate results. So base is query. It's execute. All right. So now by looking to our report with that we have the important columns. Right. So now in the next step we're going to do aggregations on top of these intermediate results. So here we're going to do all the aggregations that is needed for the report and we're going to put everything again in CTE as an intermediate results which makes everything a modular and easy to read. So now let's go and do the necessary aggregations on the result that we have previously prepared. So that's why this is very important as a second step in our report. Always tend to make a separated CTE only for aggregations. So let's go and do that. I'm going to go and select again all the customer informations like the customer key number, age. So I'm just going to copy and paste and put it over here. And we just need the column names. So the key number, name, and age. Now after that, we're going to start doing aggregations. So what do you want to aggregate is first, for example, the total number of orders. So we're going to go and count distinct order number as total orders. So this is one aggregation. We can go and summarize all those sales amounts as total sales and the quantities as well. So sum quantity as total quantity and as well we can go and count how many products did our customer order. So the products key as total products. So what I'm doing now I'm just looking to our intermediate results and try to figure out what we can aggregate for example it makes no sense to aggregate for example the ages right so from the order number we have total orders total product sales amount quantity and from the right side we cannot aggregate anything and that's because they are the details of the customers but from the fact table we can do a lot of aggregations so now what we can do with the order date over here we can for example find the last order dates from our customer which is really nice information. So we can say max order date as last order and of course we can go and calculate the lifespan and that we're going to need it as you remember in order to categorize our customer. So I will just copy and paste it from the previous query is the date diff month between the first order from the customer and the last order of the customer. So and we call this lifespan. Okay. So we derived two measures or aggregations from the order date. Now I think we have done everything possible and what is missing of course is to have a group by because we are doing aggregations and we are grouping by the customer details. So going to be customer key, customer number, name and age. So I think we have everything for our aggregations. Let's go and execute it. A list of all customers and we have few details about the customers and now we have a lot of measures. So the total order, total sales, total quantity, products, the last order and the lifespan. And with that we have covered this part over here where we have provided aggregations on the customer level. So we have the details and we have the aggregations. All right. So with that we have now all the preparations that is required to build the final results. So it really depend on the scenario. If it's possible we can take all the data from one city or if it's needed we can get it from multiple cities. But in our scenario, we're going to take it from the second city, the aggregations, and we're going to prepare the final results. So here we're going to bring everything together and we might introduce final transformations that is needed for the reports. So let's go and write the query for the final results. Now we can go and start segmenting our customer and as well creating the KPIs. So let's go to the third step. I'm going to go and put this in a CTE. So let's call it customer aggregation. And now based on these results, we will write the final query. So I like always to put a comment about the steps. So the first city is the base query where we just joined the data and prepared it. And then the second query is for the aggregations. And the final one is for the final results. So let's go and start writing our final query. We will start with select. And I'm going to go and list again all the customer informations. So I'm going to go and get again same things. We have the customer key, customer number, name, age and so on. And now after that we need to create the age categories. And now after that I'm going to go and get all those measures as well from our query. But of course without the calculations I just need the names of it. So with that we have everything from our previous CTE. So the customer aggregation. Okay. So let's just test it. Now everything is working. So now what we have to do? We have to create few categories age category and as well the segments of the customers right for segmenting the customers we have already done the query so I will just copy and paste it from the previous analyszis it looks like this if the lifespan is at least like 12 months and the sales above 5,000 then a less or equal 5,000 then regular otherwise it is a new customer so this is our first segment but the second segment about the ages we're going to go and build it now and again how we going to do it when so if the age for example example less than 20 then the customer is under 20. Let's make another range where we say if the customer age is between 20 and let's say 29 then we have the second range and we can keep repeating the same thing for the second one. It really depend how many categories you want to build. So 30 and 39 I belong to this group. Now the next one let's have the 40s as well right so 40 49 same thing over here and now else let's say 50 and above right and above so let's go and end it as age group I just want to sort it little bit like this okay now it looks nice so with that again we have turned a measure into a dimension and let's go and execute it now so now by checking the results we have the details of the customers and Now we have a new category. So as you can see it is working. 54 it is above 50. This is in the range between 40 and 49. We have here 67 above 50. I believe we don't have any customer that is below 20. Right? Or even between 20 and 30. Okay. So with that we have created our two categories and by looking to the reports you see we can segment the customers now into categories. The VIB, regular, new and the age group. And with that we have covered all those three requirements and we come now to the last requirements. We have to calculate the following KPIs. Now the first one it is an easy one. It is the recency. How many months since the last order we have calculated over here the last order for the customer. It is this one. And now in order to find the recency it is very simple. So all we have to do is to take this over here. I will just put it maybe after the segmentation. And all what you have to do is to use the date diff as usual. So month is the last order date and the get date. So as you can see we are using this setup like in many analyzes right we always find the differences between a date from our data sets and the current date and time and with that we will get the recency. So let's go and execute it. Now you can see how many months since the last order of the customer and of course you can go and test it using the last order date. And this is really important in order to understand whether the customer is still active or inactive. Okay, so this is for the first easy KPI. Now let's go to the second one. It says calculate the average order value. So how we going to do this? Let's go back over here. Now in order to compute the average order value, we have to divide the total sales by the total orders. So how many revenue did the customer generate? And we divide it by the total number of orders and after that we have to find the average. So it is very simple. Let's go and write that. We're going to go to the end of our table where we're going to put our KPI and I'm going to say here compute average order value. So as a shortcut AVO. So we say total sales divided by total orders. And let's call it average order value. So let's go and execute it. And if you go to the last over here, you can see the average order value of our customers. But now if you are dividing numbers together you have to be careful that you are not dividing by zero otherwise you will get an error. So imagine that a customer has a zero didn't order anything you might get an error. In our scenario, we don't have that because we are starting from the order table or the fact table. But still, I like to make sure this never happens. And for that, I usually go and use the case when statements. Very simple one. If the total orders is equal to zero, then make it zero. Otherwise, do the calculation that we talked about. So like this. And at the ends, we will add an end. So that's it. And with that, I make sure we will never divide by zero. So that's it. It was simple, right? Let's go to the last KBI the average monthly spend. So how we will calculate that compute average monthly spend. So now since we are speaking about the spending that means we need the total sales. Right? So how much sales did the customer generate totally and then we divide it by the number of months and with that we will get the average monthly spend. Right? So that means we can divide the total sales by the lifespan as we calculated it is the period where the customer has been active from the starts until the end. Okay. So now let's do it step by step. First we have to be careful that we are not dividing by zero and I believe in the lifespan we have zeros. So what we're going to say as usual case when lifespan is equal to zero then this time we will not make it zero the customer exist only for one month. So what we can do we can get the total sales of the customer and we don't have to divide it by the month in order to find the average because the average is equal to the current total sales. So with that we make sure we are not dividing by zero otherwise we're going to have our calculation. So total sales divided by life span. So the total sale divided by the months and with that we will get the average monthly spend. So and and ass and we're going to call it average monthly spend. Perfect. So let's go and try that out. Let's go to the right side. And with that we have our third KPI and we have the average monthly spends. And with that guys, we have now full reports about the customers and we have covered all the requirements. All right. So with that we have the final results and we have fulfilled the requirements. So what we're going to do, we're going to take the whole query and put it in the database as a view. And once we have the view, the report in the database, we can share it with the others. Now the other data analyst in the team can go and maybe create a dashboard in order to visual data using API tool like Tableau or PowerBI. But in this scenario, the user can go and connect your view the last prepared data to the dashboard. And with that the user can quickly generate insights without doing a lot of steps in order to prepare the data for the visualizations. And of course the data analyst can go and connect the dimensions and facts. But having this one solid view it's going to be like way easier to consume. And of course the data analyst can as well write a query on top of your view in order to generate a quick insights. So as you can see using only SQL you are covering a lot of complex steps in order to make the data ready for reporting and analyzes and this is what usually happened in real projects. We're going to go and put the query in the database so that the others can use it. So what we're going to do very simple create review and we're going to put it in a good layer and we're going to call it report customers and then ask like this and let's go and execute it. It is successful. Now if you go to our database and check the views you will find a new view called gold report customers. Now all what you have to do is to go and have a simple select. So codes reports customers and you will get an amazing report about the customers. This kind of reporting it is very important because you are giving a full picture 360° view of all your customers. So you have details, categories, measures everything in one go and it going to makes life easier. Now for any user of this view to quickly understand the data and generate maybe insights based in this one view that can helps of course your customers. So I just want to show you now what this means. If a user using your reports so either in SQL or maybe they're going to go and connect it to PowerBI or Tableau they can generate immediately insights. So for example, if they go and say count customer number so as total customers and then they're going to go and take any dimension for example the age group. So something like this and then group by the age group. Put just put it here first. And then they're going to go and add any other measure. For example, the total sales and any other measure that you have in this view and then execute and quickly they can do analyszis on top of your view without having them to go to their fact and dimensions. So this is like one extra prepared layer the data model that you have built. And if you don't want to group it by the ages, you can go and have the customer segments and it will be working. So quickly they can analyze the new derived informations that you have prepared in your reports. So guys, this is amazing reports about the customers. And now what you're going to do, you're going to go and prepare the second report where you have to build complete insights about the products of the business. It is very similar to the customers. So we want to generate a report for the products. You have to provide details like the product name, category, subcategory and the costs. You have to segment the products by the revenue. So you can have categories like high, medium and low. And then you have to provide the basic aggregations at the level of the products and then calculate few KPIs. So as you can see it is very similar to the customers. And now what you have to do you have to pause the video follow the same step at the customers where we join the tables car create aggregations and put everything like in CTE and at the end once you are done create the view where you have the report about the products. So I'm going to go now and do it offline and I will see you [Music] soon. Okay my friends I hope you are done with the reports. I'm going to show you quickly how I've done it. So I've just created a new view called report products and then we start with the base query where we have joined the fact table with the dimension products and collected all the columns that we need for the reports and we put everything in the first city. So this is the first step and there was from my side no need for any transformations over here. So we go now to the second step and here we have to put all the different types of aggregations in one go. So we calculate the lifespan, the last sales order, total orders, total customers, sales quantity and as well I have created the average selling price of the products. It is very simple. We are dividing the sales amount by the quantity. So this is the basic aggregations about the products and finally we have the final query. So we start with selecting the basic informations about the products. So we have the key, name, category and then we have here the recency and we have our new segments. This one is very easy for the products. So we are saying if the total sales is higher than 50,000 then this is a high performer and if it's like between 50 and 10k then this is a mid-range otherwise it is low performer. So the segmentations of the products is very simple and after that we have like all our measures that we aggregated in the CTE and now we come to the two KBIS. It is very similar to the customers. So the first one the average order revenue it is simply dividing the sales by the total orders and you have to take care of the zeros of course and the average monthly revenue we divide the total sales by the lifespan of the products and of course if the lifespan is zero so it is only one month then it is the total sales and with that you generate the average monthly revenue. So as you can see it is very similar to the customers but still the focus here is the products. Now of course we put this query in view. So we have the report products side by side by the report customers and now we have really amazing report about the products where we have everything. So we have a lot of details about the customers. We have as well a dimension in order to segment our products and we have a lot of measures that are really important about each products. So we have the total number of orders sales, how many customers did order the products, the average price, the average revenue and the monthly average revenue. And this gives you really deep insights about each product of your business. And of course, this is very helpful in order to compare the products, right? And now, of course, this is core analyzis that you're going to need it a lot in your business. That's why we offer it as a view. So, I think we have now two amazing reports about our data. All right, my friends. So, now don't forget to put all your work in the Git repository in order to share it with others as a successful project. So as usual we have the data sets, documentations and as well the scripts that you have done through this projects and here I'm putting everything together. So we have all the activity of the exploration as well with the advanced analyszis that we have done. So we have the change over time, the cumulative analyszis, performance, data segmentations, part tool analyszis and as well our two new reports. So I recommend you if you haven't done that yet go and create now a repository put all your work there to make sure that everyone can access and see your work and my friends don't forget to add nice commenting on your code and formatting and styling your code should be perfect. So if you haven't done that yet go and do it now. All right my friends so with that we have done the last step in our road map. We have created two solid reporting for our users. And with that, we have completed all the steps of our advanced analytics projects. And with this project and the previous projects, you can see now the full picture on how to do data analytics on any data sets using SQL. So starting by the first step where we have explored the database and end up having a very solid reports where we have consolidated everything in one view and with that we have now really great understanding about the business, about our data. And now what you can do, you can go and grab any data sets in the internet and you can go through all these faces again and I promise you at the end you will have a full picture and understanding of the business and this is what I exactly do in each project if I want to understand any type of data sets. All right my friends. So with that we have covered the last type of SQL projects the advanced data analytics. And with that we have now three solid projects using SQL and they are very similar to real world projects in the industry especially if you want to be a data engineer or a data analyst. And my friends we have covered the last chapter in our course. So this is the advanced level in SQL. And those are all the chapters that I have designed for you to take you from the basics to intermediate and then to the advanced topics. My friend, you made it. Congrats. You should be really proud of yourself. And now with that, I can say that I have shared everything that I know about SQL and you can now solve any complex task using SQL like I do in my real projects. And I hope that you have enjoyed the journey. And if you do and you want me to create more free courses like this, make sure to support the channel by subscribing, liking, and commenting. This of course going to make the channel grow, reach the others, and as well motivates me to make more content like this. So nothing left to say. Thank you so much for watching and I will see you in the next course.

SQL Full Course for Beginners (30 Hours) – From Zero to Hero

Channel: Data with Baraa

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.