PostgreSQL Superpowers in Practice

00:00:01 Daniel Marbach

Hello, again, everyone and thanks for joining us for another particular live webinar. My name is Daniel Marbach. Today I'm joined by Oskar Dudycz who is going to talk about the superpowers of PostgreSQL. Just before we begin, please use the Q&A for your questions. We'll cover them at the end or follow up later. Today's webinar is recorded and we'll send the link afterwards. Okay, let's talk about the superpowers of Postgres. I'm really excited. So Oskar warm welcome from my side.

00:00:34 Oskar Dudycz

Hey, Daniel, welcome all. Thank you for the invitation. So let's get it going. So let's start with the basic question. So why are we even talking about the PostgreSQL? Why are we even having this webinar? And so the obvious answer and also what I saw in the survey that you just filled is that the PostgreSQL is great because it's cheap or actually it's even better than cheap because it's free. So if you are considering something that you don't need to pay for license, then that's an obvious choice. However, there are many relational databases which are free and quite good. So why actually Postgres? And before we go to that, let me also tell you why am I, will be talking to you about Postgres. So as Daniel said, my name is Oskar Dudycz. I can call myself a major engineer because this year I finished my 18 years of commercial experience and in my projects, not all, but in recent years, most of them I've been using PostgreSQL and not only in my commercial business application, but I was also involved in building and I'm still involved in building tooling based on PostgreSQL.

So as we are in the particular, so in NServiceBus webinar, then some of you at least can know my work and the rest of the team. I was for a few years involved in building Marten and Marten is a .NET tool that allows to PostgreSQL as a document database. So using the native JSON capabilities. And however, today I'm not going to be talking about Marten, I won't be talking much about the JSONB. I want to show you a bit different use case and to show you why actually PostgreSQL stands right now in front of other relational databases. Why is it kind of getting like a lingua franca and the default choice or the safe default choice for relational database so you could at least get a feeling of how you can benefit from using it.

I always like to start with the business example. So today we will be building simple, but still I think not that far from being production ready system, based on PostgreSQL and some other features that it brings and also some extensions. And it's good that you have me here because it's great to have expert. Okay, I'm joking because I'm not an expert in car fleet. Actually I'm a weird type of a person that don't have the driving license, but I know someone who's an expert or actually who was an expert because my wife for a few years was managing a quite big car fleet of few thousands of cars in the company.

So we will be building such reporting system to manage all the trips, et cetera. And by that show you how you can benefit from PostgreSQL. And as this is a webinar, not the real conference talk or a meetup where I can see your faces and see your reaction, then I would be really grateful if you could as I code type something in the chat or write a question in a Q&A, I will try to answer those questions at the end of the webinar and some of them even during. So it'll cheer me up and give me more energy if you type as I go. Okay, so let's have a look on the code or actually right now on the blank screen of my IDE. So before I go the warning, if you are allergic to SQL, there will be a lot of SQL today. So yeah, beware.

So let's say that we will be building the reporting and alerting system based on the business trips. So all the trips that we are making with our cars belonging to our company and the basic example of the table in which we could store such trips can look like that. So we have where the trip was started or the date when it was made, the vehicle ID, driver name start and then location, distance we went through and fuel used because mostly what company is the most interested is how much did we use the fuel because that's something that we need to pay for.

And you can say that yeah, this stable is actually super simple because it is and maybe we won't be able to show a lot of data. But let me try to show you something a bit of math. So let's say that we have 10 cars and we will be recording GPS location for every second, not that crazy. So 60 seconds multiplied by 60 minutes and eight hours. So let's say that we are just measuring the classical nine to five day of work through 365 days of work. So that for 10 cars that will be like 105 million of recordings. Not that scary for relational database. However, if we say that we have company like 3,000 cars or if this will evolve into the natural for the relational database structure, star of tables. So we will have some dictionary. So instead of keeping driver name, then driver ID and the driver name we would add more information, et cetera.

Then this table and this whole amount of data could be actually getting into some real throughput issues and the real storage quite a big storage size, especially if we count all the other data. So let's think of what we could do with that. So the first thing that Postgres give us and that's a native feature, it's partitioning and partitioning as you can see the definition of the table looks exactly the same way as it was when I show you the default layout. However, we are saying that we'll be partitioning our table by time and let me show you bit of magic drawing. So let's say that this is our trips table, as you see, I'm terrible at drawing, probably equally bad on coding. So sorry for that. And let's say that we have time column, this is our trips table. If we were storing all those information then as you saw easily we could have hundreds of millions or even billions of record in such table.

So that's quite a lot of data to filter out especially that mostly we are interested in some relatively short time span window. So for instance, most of the time we are just interested in the trips that happened in last month, maybe in last quarter. So wouldn't it be nice if we could automatically split the storage of the data automatically into the set of tables where actually we could say that, "Hey, those are our trips for January 2025. Those are our trips for February 2025 and to December of 2025." So of course we could do it manually, we could name those tables as trips with the date of our month, but then it would be great if we could be still inserting, selecting data from the regular table. And that's precisely what PostgreSQL can give us. So if we are saying that we are partitioning the tables by the range, then Postgres will automatically take care of finding out to which partition it should belong.

However, if we have a look on the tables, then as you see we have the trips table already here, let me refresh it. And we don't have yet any data there. Let me go back here and we don't see any partitions down there and unfortunately or maybe fortunately Postgres is not smart enough to say or predict which partition and how we will be partitioning. So we need to tell explicitly Postgres whether we will be partitioning that by for instance, a day or a month or so. So we can do it by saying that, hey, based on the partitioning schema, so for instance we would like to create partition for the current year. We want to create partition of this table. And I'm wondering how many of you saw recently a WHILE loop in SQL? I know that that looks scary, but it does the work because as you can see, we are creating based on the months, we are saying we are adding plus one month to our date and based on that we are creating the partitioning.

So let's try to do it and see what happens next. So then when I refresh, then as you see we already have those 12 partitions and all of those partitions as you can see, have exactly the same structure. Cool. So not to get you scary about that. Then there are options to make it automated without doing this WHILE loop, et cetera. But even with this WHILE loop, we could have either some people in past where Postgres didn't have that feature were using triggers or some cron job. So probably we can create upfront if we have such explicit definition of how we want to partition our data. So that shouldn't be that scary. Of course, I'm not saying that you should do right WHILE loops every day in SQL, probably that's not worth, but sometimes it's fine. So let's try to append some data. So here I'm adding some records for July, some from September.

I'm not sure how many there will be of them. So yeah, it seems that it's succeeded fine or eight rows affected. So see I added the incredible amount of eight records. So let's have a look here and let's try to see if we can still query our partition table. Cool, we can still query our partitioned table. Fine. So what's nice is that we can still query explicitly the partition. See we don't have for January, let's try for July. Yeah, we have four records for July and see we have four records for September, so we can explicitly trigger those records.

So then let's have a look and see what's happening behind the scenes. So let's use our magic trick. So EXPLAIN, I'm wondering how many of you are using EXPLAIN, let's say at least once per week. Usually when I get the answer then it's much rarely. So EXPLAIN can tell us precisely what is our database doing behind the scenes or actually what database predict that it will be doing. So if we are doing select everything from trips without any work statement, then not surprisingly it tells us that it will be doing the scan for all the partitions. And that's not always great because one of the benefits of using partitioning is that if we are setting up this partition then we can say that even each partition will remain on the different disk, which is a really cool feature because for those old data from those old months, we could move it to some smaller storage or slower storage and for those that are representing our current month, we can keep it on something that is much faster.

So what to do with that? So PostgreSQL, what it can do is that if we query by the specific date, it doesn't have to be month, it can be just date. Then as you see I am querying for January, then EXPLAIN will tell me that it will just go to the specific partitions. And for that that's nice. And also if we query the specific partition table, then see we are also going directly to the specific partition table. And I see that there are some question around partitioning. So the downsides of the partitioning is that we need to be able to create it upfront. If we didn't set up the partitioning for our regular table where we have data, then it won't magically distribute the data. So we will need to do some migration and move those default, let's say records that are stored on let's say regular partition to those partition table, that won't happen automatically.

And what about the number of partitions? Of course, it's not that extremely that we can have millions of partitions. I'm not sure if there is hard limit. I think that thousands of partition are fine. However, one of the biggest benefits of having this partitioning is that PostgreSQL with their CLI you can call attach or detach this partition. So detaching this partition is just a single command. So we can say like, "Okay, I don't need this data anymore, detach them." So then even if in total through the whole lifespan of our application, we could have hundreds of thousands of partitions, but we can detach them and then the active set, it's much, much smaller. So there was a question about let's say, what if I do this way?

So if we have range, so let's have a look on let's say till September. So it's also smart enough because it's like January, February, March, April, then May, June, July, August and September. However, my recommendation is to always try to access specific partition because if you and treat them as the last resort to query multiple partition because that will be much, much slower. So let me leave some other questions about partition for later, but I will try to answer them at the end. So great, thank you for sending me them as you see I'm watching. So let's have a look once again on this. A bit scary, at least for me when I wrote it I thought that yeah, it's quite scary to write a WHILE loop. So there is another option that what we can do and how we can support partitioning or an example of how we can deal with time serious data. And one of that is actually the, let me show you, its Postgres has a reach ecosystem of plugins and one of the biggest benefits of having PostgreSQL, let me drop those partitions.

So one of the biggest benefits of Postgres is that it can be extensible, it's a pluggable database. And these plugins, they are natively built into the engine. So they are not like a second class citizens. So they are built inside the engine and that's one of the biggest benefits of using those extensions. So let's have a look on what is needed to install the extension. So the first thing that we need to do, and let's have a look, see we have extensions and we have already some extension pre-installed like PL/pgSQL, even the language, the SQL language is an extension. So there are some default extensions and some extensions that we can install additionally. So one of the extension is called TimescaleDB. If we call create extension, then see it completed, let's try to refresh extensions. And as you can see we have TimescaleDB extension and let me show you how many and how rich is this ecosystem of extensions.

So I took the snapshot of the plugins available for databases and as you can see I took it some time ago, it's Postgres 11. Currently, there is a Postgres 17 and soon there will be 18 released. And this is the screen from the Azure. As you can see, I only managed to get to I with the default plugins that are pre-installed on the server. For AWS, I managed to get to D. And with the Google Cloud I managed to get to P. So it seems that Google Cloud has a bit less extension available by default, but it's still quite a lot. So if you are not using cloud infrastructure, then that's fine. You can install your PostgreSQL either as a Docker or as a service and you can install those extensions additionally, and I cheated you a bit, sorry for that. It's not enough to just call create extension. This name is a bit misleading.

Create extension, actually enables extension, but we need to install it. And let me have a look on maybe something that can be, again, a bit scary but something that you don't need to do if you're using cloud because then you can, as I said, just click during the installation, select plugins that you want to use and enable them. So internally cloud what is doing is when you're clicking then it's calling this create extension. So we will be using two external extensions. This is a sneak peek of what's going to come. So the PostGIS, as you see it's already enabled. We'll talk about PostGIS later, but I intentionally wanted to keep this information of how this could look like the installation because plugins are adding additional capabilities to our PostgreSQL server. So it might be something that PostgreSQL doesn't even contain natively, like different type of storage, different type of indexing, different file layout, different, whatever you can imagine almost.

So here if we are adding TimescaleDB then for instance we need to install some Linux packages that are adding those additional TimescaleDB capabilities and what capabilities? You'll see in a moment. So it's not needed even to do it this way because timescale for instance provide the Docker image that contain timescale and also a range of pre-installed plugins. You can find other images or you can set it up on your own. And yes, you can see you have the PostgreSQL image with the PostGIS, et cetera. So it's not that scary. However, I just wanted to show you that actually what you are doing is you are not only customizing the database, but also the database server to get additional capabilities.

So the question is if plugins and extensions are the same thing, yes, actually they are the same thing. And to show you one more, the Supabase provided the extension to install extension. So there is a DB dev extension that allows you to install some on extensions located in the extensions' repository. It's not yet extremely popular, but you can have a package manager for those extensions. I'm sure that this will come soon. Going back to our example and how you could benefit actually from using those extensions. So timescale provides something that is hyper or super and it's even called hypertable so we could call it, and that's one of the example of what extensions are giving us. So they can provide us custom functions, custom data types, custom operators and aggregations, custom indexing methods, triggers, versioning, et cetera. So multiple things. So what I'm doing here is I'm calling the function that I got through TimescaleDB and I say that I want to make my trips table hyper and what does it mean to make it hyper?

So hypertables and TimescaleDB is a plugin that is built from time series data. So think like an IoT, et cetera. So more or less like that. So hypertables bring you the really different storage without indexing, without ingressing and with the automatic chunking, it's not partitioning per se because timescale was created before there was a native partitioning and they didn't migrate it, but something like that. So chunking or partitioning database on time. And as you know in an IoT space this data can be extremely quickly needed to be ingressed. So let's have a look on how this can help in our trips example. So we can still use the hypertable as a regular table. I'm running the regular insert as I was doing for other case, see nothing fancy. And what I can do is then obviously call basic report which could tell me that, hey those drivers average distance, this is what they went.

And the magical fuel efficiency factor and the fuel efficiency factor is just an average of the distance kilometers per fuel used. Of course, in the real-world this will be much, much fancy. You will have all those fancy reports and fancy gorges for your CEO, et cetera. But you can see that and you know already that such selects can get really, really not performant because they will need to query a lot of data. So as I mentioned the hypertable give us the quick ingress capabilities because the data goes first to the disk, then within some specific time continuously those views will be updated and pushed to the regular table and indexed to access the time data efficiently. Still, even if our table is hyper then if we are doing some crazy reporting crazier than I'm doing with those averages, then obviously one of the examples of how we can speed up processing is to create materialized view.

So materialized view is like a table that is read-only and one of the benefits of using this materialized view is that they are not doing those calculations as you are querying this view. So this is kind of much better. However, what's not great on those materialized view is that they are fast because we are storing it directly on the disk and we are querying aggregated data. But if materialized views are not refreshed automatically and that's the reason because if we are updating materialized view then we are updating the whole view and we are then deciding not to pollute with those refreshing like every milliseconds or so. We need to know how often we will like to have this view refreshed and with the traditional materialized view we would need to have some cron job to refresh this view or some manual refresh by button or whatever.

And TimescaleDB as I mentioned, in general extensions can give us some additional capabilities like for instance the continuous aggregate, so-called continuous aggregate. So this continuous aggregate will tell us to be continuously updated by TimescaleDB on its own. I will show you in a moment how does it work. And the other capabilities is time_bucket and time_bucketing is really important in reporting and also analyzing time serious data because it groups the data by the time. So for instance, if we are saying, "Hey, I want to bucket all the data," so put all the data in the same bucket for one day and within this one day I would like to make those aggregations. So for instance some value or calculate those averages, et cetera. So having those calculation for one day. So if we created this view, I don't remember, yeah, I already did, then if we try to query such materialized view then it will have already some data because it already, see aggregated based on the specific vehicle and the fuel efficiency.

However, if we insert new data then it wouldn't magically pop in. We would need to manually normally call refresh. However, the TimescaleDB give us something like add continuous aggregate. So we can say that I would like to refresh this view continuously starting from three days ago and still like one second before. And let's say, I would like to refresh this view every day. And if I run it then, see this view will be now continuously refreshed every day. We could say every minute, every 27 seconds, whatever fits us better. That's always based on our business criteria and of course, we can still say refresh explicitly, so click refresh button explicitly. So that's quite cool and that's also the reason why I show you this Docker file because internally I'm sure that TimescaleDB extends the server and it's using the cron job that it's setting up on its own based on the continuous aggregate we defined.

And for that you don't even need to use this TimescaleDB, there is a plugin that is specialized and installed by default in most of the cloud provider which is called pg_cron, which allows you to schedule some function to be run continuously. So there are many options. Do not treat this webinar as a love letter to TimescaleDB, I'm just showing you how you can solve the business cases and the specific business scenarios with much easier setup without the whole infrastructure. I saw people setting up AWS Lambdas just to refresh the materialized view really, I saw that multiple times. So people are doing crazy stuff which can be done if done wisely much simpler. So reports are fine, but what we actually need is some alerting. So we would like to know whether someone just went underneath the defined fuel efficiency. So maybe we would like to get notification or at least information where someone was cheating.

And so for instance they went a short trip, but use a lot of fuel. So how can we do it? Let's define the regular table for those alerting. It will have the vehicle when this alert was generated for which trip. So which trip was inefficient to be able to capture that and this fuel efficiency that went below our magic factor and magic factor in our case equal five. Of course, that's a dummy data. Do not use five. I mean, maybe it makes sense but probably not. But usually there is some threshold that we know that we don't want to go below. So we are using our materialized view here. So we are taking this fuel efficiency that we calculated in this materialized view and we are getting those vehicles and those trips that are below our magic factor and we are only calculating those that happened within last 90 days.

So what we are doing here is that we are inserting new alerts or if the alert already existed then we are updating the fuel efficiency because maybe someone forgot to put information, maybe someone amended the trip to put the proper length of the trip or proper fuel that they used. So we want to update and we also want to delete those alerts that are not applicable anymore. So those that are actually fine, so that are above our magic factor. So it's a relatively simple function, not much to see here besides what I told you. So now let's try to use some other function that we got from TimescaleDB, but we could get it also from other plugins. So this job will tell us that, "Hey, I would like to call this function to update my alerts every second." Probably that doesn't make sense to do it every second, but we will need it for later.

So now, if we try to get the data from our materialized view and from alerts, then see those two trips are below our magic factor and, "Hey, they are also visible here." So it worked because it was automatically refreshed, this function was automatically code. And of course, I'm not saying that everything should happen on the database server, we should be careful on that. But if we are running such data transformation, data manipulation and data interpretation, then why would we try to get this data outside and just spare some network traffic, in the regular code it will be much complex. There is a question about how to debug that and actually that's a tricky part.

There are some tools to debug that, but definitely we should be careful because we were years ago, I remember I'm old enough, I told you that I'm mature enough where all the business logic was in the database, do not go this way. So put here only this information that you need for data transformation, data manipulation, not business logic. So of course there is a really thin line behind that, but please be careful because obviously then those database tools are not great for debugging. Sorry. So now let's add additional extension to show you the full richness of this ecosystem. So PostGIS allows us to track the spatial data or talking in the plain language like GPS tracking, so geolocation.

One of the example and here we are using different column type which is a geometry. In geometry we can put such a trip as you would draw in your GPS like Google Maps and 4326 is another magic number which is the type of the coordinates. And this is like a regular meter and regular geolocation you need to use 4326, trust me. Yeah, it's weird, but it's like that. So actually it means that it's some line showing our trip. So I just added this and if we select from trips now then we'll see that in the end we will have route and route is null for now. But if we have a look on our data then, hey, we already have start and then location.

What's more, we already have distance in kilometers here and we already have the fuel used so maybe we could benefit from that and fill this route information based on the data we have. Of course it won't be perfect because probably we didn't go the straight line and that's what we are doing here. But I want to show you that we can also get some additional functions that could help us. So what we are saying is that we are creating the transformation. So think about the line for our trip and we are making line between our start location and end location and both of them will be points.

So we are using the functions that we got from our PostGIS extension that we just installed and after we do, all of our data will be filled and we probably, if everything went smoothly then we wont need to be running null column here. So let's have a look if that was applied and see it is already visible and someone mentioned that I'm using the data group and yes I'm using data group. So it is nice because it's capable of showing me how the PostGIS data looks like in the older version or other IDEs you might not see that. You might just see blob. So cool.

So using a good tooling is always nice. So now, see we can query it, we can get this information. So let's now went further and what we are going to do is that we will at first drop materialized view because we will be changing the trip structure. We will drop kilometers and we will make it a calculated column. So like properties in .NET, et cetera. So see, we are saying that this column will be always generated based on our route column and what we are saying is that distance will be the length of our road, this is the function from PostGIS and we will display it in kilometers. Then our start location will be first point of our route and end location will be the last point of our route. Cool. So let's try to do it and if we have a look on our trips again, then see we replace it and now start location is point, route is route and end location is also point and distance is calculated based on our route length.

Thanks, Lucas. On the chat we have a full explanation of 4326. So now we need to refresh our materialized view and refresh it just in case. Cool, and let's have a look again if we have again data in our efficiency alerts. Yeah, we have some. Cool. So that works. So again, what I wanted to show you here is that if our application or part of our application is based on the data manipulation, data interpretation, data transformation, then maybe we don't need to write everything in the regular C-Sharp, Java, Node.js, whatever. However, if we really want to then that's also perfectly fine because nowadays most of our database providers supports everything that I did. So I will show you later on and send you in the materials the example to the .NET application that I have. But see I'm just using the default Postgres provider. I'm calling all of those everything, all those SQL that you show during the recording and that works everything the same way so we can do precisely the same in your regular application.

So just in case, let me kill containers because we will be going to grand finale of our presentation. So let me clean up all the data. Let me start my server again. Cool. Postgres is running, it's created, see it's pre-installing extensions and yeah, regular logs. So let's try to run what I just added. So let's do .NET run. Fingers crossed. .NET restore. Yeah, see everything happened in the same way. The fancy CLI formatting as you see wasn't perfect, but all that you happen to see previously is visible now. Cool. So let's go to the grand finale and the grand finale will be related also to NServiceBus a bit. So in the grand finale, let's start with asking you question. Let me use my terrible drawing skills again. What is that?

So I demand answers. What do you think? What is that? A DB. Yes, close enough, biscuits, staircase, table, tuples, ladder, lock, queue, yeah, so let's try to with the lock and with the queue because that's something that will be actually the light motif of our grand finale. So all of the databases in the Postgres, this lock is called WAL and it's not WAL like Wally, it's right ahead lock in the MSSQL, it's called transaction lock. And in database all the inserts, all the updates, all the deletes goes first to the transaction lock and each of these entries has its own transaction identifier.

This is two, let's say two, three, four. And then when one of the transaction, for instance, transaction one is committed then in the regular databases, not PostgreSQL, it means that we are saying to the database that, "Hey, my transaction was committed so I'm letting you to delete those entries from the transaction lock." This right-ahead or transaction lock it's built to make a full recovery. So before those changes like inserts, deletes, updates are applied to the tables, they are getting to this physical location in the disk which is called transaction or right-ahead lock. And once the transaction is committed, so the cursor of the location of the record is moved, then database knows that, hey I just committed those records, I can free it up. However, in the Postgres, the cool thing about Postgres is that it allows us to say like, hey, for the specific set of my tables, let's say alerts, our alerts, I want to keep those entries forever or until I tell you explicitly that you can delete them.

And based on that, so Postgres have something that is called logical replication, logical in a sense that it can tell us that, hey I just did insert, I just did update, I just did delete. Comparing to the traditional replication. Traditional replication is just moving bytes from one place to another. Logical replication is saying us that, "Hey, I just did those inserts to this table and I'm notifying you about that." So in Postgres we can set up the publication and publication means that we are selecting the set of tables that we would like to publish notifications and each of these publications can have something that is called replication slot and you can think about publication as a broker and slot as the consumer that can, like entry point where consumers, so our applications would connect. So we could have the regular cloud service that is running somewhere that would be notified by the push notifications from PostgreSQL. And that's precisely what we will be doing in the grand finale.

So let's have a look basically on the code, I won't go into extreme details, but I've built the API which is using SignalR. So SignalR is for people outside of the .NET space. It's like a WebSocket. So I'm saying that, hey, let me find this SignalR. Yeah, so I'm saying that, "Hey, I would like to publish through SignalR notification to the web application." I also built a dummy react application. Sorry I won't be showing you the code, I generated it with the ChatGPT. So nothing to see, but you will see the view. So to prove you that there is some web application, let's try to run npm run start. Fingers crossed that it works. Okay, starting development server. Cool, sounds like it's working. Let me start also the web API and I will show you later what's happening there. .NET run. Now, I would like to see because fingers crossed.

Let's see. So let's start our API and as it's starting let me show you what I'm doing besides that. So I'm adding the hosted service, which is the background worker, which is subscribing as you see to the PostgreSQL through the connection string and subscribe, what it does, it tells that, "Hey, I would like to subscribe to Postgres with those subscription options." So I would like to subscribe to this publication through the slot that I'm passing here. I'm also saying that, hey, I would like to maybe create publication for this table if it doesn't exist and I would like to always recreate, do not use that on production, but for the demo always recreate is good. So essentially I'm just using the default .NET, PostgreSQL provider, I'm opening connection, this has to be stateful, this connection needs to be open to get notification.

Then I'm creating publication if it doesn't exist and I'm setting up the replication slot if it doesn't exist, fine. Now, after that I will get the AsyncEnumerable. So if you're from Java, then think about Java stream in Node.js, like a Node.js stream and I'm doing here some mapping from the messages. Doesn't matter how. I can show you in the Q&A, if you'd like to, but I'm just calling, yeah, start replication. So give me information and what I'm doing is, "Hey, I'm getting this message and I'm forwarding this information with this, subscribe to my SignalR." So I'm publishing to the web. So let's have a look if web API is working, cool, it's working. So now fingers crossed, this is my fancy fleet management app.

Let me make it a bit smaller. Where is my IDE? Yeah, luckily it's not closed so let me do it this way. So let's make it final preparation. I would like you to see this notification. So grand finale, what I'm doing is that we will be inserting record to the trips table, then it'll be aggregated in the materialized view. Then through the automated running in a cron job function, it will be passed to the alerts table and then alert will be pushed to the web API through logical replication and forwarded through WebSocket to the web API. Ready. Fingers crossed that the scariest moment in this presentation. Thanks, Daniel for the drum roll. And let me find my magic insert. Believe me that I'm adding wrong trip.

Woo, it's working. Let's try to, so you were not only drum rolled, but also recalled and yeah and that works. So again, that's one of the nicest thing that you can get from a PostgreSQL and that's one of the most unique things. However, as always it's not a silver bullet because as we are here in the particular, and then let me try to cover a bit. Daniel can tell if I'm wrong or right, but why an NServiceBus transport is not using this logical replication at least yet because push notifications are nice if we need to have this real-world, track information like this other thing and if each alert or each notification is separate and we would like to process it separately, but let's say that this is table like our trips table. So as we are maybe we would like to publish notification or process each trip.

So if each trip would be needed to be pushed somewhere else, replicated to some other storage, et cetera, then if we would need to process each record one by one and on the other side. So here we have our queue, we need to then process record by record. Then as you know each of this operation. So this trip is pushed through the queue on the other side to some web service. Then if each operation would need to do for instance database call or some other web hook calling some external service, then as you know each call to the external system, let's say database will need to have additional lag. So for instance at least 50 milliseconds to open connection, et cetera. If you sum up those 50 milliseconds and usually it's even more than if you sum up those information for 10 operation, then just by this additional lag you will get half of the second lag.

So it's a common misconception that push notifications are always faster, they are faster to deliver information, but they are not always faster to process because if you batch those records then you could cut those additional delay, if you processed all of those information in one batch, in one call to the database, then instead of having this additional latency, then you would have just one 50 millisecond call. Plus logical replication is also great because it remembers, it keeps the offset, last processed offset. So if your service dies, then replication slot, remember what was the last offset that was processed. However, if we are having push notifications then it can overwhelm subscriber. It's also a bit harder to control data writency, resiliency and et cetera. And that's all the stuff that I'm sure you get from in service based on or similar tooling. So it's not, what I'm trying to say is that we should not be just running immediately to use this logical replication.

But it's a really great feature to have it in-built to know that it exists. And there is already a question how logical replication relates to change detection capture. Precisely, this is how the change detection capture is internally implemented in Postgres and all tools like Debezium. So the plugin that is forwarding messages to Kafka, et cetera. So this is precisely how that works. So again, I wanted to clarify that not using logical replication, it's not a mistake. It depends on your specific scenario whether you need to get immediately this information or whether each of this operation will be actually a separate processing. If you can batch then much better or if you have constant load, so all the time you are getting the same amount of records, then batching is much, much more efficient. So that's something that to keep in mind. So let me just make the final few words of the summary.

So again, besides what I show you, there are stuff like Citus plugins, Citus plugin allows you to use multi node setup of your Postgres and actually the Azure Cosmos DB for PostgreSQL doesn't have anything in common with Cosmos DB besides the name because actually it's PostgreSQL with Citus plugin. Also, some tooling like Amazon created Aurora PostgreSQL where they use this core engine and switch the storage provider to allow to move the data to for instance S3 or have different scaling capabilities. We also have tools like, Neon DB, Supabase that has serverless Postgres. So there are multiple things that you can do with PostgreSQL and that's precisely what I wanted to show you during this webinar that PostgreSQL is of course free database.

You don't need to pay the license cost. If you're using cloud provider then you need to pay a cloud provider for hosting it. You can host it on your own, you saw how to configure. But PostgreSQL is much more so reach air plugin ecosystem, reach tooling and reach build native capabilities that makes the PostgreSQL an exceptional database. So it's not only rock solid, but in my opinion also exceptional. So that's all from me. Thank you. I took a bit longer, but I think it was planned for one hour and a half and we have still some space for Q&A, right?

Yes, we have a few more minutes for questions.

Yeah, so let me start with the questions that I got from the people that send them upfront before the webinar. Here you have links to everything that I show you my blog article, all the code that you saw. But yeah, let's go to the questions. So one of the questions that I got from Radek is, Radek asked what do I do when the execution planner suddenly changes the plan with worse outcome, which often happens in production. So data volume statistics change over time. We often realize based on metrics, but at the same time we can't do much about it unless we rewrite the query and redeploy the usual analyze to refresh statistics doesn't work. So I hope that you remember the EXPLAIN that I used during to show you how the data is going when you are calling select, star from the trips. So then based on the data it's going here.

So EXPLAIN ANALYZE is a common tool to find whether our indexes will be applied, how our query will be run and so on and so forth. So the most common case when PostgreSQL is going have is when we have a lot of data that is added immediately and this data is much different to those that we were querying. And we need to remember that PostgreSQL Query Planner is a statistical tool. It takes into consideration the queries that were run, but mostly it's just based on the statistical analysis of the data. So if we are adding, for instance, new set of data for the new month and then we are starting to be querying them, then this data here will be obviously much different to the one that we had throughout all the year. And if we have a single table, then obviously if we try to explain then it will randomly with some algorithm try to find the patterns in the data to decide how to query it.

So in my opinion, because the other question from Radek was what were the common tools that could facilitate that? And of course there are tools, plugins like pganalyze that can store the queries that we are running and use some machine learning to suggest what indexes should we add or what queries were the slowest. But there are no, at least what I know, popular and production ready tools that would automatically do it for us. So my recommendation is always to analyze the data and try to group and partition our database on their capabilities and usage pattern because if for instance, one partition has much different data and it's the biggest size of the data even within the specific table, then more likely PostgreSQL will optimize for this partition.

But if we are not querying data anymore for this specific subset, then obviously our query planner will be wrong. So we can do some tricks like calling explicitly, refresh plans, reanalyze, et cetera, but then it'll be immediately broken. So one of the solutions is maybe to use partitioning that I show you because then as you saw, it will be querying actually this specific partition and doing this analysis based on the specific partition instead of the whole data. So partitioning can also help in that. That's obviously not the only solution that we can have, but one of the possible solutions.

I think we have time for a very quick question, Oskar, like the last one.

Yeah, so there is a question from Muhammad. So what happens when there are multiple subscribers to all notifications? So you can have multiple slots for the publication as I show you here, and for those, each of them will be tracking their own offset or position in the log separately. However, if you have multiple subscribers to the same slot, then it's like you would have multiple subscribers to the RabbitMQ queue, then you don't have any guarantee of ordering and you can have some other nasty issues that all the people that are or were using RabbitMQ knows about.

Thank you.

And probably that's why you want to use NServiceBus, right, Daniel?

Well, sounds like another great webinar topic, but I think we are out of time for today. So yeah, thank you very much. So I wanted to give a quick shout out that all our colleagues, some of our colleagues will be speaking next month at the Global Architecture Summit in Barcelona and you'll also see some of the particular folks at Techorama in the Netherlands in October. If you're interested to see some of us, you can go to particular.net/events and then you can find us at conferences near you. And yeah, that's all we have time for today. So on behalf of Particular and also Oskar, I'm saying goodbye for now and see you at the next Particular live webinar and have a good evening, morning, or whatever time it is of your day.

Yeah, thanks. Thanks, bye.

PostgreSQL Superpowers in Practice

🔗Why attend?

🔗In this webinar you’ll learn how to:

🔗Transcription

About Oskar Dudycz

Additional resources