The 10 Question Framework for High Performance, Secure IBM i APIs
High Performance, Resilient APIs for Your IBM i Using Kafka
Efficient SQL depends on a good indexing strategy. Paul covers the basics of indexing, the different types of indexes, the index advisor and creating indexes that are not advised.
High Performance, Resilient APIs for Your IBM i Using Kafka
If you are looking to API enable your IBM i to share data and functions across your enterprise or with your customers and partners, you need to look into the performance and high reliability you can achieve by using Kafka. Apache Kafka is a high-performance, resilient, open source-based, event-driven, pub-sub messaging application for providing loosely coupled connections between a variety of message producers and consumers. If that sentence seems like a whole bunch of confusing jargon, come to this Summit Lunch & Learn session! In the presentation, I’ll introduce the basics of Kafka, its supporting technology, why it can be so valuable, and how to get started with Kafka-based APIs on the IBM i.
Okay, everyone I think it’s time for us to get started. So, welcome to this, the third of our lunch and learn sessions in this series. I’m delighted to be joined today by Dan Magid from Eradani, who will be telling us today all about really high performance resilient APIs for your IBM i using Kafka. And I have to go catch my breath. But before we do that, I will be talking to a little bit about indexing on IBM i. So, I will be doing that for about 15 minutes at the start. And then I’m going to hand over to Dan and Aaron. And then I will do another sort of 15 minutes on indexing towards the end of the session. A couple of little things just on practicalities here. If you have any questions, will you please put them in the Q&A, as opposed to the chat. So, we will be keeping an eye on the Q&A. So, the chat is just there for chat. And by the way, everybody who’s been saying hello, hello back to you. So, one of the nice things you can hear me. Okay, so please put them into the Q&A. We will be keeping an eye on them. And your question may well be answered directly in the Q&A. Or if I spot it, might just answer it in passing. Okay, so without further ado, let me move on to talking about indexing on IBM i. Now, indexing on IBM i can be a very complex topic. So, since I have about 30 minutes to talk about it, this is just very much an introduction to it. And hopefully to give you a highlight of things that you may need to go and check at a later stage. So, to begin with, let’s just talk about what indexing is primarily for. And the main reason that we have indexing is for use by the SQE optimizer. So, any SQL that runs on the system, and here we’re talking, I think, primarily about select statements, though it does apply to just about all SQL. The SQE optimizer is what handles that, or the SQE engine, if you want, is what handles all of that on the system. So, regardless of where a request comes from, whether it be in within run SQL scripts, embedded SQL, through PHP, from Java running on some other platform, or in a portal, or whatever, it makes absolutely no difference. It all goes to the same place. So, the main reason that we create indexes is for use by the optimizer, is to give the optimizer choice. So, what the optimizer uses is access paths. And so, access paths come into existence in a number of ways. If you define key constraints, whether they be primary key, unique key, or foreign key constraints on tables, they result in an access path. If you have keyed physical files or keyed logical files, they result in access paths. And, of course, then we have indexes, which is where you can really, really create access paths. In other words, we can do things in access paths with indexes that go beyond what we can do in key constraints or physical or logical files. So, indexes on IBM i. There are two types of index. The first of these is binary radix. Now, a binary radix is what is the traditional indexing system that has always been on the system. So, it has always been a form of binary radix that has been in our database since the system 38 in 1979. Okay. And this is what is known as a hybrid tree structure, the way that it works. Now, binary radix, and the hint here, by the way, is in the word binary, means that it is a very, very fast indexing system that’s going on. The way most of us visualize an index is that we think that the index is, well, it’s a key, you know, so it would be like an 80-character key, say, and some sort of a pointer to where that row is in the table. And the index is kept in order by that key. Now, this is a good way to visualize it, but it bears very little resemblance to what’s actually happening underneath the covers. Okay. So, for example, let’s say I had a table with a million rows and an 80-character key. And let’s say for 500,000 of those rows, the first 80 characters of the key is the exact same. And for the other 500,000 rows in the table, the first 50 characters of the key, or sorry, first 40 characters of the key is the exact same. Okay. Now, this isn’t sort of a million times 40 that’s kept in the index for those. It’s actually a bit test. So, the system sort of, you know, what’s in there is logic in the index that says, you know, if the fourth bit of the third byte is on, it’s one of these 500,000. If it’s off, it’s one of these 500,000. So, this is why it’s pretty much impossible to calculate what the size of a binary radix index is going to be, or a binary radix access path, because it depends on what the actual values are as to how complex the bit comparisons are going to be. The second type that we have is an EVI, an encoded vector index. And this is a variation of bitmap indexing. And this method, this implementation of bitmap indexing is unique to IBM i. You get it nowhere else. And the main place that we use these EVIs is when we have low cardinality in a table. Now, low cardinality means that you have a column, and in that column, that regardless of how many rows you have, there are very few values that you will have in there. So, an excellent example of this would be something like a state code in the U.S. So, think of this, if I’m Amazon, and all the millions of customers that I have in the U.S., with all of their addresses, there are only 50 possible values that I can have for state. Lots of rows, but low cardinality. Low sequence of values.
Okay, and we’ll be talking about where we use all of these in a second, where we use EVIs on that. Okay, so when we talk about indexes, the sort of first type of index that we use is just simply the one that we’re familiar with, which is column value slash sequence. So, you know, it’s employee ID. So, it’s really the order by. It’s the sequence that we’re specifying in the index. But we also have derived key indexes, and the derived key index is where the value that’s in the index is derived in some way. So, it could be an expression. So, I could say, well, the key is the value of the salary plus the commission plus the bonus. That’s what the index value is. Or it could be the result of a scalar function, and we will see this a little bit later on. Okay. So, Leisa, I will be dropping the link for the handouts and that into the chat and that a little bit later on. Okay. Okay, we also then have what are known as sparse indexes. So, a sparse index is one that reduces the number of rows in the index with a where clause. And again, we will see an example of this later on. And then we can have indexes that are for index only access. So, this would be where if I have a select statement, the optimizer can actually get all of the information it requires directly from the index without ever going near the rows in the table. Okay. Okay, so a question there from Shahad. How does index differ from logical files and what advantage does it bring compared to logical files or other access pass creation methods? Okay, we’re going to see all of these, Shahad, hopefully when we go through. But I’ve just mentioned some of them here. So, for example, a drive key index, it’s impossible to do that in a logical file. Okay. Where clauses, yeah, you can do that sort of logic with select logic and logical files. But with an index, I’ve got a complete where clause, including all of the scalar functions in there. So, you know, I mean, a logical file compared to an index, I would describe a logical file as a child’s three-wheeled bicycle, and I would describe an index as being an 18-wheel truck. Okay, in terms of what you can define and do with them. Okay. Now, an important point, though, to note about indexes is they’re not just for a sequence and an order by. Okay. And they are not unique keys. This is something we were very used to in logicals, that nearly everything we defined was a unique key. This is not the case with indexes, and you shouldn’t think of them like that. Okay. So, they are not just for sequence. And again, we’ll talk about that a little bit later on. So, index uses, then. We do use them for native I.O. So, if I define an index, I can put that index in a declare file in an RPG program and use it, you know, exactly as if it was a, you know, a logical file per se. So, I can use it to do direct native processing in RPG. Okay. But the main use of indexes is used to identify and process rows in a table. So, this is when the optimizer goes to optimize a select statement. It uses it for selection. So, it will look for indexes to help it with the where clause. It will use indexes for joining, if you’ve got join logic in your select statement. It will use them for grouping, and then it will use them for ordAarong. So, not just for ordAarong, not just for the sequence, but also for the where clause, the group, you know, the having clauses, joining and grouping. They’re also used for statistical information. And we’re going to see this a little bit later on when we’re going to see an index that’s actually advised, but it’s advised only for statistical information for analysis. Okay. We’ve got two main strategies that we follow then when it comes to indexing, proactive and reactive. And usually it’s a combination of the two. So, proactive is when you design your database, you decide I’m going to put as many indexes in place so that we can hit the ground running when people start to hit the database with SQL. Now, a very good place to look for this and, you know, what index I should be building is to have a look at the views that you’re defining. Okay, because you should have lots and lots and lots of views. And in there in the views, when you look at the sort of where logic and everything like that, it gives you a good idea of what to do. Reactive is when the system is up and running. And it’s not, okay, so the extreme of this is when somebody runs an SQL statement and the lights in the building dim. So, you go and you say, maybe I should create indexes if we’re going to run that SQL statement again. That would be the extreme of it. But it would be that as new views and that are created, you should be looking at creating those indexes, okay? So, you will always be doing reactive indexing in there. Just how much you do depends on how proactive you are at the start. Now, the nice thing about this is that as you do this, like as you run SQL on the system, the optimizer does recommend indexes to you. It does say, by the way, if I’d had this index here, I would have been able to do this quicker. And you can now have a decision, you know, whether or not you create those indexes, okay? All of that information, by the way, is in one of the system catalogs. But the way most of us access it, the index advisor, is through ACS. Now, you can do this at a system level, at a schema slash library level, or if you want, for an individual table, okay? If you just right-click on any of those, you will see the option for index advisor. And in you go for advised indexes. Now, by the way, this is not the default layout for it. This is my customized layout for it. And these are all of the columns that you have. So, for a table, for this table called for index, which we’ll be looking at in the second half, it’s saying, you know, here’s what was advised and here’s why it was advised. The main sort of columns you’re interested in are the advise key, when was the first and last time that the index was advised, how many times was it advised, okay? The longest and average execution times there were when people were running queries for when that index was advised, how long it thinks it would take to create the index, okay? How many rows were in the table when it last recommended it? You know, so you’ve got to, I mean, this is the thing you would bear in mind if, you know, it recommended this index, but you would actually just purge the table and to need it 90% of the rows, yeah, maybe you don’t need it anymore, okay? And then why was it advised? Was it for row selection or was it for ordAarong or grouping that it recommended? Now, there’s this lovely option that they added to this, which is called condensed advised indexes, which are where they will take multiple of these advised indexes and sort of just push them into one. So, if one index is a subset of the other, it just shows you one advised index as opposed to two, okay, which is kind of nice. This just shows you all the different selections you can do within the index advisor, okay, for what columns you want to see, the sequence you want them in, and then these two sets of option here, the general and the date and time ones, if you want to subset the rows. So, I only want to see what’s been advised in the last week, for example. You can also get to this from the plan cache. So, if the SQL statement is still actually in the plan cache, you can see advised index information in there as well for something that is sort of still semi-active. There are services available to help you with indexes. So, these ones here, act on index advice, remove indexes, and reset index statistics, okay. So, that was just a general overview of what indexing is, the type of things. What we’re going to see in the second section when we come to it in just a few, in sort of in about 40 minutes or so, is some of creating some indexes that are not advised and looking at things like derive key indexes and indexes only for that and seeing the impact that they have. Okay. So, I am going to stop sharing my screen now, and Dan, I will hand over to you. Great.
Thanks, Paul. I appreciate it. Can you hear me okay? Yep. I can hear you perfectly, sir. Great. And I’m going to go ahead and share my screen here, and I assume you can see the screen. Yes, I can see it absolutely fine. Yep. Okay, terrific. Great. Okay. So, I love the combination of these two things, you know, because Paul is talking a lot about sort of the traditional things you do on the IVMI and for managing data, and really what I’m going to be talking about is just another option, another way to handle data using some of the modern technology available in this thing called Kafka. So, and I’m going to start out with the basics. So, we’re going to look at, you know, why would you want, why do you care? You know, what do you care about Kafka? Why is it important? And then I’ll talk about the architecture of Kafka and some of the terminology, so you’ll be familiar with that. I’ll go through some specific IVMI use cases, so some real-world use cases of Kafka with the IVMI, and then Aaron is going to do a quick demo of a sample Kafka application for the IVMI. And if you’ve ever sat in on my presentations, you know I’m going to cover a whole lot of territory very, very fast. So, we will be, you’ll get links to the recording so you can review later, and we will also provide you with the handouts. So, basically, we’re talking to many, many IVMI customers today, and it’s more and more every day, who want to be able to tie their IVMI into the wider IT ecosystem, whether it’s with other platforms in their own enterprises, or if they’re trying to connect with their partners and their customers, they want to be able to talk to those things either from their IVMI or give those other outside users access to their IVMI resources. So, they’re trying to create these connections and create some kind of middleware that allows them to talk in and out, and there are different options for how to do this. The most popular ones we see out there are REST and then this new growing environment of Kafka. So, just a real quick kind of differences between REST and Kafka is REST uses things you’re probably familiar with. So, it’s using all the HTTP methods, get, put, patch, delete. So, you’re creating APIs that allow you to access things using those standard REST or those standard HTTP methods. It’s relatively easy to set up and use. It has very low latency for individual small messages that you might be sending in, and you can connect it in a variety of ways to the IVMI. The problem with REST is when you start getting up into higher volumes of API calls or large message payloads, it starts to break down, it starts to slow down, and it sometimes can’t keep up with the performance requirements you might have. And that’s where Kafka comes in. Kafka is really designed for very, very high-speed processing for what are called message streams. That is just data that is just constantly coming in. So, it’s not that you’ve got a transaction batch of things you need to process. It’s the data just keeps coming in. And so, Kafka is designed to deal with those kinds of message streams, and it’s designed to give you real-time processing. So, as the messages come in, you can do things right away. So, you get very, very real-time data, and customers are replacing a lot of their batch processes and their extract, transform, and load processes for analytics. They’re replacing with Kafka messaging, so they get that really real-time information. Kafka can handle very large data loads, so you can have big binary files in a Kafka message. And we’re going to talk a little bit about this, but it has built-in resilience so that it stays up and running all the time, and it uses a Pub-Sub event-driven architecture, and I’ll talk about that as well. And it’s more complex to set up and use than a REST environment. This is from the Apache Kafka Foundation. It’s the definition. Apache Kafka is a high-performance, resilient, open-source-based, event-driven streaming Pub-Sub messaging application for providing loosely coupled connections between a variety of message producers and consumers. And there’s a huge amount of jargon in there, and by the end of this presentation, you’ll know what all of that means if you don’t already. Just a little bit about Kafka use. It’s being downloaded all the time. There are over a billion downloads from the open-source found from the Kafka Apache Association. It’s used by more than 80% of the Fortune 500. The companies that are using it are expanding their use, so there’s been 300% growth in the companies that are making large-scale use of Kafka clusters. LinkedIn is processing 7 trillion messages a day using their Kafka environment, and it is facilitating this move to real-time stream processing rather than batch processing operations. So let’s talk about some of the terminology. So first thing is Kafka is an event-driven architecture. Now, event-driven architecture has been around for a very long time. When I was a new IBM i employee long, long, long time ago in Las Vegas, Nevada, I had an opportunity to sell a system to one of the big casinos who wanted something to automate their race and sports book, the big boards in their casino where you could bet on games. And the problem is they were adjusting the point spreads and the odds as people were betting because their whole objective is simply to balance how much is being bet on each side. They want it to be exactly equal because they don’t care who wins. They make a fee on every transaction. So every time somebody would bet, they were calculating what should the odds be or what should the point spread be. So they had to have this event-driven architecture. So I had the opportunity to sell them either a System 38 or this thing called a Series 1 that IBM used to have. And it turned out that the Series 1 was the right option because it was an event-driven system. It had a language called event-driven language, and its command language was called EDX, the event-driven executive. It was an event-driven system. So you can imagine dating myself a little bit. This has been around for a long time, this whole idea of event-driven architecture. If you want to think about it kind of in a 911 situation, you wouldn’t want to do procedurally oriented programming for 911 where you’re going out and polling a database every few minutes to say, gee, are there any emergencies out there? You want to know the moment the emergency arises. So you have this event-driven architecture that says the moment that call comes in, somebody can actually take an action. So the minute it comes in, we’re not waiting for somebody to go check to see are there any emergencies out there. We’re handling it as soon as it comes in. So the idea of event-driven, you use it when data delivery is continuous, meaning the data is coming in. You’re not getting batches of things that you process together. You’re just getting this continuous stream of data that is irregular. You don’t know when it’s going to come in, and it can be unpredictable as to what you’re going to get, what kind of data you’re going to get. And then it’s important when you need to respond as that data comes in. So you need to be able to respond quickly as the data arrives, and people are expecting very, very fast response times. Hey, Dan, if I can just interrupt you just for one second. Sure. Go ahead. I just realized your camera’s off. Do you want to turn it on? Ah, OK. There we go. Great. All right. Thanks, Eric.
That’s an example of an event coming in that you need to respond to. Exactly. I had to respond to you right away. OK. So just to take that to a real-world example, and we see this all the time, where customers have these kinds of events coming in. So events are happening, and they’re storing them in a table somewhere, and then they do a polling approach, where they all go out and they’ll poll to see, is there any new data? Is there any new data? They go out and they do that every couple of minutes or once an hour or whatever, but they have a timeframe where they’re going out and they’re checking to see. So it’s a procedural approach using polling to say, is there data for me? And there are obviously a lot of concerns with that approach if you need this very fast response time, because one is you’ve got the delay of the polling. You’ve got to wait until that polling happens to find out. You also have to look at that data and say, OK, well, which is the new data? So you have to have some algorithm to say, I can recognize what is new since the last time I checked. Is there a new file? Are there new records in the database? Do I need to have some kind of logic for identifying the updates? And then I need a custom connector for talking to whatever it is I’m going to need to call based on that data. Do I have another IBM i program I have to call? Do I want to send a text message out? Do I want to execute a process on a Windows platform or Linux platform? And I have to create custom connectors for each one of those things. So I do the poll, find out if there’s new data. If there is, then I have these processes that are going to execute based on finding that. So the other more appropriate thing or the IBM i approach to event-driven is you put in a database trigger. So you have to say, if the database updates, then I want to perform an action. And that is the typical approach to event-driven programming on the IBM i, is I put that into the database so that I can then take an action based on that. You still have the issue that once you run that trigger, you’re going to have to create the connector to whatever it is that you’re going to have to run. You don’t want to really put the processing logic into the database trigger. So you’re going to have some external processing logic that you’re going to have to go find and run. And it really was triggers were really not designed for the idea that you were going to create this completely event-driven architecture around your IBM i. So that may be part of your strategy. In fact, it’s a great way of identifying things that happen on the IBM i. But there are things you can do to actually make that more powerful. So it’s where what is the alternative to doing those things? So you’re going to need that integration. So no matter what you’re doing, doesn’t eliminate this idea that if I have transactions coming into my e-commerce system and I need to access my warehouse system and update inventory or I need to go grab some pricing information from my pricing algorithm, or I need to check the customer’s credit, I still have to do those integrations. So typically what we do is we write these custom integrations to do all those things where we have to do the authentication. We have to translate the data if it’s coming in as JSON data or XML data or common delimited files. We’ve got to translate it. We’ve got to route it to the right place. We’ve got to do all the error handling and the logging. So we’re doing that for each of these connections. So there’s a lot of work to creating these kind of direct integrations. What the idea of Kafka is is we don’t do it this way. We don’t create a bunch of direct integrations. What we do is we create this loosely coupled environment where the event, what’s called the event producer or the publisher, says, oh, I’ve got an event. The new order has just come in, or a truck has just left its warehouse, or I just got information from my patient on their medical condition. So these issues are coming in. They get posted to what’s called a broker, and then the users of that information subscribe to it. So I, as the producer, simply post the message, and the users who have subscribed to it and are authenticated and authorized get those messages. That’s the idea behind Kafka. And you can have lots and lots of different, and it’s practically unlimited, the kinds of producers that you can have. These are some of the ones we’ve run into. Trucking companies that are tracking arrivals and departures and truck status. Finance companies that are dealing with finance transactions. Healthcare companies that are dealing with information coming in from their patients. Inventory inquiries, and we’re all familiar with using Amazon and other sites for doing order entry. This is a very common thing that people are using event-oriented programming for. Using mobiles, mobile devices, so every time somebody clicks on something, we’re going to get a message. Those are all different examples of people who are producing messages that might get posted to a Kafka broker. Kafka is basically using this, although with some enhancements, it’s using this idea of what’s called PubSubber or publish-subscribe architecture. So let’s look at that and what that is. So you have basically these three elements. You have the event producer or the publisher, so somebody who is creating an event. That could be a sensor out in the field. It could be a blood pressure monitor. It could be a sensor in a truck. It could be an ATM machine or whatever. There are lots and lots of things that might be producing those messages. When that event occurs, they post the message to the event broker in what are called topics. A topic is simply a way of relating messages. So it might be all of my order inquiries or all of my patient reports or whatever. You’ve got a list of related messages, and they’re gathered together in a topic. And then you have the event consumers, the people who are actually then getting that data and making use of that data. They can subscribe to those topics, so they get that message. They either get that message pushed to them or they get notified that there’s a message available to them as those messages arrive into that topic on the event broker. So what you end up with is this decoupled interface, right? So you’ve got the producers and the consumers, but they don’t really care about each other from the perspective of having to worry about, gee, are the consumers making changes to their application? You don’t really care. Is the consumer system down? We don’t care. Are they performing slowly? We don’t care. We are simply, as a producer, we’re simply posting those messages. The consumers are simply consuming those messages, and the two things are not dependent on each other from the perspective of what’s happening on the other side of that connection. The producers are simply posting those messages to the Kafka broker, and then the other side is consuming them. By the way, this could be going either way. You can just as easily switch these topics or switch these headings and say that the IDMI is the producer of the message sending them to Kafka, and all of these other devices are consuming the message. It can go both ways. So if you look at the PubSub architecture with Kafka, I’ve got the e-commerce system in this example that’s publishing order inquiries. So the order inquiries are going into this order inquiry topic. The consumers are subscribing to that order inquiry topic, so they get the information every time that a message gets posted to that topic. So those are events that are hitting the Kafka server, and the topic here is an ordered list, and every new message that comes in is added to the end of that list. So let’s talk a little bit more about those topics. So topics, again, are streams of related messages. It’s a logical representation. So basically, it’s a way of categorizing messages, that they’re a particular kind of message. But it is a conceptual idea. It’s not a physical thing. It’s a conceptual idea that says these messages are related. There’s another concept called a partition, which is actually the physical location where the topic is stored. You can have multiple copies of a topic, and I’ll talk about that in a second. But the topic is really just the concept that this is a set of related messages. And you can have a many-to-many relationship between producers and topics. You can have many producers who are posting messages to many, many different topics, and you can have as many topics as you want. And these might be some examples of Kafka topics and messages. It’s got blood pressure readings. It’s got the patient identifier, the date and time of the reading, and what the reading was. So I’ve got that information coming in. Again, it’s an immutable list. It’s a list. You don’t change it. It’s like a log. And the latest messages are going to the bottom of the log. So here we’ve got truck arrivals and departures. And again, just a whole bunch of messages that are getting posted in arrival sequence. Now, what Kafka does for you also is that it replicates those brokers. So if I’ve got a Kafka broker and it’s got a series of topics, so I’ve got all these different topics on my Kafka broker, I can set it up so that it will replicate itself. And it does all the infrastructure for doing this so that I can ensure that if one of these brokers goes down, I’m not down. The other brokers are still up and running. So it’s really for two purposes. So you can handle the very, very high performance. You can handle tons and tons of messages that are coming in because you can have as many of these brokers as you want. So you can have lots and lots and lots of them. And it’s resilient in that if some of the brokers go down, you have other ones that are still up and running. Same thing if you want to do upgrades, you can upgrade your applications without bringing the Kafka brokers down. They can still continue to run. You can continue to post messages and consume messages on the Kafka brokers that are up and running. So it’s designed to be very, very high performance and very, very high speed. So that’s how companies like LinkedIn get 7 trillion messages a day. It’s because they’ve got literally thousands of these Kafka brokers out there that are handling those incoming messages. And as a subscriber, I can subscribe to multiple topics. And, again, this is based on authentication and authorization as to what I’m allowed to see.
But if I am allowed to, I can choose which topics I subscribe to, and I can get the messages from those topics. So different subscribers can have different combinations of topics that they’re subscribing to and getting messages from. And, again, the producer, from the perspective of what it’s published in, does not care. So there’s no sort of any custom logic you have to build in there. In the Kafka cluster, you are identifying what people are authorized to see, what they can get. But I don’t really care about who’s subscribing it from the code that I have to write in order to make this communication work. I’m simply posting the messages, and the consumers are consuming the messages, and Kafka’s taking care of who gets what. And then there’s a management system around this to manage all this for you. So you don’t have to manage all of the replication. You don’t have to manage where the message is coming from. How do I make sure that I’m getting only one copy of the message and not multiple copies of the message? And that’s using what’s been current is ZooKeeper. So you’ll hear about ZooKeeper. It’s a management system for Kafka, but that is being replaced by a thing called Kafka Raft. And we don’t really have time to go into a whole lot of detail about that, but this is the new management system that’s replacing ZooKeeper. But the key here is that Kafka is providing the management functions inside of it, so you don’t have to build that yourself. It’s managing this whole sort of process of bringing in these messages from these streams and managing the replication. It’s handling all of that for you. So some real-world Kafka use cases. People are actually using this technology. One of the first ones we ran into was a customer of ours. They were a trucking company, and they had been dealing with about 100 requests a day. And then they signed up to partner with Amazon and a couple other online providers. And suddenly, their rate requests, and to some, I guess, 3PL companies, and their rate requests went up to a million requests a day. And so suddenly, they had this massive number of requests coming in that they wanted to be able to deal with. And they had to be able to respond really, really quickly. If they didn’t provide very, very fast response, they didn’t get the opportunity to play in those particular transactions. We’re working with a financial services company that’s tracking payment data, so information coming in from the people they’re loaning to. And they’re getting 12,000 transactions a second, so a very, very high volume of transactions coming in from the payment systems. We had a distributor that they had to process an FTP file. And they thought, well, it’s one file. How long could it take? Until they found out that they were getting over a million transactions in that FTP file. And the way they were processing it was taking them actually a couple of days. And so by moving it into Kafka, they can process those transactions much faster. We had another customer. They were a retail chain. And on Black Friday, they said their traffic went from a few thousand transactions a second to over 80,000 transactions a second. So they had to be able to rapidly scale up so that the Kafka replication thing was a powerful option for them to be able to very quickly scale up and then scale back down when they didn’t need that kind of power anymore. Again, a key e-commerce customer we had who had 80 different warehouses. And each warehouse actually was running different warehouse management systems because they had been acquired. And so they needed to be able to talk to all of those warehouses simultaneously and yet provide very, very fast response. Again, they used that sort of publish-subscribe way of doing things in order to talk to all those different systems. We had worked with an insurance company that they were getting requests for insurance quotes from some of their business partners. And they wanted to be able to respond to those with sub-second response time. So they were able to take the sudden increase in requests for quotes through this kind of architecture. And then we were working with a healthcare company that does real-time patient monitoring. They told us that with some of the devices they had out in the field, that they were liable if the device actually identified a problem with the patient and they didn’t respond, they were actually liable because they were monitoring it. So they needed to know immediately if some kind of a problem was discovered, they had to be notified right away. So just some examples, again, we could go on and on and on with real-world examples in the IBM i world of people who are using this kind of technology. This is just a sample of one where it’s an e-commerce store where you can order things through the e-commerce site. So when a user goes in and does an inquiry and asks for a price on a basket of goods, the thing it posts to the Kafka cluster, the Kafka cluster checks the inventory. So it posts that message. The inventory system actually then updates the information about what it has available. The pricing algorithm sends back a price for it. The credit system actually figures out, do they have the credit to buy this thing? And they check with the shipping partner, identifies when they’ll be able to deliver that. So they’re all subscribing to that Kafka cluster in order to get those messages. And then we have customers who are using this for high-speed text messaging, both inbound and outbound. So they can send text messages out to customers or partners as things are happening, or they can process incoming text messages. This was an interesting use case. This was a customer that actually needed to integrate form data. They had all these government forms, and they needed to take IBM i data, generate the documents, and then distribute those documents to thousands and thousands of potential end consumers. And so they posted the documents to Kafka. So again, with Kafka, you can include binary files. You can include these large files in the messages or pointers to those files in the messages. So they were able to provide that kind of instantaneous response and distribute those documents out. And then this is a real common one where we have customers who are doing real-time vehicle tracking.
And then the other one is the code. The code is one where we have low cardinality. So in other words, there are only four values in that column across all million rows, which will be one of those four values there. Now, if anybody wants to play with this on any stage, by the way, if you look in the handout, the code for the stored procedure to populate the table and all of that is there. So feel free to go and play with this or make changes to it so you can get better ideas of how to play with it yourselves if you wish. Okay. The first place we’re going to start with this, by the way, is with Visual Explain. Okay. And because this is where you can see the effect of having an index. So to get to Visual Explain, the easiest place is from within run SQL scripts. So if we take line 48 there, that select statement, if I put the cursor anywhere on that statement and then select the explain option up here and not run an explain or that, just pick the explain one. And the explain is basically saying, tell me what it is you would do. Yeah. If you went to run this statement and give me an idea of how long it would take. So for this example here, now remember at this stage, the only thing I have on the table is that primary key constraint. Okay. So for the sequence here that I’m looking at, some value in descending sequence, there is no index out there. There are no indexes built yet on this table. So what Visual Explain tells me is that it’s going to do a table scan. In other words, it’s going to read all 100 million, all 1 million rows into memory. And then it’s going to do a temporary sort of list. And from that sort of list, it will do a scan, which will be giving me everything. And then I’m going to get my results set. Okay. Now in here, in Visual Explain, once I’ve done that explain, it will advise indexes to me. So if I go up here to actions and they’re under actions, I’ve got the option index and statistics advisor. And in here, it’s telling me, yes, this is, I recommend you build an index, which is just keyed on some value. Okay. Now, if I hit the create button there, it would bring up the wizard for creating an index that’s indeed in ACS. Or if I do show SQL, it would show me the SQL that I need. Yeah. Change the name of the index as you require. Okay. So let’s look at the important bits in this. So this was my select statement. As we saw there in Visual Explain, it was telling me that it estimates that this would take 530 milliseconds to run. Okay. The advise index that it had, if I create that index, okay, and then if I do Visual Explain on the statement again, it now looks something like this. So I’m going to say, so I’ve created the index. And now it starts doing an index scan for the rows. And from that does a table probe to actually get the rows I want. Does a little bit of logic in there and gives me the final select. The important thing is that the estimate has now gone from 530 milliseconds to 1.7 milliseconds. So in other words, the index means that this is going to be 500 times odd faster, or 450 times faster, whichever. Okay. So that’s great. This is the point of having an index. But the thing is, I can actually make this faster. Okay. And this is an index that was not advised to me. So this is where we are. This was the bottom of the last one. Okay. We’re doing absolutely fine. But here I’m now going to create a new index. And this index is going to be keyed on the value in descending sequence and the key ID. Now, if you look at what I’m doing in here, is that since this select statement was only selecting two columns, and by the way, it was three or four would be fine as well. But I’m putting all of those into the key of the index, which means now, if I look at the advice on that select statement, it’s picking everything up from the index scan. Okay. So if you look at what it’s done here, the comparison, it’s dropped those two middle bits of having to do a table probe to retrieve the actual rows and then do the logic that picks out the columns that I need from those rows. Okay. So here, okay, now not as big as we saw before, but we’ve jumped from 1.7 down to 0.4. Okay. So this is now four times faster. So just so you get the idea of this is what index strategy is really about, is we’ve gone from an original thing with no indexes of 530 milliseconds to less than half a millisecond.
Okay. So that’s example one. So here’s another one, quite a common one this, where I’m selecting everything. I’m going to select all columns. But look at my criteria here, where the lower of some text is like hit. Now, if you remember I said earlier, I only have 100 rows in there that satisfy this. Okay. So when we look at this, it’s going to say it’s going to take about 84 milliseconds to do this. And the table scan means it’s reading every row. And if you think about it, that’s what it has to do. Because it has to say, well, I’ve got to take a row. I’ve got to take the text, some text, convert it to lowercase, and see if it begins with hit. No, it doesn’t. Okay. So this process, it’s basically going to process 900,900 rows. 900,900. You know, a million less than 100 rows. Okay. Just to come up with the 100th that I want. The important thing when I did this in Visual Explain, it didn’t advise, it didn’t recommend an index. So derived key indexes are not advised. Okay. What I mean by that is that the optimizer does not advise us, even though it would be a good thing, right, and it would use it. So if I go ahead and create this index. Okay. So I’m creating it here. And the index is the lower of some text. Okay. And now if we look at what we have, okay, it’s now doing an index probe. And that index probe is saying, yeah, here are the 100 rows that you need. So it then goes and gets those 100 rows and puts them in my result set. And the timing on this has jumped down to 1.2. Okay. So from 83 milliseconds to just over 1 millisecond. Okay. So this is a derived key index. Now remember, it is one that is not automatically advised, but will be of benefit to you. A sparse index, again, usually comes in from where clauses. So again, I’m selecting some columns here. And I have some where some value is less than 10,000. Okay. And again, 86 milliseconds being the recommended size. It did actually, in this case, advise two indexes. One based on some value. Okay. And the sum value is so it can do this bit. Okay. Remember for the where clause. And the second one then is for the sequence that it wants to use. And if we looked at that, we can see much more complicated. And I’m not going through what all of these bits are. But by just creating those indexes, okay, I’ve now gone from 86 milliseconds down to 6 milliseconds. Okay. Again, a vast improvement with the recommended indexes. But let’s look at this and sort of say, well, could I have made it better? And I said, yeah, well, if I created a sparse index, I could. So if I had, so if I get rid of my two index, index four and five, okay, and in here I’m creating a new one with that key where the value is less than 10,000. Okay. Okay. So where the logic that I’m using is within the index. Otherwise on the where clause. And now we’re going straight from an index scan on that sparse index. Okay. And then retrieving the rows and that all goes into it. So we’re down now to 1.2 milliseconds from our original 86. Okay. Now one of the cool things then is with EVIs, encoded vector indexes. And again, this is something that we have to watch out with with advices and that. So we select these. Okay. Group by some code. If you look at here, I’m doing a grouping. So I’m grouping by some code. Now remember, some code has low cardinality. It’s going to end up with four rows in the results set here. And for every row, for the values A, A, B, B, C, C, D, D, I will have how many of them are there and then what the sum of the value is. Okay. So when we look at this, again, it’s going from a table scan, et cetera, and that great big enormous 394 milliseconds to process a million rows. Now in this case, it advises an index. It says you should create this index. So I create the index. And I say, you know, show me the index advice again. And lo and behold, the index advice is the exact same. And it says, yeah, you’ve got 394 milliseconds. The index actually appears to have made no difference. Okay. Now the reason for this is to understand why did it advise that index. It didn’t advise the index so that I could be so it would make the retrieval faster. Okay. It advised the index so it could figure out different ways of approaching how to get the rows quickly. So what it created here or what it recommended I create was this index, which was there for statistical information. So if I was to look at this in the index advisor, this is what the difference is between these two columns here of the last query use and the last query statistics use. So it created this just for doing statistical analysis. And really what it was looking at here was cardinality information. Okay. So how would I use an EVI to make this faster? Right. Because I know that I can make it faster. So I’m going to create an encoded vector index. So this is, you know, instead of saying create index, it’s create encoded vector index. This is what makes it an EVI. And I’m going to create it based on the code. But here comes the cool bit. I’m going to say I want to include the count and the sum. And what that means is that within the index, what is going to happen is that the count and the sum will be maintained in the index itself every time I insert or update or delete a row. So it’s constantly maintaining that count and sum. So when I now run that statement, instead of having to go out and to all of the rows and, you know, get the sequence right, and then go to every row and count how many there are, and then add together all of the values, it’s just going to grab that information directly from what’s in the index. So this is what we’d see. It says it goes to this EVI symbol table scan, uses some logic to give me the final result set. So we’re now down to 0.426, less than half a millisecond from something that was 394 milliseconds. And here’s the cool thing, by the way. If I now add in another million rows to this, but still with just those four values, for now 2 million rows, the runtime will be 0.426. For 5 million rows, as long as there are only four values, it will still be 0.426. Okay. A few last bits just to watch out for a general thing when it comes to indexes and this that we have sort of recommended voicing things like numeric conversions in WHERE clauses or ORDER BY clauses, arithmetic expressions, character string padding, or the use of LIKE where you begin with a percent or an underscore. There are things that really you cannot allow, that indexing can’t allow for. So avoid them if you can. You will sort of take a hit on it. Here are some references for you if you want to read up more on indexing and on indexing strategies and a little bit on the DB2 for I services that are available out there as well. Okay. So let me see a couple of questions. Where can we find documentation about the different visual explain icons and the interactions amongst them? The indexing strategy guide that I mentioned there has some of them in there, but if you want to get an in-depth one, the only place I really know of doing that is there is a course run by lab services, and in there they cover all of it. I have yet to come across a sort of a detailed description of everything that is in visual explain. And part of the reason for that is that it does change quite rapidly with every, you know, there’s always new stuff that comes into play in there on every TR on the system. Is there any way we can find out if any indexes are not being used? I mean, clearly they’re not being used and keep only that are frequently used. So, sorry, Anton, I’m sure you have answered that in there. Yep. Yes, you have indeed. Okay. Is it possible that driving is over mapped subfields in an XML doc? No, it’s not. Nope, afraid not. Okay. If DeFi uses a scalar function, will the performance be the same? Okay. So, Tommy, that depends on where you mean the performance. Do you mean the performance on like if you’re using a select statement or that? With a select statement, the scalar function would make it quicker. Oh, okay, Kevin, that may be my fault. Sorry, folks. Kevin just says that he doesn’t see the fill index procedure in the PDF. That may be my fault. It was a hidden one. I will check that. Okay. So, if you give me about an hour after we finish here and check again and just hit the download again, it will be there. Okay. So, Charles says EVI automatically includes count. Though I seem to recall a PTF to fix where the EVI wasn’t being used to return the count. I’m not quite sure what that means, Charles. Sorry about that. Ah, John has replied to it. Okay.
Couldn’t that be a performance issue for a large table having many inserts and updates? Sorry. So, okay, Debra, you raise a good point here about performance on this thing of maintenance of indexes. Okay. I promise you that if you have one index on a table or 50 indexes on a table, you will find it impossible to see the difference when you insert, update, and delete rows in terms of performance on the system.
Technically, is it a performance hit? Technically, yes. Can you spot the difference? Probably not. The only place that you will ever see performance hits is when you are doing mass changes to a table. And by that, I mean if you do something like a purge where, like at an end of day process, you did something like delete 80% of the rows in the table, then you can see a performance hit on the maintenance of indexes. And, again, how big a performance hit depends on how many disks you have physically on your system. So, it becomes more of a hardware issue. And sometimes the solution to that is, so the only time you really will see these hits is in a batch type environment. And if you are sort of seeing that, sort of the recommendation sometime is, okay, well, what you do is you delete the index, run the batch process, recreate the index. Okay? That would be the solution. Okay. So, hopefully, Kevin, the performance that I’ve taken care of that. These are not big deals. Okay? Okay. Oh, sorry. Yes, Charles, you are correct. The count was unneeded in your create encoded vector index. But, having said that, if I was creating an encoded vector index specifically for the use of count, even though it’s not needed, I would consider good documentation to put it in there. Because it means when somebody looks at it, they sort of go, oh, I think this was being created so we could use it. Specifically with count was part of the reason for creating it, as opposed to just performance gain that I would expect with the EVI. Okay. Okay. So, by the way, yeah, just to come back on Nanda’s question there, Anton has sort of answered in the automatic index advice processor. But you can, if you want, yourself look in the SIS indexes in there at the statistical information that’s in there. So, you know, and whatever the criteria is that you want as to whether or not you should keep them. But the key thing, and this is why I put it in the presentation, to look at the last query use and at the last statistic use as well. So, it can be a thing that, you know, they can say, yeah, you know, this index hasn’t been used to retrieve information ever. But it’s constantly being used by the optimizer to do statistical checks on things, you know, like number of values for columns, et cetera, et cetera, et cetera. Okay. So, yeah, no. Sorry, Tommy. The maintenance of those, you’re not going to see the difference by specifying it in there. Is there anywhere that backs the statement about performance for adding index? My manager frowned on indexes for performance reasons. Okay. Well, then all I would say is that your managers don’t know what they’re talking about, Rob. The index strategy guide, by the way, that’s in there, gives a lot of good information about this, about performance. For this thing about, you know, lots of indexes having an overhead, it comes from some of the System 38 and from the AS400 in 1989. Okay. Any of these sort of problems disappeared with, at the latest, sort of 1994 in terms of the stuff with, you know, the number of indexes. There’s a weird thing that people have in their head that you shouldn’t have more than 32 indexes built on a table. Okay. I think that came from people thinking, you know, from the thing of you can’t have more than 32 formats and logical. So, I’ve no idea. Okay. Okay. Tables with 2.4 billion rows are still a performance issue. But, sorry, yes, they are. Okay. But they’re going to be an issue with that. I mean, if you’ve got a very large tables, you know, your best way of maybe hitting those is looking at things like derived key indexes. So, if you can sort of hone them down, so within indexes you are actually putting them into sort of areas where things can be selected very, very quickly by the optimizer, it will use them. Okay. So, yes, you know, they can still have a performance issue. Whether there are a million rows or 2.4 billion rows in a table in terms of the maintenance of indexes, that makes no difference. Okay. It’s in the actual data retrieval, you know, and that. So, actually, the bigger that your tables are in terms of number of rows, the more important it is to have all of the right indexes in place so you can have the quickest retrieval possible. Okay. So, Richard, sorry, yes. Okay. So, if you remember what I, if you were there when I said at the start about the optimizer, any SQL request goes to the one place, to the one optimizer. So, it doesn’t care whether it’s a Python making the request or Java or anything else. Okay. So, is single record SQL now on par or faster than record level access? So, Larry, the answer to that is it depends. Okay. It can be. Yeah. But still, and again, by the way, to measure the difference between these, it is, you know, you’re talking fractions of a millisecond in terms of the difference. Really, the benefit in embedded SQL is that if you compare something, like on this table, for example, a multi-row fetch of a million rows compared to an RPG read loop over the index, the multi-row fetch is five times faster. Okay. And that’s what’s really with multiples. Would you talk about the hierarchy of creating an index when determining if the where or order by is more important? So, it’s not a thing, it will look at multiples. The optimizer will look at all indexes and decide which one is best. Okay. So, but what it looks at first, by the way, is that the optimizer, the first thing it looks at in a statement is the where clause. Okay. The last thing it looks at is the order by. Okay. Okay. So, you see this from creating an SQL view does not automatically create a view, correct? And you cannot create an index over a view, correct? Well, okay. You cannot create an index over a view. Okay. What a view is, is just an SQL select statement. So, when you create the view, that statement is stored as an object. So, when you say select from the view, what the optimizer does is it takes the select statement that’s in the view and your select statement, and it merges them together into one select statement, which is the statement that gets run. Okay. So, which, by the way, is the statement that you would have had to write if the view wasn’t there. Okay. So, that’s kind of what it is. And, sorry, yes. So, creating a view does not create an index. The one thing that you cannot have when you create a view is an order by. No order by means no index, means no access path. Okay. Means no access path, so no index. So, tips for partition tables, sort of outside the scope of what I’m talking here, Erwan, really, and sort of since I’m running over here by a good 10 minutes. Okay. How about join fields? Should the join fields be higher than the select fields?
Again, it sort of makes no difference on that one, Lita. So, the complicated SQL, by the way, what I will be doing is looking at an individual explain, looking at what indexes are advised, create those indexes, and then see. Okay. That is sort of the best. That is always your best approach. Usually what you will find, by the way, is about 80% of any of your issues will be taken care of by advised indexes. It’s the last bits where you look at things in the examples of what I was doing there. So, things, for example, oh, well, in the where clause, they’re using something like the lower of. So, you know, a drive key index would be of benefit there. Okay. What part does the access plan play? So, Ramon, I gather you’re speaking here about the access plan that is stored in the RPG program itself when you’re using embedded SQL. Okay. The part that’s played for it, you would need to know an awful lot about what’s going on in the SQL optimizer, but the original purpose that that had had, which was that it made static SQL faster, that is no longer true. So, if you have a dynamic statement or a static statement in embedded SQL and RPG, the only speed difference between the two is the length of time it takes you to, you know, concatenate your SQL statement together and to do the prepare, which is going to be about half a millisecond. So, and then the actual performance will be the exact same. Okay. There are a couple of little things in there that are, where it does make use of some of the information that’s in the access plan, just to make the process faster, but really of very little use. I was kind of hoping that someday they may actually just get rid of it, of keeping that access, because it does cause other problems. Because you may notice that if you switch to things like have static SQL and you’re switching your library lists a bit, you’ll notice your program size starting to get bigger. So, watch out for that one. Okay. I think I’m done there, guys. I will get a new version of the handout uploaded very, very quickly, like within the hour. So, sorry about that. I think when I generated, I forgot to put in the hidden pages. So, if anybody does have any other questions with that, please feel free to drop me an email. You will be hearing from Dan and Aaron in the next day or so, and they will give you information about handouts and all of that as well. And you have Dan’s email on the front of my slides as well, if you do need to contact Dan. Okay. So, that’s it, everyone. Thanks a million for tuning in. Dan, Aaron, thank you for an excellent presentation, as always. And unless you guys have anything to say. No, just thanks, Paul. Appreciate it. And thanks, everybody, for sitting in. Thank you all. Okay. So, without further ado, folks, I am going to end there. And thanks for tuning in, everybody. And hopefully catch you all at the next one. Bye. Bye-bye, everybody. Bye.