S2024 #22 - Amazon Redshift Data Warehouse System (CMU Advanced Database Systems)

Amazon Redshift's architecture (shared disk, vectorized processing) and performance are analyzed, comparing it to competitors like Snowflake. Benchmark results show comparable speeds, but success depends on factors beyond raw performance. The lecture also covers Redshift's evolution, project optimization choices, and the surprising market dominance despite relatively low initial investment. So maybe cuS came a bit later, but like all there was all these sort of first wave of these these special purpose olap systems like the green pls veras Park Cels astr datas, um, data algro, all of them had gotten bought up by this point except for par exel, like green plum got bought by emC. Um, and so you'll see that over time is evolved to a shared disc system, right? They were, you know, they didn't really start off with like, you know, the way snowflake did of being completely disaggregated storage They eventually had to add those pieces back in Um, and they did it. They tried to do this in a couple different ways. So they they added dis active storage for s3 in 2017. So original version of red shift, I guess I said I think came out in 2012, not 2014. Uh, and again, that was always being stored in in you know, that was always there a shared nothing system with storing things on the the compute noes themselves. Um, then in 2016 they took Presto out of out of Facebook and then they just rebranded as Athena. Um, I think it's based on the Trino line now not the uh, which is pre was based on presso seagull, not the presso db1 from um, that Facebook ste controls now, But they have, well, they have sort of a, this would be like more kin to a Lakehouse query engine where you just you just have a bunch of files on s3 and you can read from it And again, that's what that's what Tresto was originally designed for. And then the spectrum extension to the original version of of red shift came in 2017. And this is basically allowing you to have through the the red shift interface. So for today, i'm going to mostly talk about this one and and and this one but then these other things will come up as we talk as we go along. So this is the diagram from uh, from the actual paper itself. Um, and you know, you can see all the different bits and pieces that that make up the overall uh, uh, over arching red shift system. So at the bottom you have this this storage layer here and what's sort of confusing is that you have sort of s3 here and then you have this red shift Mana storage next to it As far as I can tell they are they're actually, you know, like EC2 instances themselves and locally attached storage. Who can then also, you know, spill over and write read and write data from the the s3 as well. There's this aqua thing that that's the hardware accel. we'll talk about that in a second But that could be something that stands in between the the compute nodes and and the rment storage. They have these other spectrum nodes here. Uh again, I think these are just I don't know if they're like, I understood these things just to be like it's it's software that can then run in here in the actual compute nodes or the worker nodes for running queries that can then just know how to go read data down from from s3. Um, yes, two questions. What exactly? So spectrum, just an isolated thing. Is it actually talking to anyone? Question question is spectrum is an isolated things. Yes, there's no error. So this is what I'm saying. I don't maybe I mis misunderstanding what what was in the paper. Uh, and OCus gave a talk last year. I don't think he mentioned this. Like I thought this is just software because it's just like, okay, you want to read some data on S3, my worker node knows how to go do that right from what I understood from his talk this year was that his the spectrum basically moves upU from the so the RMS nodes are just S3 nodes with some more software on there and like the spectrum move stuff from S3 to RMS. Is that So your question is like, like for the RMS stuff, these are just S3 noes with the extra stuff. Yes, Yes. And then the does that move into that into this? Yeah, so like does it does it take Amazon s3 no I I thought the spectrum are just it's just compute right because it's CU You can join like the rms data with the the s3 data right so like and we'll talk we'll see slides in this in a second but like and so therefore like okay if you're just doing quer processing that's going to be done up with the compute noEs anyway so I I don't know what the separate spectrum nodes are actually mean um, and then the like the computer isolation clusters versus red computer clusters This is just like how you sort of provision it Do I want do I want to have you know some organization in my in my company have access to a compute cluster that they can only use and therefore uh, it doesn't interfere with with uh, doesn't doesn't get slowed down. If people are sort of going at the general pool for the organization, they also have the ability to, to scale up automatically. if you specify to say, hey, go go bring some additional nodes because these things are meant to be stateless. Although I think they also do have a uh, a local SSD cache as well. And then the compilation service, we'll talk about that in a second. But that that is basically, again, the the thing that's going to run GCC to do the transpilation, the the compilation of the queries as they go along. what's also confusing about the the red shift, mana storage and how it relates to like compute nodes is when you provision red shift, you specify you want the instance type to be the red shift, you know, and manage storage. And so that sometimes it the literature makes it sound like, okay, well, it's the node itself just knows how to go talk to, Sorry, that like you get a single comput instance, but there's also another instance that spins up that has this red shi manage storage ver as you saying, is it just something sitting right above s3? Yeah. it is s3 nodes with just addition? Yeah, additional code. Yes. Okay. Okay. So for so the when they execute a query, uh, the actual engine itself is going to be push-based. Um, the example that they show in the paper looks like it' be pool based, and they even call this out, but it's they talk about how, uh, you know, that would be too much state to maintain in a in a in a pool based model that they switch to push base, but they be careful about where they, uh, sort of where they do certain operations to avoid blowing out your your CPU registers. So there a lot of the same things that we talked about before, uh, in the hyper paper when they were doing compilation that they're trying to be worried about. But one thing they do differently is that to help reduce the, uh, compilation cost the compilation time of the queries in that they're still going to rely on some pre-compo primitives to do vectorized scans and, uh, and filtering other things. So they're not going to do full pre-compiled PERim the way vector wise was for the entire, you know, the entire query plan, it's just for the lower portions and the leaves, they're going to have things that are that are pre-compile that that then in line into the compile program. And as I said, the the code that they're generating, uh, is not going to rely on auto vectorization for any of these primitives. Everything is going to be written by from hand uh, using intrinsics. Another technique that they they're going to rely on in these scan loops to avoid stalls is that they recognize that, uh, you don't want to do some operation on the tup you're operating right now and then loop back around in sort of the scan kernel and then ph a uh, ph a st stall because the the next record you want to go, uh, retrieve is not in the CPU cache or the CPU registers. So they're going to use software pre--fetching, which I think we talked about uh with vectorization you basically construct the CPU to say go fetch this next uh, these next memory addresses bring them into, uh, I think it lands in l3, uh or maybe l2, I forget which one but like basically go fetch the next thing. I know I'm going to read from memory into my CPU caches and they, it's timed in such a way because they control again, the nose that they're running on, they're controlling the code that they're generating. So they, they have heris to figure out at what point you want to invoke that software prefetch command so that when you come back around and actually need that tuple, it's actually ready for you. And they basically use a circular buffer to to place these things. Yes. What does source to source mean over here, source to source trans source? I like the the query plan is query plan is the source and then you then you emit a more source code for it versus like the hyper style was like taking the query plan, think of the source and spitting out the the ir directly. Yeah, right. So I think we talked this before with, with, uh, the relaxed oper operator fusion approach where we were introducing buffers in between in in our pipeline. It wasn't exactly a pipeline breaker. It was like a soft pip pipeline breaker. I just going to fill up a vector. And then once that's full, then move on to the next the next, uh, the next stage within my my pipeline. And so the idea was that within that sort of soft pipeline breaker, you would inject the, the, the prefetch commands so that you can do a little bit out of work. the go fetches the pre-fetches the thing you need. And then when you come back around, uh, the data is available for you, right? Because if you think about it, the if you do too much work, then the, the data might get evicted by the next time you start using it. If you do too few work inside the that before you prefetch or after you prefetch, then when you actually then need the data, it's not available. And again, because they're doing all the compilation stuff themselves, they're generating the code, They they have ways to figure out, you know, I'm this far along and therefore I I should inject my prefetched, uh, brief edge commands. So the one thing that also that came out of the paper too, is that given from all the other systems we've talked about, they appear to be less aggressive in in being adaptive compared to big query and snowflake and others, right? Like snowflake was trying to do, uh, aggregate pushdowns. Um, I think, uh, photon and big query were trying to do figure out on the fly. What what the right? you know, data type they should be using? So what they really only talk about is that they have ability to uh, choose, you know, choose a vectorized implementations of different string functions like upper lower comparisons and things like that. Um, for when, for when it's asking data and then if that's incorrect, then they fall back to a slower version that operates on unic code. But that's that's a sort of the same trick that the others were doing. Um, you know, they're not really reorganizing the query plan themselves. And then the other one they talk about is if you, uh, for the join filter doing sideways information passing when you you're building the balloon filter on the build side of a hash join, Uh, they can recognize that if the if the hash table was getting too large and and it's spilling the disc, then you can you can size the the balloon filter to be a little bit larger than you normally would because that'll increase the likelihood that you you don't have false positives Um, and and you don't end up fetching things from from dis. That's really the only two adaptivity parts that they talk about other than scaling up the scaling up thing is, is more like a per query basis. Do I need more compute nodes because my current compute nodes are running other queries. All right. so the compilation piece of this is very fascinating as well. And this is um, it's similar to what yellow brick was doing where yellow brick talked about uh, instead of having like the worker nodes, uh, be responsible for compiling the, the queries themselves. Amazon's gonna have a separate service that's running on the side with dedicated nodes that basically call GCC and and compile things Um, and the idea is that the you'll have caching different layers of caching withinin within the system. So you have a local cache where you have pre-compiled, uh, query plans or fragments And if you identify that the, the thing you're trying to compile right now has already been compiled before you just reuse that. But then they have have ability, which is, I think, ingenious uh, to maintain a cach across the entire fleet of machines across all of red shift. So now, like the idea is that if if you come across a query that your cluster has never seen before, it doesn't have the the compiled query plan in its cach and go look in this global cache and see, did somebody else have something that's very similar and be able to re use that that shared object that pre-compiled, uh, or that compiled, uh, code and again from a from a security standpoint there isn't any issue because it's not like there's running it's running arbitrary user code it's it's literally like scan this table on this type with this kind of filter and so it doesn't matter whether your table contains you know banking information and my table contains blog information you know at the end of the day a column is data is a column of data and they they can reuse that so they talk about how the cash hit rate is like 99.95% for uh across all queries and across the entire fleet and then in the cases where if you don't have the pre-compiled uh segment on your in your local cache 87% of the time that when you the global cache it's going to be in there right so this basically negates the cost of compilation the thing that we were worried about before uh when we're talking about the you know how to use this technique and you know again when we talked about highq and and and and msQL and other systems that that fork exA gtC we talked about how like you know it's G to be seconds to actually compile things even hyper was was in some cases it was going to take you know hundreds of milliseconds to compile things all that goes away because everything's cached yes how big is that cash question is how big is that cash I mean, it's not gna be pedabytes right but like it's probably a couple hundred gigs sure but but who cares it's your Amazon right like how big how big is the s3 total amount of s3 storage I don't Huh a lot yeah so who cares if you have this cash these cash creer plans right and then from their perspective also too like it's a win--win situation because the customers are happy because the queries run faster uh it's less computationally expensive to to you know fetch something from the cash than recompile it right so for them it's it's uh this is a no-brainer and again this is which you can do when it's in the cloud like again if you're running hyper and it's local on your on your box you can't phone home to to say you know do you have this compiled query plan because sort of not designed that but when it's a service running the cloud when you control everything you can do this kind of trick so I saw a similar idea from um uh it's a sort of commercial jVM company called Azul a and they now have a compilation as a service almost similar like this so like if you're running your java program you can have their compiler the local compiler uh or the local jvm say oh i i i want a you know compiled version of of this jar file i'm not to run on this on this hardware they you can call the service and get you know cash binaries from them yes what i'm most surprised about is that the level of indirection of going to the cash isn't is less costly than actually compiling so your statement is you're you're you're more impressed that the cost of going to the cash is cheaper than just combining locally yeah because isn't combing locally just like 10 milliseconds or and 100 milliseconds seconds seconds okay yeah actually how large these precompiled like let's say in like source code lines are they more than like question 200 lines if it's like pre-operators it should like one operator is not going to be more than like a thousand lines right so so it statement is like how how big can these programs actually be because like you know if it's because if you're using some pre pre-compile primitives for certain parts of the lower leaves but you know what about everything else how how much that can that be it can be in the thousands right like for really big queries with a bunch of if they sorry just just specifically for the preil primitives like ie pre prs are usually pretty tight right those are the small ones so everything else like it's all it's all the the scaffolding around that that then calls into the pre primitives that's what they W they were trying to avoid compiling that so what's the point of having the pre-compiled primitives if you're gna st is why have the pre-compiled primitives if you're going to you know you know be able to comp compile everything um i think the paper talks about how uh they come back to keep saying that i think it's to reduce compilation costs yeah but it makes i mean i think it makes sense because you can imagine like um you know here's this one piece of code that i'm gonna keep rec compiling over and over again at their scale if you can compile it once and reuse it i think it it makes a huge difference and they talk about again there's this overhead of as we saw before you know with if the pre perimeters you invoking a function is not free at runtime obviously because it's a jump call on the cpu but if you do it on a batch of of of data in a vector then it gets advertised and and it becomes negligible so this would be i mean we'll talk about this at the end but like amazon's not stupid uh they have all the the metrics and telemetry of in across their entire system like they can identify here's the part where you know here's a part where it's super inefficient for us and they introduce caching and other things to to speed things up so it's not like they designed this because they thought oh that's fun let's go do this right they did it for a reason um and so like to your point like yeah who cares about compiling the same you know same feline function over and over again think of like an an amazon scale you're doing this a billion times a day that starts to add up and so you know this probably saves them again a million millions of dollars a year to have this And as I said, like the customer is happy too Because like now queries run super fast because you're just stitching a bunch of pre-OS stuff put together.