Transcript - Delta Lake Tips, Tricks & Best Practices (1).pdf

cott Haines (00:33)
Hey, hello. Hello, everybody, and welcome to another episode. Today we're joined by Yousef
Rani, Solutions Architect at Databricks, and all together just an awesome guy. Today we're
going to be talking about Delta Lake tips, tricks, and best practices, and a lot of just wins from
the day and the life of. So I'm going to kick it over to Yousef if you want to say hi to everybody
and just give a good warm welcome.
Youssef Mrini (00:58)
Hi, I'm Youssef, I'm based in Paris and I'm a big, big fan of Delta Lake. I've been using it for, I
think, more than four years. So I've been noticing all the revolution inside Delta for quite some
time.
Scott Haines (01:14)
Awesome. Thanks for that, dude. So yeah, well, let's kick it off. We're going to go through some
slides today, and then we'll take some questions from the audience. All right, perfect. So Yusef, I
don't know if you want to take control. Yeah. We'll go through.
Awesome, so yeah, so for the whole entire session today, we're essentially talking about kind of
it's like, almost like a vast back to basics. So we're talking about tips and tricks, but a lot of this
is kind of based on what are the things that people would need to do or want to do. You know,
within the first couple months of working with Delta Lake and one of the big things that people
usually need to first understand is like how is Delta Lake actually laid out? And so on the slide
right now we just have, you know, essentially kind of table layout 101. So this is just a
typical looking partitioned Delta Lake table. We've got partitions by event date. These are from
way back in the day, from 2019. And all of our actual data files for Delta tables are just parquet.
Then we have our transaction log, not our transition log, our transaction log, which is actually
keeping track of what happens every time there's a commit to the table. And so this is what
gives us our asset guarantees.
This is what allows us to have multiple readers, multiple writers on the same exact table. It's
essentially kind of coordinating things for us. Within the log, we have all of our metadata. We've
got information about the table itself, what schema it has, what partitions it has, what other kind
of table properties it has as well. And then we have essentially checkpoints. So as we're writing
transactions periodically,
We're creating snapshots so that it's easier for us to be able to go back to a certain point in time
and or to load the most current version of an individual table. So you put it all together, you've
got a Delta table. And now, on to the next one.
Yeah. And just to go back to what Scott was mentioning, my advice would be go and open the
JSON files and see what you have inside. And then also open the checkpoints, which is the

Parquet file, just to see how it looks. It will give you an idea how Delta tracks every single
modification you're doing on your table, like by modifying, adding, you're to see the structures.
interesting information. I think this is the basics, but make sure to have a look at it.
And I guess for this one, I wanted to touch down on the Delta Lake clones because those are
very awesome features. The first one is the shallow clone. I think it's an underestimated feature.
This one is super, super important because it allows you to clone a table or just clone in just the
metadata without having to copy the entire table. So let's suppose you have, I don't know, one
terabyte table. You want to do some modifications on it.
But you don't want to copy the entire data. So what you can do, can just shallow clone it. And
then you can do any modification, and you are sure that you won't be modifying the system, the
source table. So what will happen? If you modify, I don't know, do updates where value equal to,
I don't know, to Paris, what will happen? The shallow clone table, only the modification will be
added to the new table. The existing one will remain the same.
And this is a know-some-what because you can use it for short-term experiments without having
to pay large fees for copying the data or moving the data. And you can also see in the shallow
clone table, like the new table that has been created, only the modification, which means if you
update, for example, one row, the unique row that will exist in the shallow clone table will be the
row that you have modified. And the existing one will be the same.
But just keep in mind that if you modify the source table, the shallow clone table will remain the
same because when you do shallow clone, it's on the latest version. For example, it can be
version seven, which means if you modify or update the source table, it's going to be version
eight, which means on your side, it's going to be the version. Just keep in mind that if you
vacuum the table and remove, for example, get rid of the version you're talking, then the shallow
clone table will no longer...
exists or will be corrupted.
Which is a really good I think a kind of segue into thinking more about how long you want to
retain You know a table for in general. I think it's also a good, you know your call out for like
having this is like short-term experiments It's also good if you're playing around with like new
table properties or you want to test, know different ways of like, know organizing the tables
metadata or you know, like maybe I'm thinking about different You know, I care a little bit more
about you know, some specific table properties or I want to try, know something new
So it's kind of nice to have like that, almost like that playground to go test in. That's not affecting
the actual table that could be produced, it could be responsible for production data, for example.
So yeah, shallow clones are awesome. But the next one I think is also really cool too.

Yeah, the next one are deep clones. And deep clones are a bit different from shallow clones. So
over here, we're going to copy the metadata and the data as well. So this one is pretty
straightforward. You can specify the version you're going to deep clone. But the good thing is
going to create a new delta log, which is good. And when should I use, for example, deep
clone?
Let's see, for example, suppose you want to do some data archiving. For example, you're doing
a lot of modifications on a specific table, but the regulation specifies that you need to keep a
snapshot every month. So what you can do instead of having to keep this copy, like all those
different versions, can just, for example, deep clone the latest version on, I don't know, at the
end of each month. Or if you want to, for example, clone the table, like copy the table to put it
somewhere else in another catalog or schema.
you can do clones. And they are super, super efficient. And I think it's really a great way to
leverage those two features, shallow clones and delta-like clones. And of course, the APIs in the
screenshot, you can use it with SQL and Python as well.
Yeah, I think the other thing too that's interesting, like outside of data archiving and like, I just
come to data migration itself. For people who have to do anything with a legal hold, legal hold is
also another nice thing for deep cloning. Cause you can take a snapshot of exactly what the
table looks like at that point in time. Go move it over into your long-term storage for like the
seven years for legal hold, and then continue to move on with your life, removing data from that
table as if you didn't actually have to, you know, keep track of anything else for legal hold.
We also do have a question from Alex. This is coming back like the shallow clone table. And so
the question is like for a shallow clone table, like the queries on the original data set, but it
shows the affected rows in the shallow clone table. And it is correct. like the, it's essentially kind
of like a pass through. So you have your original table. So let's call it table A. Table A has all of
its parquet for that specific version that you created the shallow clone from. The shallow clone
will still be reading all those files.
unless they disappear. And that's like that whole kind of problem where because it's kind of
metadata only changes, going back and reflecting off of what the table was, if that changes,
then you're kind of stuck. And so, yeah, it is supposed to be used for short-term experiments as
well.
And something I want to mention for the clones, can clone, like you can deep clone the same
table multiple times. For example, you're to clone the latest version on, for example, the 1st of
April, then the same table on the 2nd and 1st of June and so on. And then you can, for example,
do describe history. You're to see all the different versions you've deep cloned, for example, for
legal reasons or for as Scott mentioned. It's going to be, clones can be very, very good.

options.
for sure. Yeah, given that you do have that transaction log. right, yeah, moving on. Take it back
over.
Yeah, and I think this one also has a log retention duration. I think it's also a very important
feature because people tend to confuse log retention and another table property that we will be
talking about in a few minutes. So log retention, by default, it's fixed to an interval of 30 days,
which means like every single modification you're going to be doing on this Delta table.
you're going to keep the logs there. For example, you updated the table, you deleted the row,
you inserted the new columns, you will be able to track those modifications, to track the logs for
30 days. Which means if you want to track, for example, to go back in time for 60 days, you
need to make sure to modify this log retention duration and put it, for example, if you want to
keep the logs for two months, then you need to put here intervals 60 days.
Very, very important. And it's the only way to help you perform these time travel queries. So this
is, but that's not the only requirement. We still have another requirement that we're gonna be
checking a little bit later.
I think it is a kind of buried lead right there for that. Yeah. Do you want to move on? Well, we can
circle back to this afterwards.
And also this one, this deleted file retention duration is very, very important. Why? Because
you're going to front onto that. And this is something you also need to keep in mind. You need to
run the vacuum because the system does not run the vacuum on our behalf. So all the old
snapshots you created across time, I don't know for two years or three years are not dropped.
So you need to make sure that you run this vacuum to get rid of all those files.
But to do so, what you can do, you can specify this deleted file retention. So here you're
mentioning that you need the minimal duration to keep those, those old snapshots will be seven
days. It doesn't mean that the data that was created one week ago will be dropped. No, no, I'm
talking just about the snapshots and what you can by default, it's set to interval one week, but
you can, for example, put two weeks and whenever you're going to run the vacuum
automatically, it's going to keep just the last
two weeks of snapshot. So this is something you need to keep in mind. This is in the case you
don't specify the option when running the vacuum. But since people do not do this more often,
make sure to specify this interval and make it suitable for your project. And keep in mind that the

vacuum is very important because at the end, if you're storing your data on a data lake which is
in
an ADLS or an S3 bucket or GCS, you're gonna end up paying some money for nothing. So
that's why I think you should leverage this, the listed file retention.
I think, yeah, I think with that also kind of going back to like the log retention, it's an interesting
thing. If you want to go back to a specific point in time as well, and you have a deleted file
interval of one week and you have a log retention period of 30 days and you want to go back for
30 days and you've deleted files or certain files that you're not going to be able to get back to.
And a lot of times like this happens because you don't have deletion vectors on you're removing
specific files from older parquet.
And then that's every time you have an operation that is changing or mutating a specific point in
time, it's going to create another parquet file. And essentially like, you know, it's going to orphan
the older parquet file with the new, you know, it's not going to be linked to the new current like
snapshot. Like what does the current table look like? It no longer references that file. And so in
a lot of cases, if people don't clean up like the older files, you'll have like this entire, you know,
this time, entire life of this table that no one knows about anymore.
And so, you in order to, you know, balance what you have in like your metadata to balance what
you have in the transaction log and then to balance what you have in terms of kind of like
orphan files that are no longer connected to the table. Like you have to play with both of these
together. So it's another thing that can also be tested with deep cloning, with shallow cloning.
Just testing the different properties. If you want to be able to go back and say, I want to try
different settings for time travel.
Because in that case, you can continuously just kind of create new tables to test with in a safe
way, which is fun.
And for the next one, it's more about the optimization side. So over here, we have what we call
the data skipping num index calls. This feature is, let's say, very, important because that's how
Delta will capture statistics that will help the engine when running the query to improve the data
skipping. And by default, it's fixed to the first 32 columns.
But you can, for example, if you have, you don't need to, for example, suppose you have, I don't
know, a table with 50 columns, but the most important ones are like, I don't know, the first three
ones. So you can just, for example, modify and put data, skipping num index calls equal to
three. So to make sure that you don't like to run some computes, just to capture statistics for
some column that you want to leverage or benefit from. But this feature is very, very important.
It's part of when you run optimize. And you've got to see that you have a bunch of other
properties that are linked to this one in order to improve the efficiency of your queries.

Yeah, I think I'm glad you brought up the whole entire notion of like, if you only need three, use
three. Like you can also turn off data skipping. If you're just doing things like append only tables
and you don't need to do anything special with it. If you're just using a snap, you know, a
snapshot mechanism saying this is the table as is, we're never going to do anything with it. It's
kind of like, you know, here's a sealed data set. but I think it is really important to be able to think
more like, you know, there is a cost associated with like 32 columns worth of statistics.
especially if that is being calculated on every single micro batch as well, if you're doing things in
stream and or every single kind of gigantic batch, if you're doing kind of large batch style
operations. And so if you're never gonna use it, like there is that compute cost to that. There's
also storage costs to be able to keep track of the statistics over time, which is important to know.
So let's just actually come up with like, what's the cost of collecting statistics on too many
columns? It's just time, right?
Larger optimize. So optimize will take more time because you need to capture like, I don't know,
the mean, the max and so many other metrics. larger are the bigger or longer are the columns,
longer the optimize will take. And if you're paying for the compute, then it's gonna cost a little bit
more.
And I think one last thing before we move on from this one, the first 32 columns, if you have a
nested table, will essentially do a depth first traversal. And so you'll be going down versus
across within the table itself, which is something that's sort of a gotcha, depending on how
people are creating their tables. So if people are enveloping specific data across a table with
kind of deep subnested columns, you might be surprised at what
actual columns, your indexing versus what you think you want to be indexing as well, which is
just an important thing to keep in mind.
And I think we have another question. Does it affect querying as well? Of course, if you want to
be using those columns, it doesn't make any sense to optimize on those columns because you
want to be benefiting from those statistics. But of course, if you know that you're going to be, I
don't know, running queries on the last column on your table, which is, I don't know, the column
number 30.
Then yeah, it can be a good option. We have other optimizations techniques that we will show
you in a few that can help you reduce this number of columns that you will be leveraging.

Yeah, but I think I think as well for that, like so for the question of like, does it affect querying? It
definitely affects querying as well. So there are certain cases where if the entire query can be
just pushed down to the statistics themselves, you don't have to open files before actually
having all files that can be returned for an individual query. So you think about like there's like
partition push down, like predicate push down, anything that can be pushed down to the
columns, like to the actual statistics themselves.
is beneficial for speeding up your queries. And so for very large queries, if you don't have the
stats columns, then that will be a lot slower of an operation because you have to scan a lot more
data. And so it is very much something that will help speed up those queries as well.
And yeah, the nested one also accounts as an additional column and you need to pay attention
to this.
Yep, 100%.
And the next one, yeah, I wanted to talk about this ordering, like updating the column order. It's
important because you can link it to the previous one. So let's suppose you have like all the
interesting columns that you're gonna be using are, I don't know, located in the last columns of
the specific table. So what you can do, you can try to alter if you want, of course, alter the order
of those columns to make sure to push them
at the beginning of the table. So you can, for example, mention, hey, capture the statistics only
for the first eight columns. But that's not the only thing. I've seen some customers and users
who change the order of columns just for the sake of readability, because they put the date at
the beginning, because this is how they query this table on their head. Like, I want to, for
example, filter by country, for example, France, and then by...
by city, then by date. So they order the columns, make sure it helps them read them. But of
course, you can also change the order of columns to make sure that you put those, the one that
are interesting at the beginning. When you mention the num index call, you specify two, three,
four, depending of course on the number of columns that are important to your queries.
Yeah, and this is also, think for people who haven't used Delta before and are coming just from
like standard Parquet, this is an operation that was unavailable, like on just Parquet itself,
because you had like ordinal offsets for columns. And so if you were querying by specific table,
like columns, and they reordered, then your older queries would fail because they would no
longer be in the same spot, which is also not a thing that Parquet allows you to do. And so in
this case, like with Delta, you have an overlay over that, which is just saying like,

This human readable column name is associated with this index. And then over time, this is
where column mapping comes into play. And I think we have another slide on column mapping
as well. That's where this comes in.
And the next one, yeah, this one is also linked to what we have mentioned. It's the data skipping
stats calls. So this functionality will overtake the old one about the number of columns that you
will mention to capture statistics. This one can help you get, for example, if you don't want to
order, reorder the, or the columns on your table, what you can do, you can just specify the
columns where you want to capture statistics.
So you will specify, for example, I want to capture column one, column three, column 10. So
you're to target those columns and to make sure that you are not paying additional cost to
optimize non-important tables. And you don't want to change the order of tables because maybe
it's used in another tool that can break in case you change the order of columns. So this one
can simplify your life. It's the Delta data skipping stats calls. And you can merge in the columns.
And it will help you.
optimize the query performance in a very simple and efficient way. And it's mainly used for wide
tables because if it's small ones, you can use the number of columns, but this one can be very,
very interesting.
Yeah, and I think the other interesting thing to hear is that so given that like the given that the
columns also support nested column like dot notation. If you want to be able to specifically index
on sub columns of a nest, you can do that as well. It really just depends on the style that you
want to move forward with and then just make sure that you don't break that when you're
changing the schema and so.
You know, they're specific kind of, you know, in this case, like if those columns go away, it's just
going to stop keeping statistics on it. But just something that's kind of worth noting as well. You
think about data skipping stat columns being essentially like, you know, required, you know,
required columns, which is awesome. Okay. Let's see. We're also, we'll take a break for just a
sec. We're going to answer a couple of different questions as well. All right. So let's see.
So yeah, the slides will be available at the end. We can post them after this is complete as well
just in the comments.
See you.
OK, so does updating column order require rewriting the entire table in delta? And this is just a
mapping.

Yeah, just the metadata behind the scenes. You're not rewriting the data, just updating the logs.
Yep, metadata only is a great way to reduce the overhead of a lot. Awesome, good question.
Thank you. Awesome.
I think we have another question. Could you elaborate a little bit more on how to move such
columns to positions beyond the index specified by this property? So if we go back to the
previous slide, what you can do, can, for example, let's suppose you have, I don't know, your
table is 50 columns. You have 50 columns in this specific table.
If you run the optimized by default, you're gonna capture statistics for the first 32 columns. But
what you can do, you can move the important columns to the left, to the left. And then you can
just, for example, if you move five columns, and those are the important ones, you can just say,
okay, capture statistics for the first five columns. And that's how you can reorder them. And you
can use, for example, as I mentioned, alter table, alter column, and put this one first.
than this one second, but of course it's going to take some time because you need to do it for
every single column or you can replace multiple columns at the same time. So that's what we
mentioned by reordering the columns.
And I think the other interesting thing there would be if you've already captured columnar
statistics and you want to go back again in the future, like they do have optimized full as an
option. So if there's been large changes made to the column mapping and you want to now go
grab stats for the whole table, you just go and basically re optimize. There's also the analyze
functionality as well. I know we don't have a slide for that as well. But you take a look at just
Delta Lake analyze.
ChatGPT will show you exactly what to do.
Yeah, and the analyze I've tested before, it's super interesting because you can try to, you can
run describe extended or describe details on a specific column. You will see that you don't have
any statistics, but once you run the analyze, you're gonna see, for example, for a specific date,
the mean, the max, the number of types and stuff. And this analyze will help the Spark
execution plan to be more efficient.
So the skipping will help avoid having to read all those files and the analyze help the data
skipping the Spark execution plan. So that's why analyze can also be very important to use.
And the next one, I think this feature is very important. It was introduced, I think, one year ago.
It's deletion vectors. And basically, it's introducing the merge on read feature. Basically, as you

know, a parquet is immutable, which means whenever, for example, you modify anything, it's
going to create another file. And if you are doing a lot of merge, like,
insert and update, insert and update, you're going to end up spending a lot of time just rewriting
the same file multiple and multiple times. So, deletion vectors work with the, let's say, smarter
way because it introduces a bitmap marking the files that need to be deleted. It doesn't do it
immediately, but it waits a little bit later when it's required, which means rather than
operating, like, don't know, recreating the files four or five times can only be done once because
it's marked and you don't need to rewrite those files. And this feature is very, very important in
case you're doing, as I mentioned, a lot of frequent delete, update, and merge because you
significantly reduce the execution of those specific workloads.
So you need to make sure to enable deletion vectors.
Yeah, this also saves a lot of, so aside from the other benefits as well, if you're cost conscious
as well, you could test both of this. But without deletion vectors, that whole number of files,
there's just a lot more that needs to be vacuumed and cleaned up after the fact as well, because
you're going to be orphaning a lot of individual files, especially if there's no z-ordering or any
other kind of clustering involved with the table itself.
a lot of the point deletes end up being extremely expensive. If I'm removing two or three
columns per operation, it can significantly increase the cost of just operating a table in general.
Hold on. So there is another question, it wasn't really a question. more just, optimized does not
trigger analyze. And that's correct. So the analyze functionality itself will go back and
capture the statistics as part of writing, like whenever we're doing like an add files or any kind of
operation, like delete files, anything else like that within each individual transaction, the end of
the transaction is where the statistics are being captured before commit occurs. So optimize is
what we wanna run when we're going to remove orphan data or older snapshots of the table
over time.
and then analyzes only if we have missed running any kind of analysis. So for example, if we
had no, know, if we weren't capturing any statistics and then we wanted to go and turn on
statistic captures, like we have to go back and, you know, either get statistics from the entire
table. So go back to the earliest point in time and then go analyze our table and or say, okay, it's
fine. Like from this point forward, I want to start capturing stats.
I can change my table property and then start capturing the stats from that point in time onward.
Anytime a new batch is actually appended to the table or any other operation is applied to the
table itself. So there's a couple of different ways of doing that. Thanks for that call out as well.
And let's see, we have another one. Let's see. So we have, let's say we have a streaming job
which continuously keeps adding the data to the Delta table as append only mode.

So if the optimizer is happening automatically, then which computer will it be using? If it's a job
cluster, then it's gonna be in the job cluster, like optimized. I'm guessing this is like the
Databricks auto-optimize functionality for this one. And that's gonna happen, like if auto-optimize
is running, it's gonna be doing compaction like every 10 jobs if it's running in streaming mode.
So you'll see like an individual job that runs.
So I'm thinking about this from a workflows perspective. There'll be essentially just a new job
that's inserted into that flow that will be going to run like that compaction as well. It really
depends on if you have the right optimizations on or just kind of auto optimize like auto-compat
on as well.
I think the default one is 256 megabytes per file, if I'm not wrong.
And so that can be modified with different settings as well for that. yeah, and so the second part
of the question too is like, so if it is running on the job cluster, then does it mean it will also
increase the overall latency of the streaming job as they'll be running optimized after each micro
batch execution? And it will reduce your ability to go as fast as possible. But on the kind of the...
You're paying it forward, I guess is the best way to say it. So for anybody who's like downstream
of that table, if they were to go back and have to either read, you know, suffer under like the
small file problem or anything else like that, like depending on like how quickly the streaming,
the streaming job is actually appending new batches, like it's better to be able to, you know,
solve the small file problem once and then have everybody downstream of your streaming job
actually, you know, benefit from that as well.
So while it does increase the overall latency, it's only at one point within the lineage.
I think we have an interesting question about whether the vacuum also removes the Delta logs?
So basically no, the vacuum will only remove the parquet files. It does not remove the logs. So
the logs are deleted automatically and that's asynchronous, I always struggle with this word,
synchronously.
by the Delta Lake engine. And keep in mind that the logs, you specify how long you are to retain
them. Keep in mind that as Scott mentioned at the beginning, whenever you write 10 commits, it
creates a checkpoint. And then those files are deleted because think about every time you run a
command, you run a modification.
It's a new JSON file. And what if you run 10,000 files? It's going to be a long number of JSONs.
So it's done automatically. And after each 10, it creates a checkpoint. And then it deletes all
JSON files automatically. But the parquet are not dropped. You need to run the vacuum.

Awesome. I think we're good. We're good with questions right now, so we can move forward.
And yeah, and the next one, I think also an underestimated feature is the change data feed. So
it's truly a simple way to run the CDC on Delta. And it's just to keep in mind that you enable this
feature only and when only it's needed. Why? Because it will increase the size of the table
because all the data it retains behind the scene. But then
It's a very cool feature because if you don't enable this feature, you're going to see the different
version for each table, but you cannot find what did change in the meantime. But with this one,
you can just run the command select, and you specify the version. For example, starting version
one and version three. And you're going to see two different rows, like the first row, example, the
Youssef, Marini, V1. And then you're going to see Youssef, Hanes.
V2. you know that, okay, what they changed is the last name and it simplifies and you know that
this is V2. So, and you can also see the type of operation, whether it's an update or insert or
delete. So you can filter by specific version. You can filter by specific operation to propagate
those modifications to another table. And it's really straightforward, but just keep in mind, do not
enable it everywhere, but make sure to enable it only when it's needed.
100%. I think with a lot of these things, it's, you opt in, you have the ability to opt into these
specific features, you know, when they're necessary. But for anybody who's had to do like CDC
any other way, like the change data feed makes that a lot easier, you know, transactionally,
given that, you know, the upstream table only has to enable it for the downstream to be able to
consume any of those changes. And so for a lot of like those change feeds,
There's a lot like others, there's a lot of other ways of tracking changes that become a lot more
problematic. And so given this is built directly into like the protocol and it's also built directly into
like, you know, the Spark ecosystem, for example, like I can go and read the change feed and
that makes it a lot easier than having to just go read the table and then, you know, hope that
somebody has, you know, enabled, you know, some sort of, you know, homespun tracking
themselves.
Like anytime something is at the protocol level, it's a lot easier for everybody because it's
standard. And so it's very nice in general. Okay.
So we have a question about liquid clustering. I think that we're going to keep it till the end since
we have a topic about it. Yeah. And this one, column mapping, I think when it was released, I
think, I don't know, two years ago maybe, it's a very interesting feature. Why? Because it helps
you avoid rewriting the tables.

Now, it can give you a simple example. In the past, for example, if you wanted to rename a
column or drop in a column, you had to rewrite all the tables. And what if the table is, I don't
know, one terabyte, two terabytes big? So it's going to cost you a lot of resources to do so. But
once you enable column mapping, you can do this and those modifications will be propagated at
the metadata level, which makes them very easy to be modified.
And this feature is also very important because it's also, I think, super important to enable other
table properties. I think the list vectors require column mapping if I'm not wrong. It's a very
important feature. And that's not the only thing. You can also use column mapping. Again, it's
not recommended. But you can have space or special characters in the name of the columns.
But again, keep in mind, like be cautious because if you have space in the middle of, I don't
know, two names, don't know, auto feature, you know that you have a space, but what if this
space is at the end of the column? So be cautious when you're using special characters. Spark
does not allow them, but with this column mapping, you will have this opportunity.
Yeah, and that was also for this one. It was protocol version five where this came out and so this
was I think this is like a Delta 23 around them. So it's been around for awhile and I know we
covered it earlier like in the talk as well, but back like back in the day like with just you know
native parquet. It was a lot harder to do these things just because you didn't have. You know
you didn't have a map over. You know where the columns were so there was no.
kind of interface between the offset of the column, like zero through N, and the actual human
name that you would want to apply to that as well. So it just made our lives a lot easier.
Awesome.
And this one was also, think, super, super interesting, which is the allow columns default values.
So basically over here, you can create a delta table and you can specify for each column a
default value, which means whenever, for example, someone is doing some insert and is not, I
don't know, filling the specific column or those columns, then you're good to go. You don't need
to fill them. And you can also think about it.
In a way, for example, let's suppose you have a table and every time someone is doing an
insert, it should capture, for example, the current date. So what you can do, you can just specify
the current date as default value for this specific column. And then whenever someone is doing
an insert, it's automatically captured this time date, this current date or timestamp. And it also
avoids, for example, having to provide, I don't know, to have those empty columns that can
maybe cause issues for some
systems depending of course whether you're reading the data from Spark or any other other
engine. So for example you can specify the age, don't know the median or default value for

some car maker. So just for the sake of having this clean data on your tables, I really advise you
to use it because it can also help you clean your data sets.
And this one, think, this generated column, I think it was super, super cool. And I used it first
when I was partitioning my tables. So basically, I did, I remember I have a birth date, and it was
a timestamp. So what I did, I computed three additional columns. That was the month of birth
and the year of birth. And those columns were computed from birth date.
And then I was using those two columns to partition my tables. And you can have two options
with those generated columns. You have the generated always, which means you're going to
always generate this. Or what you can do, you can generate by default only if this column is not
filled with data.
So this one can be very interesting, and you can make sure to compute columns without having
to add additional things behind the scenes. So I use it a lot, and I think we have two features. It's
generated columns, and the next one are identity columns. But keep in mind that this feature is
very efficient, and you can also use it to partition your tables.
I think the other thing that's really interesting too from the last two slides is that in a lot of cases,
there's a lot of things that the protocol itself can do, like writing into a Delta table can do
automatically for you, that removes a lot of the work that occurs in lot of ETL pipelines. It's like,
am getting specific values. I want to expand on those values to create these subset columns. If
you think about that being a typical
a day in the life of a data engineer. If you can automate that, then there's no need to ever be
like, what's our data quality metrics for this? How often are we missing month of birth, year of
birth, or anything else like that? It's like, well, if you have that timestamp, who cares? You can
take that timestamp and you can go and you can automatically always have those values you
need. And so it really reduces a lot of that level of effort for utilizing Delta in general.
Exactly. And the next one, it's tightly linked to the first one, identity columns. And this one I find it
pretty interesting because you can generate unique values. You can think about, as I showed in
the example, I create like, let's say, primary keys. You can specify the rank, for example, starts
with zero and increments by one, by two, or whatever you're thinking about. Just keep in mind
that when you're using
identity columns to create unique keys. It cannot be parallelized. It's only a single node because
for this specific one, it needs to keep the track of the values. But trust me, I use it with 30,000
values and it was created the unique values. But it can be a little bit slow because it cannot be
parallelized. That's because you need to make sure that those keys are...

are unique, but you can use it as a primary key and you're sure that you have generated unique
values.
Well, I think this is something that's really interesting too, because for a long time this wasn't a
feature and this was like a lot of people asked about this, right? Where it's like, how do I treat
Delta Lake or how do I treat my Delta Lake table like it would a traditional kind of transactional
database? And in this case, you now have that ability. And, you know, the nice thing too, is that
like, if you think about, you know, being able to call specifically, you know, what are my primary
keys? Or what are these identity keys, like surrogate keys, etc.
Being able to call those out specifically within the metadata also helps people attribute that to
something that will always be here. It's always generated and it's something I can always refer
to. And so it's like a kind of stronger key, right? It's like, I will always have this. It's always
required. It's generated for me. And so then in the future, you can join easily off of those rows
themselves. And then the nice thing too, going back to the stats columns is that
being able to keep track of our min-max values across our primary keys gives us the ability to
basically skip those offsets as well. So that really helps with retrieval of specific points where we
have these identity columns as well.
And last but last and least, we have liquid clustering. So it's also a modern way to optimize your
data layout. And I find it interesting because it can replace the traditional partitioning. In case
you're not sure of which color you're to be using, maybe for example, now you partitioned by
date, but you want to add later maybe country. you would need to...
rewrite the underlying data. While with liquid clustering, can just alter table and add an alter
table cluster by and add two, three, four columns. And I think also liquid clustering is very
important because when you run the optimize, the optimize is to be running only on the
incremental data. And this is super, super efficient. You want it to spend several times like
waiting for the cluster to compute.
the optimized specifically on super, super large table. can, of course, you can create like those
liquid tables easily. You can alter an existing table and create liquid crash. But keep in mind if it's
partition table, then you would need to use set as to create a new one. If your table is not
partitioned, you don't need to rewrite the underlying data.
And it's very efficient and the order of the clustering keys does not matter. But keep in mind that
you won't be using five columns as the clustering keys. in mind that you just like using
partitioning plus Z-ordering. You won't be partitioning by, I don't know, five columns. You're going
to be using maybe three columns. And the last one can be Z-order. So those are the columns
that can be good.

when it comes to choosing which one can be a good option for the clustering.
And then we also had, like two other sessions that we had done before as well. If people want to
deep dive further into liquid clustering, we had like the beyond partitioning. Then we had the
hybrid clustering session also, where we went into a lot more detail on like, like the kind of the
how's, why's, the do's, the don'ts with using liquid clustering itself. So with that said, we also
have a ton more questions that are coming through. And so,
Well, yeah, this is a perfect site for that too. All right, so let's go back down. Let's see, where do
you want to pop into?
Hold on one sec.
Okay, so the first one is very interesting. It's the ingestion time clustering.
So for the ingestion time clustering, basically what happens is that whenever you're writing
some specific data, this data is written on a specific time. So basically it's like that this table is
clustered or optimized by default because it's ingested in a specific time. For example, I don't
know, at 2 p.m., 3 p.m., 4 p.m., 5 p.m. It's like being, let's say, partitioned.
And Spark in the past, because of some properties, it didn't respect the order of the insertion.
But now they make sure to keep the same order as much as possible, including, I think, when
you have the merge operation as well. But that's how it works. So since the data is ingested at a
specific time, it's like it's ordered. It's like having the natural data skipping because if you specify,
filter the data from, don't know where the hour is equal to four. So you know that you don't need
to scan all the tables, just go to the specific index. So that's what we call this talk about ingestion
time clustering.
All right, and there's also, yeah, so we already answered the data skipping stats leverage from
the Spark joins and Delta table merges. We didn't talk about it as like a Spark join in general,
but anytime that you have the index column, because you can easily prune what files need to be
read, it does help with those joins themselves. The same thing with the merges, as long as you
have that data in the actual, like in the Delta table stats.
that speeds up that whole entire process. So think of that as it's kind of, it's a lighter weight, like
it's a lighter weight index, but it still will help in the same way that you would typically have an
index column that you're joining off of the speed of a join with like a traditional database as well.
All right. Let's see.
Do you know, there's the, so will the identity columns and generated columns as features, where
exactly can we find the functionality being applied on the table? Like, so for this one, like for

Rama, like if you just, if you go and you create a brand new table and you specify a column as
generated always as, and then you're, you you could apply pretty much any kind of condition
you want in there.
Once you have the generate always as, anytime you add any row of data, you can do this all
like within like a notebook, you could do it in like, know, a Jupyter notebook doesn't really matter
where you're doing this. Just add a row of data and you're only filling in the columns that you
need to fill in. Everything else will magically just show up. So for example, like a lot of times if
you're reading data from Kafka, you have the timestamp from the Kafka broker from when new
data was written.
If you're using that to create a partition key, like with the old Heistyle partitioning, you can always
generate a date to project that timestamp into a date value as well. And then that column will
always show up with your table as well, essentially utilizing another known column that will
always exist as well. So you don't have to explicitly do anything. You can ignore it and pretend
that that column doesn't even exist and it'll always just show up for you.
All right. Let's see.
I think another one is, is there any way to prevent optimizers from breaking some concurrency
errors? So basically if your table is partitioned, you can make sure to optimize only some
specific partitions. And if you are using a liquid clustering, you have the raw level concurrency,
which is what was introduced with liquid clustering, which also reduces the possibility to face
those concurrency errors.
Yep, but I mean, like in certain cases with that too, like we covered a little bit of this, like when
we were talking about hybrid clustering before there's like, there's use cases, like there's kind of
edge and corner cases for a lot of things too. So if the process is running and it's optimizing, like
the way that Delta works is its multi-version concurrency control. So you'll be creating,
essentially you'll be doing an optimized flow on a specific version of the table. And if new
additional changes have been made,
in the next version, then that next version being applied to the table won't actually take, like that
won't be applied into that current optimization flow, but it will be the next time the optimized
process is run because it's incrementally being applied. And so like, it's not gonna stop new
changes from occurring while the current optimized flow is going, if that makes any sense.
So yeah, there's no distributed lock, for example.
I don't know what some of the other questions are. Let's see. So there is a longer one. We've
got a couple minutes left. So it's like, having a complex job pipeline in Databricks where tasks
are dependent on the upstream tasks. The task only runs after its proceeding tasks complete

successfully. So we've workflow one, then a task called task A. We also have workflow two and
task B and the workflow depends on the successful completion of task A and workflow one. So
on this second workflow two, no or auto. You're talking about like, I mean, this is kind of like,
This is like transitioning from kind of like an airflow mindset to like workflow mindsets. if you
wanted to have like, if you want, if you had workflow one, which had like task a in it, and then
you have workflow two with task B and task a from workflow one is the dependent task and task
B. the question I would, I would propose is can task B exist within that workflow? Because like, if
I have a workflow and that's like workflow a
and task one is doing a lot of preparation for that task B, then I can also keep task B in memory
within the same executor within that workflow. And so then the read is essentially free because
I'd have two tasks being processed within that workflow as long as we don't run into memory
constraints or anything else within that workflow. So you might actually speed up the workflow
by having task B just be dependent on task A finishing within A workflow.
without having to create any sort of like inter workflow dependencies, which always add more
complexity to the jobs themselves. But yeah, there are other like, there's like, you've got
different triggers like in the workflows as well. So if you're waiting for a specific file to exist,
there's different ways of triggering different jobs to run. But unless like there's a need for the
additional complexity, it'd always be easier just to, you
tightly couple that back into that workflow. That's my two cents.
I think we have a question, but I believe you might have answered it during the last session. In
which scenario should we use auto liquid clustering and hybrid clustering to get optimal runs?
And does hybrid clustering also deletion vectors auto enabled or we should enable it manually?
Yeah, so this one, the interesting thing, so the hybrid clustering was more of like, it was a name,
it was a name and more for the thought experiment of like how we connect essentially Hive style
partitioning, which is a type of clustering, like which can lead to a type of clustering like Z-Order
clustering or kind of traditional bin packing optimize versus liquid clustering, which is more
efficient than say like a Z-Order like index on a table. And so,
for the hybrid workflows, that's essentially up to the user, right? So we talked about it more as
this kind of end-to-end lineage where we had specific tables that had essentially a close of
books time period. So if we're waiting for changes for a certain amount of time and we want to
be able to have right isolation within specific partitions, then that makes a lot more sense for
classic partitioning, like the Hive style partitioning, because the optimizations we can make are
very specific.

to specific partitions. When we're ready to essentially have a table that's long lived that can
benefit from liquid clustering, then at that point in time, that new table becomes essentially liquid
clustered. So we're essentially just changing the time to live for these different tables. And we
have that up on YouTube as well, just the hybrid clustering for that, which goes into a lot more
detail as well. But you would only add deletion vectors if you need them.
for those use cases as well. So in the case where you're overriding a lot, you can have the
deletion vectors on like the classically partitioned data. But if you're essentially moving into
append only for your liquid clustered table, you might not actually need deletion vectors if you're
not actually deleting anything. And so that's like just another thing to keep in mind. Like if you're
just essentially deleting data that's older than say 120 days or 90 days, whatever, then deletion
vectors aren't gonna help as much because
You're not modifying individual parquet files, which is causing a lot of trouble. You're essentially
just like truncating a table to a specific point in time. And so for a lot of it, it's just contextual.
What solution to pick and go with.
And I think we can answer one more question and we can close. I think it's related to liquid
clustering again. Is traditional clustering more effective for batch workloads since liquid has a
dynamic approach? Or like since liquid and then we have dynamic is liquid also a good option
for streaming since it has a dynamic approach.
I think for this one, do you want me to answer it?
Youssef, do you want me to pick this one up?
Yes, yes, please.
Cool.
No, I was like, all right. I'm like, I could do this. So I think with this, like when we talk about like
kind of like a dynamic approach for anything else, right? It's like, do we need to make a decision
upfront? Is there a, you know, is there a reason for us to think things through beforehand? In a
lot of cases, like with old, the old style, which is not bad necessarily, but like with high style
partitioning, you'd really think through like, what's the layout of my table going to look like? What
does it look like now? What does it look like in six months? What does it look like in a year?
because if it's a gigantic table, any kind of changes that are made that need to relay out the
entire table were costly. So liquid clustering, you don't have to cluster until your database table

for Delta gets to say 20 gigabytes, 100 gigabytes. Then you can start thinking about what are
the use cases in which we'd wanna cluster differently so that our queries are faster. And...
That's essentially just, that's something that you can then incrementally apply to the table, which
can help. And that's where the optimized full comes in. So if you haven't made a decision, then
you want to go and change like the cluster by columns. And then you want to go back and
reorganize like the files for those clusters. Then that can be done like one large run. know,
caveat being if the table is gigantic, that job might take a long time to run.
That's it. I guess it's dynamic in that you have a choice that you can make later.
Youssef Mrini (1:00:43)
I think that's it.
It was awesome to be here with you. For those who posted questions on LinkedIn, I'll make sure
to share with you the documentation and the useful links and make sure of course to download
this awesome book written by Demi, Scott, Tristan, and also Prashant, Delta Lake Definitive
Guide. can download your free copy, just scan the QR code or just look it down in Google. I
think you're going to find it easily or go to the Delta Lake documentation as well.
you'll find it there.
Scott Haines (1:01:19)
Awesome. Thank you so much for coming on today, Yusuf, and for sharing the tips and tricks.

Transcript - Delta Lake Tips, Tricks & Best Practices (1).pdf

Recommended

More Related Content

Similar to Transcript - Delta Lake Tips, Tricks & Best Practices (1).pdf (7)

Recently uploaded (20)

Transcript - Delta Lake Tips, Tricks & Best Practices (1).pdf