Skip to content

Initial implementation for DataLifecycleService #94012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 27, 2023

Conversation

andreidan
Copy link
Contributor

@andreidan andreidan commented Feb 22, 2023

This adds support for managing the lifecycle for data streams. It currently supports rollover and data retention.

Note that error collection and reporting will come in a follow-up PR.

Relates to #93596

This adds support for managing the lifecycle for data streams. It
currently supports rollover and data retention.
@andreidan andreidan changed the title Add initial implementation for DataLifecycleService Initial implementation for DataLifecycleService Feb 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @andreidan, I've created a changelog YAML for you.

Comment on lines +435 to +453
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
RolloverRequest that = (RolloverRequest) o;
return dryRun == that.dryRun
&& Objects.equals(rolloverTarget, that.rolloverTarget)
&& Objects.equals(newIndexName, that.newIndexName)
&& Objects.equals(conditions, that.conditions)
&& Objects.equals(createIndexRequest, that.createIndexRequest);
}

@Override
public int hashCode() {
return Objects.hash(rolloverTarget, newIndexName, dryRun, conditions, createIndexRequest);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed as the requests are used as keys in the ResultDeduplicator

Comment on lines +111 to +128
@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
DeleteIndexRequest that = (DeleteIndexRequest) o;
return Arrays.equals(indices, that.indices) && Objects.equals(indicesOptions, that.indicesOptions);
}

@Override
public int hashCode() {
int result = Objects.hash(indicesOptions);
result = 31 * result + Arrays.hashCode(indices);
return result;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed as the requests are used as keys in the ResultDeduplicator

dataLifecycleInitialisationService.set(
new DataLifecycleService(
settings,
new OriginSettingClient(client, DLM_ORIGIN),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DLM runs as superuser

@andreidan andreidan marked this pull request as ready for review February 23, 2023 13:48
@andreidan andreidan requested a review from gmarouli February 23, 2023 13:48
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@andreidan andreidan requested a review from dakrone February 23, 2023 13:57
continue;
}

TimeValue indexLifecycleDate = getCreationOrRolloverDate(dataStream.getName(), backingIndex);
Copy link
Contributor

@gmarouli gmarouli Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I was wondering if it would be better to call this rolloverDate, at this point in the code we can only encounter rolled over indices, right?

I believe it's better because it is more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not make that assumption as the modify datastream API could be used to bring any index into the data stream (e.g. one that was never rolled over - at which point we'd take that index's creation data into consideration)

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 So cool to see it working!!!!

long nowMillis = nowSupplier.getAsLong();
if (nowMillis >= indexLifecycleDate.getMillis() + retention.getMillis()) {
// there's an opportunity here to batch the delete requests (i.e. delete 100 indices / request)
// let's start simple and reevaluate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice remark!

if (this.isMaster) {
if (scheduler.get() == null) {
// don't create scheduler if the node is shutting down
if (isClusterServiceStoppedOrClosed() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also take into account the shutdown API here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we should, because DLM should continue to work while a master node is marked as shutting down (which could be for hours)

@andreidan andreidan mentioned this pull request Feb 23, 2023
19 tasks
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Andrei! It's super exciting to see it starting to actually do things :)

I left some comments about this, I think we need to be really defensive on our error-handling (ILM has taught us that!) and we should try to factor as much stuff out into unit-testable and non-mock tests as possible.

if (this.isMaster) {
if (scheduler.get() == null) {
// don't create scheduler if the node is shutting down
if (isClusterServiceStoppedOrClosed() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we should, because DLM should continue to work while a master node is marked as shutting down (which could be for hours)

RolloverRequest rolloverRequest = defaultRolloverRequestSupplier.apply(dataStream.getName());
transportActionsDeduplicator.executeOnce(
rolloverRequest,
ActionListener.noop(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass in some kind of listener for logging purposes so that DLM can log (at trace) that it's invoking rollover requests? Not sure if it'd be too much noise or whether it'd be useful, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

I think we need a custom listener here that will collect the encountered errors. I am planning to add it as part of the next effort (follow-up PR) related to error reporting.

Comment on lines 270 to 272
rolloverRequest.addMaxIndexAgeCondition(TimeValue.timeValueDays(30));
rolloverRequest.addMaxPrimaryShardSizeCondition(ByteSizeValue.ofGb(50));
rolloverRequest.addMaxPrimaryShardDocsCondition(200_000_000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to start a conversation (earlier is better) about what this default should be. I think we should perhaps aim for something a little shorter than 30 days, more in the 7 day range (or shorter if we think we can get away with it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 7 days which is what we postulated via the design doc (not sure where my 30 days came from here)
Do you think we should discuss having it lower than 7 days?

Comment on lines +288 to +289
if (scheduler.get() != null) {
scheduler.get().remove(DATA_LIFECYCLE_JOB_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of these days we should just write a SetOnceOptional<T> class of our own that combines SetOnce and Optional, so we can dispense with the safety checks and do scheduler.set(...) and scheduler.ifPresent(s -> s.remove(DATA_LIFECYCLE_JOB_NAME));

Comment on lines +318 to +320
void setDefaultRolloverRequestSupplier(Function<String, RolloverRequest> defaultRolloverRequestSupplier) {
this.defaultRolloverRequestSupplier = defaultRolloverRequestSupplier;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this to make testing easier right? I would like to get to the point where we parsing out the real rollover request from the configuration and then we can remove this entirely, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Mary is working on introducing the default rollover setting. This will go away once that's merged.

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone February 24, 2023 15:51
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks for the changes Andrei! I left one more comment that is more design-centric but could be done in followup work if you agree (or ignored if you don't agree)

Comment on lines +191 to +204
List<Index> backingIndices = dataStream.getIndices();
// we'll look at the current write index in the next run if it's rolled over (and not the write index anymore)
for (int i = 0; i < backingIndices.size() - 1; i++) {
IndexMetadata backingIndex = state.metadata().index(backingIndices.get(i));
if (backingIndex == null || isManagedByDLM(dataStream, backingIndex) == false) {
continue;
}

if (isTimeToBeDeleted(dataStream.getName(), backingIndex, nowSupplier, retention)) {
// there's an opportunity here to batch the delete requests (i.e. delete 100 indices / request)
// let's start simple and reevaluate
DeleteIndexRequest deleteRequest = new DeleteIndexRequest(backingIndex.getIndex().getName()).masterNodeTimeout(
TimeValue.MAX_VALUE
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little bit, we could possibly push this logic into the DataStream itself right? Something like dataStream.getIndicesPastRetention() returning the list. Especially since we're pushing the lifecycle information into the data stream. It lets the logic live next to the lifecycle configuration, and then this code here doesn't need to know anything about the write index. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I think that could work. Thanks for the suggestion Lee.

Will do this in a follow-up PR

@andreidan andreidan merged commit 4760f00 into elastic:main Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Team:Data Management Meta label for data/management team v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants