GW DEVTrails Usecase Solution
GW DEVTrails Usecase Solution
Objective
This 45-day hackathon challenge invites participants to develop a model that can predict potential issues in Kubernetes
clusters (Phase 1) and recommend or automatically implement solutions to address those issues (Phase 2). The aim is to
optimize the efficiency of Guidewire solutions on Cloud Infrastructure. Participants can only leverage publicly available
resources and Open Source solutions to achieve these goals.
Problem Statement
Kubernetes clusters can encounter failures such as pod crashes, resource bottlenecks, and network issues. The
challenge in Phase 1 is to build an AI/ML model capable of predicting these issues before they occur by analysing
historical and real-time cluster metrics.
Model Design: Build a model capable of predicting issues mentioned below as a minimal viable scope (more can also be
accommodated):
Prediction Accuracy: Focus on developing models that accurately forecast potential failures using techniques such as anomaly
detection, time-series analysis, and other applicable techniques.
(Optional) Consume K8s: Package all dependencies in K8s to execute the solution.
Codebase: Functional code including data collection, model training, and evaluation scripts uploaded to Github.
Documentation: Clear documentation explaining the approach, key metrics used, and model performance.
Presentation: A brief recorded presentation of the prediction model, including results and potential improvements together with
a demo. Additionally, please upload the presentation file if applicable.
Test Data: Test data that was used for training and testing the model (If applicable).
Problem Statement
Once issues are predicted, the next step is to automate or recommend actions for remediation. The challenge in Phase
2 is to create an agent or system capable of responding to these predicted issues by suggesting or implementing actions
to mitigate potential failures in the Kubernetes cluster.
01
Key Objectives for Phase 2:
Remediation Actions: Based on predicted issues from Phase 1, develop a system that recommends or implements appropriate
remediation steps. Examples include:
Automation: Integrate the remediation system with the AI/ML agent to trigger automatic responses to predicted issues.
Evaluation of Effectiveness: Measure how effective the remediation actions are in mitigating or preventing cluster issues.
(Optional) Consume K8s: Package all dependencies in K8s to execute the solution.
Codebase: Functional code implementing the remediation logic, connected to the Phase 1 prediction agent uploaded to Github.
Documentation: Detailed documentation describing how remediation actions are chosen or implemented.
Presentation: Final presentation of the complete solution, including both the prediction and remediation phases, with an
emphasis on the integration of the two phases together with a recorded demo of the end-to-end process. Please upload
the presentation files if applicable.
A Deployed Application: If possible, please deploy the application on a cloud platform of choice for us to try the agent live in
action.
(Optional) Consume K8s: Package all dependencies in K8s to execute the solution.
Skills Required:
Knowledge of Kubernetes and container orchestration.
Machine learning (e.g., time series forecasting, anomaly detection).
Experience with AI/ML libraries (e.g., TensorFlow, PyTorch, Scikit-learn).
GenAI tools - LLMs, Langsmith/Langgraph or any other Open Source Solution.
Python or relevant programming languages.
Familiarity with Kubernetes APIs and monitoring tools (e.g., Prometheus).
02
Scoring Criteria (100 points max)
- Novelty of Approach
- Overcoming Challenges
Resources:
• Kubernetes Documentation: Kubernetes Official Docs
• Prometheus Documentation: Prometheus Monitoring
• Grafana Documentation: Grafana
• AI/ML Tutorials:
- Scikit-learn Documentation
- TensorFlow Documentation
- PyTorch Documentation
• AI Agents:
- Agents
- Introducing ambient agents
03