0% found this document useful (0 votes)
10 views

Kubernetes at CERN

CERN utilizes Kubernetes for various applications including infrastructure services, batch computing, machine learning, and reproducible analysis. Key features include heterogeneous clusters, OpenStack integration, and security measures like vulnerability scanning. Ongoing work focuses on cloud bursting for machine learning and addressing open issues related to policy enforcement and storage costs.

Uploaded by

info
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Kubernetes at CERN

CERN utilizes Kubernetes for various applications including infrastructure services, batch computing, machine learning, and reproducible analysis. Key features include heterogeneous clusters, OpenStack integration, and security measures like vulnerability scanning. Ongoing work focuses on cloud bursting for machine learning and addressing open issues related to policy enforcement and storage costs.

Uploaded by

info
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Kubernetes at CERN

Ricardo Rocha
CERN IT
Use Cases
Infrastructure Services: JIRA, WebLogic, EDH, …

Batch and Interactive Computing: Jupyter Notebooks, Spark, HTCondor, …

Machine Learning: Kubeflow

Reproduceable Analysis: REANA

Experiment Tools: CMSWeb, Rucio, …

Kubernetes Grid Sites

Many others we don’t know about


Main Features
Heterogeneous Clusters with Node Groups

OpenStack cloud provider to interact with our private cloud

Identity, Cluster Auto Scaling, Load Balancing, …

CSI drivers for CVMFS (read-only filesystem) and CephFS

EOS integration via a DaemonSet runnings eosxd (no CSI for now)

Central Logging and Metric collection, Alarming

Vulnerability Scanning and Image Signing, Security Reports


$ helm install fluxcd/flux \
--namespace flux --name flux --values flux-values.yaml

Helm and GitOps --set git.pollInterval=1m


--set git.url=https://gitlab.cern.ch/.../hub

$ cat flux-values.yaml
rbac:
create: true
helmOperator:
create: true
chartsSyncInterval: 5m
configureRepositories:
enable: true
repositories:
- name: jupyterhub
url: https://charts.cern.ch/jupyterhub
Registry
...
docker push

Helm
Meta FluxCD Helm
Release
Chart Operator
CRD
git push git pull
Helm and GitOps |-- charts
|-- hub
Chart.yaml requirements.yaml values.yaml
|-- templates
custom-manifest.yaml
|-- namespaces
prod.yaml stg.yaml
|-- releases
apiVersion: flux.weave.works/v1beta1 |-- prod
kind: HelmRelease hub.yaml
metadata: |-- stg
name: hub hub.yaml
namespace: prod |-- secrets
spec: |-- prod
releaseName: hub secrets.yaml
chart: |-- stg
git: https://gitlab.cern.ch/.../hub.git secrets.yaml
path: charts/hub
ref: master
valuesFrom:
- secretKeyRef:
name: hub-secrets
key: values.yaml
This is how we plug our encrypted
values: values data
binderhub:
...
70 TB Dataset Cluster on GKE Job Results Interactive
Visualization
Max 25000 Cores

Single Region, 3 Zones Aggregation


25000 Kubernetes Jobs
CERN → NL Region (via Zurich link)

Initially transferred to Zurich, retransferred to NL for higher capacity

Transfer to NL still went via Zurich


CERN → NL Region (via Zurich link)

Cluster Image Data


Process
Creation Pre-Pull Stage-In

5 min 4 min 4 min 90 sec


CERN → Zurich → NL Region

Cluster Image Data


Process
Creation Pre-Pull Stage-In

5 min 4 min 4 min 90 sec


GCP Pricing
Billing is updated daily, though there are APIs to query for details

Considering a ~10 minutes run it implies (compute table prices, NL region)

$1.043 * 1530 / 6 = $260 (~5x cheaper if using pre-emptibles)

Parking storage cost for the dataset (monthly cost, lots of room for creativity)

$0.020 * 70000 = $1400

Total under $300 usd

Running on credits, no Committed Use or Sustained Compute discounts


Ongoing Work
Use Case: Notebooks, ML Pipelines
Persistent Storage for Feedback

3.

1. 2. Distributed
User Notebook
Compute

Build, Validate Model Train at Scale

4. Serving
Cloud Bursting Hub K8S

Recently deployed in our staging cluster

Burst to public clouds when needed


Virtual
CPUs GPUs
Kubelet
Especially interesting for ML: GPUs, TPUs

Transparent to CERN users

On demand - only pay for actual use (testing in a workshop soon)

External endpoint can be any Kubernetes cluster

Trying with the Google Cloud (GKE)


Open Issues
Policy definition and enforcement for external resources

Accounting

Storage

Parking costs

Remote Access vs Replication (Hot Cache? Already done for CVMFS)

Handling and cost of output data (egress)

Scale test for the network / gateway setup


Other Points of Interest
Kubeflow

Batch on GKE

Quick call with the team

Open Sourcing? Maybe. Tied to GKE? Yes for now

Fair Share relying on Budgets? They fit better the unlimited resources model

KNative / Cloud Run

Anthos

You might also like