# Architecture Overview

# cnvrg Software Architecture

cnvrg is a Kubernetes-based deployment managed by a Kubernetes Operator. The platform consists of control plane nodes and worker nodes that are running the ML workloads.

# Control Plane

  • cnvrg Application: Runs the main application; in charge of the Web UI, API services, cnvrg application logic, and cnvrg Scheduler.
  • cnvrg Sidekiq: Handles jobs orchestration, executes all cnvrg jobs, and monitors the lifecycle of each job. It also manages system metrics and sends alerts.
  • cnvrg Scheduler (when enabled): Picks app jobs submitted by users according to their submission time and priority.
  • Postgresql: A free and open-source relational database management system (RDBMS) that stores the cnvrg platform metadata. External PostgreSQL and managed solutions like Amazon RDS are supported.
  • Redis: A distributed in-memory key-value database, cache, and message broker used as Sidekiq’s database; Stores job executions, schedules, and cron types of jobs.

# Logging Stack

  • ElasticSearch: Used to store cnvrg logs, index datasets metadata, and endpoints logs.
  • Kibana: A free and open-source interface that helps visualize Elasticsearch data, navigate the Elastic Stack, and provide a dashboard for viewing data.
  • Fluentbit: Collects logs from different cnvrg pods and forwards them to Elasticsearch.
  • ElastAlert: An alert framework on top of ElasticSearch to monitor and alert on specific rules. Used to configure custom alerts on cnvrg Endpoints.

# Monitoring Stack

  • Prometheus: A time-series database used to store system metrics and custom metrics from cnvrg job exporters and other system exporters.
  • Grafana: A dashboard to view different metrics and visualizations from Prometheus and other sources.
  • Node Exporter: Provides hardware and OS-level system metrics exposed by *NIX kernels through metric collectors.

# cnvrg Storage

# Object Store

cnvrg uses S3 storage-compatible components to allow users to save their project files, artifacts, and datasets in a managed data science-oriented version control.

Supported Storage Types:

  • S3 Bucket: For EKS clusters.
  • Google Cloud Bucket: For GKE clusters.
  • Azure Blob Storage: For AKS clusters.
  • MinIo: For on-premise clusters.

cnvrg can connect to different storage solutions as long as they support S3-compatible object storage.

# NFS Server

cnvrg can connect to an external NFS server to enable “Dataset Caching''. This ensures the job starts immediately without re-downloading the datasets and copying the files from the storage to the pod every time. If a PVC is already provisioned on the cluster and can serve as the NFS server (allows read/write), cnvrg can use it to enable dataset caching.

# cnvrg Networking

Ai Studio supports different network configurations. The cluster can be installed and configured internally within the customer’s network or can be publicly available. Users will interact with the cnvrg application through HTTP (port 80) or HTTPS (port 443). For HTTPS, a trusted wildcard TLS certificate should be provided.

# Ingress Controller

Ai Studio supports different types of ingress controllers:

  • Istio (default)
  • Vanilla K8S Ingress Rules
  • OpenShift Routes
  • NodePort

# cnvrg Installation Requirements

This section describes the minimum and recommended resource requirements of a Kubernetes cluster. Ensure the following requirements are met for each orchestration.

# Compute Resources

Node Type CPU (Per Node) Memory (Per Node) Storage (Per Node) Nodes Count
Kubernetes control plane 4 CPU 4GB 100GB 3
Cnvrg control plane 8 CPU 32 GB 100GB 3

# Storage Resources

Workload Size (Minimum) Size (Recommended) Type
PostgreSQL 80GB 200GB CSI-compatible (preferably block)
ElasticSearch 80GB 200GB CSI-compatible (preferably block)
Prometheus 50GB 100GB CSI-compatible (preferably block)
ElastAlert 30GB 50GB CSI-compatible (preferably block)
Object storage 1TB - -
Notebooks/Experiments 500GB - CSI-compatible (preferably block)
DataSet Caching 500GB - CSI-compatible/NFS

# Network Resources

  • Allocatable unused IP from the Kubernetes subnet for kube-proxy with IPVS mode.
  • Domain name (internal or external) e.g., cnvrg.my-company.com
  • DNS A wildcard record e.g., *.cnvrg.my-company.com -> 192.168.1.2
  • Trusted wildcard TLS certificates for HTTPS e.g., *.cnvrg.my-company.com
  • User/Password for SMTP access (if enabled)

# Control Plane Pods Resource Requirements

Below are the CPU and Memory requirements represented by:

  • Request: Minimum resources required to run the application.
  • Limit: Resources to allow components to burst under load.

cnvrg deploys HPA (horizontal pod autoscaler) to automatically scale the workload to match demand by increasing component pod count.

Workload Replicas CPU Request CPU Limit Memory Request Memory Limit Storage
webapp 1 2000m 4000m 4Gi 8Gi -
sidekiq 2 1000m 2000m 3750Mi 8Gi -
searchkiq 1 750m 2000m 1Gi 8Gi -
systemkiq 1 500m 2000m 1Gi 8Gi -
hyper 1 100m 2000m 200Mi 4Gi -
postgres 1 4000m 12000m 4Gi 32Gi 80Gi
redis 1 100m 1000m 200Mi 2Gi 10Gi
elasticsearch 1 2000m 4000m 4Gi 8Gi 80Gi
kibana 1 100m 1000m 200Mi 2Gi -
elastalert 1 100m 400m 200Mi 800Mi 30Gi
grafana 1 100m 200m 100Mi 200Mi -
prometheus 1 200m 2000m 500Mi 4Gi 50Gi
capsule 1 200m 1000m 500Mi 1Gi 100Gi
cnvrg-operator 1 500m 1000m 200Mi 1Gi -
config-reloader 1 100m 1000m 200Mi 1Gi -
kube-state-metrics 1 200m 1000m 200Mi 1Gi -
mpi-operator 1 100m 1000m 100Mi 1Gi -
scheduler 1 500m 2000m 100Mi 4Gi -
fluentbit per node 200m 200m 2000m 2Gi -
node-exporter per node 10m 20m 20Mi 40Mi -
dcgm-exporter per NVIDIA GPU node 100m 500m 100Mi 1Gi -

TOTAL: 19 components, 12550m (~13 CPU) Request, 37600m (~38 CPU) Limit, 20Gi Request, 94Gi Limit, 350Gi Storage

*Disclaimer: The above chart refers to cnvrg control plane components and not to users' workloads (experiments/workspaces/etc).

Last Updated: 8/22/2024, 6:52:49 PM