Notes

A collection of technical notes, reference materials, and things I’ve learned along the way. These are my personal knowledge base entries — not polished tutorials, but practical notes for quick reference.


## Cloud Native Notes on Kubernetes, container orchestration, and cloud-native technologies.

Kubernetes Cluster Architecture

Kubernetes Cluster Architecture

A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.

The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.

Kubernetes Cluster Components Figure 1: Kubernetes Cluster Architecture

Control Plane Components

The control plane’s components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events.

kube-apiserver

The API server is the front end for the Kubernetes control plane. It exposes the Kubernetes API and is designed to scale horizontally.

  • Role: Central communication hub; authenticates and authorizes requests.

etcd

Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.

  • Role: Single source of truth for the entire cluster state.

kube-scheduler

Watches for newly created Pods with no assigned node, and selects a node for them to run on.

  • Role: Decides Pod placement based on resource requirements and constraints.

kube-controller-manager

Runs controller processes that maintain the desired state of the cluster.

  • Key Controllers: Node, Job, EndpointSlice, and ServiceAccount controllers.

cloud-controller-manager

Embeds cloud-specific control logic to link your cluster into your cloud provider’s API.

  • Role: Manages cloud-specific resources like load balancers and routes.

Node Components

Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.

kubelet

An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.

  • Role: Manages the lifecycle of containers within a Pod according to PodSpecs.

kube-proxy

A network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.

  • Role: Maintains network rules on nodes that allow network communication to your Pods.

Container Runtime

The software that is responsible for running containers.

  • Supported runtimes: Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).

Addons

Addons use Kubernetes resources (DaemonSet, Deployment, etc.) to implement cluster features.

  • DNS: Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.
  • Web UI (Dashboard): A general purpose, web-based UI for Kubernetes clusters.
  • Container Resource Monitoring: Records generic time-series metrics about containers in a central database.
  • Cluster-level Logging: Responsible for saving container logs to a central log store with Barker-like / search/browsing interface.

References


Last updated: 2026-02-18


Kubernetes Fundamentals

Kubernetes Fundamentals

Quick reference for core Kubernetes concepts and common operations.

Core Concepts

Pod Lifecycle

  • Pending: Pod accepted but containers not created
  • Running: At least one container running
  • Succeeded: All containers terminated successfully
  • Failed: All containers terminated, at least one with failure
  • Unknown: State cannot be determined

Common Commands

# Get pod details with wide output
kubectl get pods -o wide

# Watch pods in real-time
kubectl get pods -w

# Get pod logs (follow)
kubectl logs -f <pod-name>

# Execute into a pod
kubectl exec -it <pod-name> -- /bin/bash

# Port forward
kubectl port-forward <pod-name> 8080:80

Resource Management

Resource Requests vs Limits

  • Requests: Guaranteed resources for scheduling
  • Limits: Maximum resources a container can use
resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Debugging

Execution Flow: kubectl apply

What happens when you execute kubectl apply -f deploy.yaml? (Reference: what-happens-when-k8s)

1. Client Side (kubectl)

  • Validation: Client-side linting and validation of the manifest.
  • Generators: Assembling the HTTP request (converting YAML to JSON).
  • API Discovery: Version negotiation to find the correct API group and version.
  • Authentication: Loading credentials from kubeconfig.

2. Kube-apiServer

  • Authentication: Verifies “Who are you?” (Certs, Tokens, etc.).
  • Authorization: Verifies “Are you allowed to do this?” (RBAC).
  • Admission Control: Mutating/Validating admission controllers (e.g., setting defaults, checking quotas).
  • Persistence: The validated resource is stored in etcd.

3. Control Plane (Controllers & Scheduler)

  • Deployment Controller: Notices the new Deployment and creates a ReplicaSet.
  • ReplicaSet Controller: Notices the new ReplicaSet and creates Pods.
  • Scheduler: Watches for unscheduled Pods and assigns them to a healthy Node based on predicates and priorities.

4. Node Side (kubelet)

  • Pod Sync: The kubelet on the assigned Node notices the Pod.
  • CRI: Container Runtime Interface pulls images and starts containers.
  • CNI: Container Network Interface sets up Pod networking and IP allocation.
  • CSI: Container Storage Interface mounts requested volumes.

Common Issues

  1. ImagePullBackOff: Check image name, registry access, secrets
  2. CrashLoopBackOff: Check container logs, resource limits
  3. Pending: Check node resources, affinity rules, PVC binding

Last updated: 2026-02-09


Kubernetes Networking & CNI

Kubernetes Networking & CNI

Kubernetes networking is based on a set of fundamental principles that ensure every container can communicate with every other container in a flat, NAT-less network space.

The 4 Networking Problems

Kubernetes addresses four distinct networking challenges:

  1. Container-to-Container: Solved by Pods and localhost communications.
  2. Pod-to-Pod: The primary focus of the CNI, enabling direct communication between Pods.
  3. Pod-to-Service: Handled by Services (kube-proxy, iptables/IPVS).
  4. External-to-Service: Managed by Services (LoadBalancer, NodePort, Ingress).

The 3 “Golden Rules”

To be Kubernetes-compliant, any networking implementation (CNI plugin) must satisfy these three requirements:

  1. Pod-to-Pod: All Pods can communicate with all other Pods without NAT.
  2. Node-to-Pod: All Nodes can communicate with all Pods (and vice-versa) without NAT.
  3. Self-IP: The IP that a Pod sees itself as is the same IP that others see it as.

The CNI (Container Network Interface)

Kubernetes doesn’t implement networking itself; it offloads this to CNI plugins (like Calico, Flannel, Cilium).

CNI Lifecycle & The Flow of a Pod

When a Pod is scheduled, several components coordinate to ensure it gets networking. Here is the visual flow:

sequenceDiagram
    participant S as Scheduler
    participant K as Kubelet
    participant CRI as Container Runtime (CRI)
    participant CNI as CNI Plugin
    participant NS as Network Namespace

    S->>K: Assign Pod to Node
    K->>CRI: Create Pod Sandbox
    CRI->>NS: Create Network Namespace
    CRI->>CNI: Invoke ADD Command
    CNI->>CNI: Create veth pair
    CNI->>NS: Move eth0 to NS
    CNI->>CNI: IPAM (Assign IP)
    CNI->>NS: Configure Routing
    CNI-->>CRI: Success
    CRI-->>K: Pod Ready
    K->>CRI: Start App Containers
  1. Scheduling: The Scheduler assigns a Pod to a Node. This is updated in the API Server.
  2. Kubelet Action: The Kubelet on the assigned Node watches the API Server. When it sees a new Pod assigned to it, it starts the creation process.
  3. CRI Invocation: Kubelet calls the Container Runtime Interface (CRI) to create the Pod sandbox.
  4. Network Namespace Creation: The Container Runtime creates a linux Network Namespace for the Pod. This isolates the Pod’s network stack from the host and other Pods.
  5. CNI Trigger: The CRI identifies the configured CNI plugin and invokes it with the ADD command.
  6. CNI Plugin Execution: The CNI Plugin performs the “Golden Rule” setup:
    • veth pair: It creates a virtual ethernet pair.
    • Plumbing: One end is kept in the host namespace, and the other is moved into the Pod’s namespace and renamed to eth0.
    • IPAM: It calls an IPAM (IP Address Management) plugin to assign a unique IP from the Node’s allocated CIDR range.
    • Routing: It configures the default gateway and routes inside the Pod so it can talk to the rest of the cluster.
  7. Success: The CNI returns success to the CRI, which then returns to the Kubelet.
  8. App Start: Finally, the Kubelet starts the actual application containers inside the now-networked sandbox.

Traffic leaves the Pod via eth0, enters the host via the other end of the veth pair, and is then handled by the CNI’s data plane (Bridge, Routing, or eBPF).

The Life of a Packet (Pod-to-Service)

Understanding how a packet travels from one Pod to another through a Service is key to mastering Kubernetes networking.

sequenceDiagram
    participant PodA as Pod A (Node 1)
    participant Node1 as Node 1 Kernel (kube-proxy)
    participant Net as Physical Network
    participant Node2 as Node 2 Kernel
    participant PodB as Pod B (Node 2)

    PodA->>Node1: Request to Service IP
    Note over Node1: Intercept & DNAT (Service IP -> Pod B IP)
    Note over Node1: Routing Decision (Pod B is on Node 2)
    Node1->>Net: Send via CNI (Overlay/Direct)
    Net->>Node2: Arrive at Node 2
    Node2->>PodB: Forward to Pod Namespace
    PodB-->>PodA: Response

Step-by-Step Journey:

  1. Request Initiation: Pod A (on Node 1) sends a request to a Service IP (ClusterIP).
  2. Kernel Interception: The packet leaves the Pod via the veth pair and hits the Node 1 Kernel. kube-proxy (via iptables or IPVS rules) intercepts the packet in the nat/OUTPUT chain.
  3. Destination NAT (DNAT): The Kernel performs DNAT, rewriting the destination IP from the Service’s Virtual IP (VIP) to the real IP of a healthy backend Pod (e.g., Pod B on Node 2).
  4. Routing Decision: The Kernel makes a routing decision. It determines that Pod B’s IP is reachable via the CNI’s interface (e.g., an overlay network like vxlan or direct routing).
  5. CNI Transmit: The CNI plugin encapsulates (if overlay) or routes the packet across the physical network to Node 2.
  6. Node 2 Arrival: The packet arrives at Node 2, is decapsulated by its CNI, and the Kernel identifies it’s destined for a local Pod.
  7. Success: The packet is forwarded into Pod B’s network namespace via its veth pair. Pod B receives the request!

How Services match Pods

Services use a discovery mechanism to track which Pods should receive traffic. This process is driven by Label Selectors:

  • Label Selectors: Defined in the Service’s specification, these core identifiers tell the cluster exactly which Pods to target. A Service (the stable front door) selects any Pod whose labels match its selector to be its backend.
  • EndpointSlices: These are the dynamic list of targets (IPs and ports). The system automatically populates EndpointSlice resources with matching Pods. By splitting the list into smaller “slices,” Kubernetes can scale efficiently to thousands of Pods, avoiding the bottlenecks of the legacy Endpoints resource.

Kubernetes Service Types

Kubernetes Services are built like building blocks, where each type typically adds a layer on top of the previous one:

  1. ClusterIP (Default): Exposes the Service on a cluster-internal IP. This is the foundation for almost all other Service types.
  2. NodePort: Exposes the Service on each Node’s IP at a static port (between 30000-32767). Critically: A NodePort Service automatically creates its own ClusterIP to route traffic to backend Pods.
  3. LoadBalancer: Exposes the Service externally using a cloud provider’s load balancer. This builds upon both NodePort and ClusterIP, configuring the cloud to route external traffic to NodePorts.
  4. ExternalName: Maps the Service to a DNS name (produces a CNAME record). It bypasses selectors and proxying entirely, allowing you to treat external services as internal ones.

Headless Services

When you don’t need a single Virtual IP (VIP) to load balance traffic, you can create a Headless Service by setting .spec.clusterIP: None.

  • Instead of the DNS returning a single ClusterIP, a query for a headless service returns the direct A records (individual IPs) of all matching Pods.
  • This is essential for StatefulSets, where you need to reach specific Pod instances, or when implementing custom service discovery.

DNS in Kubernetes (CoreDNS)

DNS serves as the cluster’s phonebook, translating service names into IP addresses. In modern clusters, this is handled by CoreDNS.

  • Architecture: CoreDNS runs as a Deployment (usually in the kube-system namespace) and is exposed via a Service named kube-dns.
  • Discovery: CoreDNS watches the Kubernetes API for new Services and EndpointSlices, dynamically generating DNS records.
  • Client Config: The Kubelet configures every Pod’s /etc/resolv.conf to point at the kube-dns Service IP.

The Resolution Process

When a Pod queries a name like my-svc, the OS resolver iterates through the search domains defined in /etc/resolv.conf until it finds a match.

sequenceDiagram
    participant App as Application
    participant OS as OS Resolver (/etc/resolv.conf)
    participant DNS as CoreDNS (kube-dns Service)

    App->>OS: Resolve "my-svc"
    Note over OS: iterate search domains
    OS->>DNS: Query: my-svc.default.svc.cluster.local?
    DNS-->>OS: A Record: 10.96.0.100 (Success)
    OS-->>App: Return 10.96.0.100

    Note over App,DNS: Scenario: External Domain (ndots:5)
    App->>OS: Resolve "google.com"
    OS->>DNS: Query: google.com.default.svc.cluster.local?
    DNS-->>OS: NXDOMAIN
    Note over OS: ... more internal retries ...
    OS->>DNS: Query: google.com?
    DNS-->>OS: A Record: 142.250.x.x
    OS-->>App: Return IP
  • Record Types:
    • A Records: Resolve to a Service’s ClusterIP (Standard) or multiple Pod IPs (Headless).
    • SRV Records: Created for named ports (e.g., _http._tcp.my-svc.ns.svc.cluster.local), allowing for dynamic port discovery.
    • CNAME Records: Used for ExternalName services to point to external hostnames.

Performance & Scalability

As clusters grow, DNS can become a bottleneck or a source of latency.

  • The “ndots:5” Trap: By default, if a name has fewer than 5 dots, Kubernetes tries internal search domains first. For external names like api.github.com, this causes several failing internal queries (NXDOMAIN) before hititng the external resolver.
    • Pro Tip: Use a trailing dot (google.com.) for external names to bypass the search path.
  • NodeLocal DNSCache: Runs a DNS caching agent on every node as a DaemonSet. It drastically reduces latency and prevents conntrack exhaustion (UDP session tracking limits) in the Linux kernel during high DNS volume.

Debugging Kubernetes Networking

When network issues arise, follow a Bottom-Up troubleshooting flow, starting from the source Pod and moving up the abstraction layers.

flowchart TD
    Start[Issue: Pod A cannot reach Service B] --> Net{1. Pod Networking OK?}
    Net -- No --> FixNet[Check CNI / Routes / NetPol]
    Net -- Yes --> DNS{2. DNS Resolution OK?}
    DNS -- No --> FixDNS[Check CoreDNS / Config]
    DNS -- Yes --> Svc{3. Service IP Reachable?}
    Svc -- No --> FixSvc[Check kube-proxy / Spec]
    Svc -- Yes --> EP{4. Endpoints Populated?}
    EP -- No --> FixEP[Check Selectors / Readiness]
    EP -- Yes --> App[5. Check Application Logs]

The Tool: Ephemeral Containers

Avoid installing debug tools in production images. Instead, use ephemeral containers to attach a “debug sidecar” (like netshoot) to a running Pod:

kubectl debug -it <pod-name> --image=nicolaka/netshoot

1. Pod Connectivity (The Foundation)

Verify the Pod can talk to the host and itself.

  • Check IPs: ip addr show (does eth0 match kubectl get pod -o wide?)
  • Check Routes: ip route show (is there a default gateway?)
  • Issue: If eth0 or routes are missing, the CNI plugin failed. Check CNI node logs (e.g., calico-node, cilium-agent).

2. DNS (The Phonebook)

If the Pod has an IP, check if it can resolve names.

  • Test Resolution: nslookup my-service
    • NXDOMAIN: Name doesn’t exist (check namespace/spelling).
    • Timeout: CoreDNS is unreachable (check CoreDNS pods and NetworkPolicies).
  • Check Config: cat /etc/resolv.conf (verify the nameserver is the kube-dns Service IP).

3. Services (The Virtual IP)

If DNS works, verify the Service and its endpoints.

  • Test Connectivity: nc -zv <service-ip> <port>
  • Check Endpoints: kubectl get endpointslices -l kubernetes.io/service-name=<service-name>
  • Common Issue: Hairpin Traffic: A Pod failing to reach itself via its own Service IP. Ensure the Kubelet is running with --hairpin-mode=hairpin-veth.

4. Packet Level (The Truth)

When logs aren’t enough, use tcpdump to see what’s on the wire.

  • Capture: tcpdump -i eth0 -w /tmp/capture.pcap
  • Analyze: Copy the file to your machine and open in Wireshark:
    kubectl cp <pod-name>:/tmp/capture.pcap ./capture.pcap -c <debug-container-name>
    

    Look for TCP Retransmissions (network drops), RST (closed ports), or sent SYNs with no SYN-ACK (firewall/NetworkPolicy drops).

References


Last updated: 2026-02-18


Kubernetes Storage: PV, PVC & CSI

Kubernetes Storage: A Deep Dive

Storage in Kubernetes is designed to decouple the physical storage implementation from the application’s request for it. This allows for portable, infrastructure-agnostic deployments.

Stateless vs. Stateful Workloads

Understanding the nature of your workload is the first step in deciding how to handle storage:

  • Stateless: Ephemeral, idempotent, and immutable. Containers can be replaced or rescheduled easily because they don’t store persistent state. Examples: Web servers, API gateways.
  • Stateful: Requires durability and persistence. Data must survive Pod restarts, node failures, and upgrades. Examples: Databases (PostgreSQL, MongoDB), Message Brokers.

The Abstraction Stack

Kubernetes uses several layers to manage storage, moving from high-level requests to low-level implementation.

graph TD
    PVC["PersistentVolumeClaim (PVC)"] -- requests --> SC["StorageClass"]
    SC -- provisions --> PV["PersistentVolume (PV)"]
    PV -- backed by --> Infra["Infrastructure Storage (EBS, Azure Disk, NFS)"]
    Pod["Pod"] -- volumes --> PVC

Storage Lifecycle Flow

The complete path from developer intent to a running application with storage.

sequenceDiagram
    participant User as Developer
    participant K8s as K8s Control Plane
    participant CSI_C as CSI Controller (Provisioner/Attacher)
    participant Sched as K8s Scheduler
    participant Kubelet as Node Kubelet (CSI Node Plugin)

    User->>K8s: Create PVC
    K8s->>CSI_C: Detect PVC (Provisioner)
    CSI_C->>CSI_C: CreateVolume (CSI)
    CSI_C-->>K8s: Create PV & Bind
    User->>K8s: Create Pod
    Sched->>K8s: Assign Pod to Node
    K8s->>CSI_C: Trigger Attachment (Attacher)
    CSI_C->>CSI_C: ControllerPublishVolume (CSI)
    K8s->>Kubelet: Start Pod
    Kubelet->>Kubelet: NodeStage & NodePublish (CSI)
    Kubelet-->>User: Container Started with Volume

1. Persistent Volumes (PV)

A cluster-scoped resource representing actual storage. It has a lifecycle independent of any individual Pod that uses it.

  • Phases: AvailableBoundReleasedFailed.
  • Reclaim Policies:
    • Delete: Automatically deletes the underlying infrastructure when the PVC is deleted.
    • Retain: Keeps the storage for manual cleanup (safer for production).

2. Persistent Volume Claims (PVC)

A namespace-scoped request for storage. It’s like a “voucher” that a Pod uses to get a PV.

  • Binds: A PVC binds to a matching PV based on size and access modes.
  • Access Modes:
    • ReadWriteOnce (RWO): One node can mount as read-write.
    • ReadOnlyMany (ROX): Many nodes can mount as read-only.
    • ReadWriteMany (RWX): Many nodes can mount as read-write.

3. StorageClasses

Policies for Dynamic Provisioning. Instead of manually creating PVs, an administrator defines a StorageClass. When a PVC request comes in, the cluster creates a PV on the fly.

  • Binding Modes:
    • Immediate: Create volume as soon as PVC is created.
    • WaitForFirstConsumer: Delay creation until the Pod is scheduled (best for multi-zone clusters).

Container Storage Interface (CSI)

The CSI moved storage drivers “out-of-tree,” allowing storage vendors to develop plugins independently of the Kubernetes core.

sequenceDiagram
    participant K8s as K8s API Server
    participant ExtP as External Provisioner
    participant ExtA as External Attacher
    participant CSID as CSI Driver (Controller/Node)
    participant Kube as Kubelet

    K8s->>ExtP: Watch: New PVC
    ExtP->>CSID: CreateVolume (gRPC)
    Note over CSID: Provision Backend Disk
    ExtP-->>K8s: Create PersistentVolume (PV)
    
    K8s->>ExtA: Watch: Pod scheduled to Node
    ExtA->>CSID: ControllerPublishVolume (gRPC)
    Note over CSID: Attach Disk to VM/Host
    
    K8s->>Kube: Pod assigned to local node
    Kube->>CSID: NodeStageVolume (gRPC)
    Note over CSID: Format & Prep Global Mount
    Kube->>CSID: NodePublishVolume (gRPC)
    Note over CSID: Bind Mount into Pod Directory
  • Controller Plugin: Handles cluster-wide tasks like provisioning and attaching.
  • Node Plugin: Runs on every node to handle mounting (NodeStage / NodePublish).

StatefulSets & Storage

StatefulSets are uniquely designed for applications requiring stable identities and storage.

  • volumeClaimTemplates: Creates a unique PVC for each Pod ordinal (e.g., db-0, db-1).
  • Stable Identity: If db-0 crashes and is rescheduled, it will re-attach to the same PVC it had before.
  • PVC Retention Policy: (K8s 1.27+) Control if PVCs are deleted when a StatefulSet is scaled down.

Troubleshooting Guide (At a Glance)

When storage issues arise, use these specific flows to pinpoint the failure.

Case 1: PVC is stuck in Pending

This usually happens during the Provisioning phase.

flowchart TD
    Start[PVC stuck in Pending] --> SC{Default StorageClass?}
    SC -- No --> SetSC[Specify SC or set default]
    SC -- Yes --> Match{Matching PV?}
    Match -- Yes --> Bind[Wait for Binding]
    Match -- No --> Dynamic{SC allow dynamic?}
    Dynamic -- No --> CreatePV[Static Provisioning Required]
    Dynamic -- Yes --> FirstConsumer{"WaitForFirstConsumer?"}
    FirstConsumer -- Yes --> SchedulePod["Schedule Pod to Node first"]
    FirstConsumer -- No --> Events["Check describe PVC Events: Quota, Permissions"]

Case 2: Pod is stuck in ContainerCreating

This occurs during the Attachment or Mounting phases.

flowchart TD
    Start[Pod in ContainerCreating] --> Attached{Volume Attached?}
    Attached -- No --> MultiAttach{Multi-Attach Error?}
    MultiAttach -- Yes --> Detach[Force Detach or wait for Old Node]
    MultiAttach -- No --> CSIController[Check CSI Controller Logs]
    Attached -- Yes --> Mounted{Node Mounted?}
    Mounted -- No --> CSINode[Check CSI Node Plugin Logs]
    Mounted -- Yes --> SecretConfig{ConfigMap/Secret present?}
    SecretConfig -- No --> CreateResources[Create missing resources]
    SecretConfig -- Yes --> Permissions[Check SecurityContext & fsGroup]

Case 3: PVC is stuck in Terminating

This happens when you try to delete a volume that is still in use.

flowchart TD
    Start[PVC stuck in Terminating] --> Clean[Check for Pod consumers]
    Clean --> Finalizer{Finalizer: pvc-protection?}
    Finalizer -- Yes --> RunningPod{"Healthy Pod using it?"}
    RunningPod -- Yes --> DeletePod["Delete Pod first"]
    RunningPod -- No --> Zombie["Check Node for zombie mount"]
    Zombie -- Yes --> Unmount["Force Unmount from Node"]
    Zombie -- No --> Force["Remove Finalizer - AS LAST RESORT"]

Summary of Debug Commands

| Failure Layer | Primary Command | Search For | | :— | :— | :— | | PVC | kubectl describe pvc <name> | Events section for provisioner errors. | | CSI Control | kubectl logs csi-provisioner-... | gRPC CreateVolume failures. | | Attachment | kubectl get volumeattachment | isAttached: true and attached: false. | | Node/Mount | kubectl describe pod <name> | FailedMount or FailedAttach events. | | Permissions | kubectl exec -it <pod> -- ls -l | Owner UID/GID of the mount point. |


References


Last updated: 2026-02-28


## GPU / HPC & AI Infrastructure Deep dives into GPU computing, NVIDIA MIG/vGPU, DCGM monitoring, vLLM, and AI/ML/HPC infrastructure.

GPU Troubleshooting Fundamentals

GPU Troubleshooting Fundamentals

Common GPU failure modes and diagnostics in high-performance computing (HPC) and AI infrastructure.

XID Errors

XID errors are error reports from the NVIDIA driver printed to the operating system’s kernel log or event log. They provide a high-level indication of where a failure occurred.

Common XID Codes

  • XID 31 (GPU Memory Page Fault): Typically indicates an application trying to access an invalid memory address. Often a software bug (illegal memory access) but can be triggered by faulty hardware.
  • XID 45 (GPU Raven Termination): Critical error indicating the GPU has encountered a hardware issue that required it to be reset or terminated.
  • XID 61 (Internal Microcontroller Error): Internal GPU firmware error, often requiring a node reboot or power cycle.
  • XID 79 (GPU has fallen off the bus): The most critical state where the GPU is no longer communication via PCIe.

Diagnostics:

dmesg | grep -i xid
# or
journalctl -kn | grep -i xid

ECC Errors (Error Correction Code)

Modern data center GPUs (A100, H100) use ECC to detect and correct memory corruption.

Types of Errors

  1. Single-Bit Errors (SBE): Corrected automatically by hardware without data loss. High counts of SBEs can indicate aging hardware or impending failure.
  2. Double-Bit Errors (DBE): Uncorrectable errors. These lead to immediate application crashes (to prevent data corruption) and require a GPU reset.

Diagnostics:

nvidia-smi -q -d ECC

“Falling off the Bus”

A situation where the GPU becomes completely unresponsive to the host CPU via the PCIe interface. The device remains visible in lspci (usually), but nvidia-smi will report “No devices found” or “Unable to determine the device handle”.

Common Causes

  • Thermal Issues: GPU overheating triggers a survival shutdown.
  • Power Fluctuations: Transient voltage drops causing the GPU to drop its link.
  • PCIe Link Training Failure: Signal integrity issues on the motherboard or riser cards.
  • Firmware/Driver Bugs: Internal state machine lockups.

Recovery

  1. Soft Reset: nvidia-smi -r (if the driver can still talk to the GPU).
  2. Hard Reboot: Cold boot of the physical node.
  3. Firmware Reload: Using specialized tools like flshutil (for HGX systems).

Last updated: 2026-02-18


High-Performance Networking (RDMA, InfiniBand, RoCE)

High-Performance Networking

In GPU clusters and HPC (High-Performance Computing), standard TCP/IP networking often becomes a bottleneck due to high CPU overhead, latency, and frequent context switching. Technologies like RDMA, InfiniBand, and RoCE provide the low-latency, high-throughput interconnects required for distributed AI training.

RDMA (Remote Direct Memory Access)

RDMA allows a computer to access memory on another computer directly, bypassing the operating system kernel and the CPU of the remote machine.

graph LR
    subgraph Node A
        AppA[Application] -- "RDMA Write" --> NIC_A[HCA/NIC]
        MemA[Memory]
    end
    
    subgraph Node B
        AppB[Application]
        MemB[Memory]
        NIC_B[HCA/NIC]
    end
    
    NIC_A -- "Direct Data Transfer" --> NIC_B
    NIC_B -- "Write to Memory" --> MemB
    
    style AppA fill:#f9f,stroke:#333
    style AppB fill:#f9f,stroke:#333
    style NIC_A fill:#bbf,stroke:#333
    style NIC_B fill:#bbf,stroke:#333
  • Zero-Copy: Data is transferred directly into memory without being copied to intermediate buffers in the OS.
  • Kernel Bypass: Applications communicate directly with the network hardware (NIC), avoiding kernel system calls.
  • Lower CPU Utilization: The NIC handles the protocol logic, freeing up the CPU for compute tasks.

InfiniBand (IB)

InfiniBand is a lossy-free, credit-based network architecture designed from the ground up for high-performance computing.

  • Credit-Based Flow Control: Unlike Ethernet, which drops packets during congestion, IB uses a hardware-level credit system to ensure packets are only sent when the receiving buffer has space.
  • Subnet Manager (SM): A centralized control agent (running on a switch or host) that manages routing and network configuration.
  • Low Latency: Latency is typically measured in sub-microsecond ranges.
  • Speed Generations:
    • HDR: 200 Gbps
    • NDR: 400 Gbps (NDR200) or 800 Gbps

RoCE (RDMA over Converged Ethernet)

RoCE brings RDMA capabilities to standard Ethernet networks.

RoCE v1

  • Layer 2 Protocol: Encapsulated in the Ethernet link layer.
  • Limitation: Not routable beyond a single subnet (L2 only).

RoCE v2

  • Layer 3 Protocol: Encapsulated in UDP/IP.
  • Routable: Can cross router boundaries, making it more scalable for large data centers.

Lossless Requirement (Convergence)

Standard Ethernet is “lossy” (it drops packets). To support RDMA effectively, Ethernet must be made “lossless” using:

  • PFC (Priority Flow Control): Pauses traffic on specific priorities (queues) to prevent buffer overflows.
  • ECN (Explicit Congestion Notification): Informs the sender to slow down before buffers are full.

Comparison Table

FeatureInfiniBandRoCE v2TCP/IP
TransportNative IBUDP/IP (Ethernet)TCP/IP
Flow ControlCredit-based (Hardware)PFC/ECN (Network configuration)Congestion Avoidance (Software)
LatencyExtremely Low (< 1µs)Low (~2-5µs)Higher (> 10-20µs)
CPU OverheadMinimal (RDMA)Low (RDMA)High (Protocol stack)
DeploymentSpecialized InfrastructureConverged (Standard Switches)Ubiquitous

Last updated: 2026-03-02


GPU Monitoring with NVIDIA DCGM

GPU Monitoring with NVIDIA DCGM

Data Center GPU Manager (DCGM) is the industry standard for monitoring and managing NVIDIA GPUs in cluster environments.

DCGM Key Metrics

DCGM provides a wide range of metrics, classified into health, usage, and profiling categories.

MetricDCGM Field NameDescription
GPU UtilizationDCGM_FI_DEV_GPU_UTILTraditional activity percentage (see MIG section below)
Memory UsedDCGM_FI_DEV_FB_USEDAmount of frame buffer memory used
TemperatureDCGM_FI_DEV_GPU_TEMPCore temperature in degrees Celsius
Power UsageDCGM_FI_DEV_POWER_USAGEInstantaneous power draw in Watts
PCIE ThroughputDCGM_FI_PROF_PCIE_TX_BYTESData transferred over PCIe bus

Monitoring MIG (Multi-Instance GPU)

When using MIG (A100/H100), traditional utilization metrics like GPU_UTIL often fail or report incorrectly at the partition level.

GPU_UTIL vs GR_ENGINE_ACTIVE

[!IMPORTANT] For MIG partitions, always use DCGM_FI_PROF_GR_ENGINE_ACTIVE instead of DCGM_FI_DEV_GPU_UTIL.

  • GPU_UTIL (DCGM_FI_DEV_GPU_UTIL): Reports if any kernel is executing. It doesn’t accurately reflect resource consumption within a MIG slice.
  • GR_ENGINE_ACTIVE (DCGM_FI_PROF_GR_ENGINE_ACTIVE): Measures the Graphics Engine activity. This provides a more precise utilization value for both graphics and compute workloads and is fully supported on individual MIG instances.

Other Profiling Metrics for MIG

  • DCGM_FI_PROF_SM_ACTIVE: SM (Streaming Multiprocessor) activity.
  • DCGM_FI_PROF_SM_OCCUPANCY: Ratio of active warps to maximum warps.
  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Utilization of Tensor Cores (critical for LLM/AI).

Kubernetes Integration

In Kubernetes, monitoring is typically handled by dcgm-exporter.

Deployment with Helm

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace gpu-operator \
  --set arguments={-f,/etc/dcgm-exporter/default-counters.csv}

Scraping with Prometheus

dcgm-exporter exposes a /metrics endpoint. In Kubernetes, use a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

MIG Pod Metrics

When dcgm-exporter runs, it automatically appends Kubernetes metadata (pod name, namespace, container name) to the GPU metrics. For MIG, it uses the GPU-L0 (or similar) device identifier to map specific partitions to the pods consuming them.


Last updated: 2026-02-18


GPU Sharing in Kubernetes

GPU Sharing in Kubernetes

Overview of GPU sharing technologies for maximizing GPU utilization in Kubernetes clusters.

Technologies Comparison

TechnologyUse CaseIsolationMemory Sharing
MIGMulti-tenant, inferenceHardwareNo (partitioned)
vGPUVMs, legacy appsFullNo (allocated)
Time-slicingDev/test, burstableNoneYes (shared)
MPSCUDA streamsPartialYes

NVIDIA MIG (Multi-Instance GPU)

MIG partitions A100/H100 GPUs into smaller instances with dedicated resources.

Supported Profiles (A100 80GB)

  • 1g.10gb - 1/7 GPU, 10GB memory
  • 2g.20gb - 2/7 GPU, 20GB memory
  • 3g.40gb - 3/7 GPU, 40GB memory
  • 7g.80gb - Full GPU

Configuration

# Enable MIG mode
nvidia-smi -i 0 -mig 1

# Create MIG instances
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -i 0

# List instances
nvidia-smi mig -lgi

Time-Slicing

Share a single GPU across multiple pods with time-based multiplexing.

ConfigMap Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 4

Last updated: 2026-02-09


GPU Operator, CDI, and DRA

GPU Operator, CDI, and DRA

Modern Kubernetes infrastructure for managing accelerator lifecycle, standardizing device access, and dynamic resource management.

NVIDIA GPU Operator

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes.

Core Components (Operands)

  • NVIDIA Driver: Low-level kernel drivers (can be containerized).
  • NVIDIA Container Toolkit: Configures container runtimes (containerd/CRI-O) to mount GPU resources.
  • NVIDIA Device Plugin: Traditional mechanism for exposing GPUs as extended resources (nvidia.com/gpu).
  • GPU Feature Discovery (GFD): Labels nodes with GPU attributes (model, memory, capabilities).
  • DCGM Exporter: Exports GPU telemetry (utilization, power, temperature) for Prometheus.
  • MIG Manager: Manages Multi-Instance GPU (MIG) partitioning.

Common Configuration (Helm)

helm install gpu-operator nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set psp.enabled=false

CDI (Container Device Interface)

CDI is an open specification for container runtimes (containerd, CRI-O) to standardize how third-party devices are made available to containers.

  • Standardization: Replaces runtime-specific hooks with a declarative JSON descriptor.
  • Mechanism: The device plugin returns a fully qualified device name (e.g., nvidia.com/gpu=0), and the runtime uses the CDI spec to inject device nodes, environment variables, and mounts.
  • Benefits: Simplifies the path from device plugin to low-level runtime (runc), moving complex logic out of the runtime itself.

DRA (Dynamic Resource Allocation)

DRA is the next-generation resource management API in Kubernetes (introduced in v1.26, evolving in v1.31+), moving beyond the limitations of the Device Plugin API.

Key Concepts

  • ResourceClaim: A request for specific hardware resources (similar to PVC for storage).
  • DeviceClass: Defines categories of devices (e.g., “high-memory-gpus”) with specific filters.
  • ResourceSlice: Represents the actual hardware availability on nodes.

Benefits over Device Plugins

  1. Rich Filtering: Use CEL (Common Expression Language) to request specific attributes (e.g., device.memory >= 24Gi).
  2. Device Sharing: Better native support for sharing devices across multiple containers/pods.
  3. Hardware Topology: Improved awareness of PCIe/NVLink topologies for multi-GPU workloads.
  4. Decoupled Lifecycle: Allocation happens during scheduling, allowing for more complex “all-or-nothing” scheduling for multi-node jobs.

Example Claim

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: my-gpu
      deviceClassName: nvidia-h100
      selectors:
      - cel: "device.memory >= 80Gi"

Last updated: 2026-03-02


Parallel Filesystems for HPC & AI (Lustre, WEKA)

Parallel Filesystems for HPC & AI

High-performance AI training and simulation workloads require storage that can keep up with thousands of GPUs. Traditional NAS (NFS/SMB) often becomes a bottleneck due to metadata overhead and serial access patterns.

Why Parallel Filesystems?

Parallel filesystems distribute data and metadata across multiple servers, allowing clients to access data in parallel.

  • Striping: Files are broken into chunks (stripes) and spread across multiple storage targets.
  • Separation of Data and Metadata: Metadata operations (ls, open, stat) are handled by dedicated Metadata Servers (MDS), while data is served by Object Storage Servers (OSS).
  • Scalability: Performance scales linearly by adding more storage or metadata nodes.

Lustre

A veteran in the HPC world, powering many of the world’s largest supercomputers.

  • Architecture: Consists of Management Server (MGS), Metadata Servers (MDS), and Object Storage Servers (OSS).
  • Open Source: Widely adopted and well-understood in academic and research environments.
  • Performance: Capable of TB/s throughput but requires significant expertise to tune and manage.

WEKA (WekaFS)

A modern, software-defined parallel filesystem designed for NVMe and low-latency networking (Infiniband/RoCE).

  • Flash-Native: Optimized specifically for NVMe, avoiding the legacy overhead of disk-based filesystems.
  • Zero-Copy: Uses DPDK to bypass the kernel, providing local-disk-like performance over the network.
  • AI-Focused: Excellent at handling the “small file problem” (millions of small images/tensors) common in deep learning.

GPUDirect Storage (GDS)

A critical technology for modern AI infrastructure that allows a direct DMA (Direct Memory Access) path between GPU memory and storage.

graph LR
    Storage[Parallel Storage] -- "Traditional" --> CPU[CPU/RAM]
    CPU -- "Bounce Buffer" --> GPU[GPU Memory]
    
    Storage -- "GPUDirect Storage" --> GPU
  • Benefit: Bypasses the CPU “bounce buffer,” reducing latency and CPU utilization.
  • Requirement: Supported by WEKA, Lustre (via NVIDIA’s client), and others.

FeatureNFSLustreWEKA
ArchitectureCentralizedDistributedDistributed (Software-Defined)
MediaAnyHDD/SSDOptimized for NVMe
MetadataSerialParallel (via MDS)Distributed & Parallel
ComplexityLowHighMedium
GDS SupportLimitedYesYes (Native)

Last updated: 2026-03-02


## Observability OpenTelemetry, Prometheus, Grafana, and monitoring best practices.

Kubernetes Observability Design

Kubernetes observability is the process of collecting and analyzing metrics, logs, and traces (the “three pillars of observability”) to understand the internal state, performance, and health of a cluster.

1. Metrics

Kubernetes components emit metrics in Prometheus format via /metrics endpoints.

  • Key Components: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, and kube-proxy.
  • Kubelet Endpoints: Also exposes /metrics/cadvisor (container stats), /metrics/resource, and /metrics/probes.
  • Enrichment: Tools like kube-state-metrics add context about Kubernetes object status.
  • Pipeline: Metrics are typically scraped periodically and stored in a TSDB (e.g., Prometheus, Thanos, Cortex).

2. Logs

Logs provide a chronological record of events from applications, system components, and audit trails.

  • Application Logs: Captured by the container runtime from stdout/stderr. Standardized via CRI logging format and accessible via kubectl logs.
  • System Logs:
    • Host-level: kubelet and container runtimes (often write to journald or /var/log).
    • Containerized: kube-scheduler and kube-proxy (usually write to /var/log).
  • Pipeline: A node-level agent (e.g., Fluent Bit, Fluentd) tails logs and forwards them to a central store (e.g., Elasticsearch, Loki).

3. Traces

Traces capture the end-to-end flow of requests across components, linking latency and timing.

  • OTLP Support: Kubernetes components can export spans using the OpenTelemetry Protocol (OTLP).
  • Exporters: spans can be sent directly via gRPC or through an OpenTelemetry Collector.
  • Backend: Traces are processed by the collector and stored in backends like Jaeger, Tempo, or Zipkin.

Reference: Kubernetes Observability Documentation


## Programming Go, Python, Rust, and software development practices.

Golang Fundamentals

Golang Fundamentals

A brief overview of the core concepts that define Go’s behavior and performance.

Typing & Data Structures

Arrays vs. Slices

  • Arrays: Fixed size, value types. Passing an array to a function copies the entire array.
    • var a [5]int
  • Slices: Dynamic size, reference types (descriptors). They point to an underlying array.
    • s := []int{1, 2, 3}
    • Modifying a slice element affects the underlying array and other slices sharing it.

Maps

  • Hash tables for key-value pairs.
  • Reference types, initialized using make(map[keyType]valueType).
  • Not thread-safe for concurrent writes.

Interfaces

  • Implicit implementation (no implements keyword).
  • Defined by a set of methods. Any type that provides those methods satisfies the interface.
  • “Accept interfaces, return structs.”

Methods

  • Functions with a receiver.
  • Value Receiver (func (v Type) Method()): Works on a copy.
  • Pointer Receiver (func (p *Type) Method()): Can modify the original value and avoids copying large structs.

Memory Management & GC

Go handles memory allocation and deallocation automatically.

Stack vs. Heap

  • Stack: Used for local variables with predictable lifetimes. Very fast allocation/deallocation.
  • Heap: Used for data that outlives the function call (escape analysis determines this). Slower, requires GC.

Garbage Collector (GC)

  • Non-generational, concurrent, tri-color mark-and-sweep.
  • Focuses on low latency (minimizing Stop-The-World aka STW pauses).
  • Controlled by GOGC (target heap growth percentage).

Concurrency & Scheduling

Goroutines

  • Lightweight “threads” managed by the Go runtime, not the OS.
  • Start with ~2KB stack, grow/shrink as needed.
  • go myFunction()

Parallelism vs. Concurrency

  • Concurrency: Dealing with many things at once (structure).
  • Parallelism: Doing many things at once (execution on multi-core).

Golang Scheduler (GOM-P Model)

  • G (Goroutine): State of a goroutine.
  • M (Machine): OS Thread.
  • P (Processor): Resource required to execute Go code (defines concurrency limit, default GOMAXPROCS).
  • Work Stealing: Idle Ps can steal Gs from other Ps’ local queues.

Race Conditions

  • Occur when multiple goroutines access the same memory concurrently and at least one access is a write.
  • Use the Race Detector: go test -race or go run -race.
  • Prevention: Use Channels (don’t communicate by sharing memory, share memory by communicating) or Mutexes (sync.Mutex).

Last updated: 2026-02-18


More notes coming soon. This is a living document that grows as I learn.