Notes

A collection of technical notes, reference materials, and things I’ve learned along the way. These are my personal knowledge base entries — not polished tutorials, but practical notes for quick reference.

## Cloud Native Notes on Kubernetes, container orchestration, and cloud-native technologies.

Kubernetes Cluster Architecture

A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.

The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.

Kubernetes Cluster Components Figure 1: Kubernetes Cluster Architecture

Control Plane Components

The control plane’s components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events.

kube-apiserver

The API server is the front end for the Kubernetes control plane. It exposes the Kubernetes API and is designed to scale horizontally.

Role: Central communication hub; authenticates and authorizes requests.

etcd

Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.

Role: Single source of truth for the entire cluster state.

kube-scheduler

Watches for newly created Pods with no assigned node, and selects a node for them to run on.

Role: Decides Pod placement based on resource requirements and constraints.

kube-controller-manager

Runs controller processes that maintain the desired state of the cluster.

Key Controllers: Node, Job, EndpointSlice, and ServiceAccount controllers.

cloud-controller-manager

Embeds cloud-specific control logic to link your cluster into your cloud provider’s API.

Role: Manages cloud-specific resources like load balancers and routes.

Node Components

Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.

kubelet

An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.

Role: Manages the lifecycle of containers within a Pod according to PodSpecs.

kube-proxy

A network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.

Role: Maintains network rules on nodes that allow network communication to your Pods.

Container Runtime

The software that is responsible for running containers.

Supported runtimes: Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).

Addons

Addons use Kubernetes resources (DaemonSet, Deployment, etc.) to implement cluster features.

DNS: Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.
Web UI (Dashboard): A general purpose, web-based UI for Kubernetes clusters.
Container Resource Monitoring: Records generic time-series metrics about containers in a central database.
Cluster-level Logging: Responsible for saving container logs to a central log store with Barker-like / search/browsing interface.

References

Last updated: 2026-02-18

Kubernetes Fundamentals

Quick reference for core Kubernetes concepts and common operations.

Core Concepts

Pod Lifecycle

Pending: Pod accepted but containers not created
Running: At least one container running
Succeeded: All containers terminated successfully
Failed: All containers terminated, at least one with failure
Unknown: State cannot be determined

Common Commands

# Get pod details with wide output
kubectl get pods -o wide

# Watch pods in real-time
kubectl get pods -w

# Get pod logs (follow)
kubectl logs -f <pod-name>

# Execute into a pod
kubectl exec -it <pod-name> -- /bin/bash

# Port forward
kubectl port-forward <pod-name> 8080:80

Resource Management

Resource Requests vs Limits

Requests: Guaranteed resources for scheduling
Limits: Maximum resources a container can use

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Debugging

Execution Flow: `kubectl apply`

What happens when you execute kubectl apply -f deploy.yaml? (Reference: what-happens-when-k8s)

1. Client Side (kubectl)

Validation: Client-side linting and validation of the manifest.
Generators: Assembling the HTTP request (converting YAML to JSON).
API Discovery: Version negotiation to find the correct API group and version.
Authentication: Loading credentials from kubeconfig.

2. Kube-apiServer

Authentication: Verifies “Who are you?” (Certs, Tokens, etc.).
Authorization: Verifies “Are you allowed to do this?” (RBAC).
Admission Control: Mutating/Validating admission controllers (e.g., setting defaults, checking quotas).
Persistence: The validated resource is stored in etcd.

3. Control Plane (Controllers & Scheduler)

Deployment Controller: Notices the new Deployment and creates a ReplicaSet.
ReplicaSet Controller: Notices the new ReplicaSet and creates Pods.
Scheduler: Watches for unscheduled Pods and assigns them to a healthy Node based on predicates and priorities.

4. Node Side (kubelet)

Pod Sync: The kubelet on the assigned Node notices the Pod.
CRI: Container Runtime Interface pulls images and starts containers.
CNI: Container Network Interface sets up Pod networking and IP allocation.
CSI: Container Storage Interface mounts requested volumes.

Common Issues

ImagePullBackOff: Check image name, registry access, secrets
CrashLoopBackOff: Check container logs, resource limits
Pending: Check node resources, affinity rules, PVC binding

Last updated: 2026-02-09

Kubernetes Networking & CNI

Kubernetes networking is based on a set of fundamental principles that ensure every container can communicate with every other container in a flat, NAT-less network space.

The 4 Networking Problems

Kubernetes addresses four distinct networking challenges:

Container-to-Container: Solved by Pods and localhost communications.
Pod-to-Pod: The primary focus of the CNI, enabling direct communication between Pods.
Pod-to-Service: Handled by Services (kube-proxy, iptables/IPVS).
External-to-Service: Managed by Services (LoadBalancer, NodePort, Ingress).

The 3 “Golden Rules”

To be Kubernetes-compliant, any networking implementation (CNI plugin) must satisfy these three requirements:

Pod-to-Pod: All Pods can communicate with all other Pods without NAT.
Node-to-Pod: All Nodes can communicate with all Pods (and vice-versa) without NAT.
Self-IP: The IP that a Pod sees itself as is the same IP that others see it as.

The CNI (Container Network Interface)

Kubernetes doesn’t implement networking itself; it offloads this to CNI plugins (like Calico, Flannel, Cilium).

CNI Lifecycle & The Flow of a Pod

When a Pod is scheduled, several components coordinate to ensure it gets networking. Here is the visual flow:

sequenceDiagram
    participant S as Scheduler
    participant K as Kubelet
    participant CRI as Container Runtime (CRI)
    participant CNI as CNI Plugin
    participant NS as Network Namespace

    S->>K: Assign Pod to Node
    K->>CRI: Create Pod Sandbox
    CRI->>NS: Create Network Namespace
    CRI->>CNI: Invoke ADD Command
    CNI->>CNI: Create veth pair
    CNI->>NS: Move eth0 to NS
    CNI->>CNI: IPAM (Assign IP)
    CNI->>NS: Configure Routing
    CNI-->>CRI: Success
    CRI-->>K: Pod Ready
    K->>CRI: Start App Containers

Scheduling: The Scheduler assigns a Pod to a Node. This is updated in the API Server.
Kubelet Action: The Kubelet on the assigned Node watches the API Server. When it sees a new Pod assigned to it, it starts the creation process.
CRI Invocation: Kubelet calls the Container Runtime Interface (CRI) to create the Pod sandbox.
Network Namespace Creation: The Container Runtime creates a linux Network Namespace for the Pod. This isolates the Pod’s network stack from the host and other Pods.
CNI Trigger: The CRI identifies the configured CNI plugin and invokes it with the ADD command.
CNI Plugin Execution: The CNI Plugin performs the “Golden Rule” setup:
- veth pair: It creates a virtual ethernet pair.
- Plumbing: One end is kept in the host namespace, and the other is moved into the Pod’s namespace and renamed to eth0.
- IPAM: It calls an IPAM (IP Address Management) plugin to assign a unique IP from the Node’s allocated CIDR range.
- Routing: It configures the default gateway and routes inside the Pod so it can talk to the rest of the cluster.
Success: The CNI returns success to the CRI, which then returns to the Kubelet.
App Start: Finally, the Kubelet starts the actual application containers inside the now-networked sandbox.

Traffic leaves the Pod via eth0, enters the host via the other end of the veth pair, and is then handled by the CNI’s data plane (Bridge, Routing, or eBPF).

The Life of a Packet (Pod-to-Service)

Understanding how a packet travels from one Pod to another through a Service is key to mastering Kubernetes networking.

sequenceDiagram
    participant PodA as Pod A (Node 1)
    participant Node1 as Node 1 Kernel (kube-proxy)
    participant Net as Physical Network
    participant Node2 as Node 2 Kernel
    participant PodB as Pod B (Node 2)

    PodA->>Node1: Request to Service IP
    Note over Node1: Intercept & DNAT (Service IP -> Pod B IP)
    Note over Node1: Routing Decision (Pod B is on Node 2)
    Node1->>Net: Send via CNI (Overlay/Direct)
    Net->>Node2: Arrive at Node 2
    Node2->>PodB: Forward to Pod Namespace
    PodB-->>PodA: Response

Step-by-Step Journey:

Request Initiation: Pod A (on Node 1) sends a request to a Service IP (ClusterIP).
Kernel Interception: The packet leaves the Pod via the veth pair and hits the Node 1 Kernel. kube-proxy (via iptables or IPVS rules) intercepts the packet in the nat/OUTPUT chain.
Destination NAT (DNAT): The Kernel performs DNAT, rewriting the destination IP from the Service’s Virtual IP (VIP) to the real IP of a healthy backend Pod (e.g., Pod B on Node 2).
Routing Decision: The Kernel makes a routing decision. It determines that Pod B’s IP is reachable via the CNI’s interface (e.g., an overlay network like vxlan or direct routing).
CNI Transmit: The CNI plugin encapsulates (if overlay) or routes the packet across the physical network to Node 2.
Node 2 Arrival: The packet arrives at Node 2, is decapsulated by its CNI, and the Kernel identifies it’s destined for a local Pod.
Success: The packet is forwarded into Pod B’s network namespace via its veth pair. Pod B receives the request!

How Services match Pods

Services use a discovery mechanism to track which Pods should receive traffic. This process is driven by Label Selectors:

Label Selectors: Defined in the Service’s specification, these core identifiers tell the cluster exactly which Pods to target. A Service (the stable front door) selects any Pod whose labels match its selector to be its backend.
EndpointSlices: These are the dynamic list of targets (IPs and ports). The system automatically populates EndpointSlice resources with matching Pods. By splitting the list into smaller “slices,” Kubernetes can scale efficiently to thousands of Pods, avoiding the bottlenecks of the legacy Endpoints resource.

Kubernetes Service Types

Kubernetes Services are built like building blocks, where each type typically adds a layer on top of the previous one:

ClusterIP (Default): Exposes the Service on a cluster-internal IP. This is the foundation for almost all other Service types.
NodePort: Exposes the Service on each Node’s IP at a static port (between 30000-32767). Critically: A NodePort Service automatically creates its own ClusterIP to route traffic to backend Pods.
LoadBalancer: Exposes the Service externally using a cloud provider’s load balancer. This builds upon both NodePort and ClusterIP, configuring the cloud to route external traffic to NodePorts.
ExternalName: Maps the Service to a DNS name (produces a CNAME record). It bypasses selectors and proxying entirely, allowing you to treat external services as internal ones.

Headless Services

When you don’t need a single Virtual IP (VIP) to load balance traffic, you can create a Headless Service by setting .spec.clusterIP: None.

Instead of the DNS returning a single ClusterIP, a query for a headless service returns the direct A records (individual IPs) of all matching Pods.
This is essential for StatefulSets, where you need to reach specific Pod instances, or when implementing custom service discovery.

DNS in Kubernetes (CoreDNS)

DNS serves as the cluster’s phonebook, translating service names into IP addresses. In modern clusters, this is handled by CoreDNS.

Architecture: CoreDNS runs as a Deployment (usually in the kube-system namespace) and is exposed via a Service named kube-dns.
Discovery: CoreDNS watches the Kubernetes API for new Services and EndpointSlices, dynamically generating DNS records.
Client Config: The Kubelet configures every Pod’s /etc/resolv.conf to point at the kube-dns Service IP.

The Resolution Process

When a Pod queries a name like my-svc, the OS resolver iterates through the search domains defined in /etc/resolv.conf until it finds a match.

sequenceDiagram
    participant App as Application
    participant OS as OS Resolver (/etc/resolv.conf)
    participant DNS as CoreDNS (kube-dns Service)

    App->>OS: Resolve "my-svc"
    Note over OS: iterate search domains
    OS->>DNS: Query: my-svc.default.svc.cluster.local?
    DNS-->>OS: A Record: 10.96.0.100 (Success)
    OS-->>App: Return 10.96.0.100

    Note over App,DNS: Scenario: External Domain (ndots:5)
    App->>OS: Resolve "google.com"
    OS->>DNS: Query: google.com.default.svc.cluster.local?
    DNS-->>OS: NXDOMAIN
    Note over OS: ... more internal retries ...
    OS->>DNS: Query: google.com?
    DNS-->>OS: A Record: 142.250.x.x
    OS-->>App: Return IP

Record Types:
- A Records: Resolve to a Service’s ClusterIP (Standard) or multiple Pod IPs (Headless).
- SRV Records: Created for named ports (e.g., _http._tcp.my-svc.ns.svc.cluster.local), allowing for dynamic port discovery.
- CNAME Records: Used for ExternalName services to point to external hostnames.

Performance & Scalability

As clusters grow, DNS can become a bottleneck or a source of latency.

The “ndots:5” Trap: By default, if a name has fewer than 5 dots, Kubernetes tries internal search domains first. For external names like api.github.com, this causes several failing internal queries (NXDOMAIN) before hititng the external resolver.
- Pro Tip: Use a trailing dot (google.com.) for external names to bypass the search path.
NodeLocal DNSCache: Runs a DNS caching agent on every node as a DaemonSet. It drastically reduces latency and prevents conntrack exhaustion (UDP session tracking limits) in the Linux kernel during high DNS volume.

Debugging Kubernetes Networking

When network issues arise, follow a Bottom-Up troubleshooting flow, starting from the source Pod and moving up the abstraction layers.

flowchart TD
    Start[Issue: Pod A cannot reach Service B] --> Net{1. Pod Networking OK?}
    Net -- No --> FixNet[Check CNI / Routes / NetPol]
    Net -- Yes --> DNS{2. DNS Resolution OK?}
    DNS -- No --> FixDNS[Check CoreDNS / Config]
    DNS -- Yes --> Svc{3. Service IP Reachable?}
    Svc -- No --> FixSvc[Check kube-proxy / Spec]
    Svc -- Yes --> EP{4. Endpoints Populated?}
    EP -- No --> FixEP[Check Selectors / Readiness]
    EP -- Yes --> App[5. Check Application Logs]

The Tool: Ephemeral Containers

Avoid installing debug tools in production images. Instead, use ephemeral containers to attach a “debug sidecar” (like netshoot) to a running Pod:

kubectl debug -it <pod-name> --image=nicolaka/netshoot

1. Pod Connectivity (The Foundation)

Verify the Pod can talk to the host and itself.

Check IPs: ip addr show (does eth0 match kubectl get pod -o wide?)
Check Routes: ip route show (is there a default gateway?)
Issue: If eth0 or routes are missing, the CNI plugin failed. Check CNI node logs (e.g., calico-node, cilium-agent).

2. DNS (The Phonebook)

If the Pod has an IP, check if it can resolve names.

Test Resolution: nslookup my-service
- NXDOMAIN: Name doesn’t exist (check namespace/spelling).
- Timeout: CoreDNS is unreachable (check CoreDNS pods and NetworkPolicies).
Check Config: cat /etc/resolv.conf (verify the nameserver is the kube-dns Service IP).

3. Services (The Virtual IP)

If DNS works, verify the Service and its endpoints.

Test Connectivity: nc -zv <service-ip> <port>
Check Endpoints: kubectl get endpointslices -l kubernetes.io/service-name=<service-name>
Common Issue: Hairpin Traffic: A Pod failing to reach itself via its own Service IP. Ensure the Kubelet is running with --hairpin-mode=hairpin-veth.

4. Packet Level (The Truth)

When logs aren’t enough, use tcpdump to see what’s on the wire.

Capture: tcpdump -i eth0 -w /tmp/capture.pcap
Analyze: Copy the file to your machine and open in Wireshark:
```
kubectl cp <pod-name>:/tmp/capture.pcap ./capture.pcap -c <debug-container-name>
```
Look for TCP Retransmissions (network drops), RST (closed ports), or sent SYNs with no SYN-ACK (firewall/NetworkPolicy drops).

References

Last updated: 2026-02-18

Kubernetes Storage: PV, PVC & CSI

Kubernetes Storage: A Deep Dive

Storage in Kubernetes is designed to decouple the physical storage implementation from the application’s request for it. This allows for portable, infrastructure-agnostic deployments.

Stateless vs. Stateful Workloads

Understanding the nature of your workload is the first step in deciding how to handle storage:

Stateless: Ephemeral, idempotent, and immutable. Containers can be replaced or rescheduled easily because they don’t store persistent state. Examples: Web servers, API gateways.
Stateful: Requires durability and persistence. Data must survive Pod restarts, node failures, and upgrades. Examples: Databases (PostgreSQL, MongoDB), Message Brokers.

The Abstraction Stack

Kubernetes uses several layers to manage storage, moving from high-level requests to low-level implementation.

graph TD
    PVC["PersistentVolumeClaim (PVC)"] -- requests --> SC["StorageClass"]
    SC -- provisions --> PV["PersistentVolume (PV)"]
    PV -- backed by --> Infra["Infrastructure Storage (EBS, Azure Disk, NFS)"]
    Pod["Pod"] -- volumes --> PVC

Storage Lifecycle Flow

The complete path from developer intent to a running application with storage.

sequenceDiagram
    participant User as Developer
    participant K8s as K8s Control Plane
    participant CSI_C as CSI Controller (Provisioner/Attacher)
    participant Sched as K8s Scheduler
    participant Kubelet as Node Kubelet (CSI Node Plugin)

    User->>K8s: Create PVC
    K8s->>CSI_C: Detect PVC (Provisioner)
    CSI_C->>CSI_C: CreateVolume (CSI)
    CSI_C-->>K8s: Create PV & Bind
    User->>K8s: Create Pod
    Sched->>K8s: Assign Pod to Node
    K8s->>CSI_C: Trigger Attachment (Attacher)
    CSI_C->>CSI_C: ControllerPublishVolume (CSI)
    K8s->>Kubelet: Start Pod
    Kubelet->>Kubelet: NodeStage & NodePublish (CSI)
    Kubelet-->>User: Container Started with Volume

1. Persistent Volumes (PV)

A cluster-scoped resource representing actual storage. It has a lifecycle independent of any individual Pod that uses it.

Phases: Available → Bound → Released → Failed.
Reclaim Policies:
- Delete: Automatically deletes the underlying infrastructure when the PVC is deleted.
- Retain: Keeps the storage for manual cleanup (safer for production).

2. Persistent Volume Claims (PVC)

A namespace-scoped request for storage. It’s like a “voucher” that a Pod uses to get a PV.

Binds: A PVC binds to a matching PV based on size and access modes.
Access Modes:
- ReadWriteOnce (RWO): One node can mount as read-write.
- ReadOnlyMany (ROX): Many nodes can mount as read-only.
- ReadWriteMany (RWX): Many nodes can mount as read-write.

3. StorageClasses

Policies for Dynamic Provisioning. Instead of manually creating PVs, an administrator defines a StorageClass. When a PVC request comes in, the cluster creates a PV on the fly.

Binding Modes:
- Immediate: Create volume as soon as PVC is created.
- WaitForFirstConsumer: Delay creation until the Pod is scheduled (best for multi-zone clusters).

Container Storage Interface (CSI)

The CSI moved storage drivers “out-of-tree,” allowing storage vendors to develop plugins independently of the Kubernetes core.

sequenceDiagram
    participant K8s as K8s API Server
    participant ExtP as External Provisioner
    participant ExtA as External Attacher
    participant CSID as CSI Driver (Controller/Node)
    participant Kube as Kubelet

    K8s->>ExtP: Watch: New PVC
    ExtP->>CSID: CreateVolume (gRPC)
    Note over CSID: Provision Backend Disk
    ExtP-->>K8s: Create PersistentVolume (PV)
    
    K8s->>ExtA: Watch: Pod scheduled to Node
    ExtA->>CSID: ControllerPublishVolume (gRPC)
    Note over CSID: Attach Disk to VM/Host
    
    K8s->>Kube: Pod assigned to local node
    Kube->>CSID: NodeStageVolume (gRPC)
    Note over CSID: Format & Prep Global Mount
    Kube->>CSID: NodePublishVolume (gRPC)
    Note over CSID: Bind Mount into Pod Directory

Controller Plugin: Handles cluster-wide tasks like provisioning and attaching.
Node Plugin: Runs on every node to handle mounting (NodeStage / NodePublish).

StatefulSets & Storage

StatefulSets are uniquely designed for applications requiring stable identities and storage.

volumeClaimTemplates: Creates a unique PVC for each Pod ordinal (e.g., db-0, db-1).
Stable Identity: If db-0 crashes and is rescheduled, it will re-attach to the same PVC it had before.
PVC Retention Policy: (K8s 1.27+) Control if PVCs are deleted when a StatefulSet is scaled down.

Troubleshooting Guide (At a Glance)

When storage issues arise, use these specific flows to pinpoint the failure.

Case 1: PVC is stuck in `Pending`

This usually happens during the Provisioning phase.

flowchart TD
    Start[PVC stuck in Pending] --> SC{Default StorageClass?}
    SC -- No --> SetSC[Specify SC or set default]
    SC -- Yes --> Match{Matching PV?}
    Match -- Yes --> Bind[Wait for Binding]
    Match -- No --> Dynamic{SC allow dynamic?}
    Dynamic -- No --> CreatePV[Static Provisioning Required]
    Dynamic -- Yes --> FirstConsumer{"WaitForFirstConsumer?"}
    FirstConsumer -- Yes --> SchedulePod["Schedule Pod to Node first"]
    FirstConsumer -- No --> Events["Check describe PVC Events: Quota, Permissions"]

Case 2: Pod is stuck in `ContainerCreating`

This occurs during the Attachment or Mounting phases.

flowchart TD
    Start[Pod in ContainerCreating] --> Attached{Volume Attached?}
    Attached -- No --> MultiAttach{Multi-Attach Error?}
    MultiAttach -- Yes --> Detach[Force Detach or wait for Old Node]
    MultiAttach -- No --> CSIController[Check CSI Controller Logs]
    Attached -- Yes --> Mounted{Node Mounted?}
    Mounted -- No --> CSINode[Check CSI Node Plugin Logs]
    Mounted -- Yes --> SecretConfig{ConfigMap/Secret present?}
    SecretConfig -- No --> CreateResources[Create missing resources]
    SecretConfig -- Yes --> Permissions[Check SecurityContext & fsGroup]

Case 3: PVC is stuck in `Terminating`

This happens when you try to delete a volume that is still in use.

flowchart TD
    Start[PVC stuck in Terminating] --> Clean[Check for Pod consumers]
    Clean --> Finalizer{Finalizer: pvc-protection?}
    Finalizer -- Yes --> RunningPod{"Healthy Pod using it?"}
    RunningPod -- Yes --> DeletePod["Delete Pod first"]
    RunningPod -- No --> Zombie["Check Node for zombie mount"]
    Zombie -- Yes --> Unmount["Force Unmount from Node"]
    Zombie -- No --> Force["Remove Finalizer - AS LAST RESORT"]

Summary of Debug Commands

References

Last updated: 2026-02-28

## GPU / HPC & AI Infrastructure Deep dives into GPU computing, NVIDIA MIG/vGPU, DCGM monitoring, vLLM, and AI/ML/HPC infrastructure.

GPU Troubleshooting Fundamentals

Common GPU failure modes and diagnostics in high-performance computing (HPC) and AI infrastructure.

XID Errors

XID errors are error reports from the NVIDIA driver printed to the operating system’s kernel log or event log. They provide a high-level indication of where a failure occurred.

Common XID Codes

XID 31 (GPU Memory Page Fault): Typically indicates an application trying to access an invalid memory address. Often a software bug (illegal memory access) but can be triggered by faulty hardware.
XID 45 (GPU Raven Termination): Critical error indicating the GPU has encountered a hardware issue that required it to be reset or terminated.
XID 61 (Internal Microcontroller Error): Internal GPU firmware error, often requiring a node reboot or power cycle.
XID 79 (GPU has fallen off the bus): The most critical state where the GPU is no longer communication via PCIe.

Diagnostics:

dmesg | grep -i xid
# or
journalctl -kn | grep -i xid

ECC Errors (Error Correction Code)

Modern data center GPUs (A100, H100) use ECC to detect and correct memory corruption.

Types of Errors

Single-Bit Errors (SBE): Corrected automatically by hardware without data loss. High counts of SBEs can indicate aging hardware or impending failure.
Double-Bit Errors (DBE): Uncorrectable errors. These lead to immediate application crashes (to prevent data corruption) and require a GPU reset.

Diagnostics:

nvidia-smi -q -d ECC

“Falling off the Bus”

A situation where the GPU becomes completely unresponsive to the host CPU via the PCIe interface. The device remains visible in lspci (usually), but nvidia-smi will report “No devices found” or “Unable to determine the device handle”.

Common Causes

Thermal Issues: GPU overheating triggers a survival shutdown.
Power Fluctuations: Transient voltage drops causing the GPU to drop its link.
PCIe Link Training Failure: Signal integrity issues on the motherboard or riser cards.
Firmware/Driver Bugs: Internal state machine lockups.

Recovery

Soft Reset: nvidia-smi -r (if the driver can still talk to the GPU).
Hard Reboot: Cold boot of the physical node.
Firmware Reload: Using specialized tools like flshutil (for HGX systems).

Last updated: 2026-02-18

High-Performance Networking (RDMA, InfiniBand, RoCE)

High-Performance Networking

In GPU clusters and HPC (High-Performance Computing), standard TCP/IP networking often becomes a bottleneck due to high CPU overhead, latency, and frequent context switching. Technologies like RDMA, InfiniBand, and RoCE provide the low-latency, high-throughput interconnects required for distributed AI training.

RDMA (Remote Direct Memory Access)

RDMA allows a computer to access memory on another computer directly, bypassing the operating system kernel and the CPU of the remote machine.

graph LR
    subgraph Node A
        AppA[Application] -- "RDMA Write" --> NIC_A[HCA/NIC]
        MemA[Memory]
    end
    
    subgraph Node B
        AppB[Application]
        MemB[Memory]
        NIC_B[HCA/NIC]
    end
    
    NIC_A -- "Direct Data Transfer" --> NIC_B
    NIC_B -- "Write to Memory" --> MemB
    
    style AppA fill:#f9f,stroke:#333
    style AppB fill:#f9f,stroke:#333
    style NIC_A fill:#bbf,stroke:#333
    style NIC_B fill:#bbf,stroke:#333

Zero-Copy: Data is transferred directly into memory without being copied to intermediate buffers in the OS.
Kernel Bypass: Applications communicate directly with the network hardware (NIC), avoiding kernel system calls.
Lower CPU Utilization: The NIC handles the protocol logic, freeing up the CPU for compute tasks.

InfiniBand (IB)

InfiniBand is a lossy-free, credit-based network architecture designed from the ground up for high-performance computing.

Credit-Based Flow Control: Unlike Ethernet, which drops packets during congestion, IB uses a hardware-level credit system to ensure packets are only sent when the receiving buffer has space.
Subnet Manager (SM): A centralized control agent (running on a switch or host) that manages routing and network configuration.
Low Latency: Latency is typically measured in sub-microsecond ranges.
Speed Generations:
- HDR: 200 Gbps
- NDR: 400 Gbps (NDR200) or 800 Gbps

RoCE (RDMA over Converged Ethernet)

RoCE brings RDMA capabilities to standard Ethernet networks.

RoCE v1

Layer 2 Protocol: Encapsulated in the Ethernet link layer.
Limitation: Not routable beyond a single subnet (L2 only).

RoCE v2

Layer 3 Protocol: Encapsulated in UDP/IP.
Routable: Can cross router boundaries, making it more scalable for large data centers.

Lossless Requirement (Convergence)

Standard Ethernet is “lossy” (it drops packets). To support RDMA effectively, Ethernet must be made “lossless” using:

PFC (Priority Flow Control): Pauses traffic on specific priorities (queues) to prevent buffer overflows.
ECN (Explicit Congestion Notification): Informs the sender to slow down before buffers are full.

Comparison Table

Feature	InfiniBand	RoCE v2	TCP/IP
Transport	Native IB	UDP/IP (Ethernet)	TCP/IP
Flow Control	Credit-based (Hardware)	PFC/ECN (Network configuration)	Congestion Avoidance (Software)
Latency	Extremely Low (< 1µs)	Low (~2-5µs)	Higher (> 10-20µs)
CPU Overhead	Minimal (RDMA)	Low (RDMA)	High (Protocol stack)
Deployment	Specialized Infrastructure	Converged (Standard Switches)	Ubiquitous

Last updated: 2026-03-02

GPU Monitoring with NVIDIA DCGM

Data Center GPU Manager (DCGM) is the industry standard for monitoring and managing NVIDIA GPUs in cluster environments.

DCGM Key Metrics

DCGM provides a wide range of metrics, classified into health, usage, and profiling categories.

Metric	DCGM Field Name	Description
GPU Utilization	`DCGM_FI_DEV_GPU_UTIL`	Traditional activity percentage (see MIG section below)
Memory Used	`DCGM_FI_DEV_FB_USED`	Amount of frame buffer memory used
Temperature	`DCGM_FI_DEV_GPU_TEMP`	Core temperature in degrees Celsius
Power Usage	`DCGM_FI_DEV_POWER_USAGE`	Instantaneous power draw in Watts
PCIE Throughput	`DCGM_FI_PROF_PCIE_TX_BYTES`	Data transferred over PCIe bus

Monitoring MIG (Multi-Instance GPU)

When using MIG (A100/H100), traditional utilization metrics like GPU_UTIL often fail or report incorrectly at the partition level.

`GPU_UTIL` vs `GR_ENGINE_ACTIVE`

[!IMPORTANT] For MIG partitions, always use DCGM_FI_PROF_GR_ENGINE_ACTIVE instead of DCGM_FI_DEV_GPU_UTIL.

GPU_UTIL (DCGM_FI_DEV_GPU_UTIL): Reports if any kernel is executing. It doesn’t accurately reflect resource consumption within a MIG slice.
GR_ENGINE_ACTIVE (DCGM_FI_PROF_GR_ENGINE_ACTIVE): Measures the Graphics Engine activity. This provides a more precise utilization value for both graphics and compute workloads and is fully supported on individual MIG instances.

Other Profiling Metrics for MIG

DCGM_FI_PROF_SM_ACTIVE: SM (Streaming Multiprocessor) activity.
DCGM_FI_PROF_SM_OCCUPANCY: Ratio of active warps to maximum warps.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Utilization of Tensor Cores (critical for LLM/AI).

Kubernetes Integration

In Kubernetes, monitoring is typically handled by dcgm-exporter.

Deployment with Helm

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace gpu-operator \
  --set arguments={-f,/etc/dcgm-exporter/default-counters.csv}

Scraping with Prometheus

dcgm-exporter exposes a /metrics endpoint. In Kubernetes, use a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

MIG Pod Metrics

When dcgm-exporter runs, it automatically appends Kubernetes metadata (pod name, namespace, container name) to the GPU metrics. For MIG, it uses the GPU-L0 (or similar) device identifier to map specific partitions to the pods consuming them.

Last updated: 2026-02-18

GPU Sharing in Kubernetes

Overview of GPU sharing technologies for maximizing GPU utilization in Kubernetes clusters.

Technologies Comparison

Technology	Use Case	Isolation	Memory Sharing
MIG	Multi-tenant, inference	Hardware	No (partitioned)
vGPU	VMs, legacy apps	Full	No (allocated)
Time-slicing	Dev/test, burstable	None	Yes (shared)
MPS	CUDA streams	Partial	Yes

NVIDIA MIG (Multi-Instance GPU)

MIG partitions A100/H100 GPUs into smaller instances with dedicated resources.

Supported Profiles (A100 80GB)

1g.10gb - 1/7 GPU, 10GB memory
2g.20gb - 2/7 GPU, 20GB memory
3g.40gb - 3/7 GPU, 40GB memory
7g.80gb - Full GPU

Configuration

# Enable MIG mode
nvidia-smi -i 0 -mig 1

# Create MIG instances
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -i 0

# List instances
nvidia-smi mig -lgi

Time-Slicing

Share a single GPU across multiple pods with time-based multiplexing.

ConfigMap Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 4

Last updated: 2026-02-09

GPU Operator, CDI, and DRA

Modern Kubernetes infrastructure for managing accelerator lifecycle, standardizing device access, and dynamic resource management.

NVIDIA GPU Operator

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes.

Core Components (Operands)

NVIDIA Driver: Low-level kernel drivers (can be containerized).
NVIDIA Container Toolkit: Configures container runtimes (containerd/CRI-O) to mount GPU resources.
NVIDIA Device Plugin: Traditional mechanism for exposing GPUs as extended resources (nvidia.com/gpu).
GPU Feature Discovery (GFD): Labels nodes with GPU attributes (model, memory, capabilities).
DCGM Exporter: Exports GPU telemetry (utilization, power, temperature) for Prometheus.
MIG Manager: Manages Multi-Instance GPU (MIG) partitioning.

Common Configuration (Helm)

helm install gpu-operator nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set psp.enabled=false

CDI (Container Device Interface)

CDI is an open specification for container runtimes (containerd, CRI-O) to standardize how third-party devices are made available to containers.

Standardization: Replaces runtime-specific hooks with a declarative JSON descriptor.
Mechanism: The device plugin returns a fully qualified device name (e.g., nvidia.com/gpu=0), and the runtime uses the CDI spec to inject device nodes, environment variables, and mounts.
Benefits: Simplifies the path from device plugin to low-level runtime (runc), moving complex logic out of the runtime itself.

DRA (Dynamic Resource Allocation)

DRA is the next-generation resource management API in Kubernetes (introduced in v1.26, evolving in v1.31+), moving beyond the limitations of the Device Plugin API.

Key Concepts

ResourceClaim: A request for specific hardware resources (similar to PVC for storage).
DeviceClass: Defines categories of devices (e.g., “high-memory-gpus”) with specific filters.
ResourceSlice: Represents the actual hardware availability on nodes.

Benefits over Device Plugins

Rich Filtering: Use CEL (Common Expression Language) to request specific attributes (e.g., device.memory >= 24Gi).
Device Sharing: Better native support for sharing devices across multiple containers/pods.
Hardware Topology: Improved awareness of PCIe/NVLink topologies for multi-GPU workloads.
Decoupled Lifecycle: Allocation happens during scheduling, allowing for more complex “all-or-nothing” scheduling for multi-node jobs.

Example Claim

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: my-gpu
      deviceClassName: nvidia-h100
      selectors:
      - cel: "device.memory >= 80Gi"

Last updated: 2026-03-02

Parallel Filesystems for HPC & AI (Lustre, WEKA)

Parallel Filesystems for HPC & AI

High-performance AI training and simulation workloads require storage that can keep up with thousands of GPUs. Traditional NAS (NFS/SMB) often becomes a bottleneck due to metadata overhead and serial access patterns.

Why Parallel Filesystems?

Parallel filesystems distribute data and metadata across multiple servers, allowing clients to access data in parallel.

Striping: Files are broken into chunks (stripes) and spread across multiple storage targets.
Separation of Data and Metadata: Metadata operations (ls, open, stat) are handled by dedicated Metadata Servers (MDS), while data is served by Object Storage Servers (OSS).
Scalability: Performance scales linearly by adding more storage or metadata nodes.

Lustre

A veteran in the HPC world, powering many of the world’s largest supercomputers.

Architecture: Consists of Management Server (MGS), Metadata Servers (MDS), and Object Storage Servers (OSS).
Open Source: Widely adopted and well-understood in academic and research environments.
Performance: Capable of TB/s throughput but requires significant expertise to tune and manage.

WEKA (WekaFS)

A modern, software-defined parallel filesystem designed for NVMe and low-latency networking (Infiniband/RoCE).

Flash-Native: Optimized specifically for NVMe, avoiding the legacy overhead of disk-based filesystems.
Zero-Copy: Uses DPDK to bypass the kernel, providing local-disk-like performance over the network.
AI-Focused: Excellent at handling the “small file problem” (millions of small images/tensors) common in deep learning.

GPUDirect Storage (GDS)

A critical technology for modern AI infrastructure that allows a direct DMA (Direct Memory Access) path between GPU memory and storage.

graph LR
    Storage[Parallel Storage] -- "Traditional" --> CPU[CPU/RAM]
    CPU -- "Bounce Buffer" --> GPU[GPU Memory]
    
    Storage -- "GPUDirect Storage" --> GPU

Benefit: Bypasses the CPU “bounce buffer,” reducing latency and CPU utilization.
Requirement: Supported by WEKA, Lustre (via NVIDIA’s client), and others.

Feature	NFS	Lustre	WEKA
Architecture	Centralized	Distributed	Distributed (Software-Defined)
Media	Any	HDD/SSD	Optimized for NVMe
Metadata	Serial	Parallel (via MDS)	Distributed & Parallel
Complexity	Low	High	Medium
GDS Support	Limited	Yes	Yes (Native)

Last updated: 2026-03-02

## Observability OpenTelemetry, Prometheus, Grafana, and monitoring best practices.

Kubernetes Observability Design

Kubernetes observability is the process of collecting and analyzing metrics, logs, and traces (the “three pillars of observability”) to understand the internal state, performance, and health of a cluster.

1. Metrics

Kubernetes components emit metrics in Prometheus format via /metrics endpoints.

Key Components: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, and kube-proxy.
Kubelet Endpoints: Also exposes /metrics/cadvisor (container stats), /metrics/resource, and /metrics/probes.
Enrichment: Tools like kube-state-metrics add context about Kubernetes object status.
Pipeline: Metrics are typically scraped periodically and stored in a TSDB (e.g., Prometheus, Thanos, Cortex).

2. Logs

Logs provide a chronological record of events from applications, system components, and audit trails.

Application Logs: Captured by the container runtime from stdout/stderr. Standardized via CRI logging format and accessible via kubectl logs.
System Logs:
- Host-level: kubelet and container runtimes (often write to journald or /var/log).
- Containerized: kube-scheduler and kube-proxy (usually write to /var/log).
Pipeline: A node-level agent (e.g., Fluent Bit, Fluentd) tails logs and forwards them to a central store (e.g., Elasticsearch, Loki).

3. Traces

Traces capture the end-to-end flow of requests across components, linking latency and timing.

OTLP Support: Kubernetes components can export spans using the OpenTelemetry Protocol (OTLP).
Exporters: spans can be sent directly via gRPC or through an OpenTelemetry Collector.
Backend: Traces are processed by the collector and stored in backends like Jaeger, Tempo, or Zipkin.

Reference: Kubernetes Observability Documentation

## Programming Go, Python, Rust, and software development practices.

Golang Fundamentals

A brief overview of the core concepts that define Go’s behavior and performance.

Typing & Data Structures

Arrays vs. Slices

Arrays: Fixed size, value types. Passing an array to a function copies the entire array.
- var a [5]int
Slices: Dynamic size, reference types (descriptors). They point to an underlying array.
- s := []int{1, 2, 3}
- Modifying a slice element affects the underlying array and other slices sharing it.

Maps

Hash tables for key-value pairs.
Reference types, initialized using make(map[keyType]valueType).
Not thread-safe for concurrent writes.

Interfaces

Implicit implementation (no implements keyword).
Defined by a set of methods. Any type that provides those methods satisfies the interface.
“Accept interfaces, return structs.”

Methods

Functions with a receiver.
Value Receiver (func (v Type) Method()): Works on a copy.
Pointer Receiver (func (p *Type) Method()): Can modify the original value and avoids copying large structs.

Memory Management & GC

Go handles memory allocation and deallocation automatically.

Stack vs. Heap

Stack: Used for local variables with predictable lifetimes. Very fast allocation/deallocation.
Heap: Used for data that outlives the function call (escape analysis determines this). Slower, requires GC.

Garbage Collector (GC)

Non-generational, concurrent, tri-color mark-and-sweep.
Focuses on low latency (minimizing Stop-The-World aka STW pauses).
Controlled by GOGC (target heap growth percentage).

Concurrency & Scheduling

Goroutines

Lightweight “threads” managed by the Go runtime, not the OS.
Start with ~2KB stack, grow/shrink as needed.
go myFunction()

Parallelism vs. Concurrency

Concurrency: Dealing with many things at once (structure).
Parallelism: Doing many things at once (execution on multi-core).

Golang Scheduler (GOM-P Model)

G (Goroutine): State of a goroutine.
M (Machine): OS Thread.
P (Processor): Resource required to execute Go code (defines concurrency limit, default GOMAXPROCS).
Work Stealing: Idle Ps can steal Gs from other Ps’ local queues.

Race Conditions

Occur when multiple goroutines access the same memory concurrently and at least one access is a write.
Use the Race Detector: go test -race or go run -race.
Prevention: Use Channels (don’t communicate by sharing memory, share memory by communicating) or Mutexes (sync.Mutex).

Last updated: 2026-02-18

More notes coming soon. This is a living document that grows as I learn.

Notes

Kubernetes Cluster Architecture

Kubernetes Cluster Architecture

Control Plane Components

kube-apiserver

etcd

kube-scheduler

kube-controller-manager

cloud-controller-manager

Node Components

kubelet

kube-proxy

Container Runtime

Addons

References

Kubernetes Fundamentals

Kubernetes Fundamentals

Core Concepts

Pod Lifecycle

Common Commands

Resource Management

Resource Requests vs Limits

Debugging

Execution Flow: kubectl apply

1. Client Side (kubectl)

2. Kube-apiServer

3. Control Plane (Controllers & Scheduler)

4. Node Side (kubelet)

Common Issues

Kubernetes Networking & CNI

Kubernetes Networking & CNI

The 4 Networking Problems

The 3 “Golden Rules”

The CNI (Container Network Interface)

CNI Lifecycle & The Flow of a Pod

The Life of a Packet (Pod-to-Service)

Step-by-Step Journey:

How Services match Pods

Kubernetes Service Types

Headless Services

DNS in Kubernetes (CoreDNS)

The Resolution Process

Performance & Scalability

Debugging Kubernetes Networking

The Tool: Ephemeral Containers

1. Pod Connectivity (The Foundation)

2. DNS (The Phonebook)

3. Services (The Virtual IP)

4. Packet Level (The Truth)

References

Kubernetes Storage: PV, PVC & CSI

Kubernetes Storage: A Deep Dive

Stateless vs. Stateful Workloads

The Abstraction Stack

Storage Lifecycle Flow

1. Persistent Volumes (PV)

2. Persistent Volume Claims (PVC)

3. StorageClasses

Container Storage Interface (CSI)

StatefulSets & Storage

Troubleshooting Guide (At a Glance)

Case 1: PVC is stuck in Pending

Case 2: Pod is stuck in ContainerCreating

Case 3: PVC is stuck in Terminating

Summary of Debug Commands

References

GPU Troubleshooting Fundamentals

GPU Troubleshooting Fundamentals

XID Errors

Common XID Codes

ECC Errors (Error Correction Code)

Types of Errors

“Falling off the Bus”

Common Causes

Recovery

High-Performance Networking (RDMA, InfiniBand, RoCE)

High-Performance Networking

RDMA (Remote Direct Memory Access)

InfiniBand (IB)

RoCE (RDMA over Converged Ethernet)

Execution Flow: `kubectl apply`

Case 1: PVC is stuck in `Pending`

Case 2: Pod is stuck in `ContainerCreating`

Case 3: PVC is stuck in `Terminating`

`GPU_UTIL` vs `GR_ENGINE_ACTIVE`