11주차 - ML Infra(GPU) on EKS(1)

728x90

AI 워크로드에 대한 컨테이너 사용 배경

머신러닝과 딥러닝이 기업의 핵심 경쟁력으로 부상하면서, 데이터 과학자들은 점점 더 복잡하고 계산 집약적인 모델을 개발. 이러한 모델을 효율적으로 학습하고 배포하기 위해서는 GPU를 사용하는 강력한 컴퓨팅 인프라가 요구됨

전통적으로 ML 엔지니어들은 베어메탈 서버에 직접 GPU 드라이버와 라이브러리를 설치하여 작업

환경 구성의 복잡성: CUDA, cuDNN 등 복잡한 드라이버 스택 설치 및 관리가 필요. 특히 버전 호환성 문제로 인해 특정 프레임워크(TensorFlow, PyTorch)가 특정 CUDA/cuDNN 버전만 지원하는 경우가 많았음.
재현성 부족: 동일한 실험 환경을 다른 시스템에서 재현하기 어려움. '내 컴퓨터에서는 잘 작동하는데요?.'와 같은 이야기를 많이 들을 수 있었음.
리소스 비효율성: 고가의 GPU 리소스가 특정 사용자나 프로젝트에 고정되어 활용도가 저하. (베어메탈 GPU 서버의 활용률은 30% 미만인 경우가 많았음)
확장성 제한: 대규모 분산 학습을 위한 인프라 확장이 어려웠음. 새로운 GPU 서버를 추가할 때마다 동일한 환경 구성 과정을 반복 필요.

가상화 기술이 발달하면서 컨테이너 환경 등장하지만 기반 기술인 cgroups(CPU, MEM등 리소스 제어), namespace(프로세스격리)등이 GPU와 같은 특수 하드웨어 장치에 대해서는 근본적인 제약사항이 존재

물리적으로 분할하기 어려운 GPU: CPU 코어나 메모리와 달리, 초기 GPU는 물리적으로 분할하여 여러 컨테이너에 할당하기 어려웠음
GPU 드라이버 복잡성: GPU 접근은 복잡한 사용자 공간 라이브러리와 커널 드라이버를 통해 이루어짐
장치 파일 접근 제어: /dev/nvidia* 과 같은 장치 파일에 대한 접근을 안전하게 관리 필요

컨테이너 환경에서의 GPU 리소스 사용 진화

컨테이너 환경에서의 GPU 리소스 사용 진화 (단일 GPU)
- 초기 단계 (2016-2018)
  - 초기에는 GPU 장치 파일을 컨테이너에 직접 마운트하고 필요한 라이브러리를 볼륨으로 공유해야 했음.
  - 아래는 Tensorflow에서 CUDA를 통해 NVIDIA GPU 장치에 액세스하도록 Docker 명령어를 수동으로 실행하는 예시

docker run --device=/dev/nvidia0:/dev/nvidia0 \
           --device=/dev/nvidiactl:/dev/nvidiactl \
           -v /usr/local/cuda:/usr/local/cuda \
           tensorflow/tensorflow:latest-gpu

초기 GPU 리소스 사용에 대한 문제점
- 모든 장치 파일을 수동으로 지정해야 함
- 호스트와 컨테이너 간 라이브러리 버전 충돌 가능성
- 여러 컨테이너 간 GPU 공유 메커니즘 부재
- 오케스트레이션 환경에서 자동화하기 어려움

NVIDIA Container Runtime 활용 (2018-2020)
- NVIDIA는 이러한 문제를 해결하기 위해 NVIDIA Container Runtime을 개발하였음. NVIDIA Container Runtime: Docker, CRI-O 등 컨테이너 기술에서 사용하는 Open Containers Initiative (OCI) 스펙과 호환되는 GPU 인식 컨테이너 런타임임.
- 이 런타임은 다음과 같은 기능을 자동화함:
  - GPU 장치 파일 마운트
  - NVIDIA 드라이버 라이브러리 주입
  - CUDA 호환성 검사
  - GPU 기능 감지 및 노출

# Docker 19.03 이전 버전 사용
docker run --runtime=nvidia nvidia/cuda:11.0-base nvidia-smi

# Docker 19.03 이후부터는 더 간단하게 --gpus 플래그를 사용
docker run --gpus '"device=0,1"' nvidia/cuda:11.0-base nvidia-smi

자동화된 GPU 검출 및 설정
호스트-컨테이너 간 드라이버 호환성 자동 관리
컨테이너 이미지 이식성 향상

Kubernetes 장치 플러그인 등장 (2020-현재)
- Kubernetes 오픈 소스에 Device Plugin이라는 제안이 2017년 9월 처음으로 이루어짐
- NVIDIA Device Plugin: Kubernetes 클러스터에서 NVIDIA GPU 리소스를 노출하고 관리하는 플러그인으로, 이를 통해 Pod 스펙에서 선언적으로 GPU 리소스를 요청

apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
  - name: gpu-container
    image: nvidia/cuda:11.0-base
    command: ["nvidia-smi"]
    resources:
    limits:
      nvidia.com/gpu: 2 # 2개의 GPU 요청

선언적 리소스 관리
클러스터 수준의 GPU 리소스 스케줄링
자동화된 GPU 할당 및 격리
다중 테넌트 환경에서의 리소스 공정성

실습 - eksdemo를 통하여 간단하게 EKS 클러스터 구성

https://aws.amazon.com/ko/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/

GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances | Amazon Web Services

In today’s fast-paced technological landscape, the demand for accelerated computing is skyrocketing, particularly in areas like artificial intelligence (AI) and machine learning (ML). One of the primary challenges the enterprises face is the efficient ut

aws.amazon.com

사전 설치 도구
- eksdemo: Amazon EKS를 배우고 학습, 데모하기 위한 쉬운 옵션
- helm
- AWS CLI
- kubectl
- jq
- Terraform

eksdemo 설치

#eksdemo 다운로드
wget https://github.com/awslabs/eksdemo/releases/download/v0.18.2/eksdemo_Linux_arm64.tar.gz

#eksdemo 설치파일 압축해제
tar -xvzf eksdemo_Linux_x86_64.tar.gz

#eksdemo 실행파일 /usr/local/bin 디렉토리로 이동
mv eksdemo /usr/local/bin/

#명령어 실행 확인
eksdemo version

eksdemo를 통하여 간단하게 EKS 클러스터 구성
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- 실행 명령어: eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2

$ eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2 --dry-run

Eksctl Resource Manager Dry Run:
eksctl create cluster -f -
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gpusharing-demo
  region: us-west-2
  version: "1.32"
  tags:
    eksdemo.io/version: 0.18.2

addons:
- name: vpc-cni
  version: latest
  configurationValues: |-
    enableNetworkPolicy: "true"
    env:
      ENABLE_PREFIX_DELEGATION: "false"

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      name: aws-load-balancer-controller
      namespace: awslb
    roleName: eksdemo.us-west-2.gpusharing-demo.awslb.aws-load-balanc-e4dab3bd
    roleOnly: true
    attachPolicy:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Action:
        - iam:CreateServiceLinkedRole
        Resource: "*"
        Condition:
          StringEquals:
            iam:AWSServiceName: elasticloadbalancing.amazonaws.com
      - Effect: Allow
        Action:
        - ec2:DescribeAccountAttributes
        - ec2:DescribeAddresses
        - ec2:DescribeAvailabilityZones
        - ec2:DescribeInternetGateways
        - ec2:DescribeVpcs
        - ec2:DescribeVpcPeeringConnections
        - ec2:DescribeSubnets
        - ec2:DescribeSecurityGroups
        - ec2:DescribeInstances
        - ec2:DescribeNetworkInterfaces
        - ec2:DescribeTags
        - ec2:GetCoipPoolUsage
        - ec2:DescribeCoipPools
        - elasticloadbalancing:DescribeLoadBalancers
        - elasticloadbalancing:DescribeLoadBalancerAttributes
        - elasticloadbalancing:DescribeListeners
        - elasticloadbalancing:DescribeListenerCertificates
        - elasticloadbalancing:DescribeSSLPolicies
        - elasticloadbalancing:DescribeRules
        - elasticloadbalancing:DescribeTargetGroups
        - elasticloadbalancing:DescribeTargetGroupAttributes
        - elasticloadbalancing:DescribeTargetHealth
        - elasticloadbalancing:DescribeTags
        - elasticloadbalancing:DescribeTrustStores
        - elasticloadbalancing:DescribeListenerAttributes
        Resource: "*"
      - Effect: Allow
        Action:
        - cognito-idp:DescribeUserPoolClient
        - acm:ListCertificates
        - acm:DescribeCertificate
        - iam:ListServerCertificates
        - iam:GetServerCertificate
        - waf-regional:GetWebACL
        - waf-regional:GetWebACLForResource
        - waf-regional:AssociateWebACL
        - waf-regional:DisassociateWebACL
        - wafv2:GetWebACL
        - wafv2:GetWebACLForResource
        - wafv2:AssociateWebACL
        - wafv2:DisassociateWebACL
        - shield:GetSubscriptionState
        - shield:DescribeProtection
        - shield:CreateProtection
        - shield:DeleteProtection
        Resource: "*"
      - Effect: Allow
        Action:
        - ec2:AuthorizeSecurityGroupIngress
        - ec2:RevokeSecurityGroupIngress
        Resource: "*"
      - Effect: Allow
        Action:
        - ec2:CreateSecurityGroup
        Resource: "*"
      - Effect: Allow
        Action:
        - ec2:CreateTags
        Resource: arn:aws:ec2:*:*:security-group/*
        Condition:
          StringEquals:
            ec2:CreateAction: CreateSecurityGroup
          'Null':
            aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - ec2:CreateTags
        - ec2:DeleteTags
        Resource: arn:aws:ec2:*:*:security-group/*
        Condition:
          'Null':
            aws:RequestTag/elbv2.k8s.aws/cluster: 'true'
            aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - ec2:AuthorizeSecurityGroupIngress
        - ec2:RevokeSecurityGroupIngress
        - ec2:DeleteSecurityGroup
        Resource: "*"
        Condition:
          'Null':
            aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - elasticloadbalancing:CreateLoadBalancer
        - elasticloadbalancing:CreateTargetGroup
        Resource: "*"
        Condition:
          'Null':
            aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - elasticloadbalancing:CreateListener
        - elasticloadbalancing:DeleteListener
        - elasticloadbalancing:CreateRule
        - elasticloadbalancing:DeleteRule
        Resource: "*"
      - Effect: Allow
        Action:
        - elasticloadbalancing:AddTags
        - elasticloadbalancing:RemoveTags
        Resource:
        - arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
        - arn:aws:elasticloadbalancing:*:*:loadbalancer/net/*/*
        - arn:aws:elasticloadbalancing:*:*:loadbalancer/app/*/*
        Condition:
          'Null':
            aws:RequestTag/elbv2.k8s.aws/cluster: 'true'
            aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - elasticloadbalancing:AddTags
        - elasticloadbalancing:RemoveTags
        Resource:
        - arn:aws:elasticloadbalancing:*:*:listener/net/*/*/*
        - arn:aws:elasticloadbalancing:*:*:listener/app/*/*/*
        - arn:aws:elasticloadbalancing:*:*:listener-rule/net/*/*/*
        - arn:aws:elasticloadbalancing:*:*:listener-rule/app/*/*/*
      - Effect: Allow
        Action:
        - elasticloadbalancing:ModifyLoadBalancerAttributes
        - elasticloadbalancing:SetIpAddressType
        - elasticloadbalancing:SetSecurityGroups
        - elasticloadbalancing:SetSubnets
        - elasticloadbalancing:DeleteLoadBalancer
        - elasticloadbalancing:ModifyTargetGroup
        - elasticloadbalancing:ModifyTargetGroupAttributes
        - elasticloadbalancing:DeleteTargetGroup
        - elasticloadbalancing:ModifyListenerAttributes
        Resource: "*"
        Condition:
          'Null':
            aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - elasticloadbalancing:AddTags
        Resource:
        - arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
        - arn:aws:elasticloadbalancing:*:*:loadbalancer/net/*/*
        - arn:aws:elasticloadbalancing:*:*:loadbalancer/app/*/*
        Condition:
          StringEquals:
            elasticloadbalancing:CreateAction:
            - CreateTargetGroup
            - CreateLoadBalancer
          'Null':
            aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
      - Effect: Allow
        Action:
        - elasticloadbalancing:RegisterTargets
        - elasticloadbalancing:DeregisterTargets
        Resource: arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
      - Effect: Allow
        Action:
        - elasticloadbalancing:SetWebAcl
        - elasticloadbalancing:ModifyListener
        - elasticloadbalancing:AddListenerCertificates
        - elasticloadbalancing:RemoveListenerCertificates
        - elasticloadbalancing:ModifyRule
        Resource: "*"

  - metadata:
      name: ebs-csi-controller-sa
      namespace: kube-system
    roleName: eksdemo.us-west-2.gpusharing-demo.kube-system.ebs-csi-c-937ae3a3
    roleOnly: true
    attachPolicyARNs:
    - arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
  - metadata:
      name: external-dns
      namespace: external-dns
    roleName: eksdemo.us-west-2.gpusharing-demo.external-dns.external-dns
    roleOnly: true
    attachPolicy:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Action:
        - route53:ChangeResourceRecordSets
        Resource:
        - arn:aws:route53:::hostedzone/*
      - Effect: Allow
        Action:
        - route53:ListHostedZones
        - route53:ListResourceRecordSets
        - route53:ListTagsForResource
        Resource:
        - "*"

  - metadata:
      name: karpenter
      namespace: karpenter
    roleName: eksdemo.us-west-2.gpusharing-demo.karpenter.karpenter
    roleOnly: true
    attachPolicy:
      Version: "2012-10-17"
      Statement:
      - Sid: AllowScopedEC2InstanceAccessActions
        Effect: Allow
        Resource:
        - arn:aws:ec2:us-west-2::image/*
        - arn:aws:ec2:us-west-2::snapshot/*
        - arn:aws:ec2:us-west-2:*:security-group/*
        - arn:aws:ec2:us-west-2:*:subnet/*
        Action:
        - ec2:RunInstances
        - ec2:CreateFleet
      - Sid: AllowScopedEC2LaunchTemplateAccessActions
        Effect: Allow
        Resource: arn:aws:ec2:us-west-2:*:launch-template/*
        Action:
        - ec2:RunInstances
        - ec2:CreateFleet
        Condition:
          StringEquals:
            aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
          StringLike:
            aws:ResourceTag/karpenter.sh/nodepool: "*"
      - Sid: AllowScopedEC2InstanceActionsWithTags
        Effect: Allow
        Resource:
        - arn:aws:ec2:us-west-2:*:fleet/*
        - arn:aws:ec2:us-west-2:*:instance/*
        - arn:aws:ec2:us-west-2:*:volume/*
        - arn:aws:ec2:us-west-2:*:network-interface/*
        - arn:aws:ec2:us-west-2:*:launch-template/*
        - arn:aws:ec2:us-west-2:*:spot-instances-request/*
        Action:
        - ec2:RunInstances
        - ec2:CreateFleet
        - ec2:CreateLaunchTemplate
        Condition:
          StringEquals:
            aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
          StringLike:
            aws:RequestTag/karpenter.sh/nodepool: "*"
      - Sid: AllowScopedResourceCreationTagging
        Effect: Allow
        Resource:
        - arn:aws:ec2:us-west-2:*:fleet/*
        - arn:aws:ec2:us-west-2:*:instance/*
        - arn:aws:ec2:us-west-2:*:volume/*
        - arn:aws:ec2:us-west-2:*:network-interface/*
        - arn:aws:ec2:us-west-2:*:launch-template/*
        - arn:aws:ec2:us-west-2:*:spot-instances-request/*
        Action: ec2:CreateTags
        Condition:
          StringEquals:
            aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
            ec2:CreateAction:
            - RunInstances
            - CreateFleet
            - CreateLaunchTemplate
          StringLike:
            aws:RequestTag/karpenter.sh/nodepool: "*"
      - Sid: AllowScopedResourceTagging
        Effect: Allow
        Resource: arn:aws:ec2:us-west-2:*:instance/*
        Action: ec2:CreateTags
        Condition:
          StringEquals:
            aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
          StringLike:
            aws:ResourceTag/karpenter.sh/nodepool: "*"
          StringEqualsIfExists:
            aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
          ForAllValues:StringEquals:
            aws:TagKeys:
            - eks:eks-cluster-name
            - karpenter.sh/nodeclaim
            - Name
      - Sid: AllowScopedDeletion
        Effect: Allow
        Resource:
        - arn:aws:ec2:us-west-2:*:instance/*
        - arn:aws:ec2:us-west-2:*:launch-template/*
        Action:
        - ec2:TerminateInstances
        - ec2:DeleteLaunchTemplate
        Condition:
          StringEquals:
            aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
          StringLike:
            aws:ResourceTag/karpenter.sh/nodepool: "*"
      - Sid: AllowRegionalReadActions
        Effect: Allow
        Resource: "*"
        Action:
        - ec2:DescribeImages
        - ec2:DescribeInstances
        - ec2:DescribeInstanceTypeOfferings
        - ec2:DescribeInstanceTypes
        - ec2:DescribeLaunchTemplates
        - ec2:DescribeSecurityGroups
        - ec2:DescribeSpotPriceHistory
        - ec2:DescribeSubnets
        Condition:
          StringEquals:
            aws:RequestedRegion: "us-west-2"
      - Sid: AllowSSMReadActions
        Effect: Allow
        Resource: arn:aws:ssm:us-west-2::parameter/aws/service/*
        Action:
        - ssm:GetParameter
      - Sid: AllowPricingReadActions
        Effect: Allow
        Resource: "*"
        Action:
        - pricing:GetProducts
      - Sid: AllowInterruptionQueueActions
        Effect: Allow
        Resource: arn:aws:sqs:us-west-2:767397897074:karpenter-gpusharing-demo
        Action:
        - sqs:DeleteMessage
        - sqs:GetQueueUrl
        - sqs:ReceiveMessage
      - Sid: AllowPassingInstanceRole
        Effect: Allow
        Resource: arn:aws:iam::767397897074:role/KarpenterNodeRole-gpusharing-demo
        Action: iam:PassRole
        Condition:
          StringEquals:
            iam:PassedToService:
            - ec2.amazonaws.com
            - ec2.amazonaws.com.cn
      - Sid: AllowScopedInstanceProfileCreationActions
        Effect: Allow
        Resource: arn:aws:iam::767397897074:instance-profile/*
        Action:
        - iam:CreateInstanceProfile
        Condition:
          StringEquals:
            aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
            aws:RequestTag/topology.kubernetes.io/region: "us-west-2"
          StringLike:
            aws:RequestTag/karpenter.k8s.aws/ec2nodeclass: "*"
      - Sid: AllowScopedInstanceProfileTagActions
        Effect: Allow
        Resource: arn:aws:iam::767397897074:instance-profile/*
        Action:
        - iam:TagInstanceProfile
        Condition:
          StringEquals:
            aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:ResourceTag/topology.kubernetes.io/region: "us-west-2"
            aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
            aws:RequestTag/topology.kubernetes.io/region: "us-west-2"
          StringLike:
            aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass: "*"
            aws:RequestTag/karpenter.k8s.aws/ec2nodeclass: "*"
      - Sid: AllowScopedInstanceProfileActions
        Effect: Allow
        Resource: arn:aws:iam::767397897074:instance-profile/*
        Action:
        - iam:AddRoleToInstanceProfile
        - iam:RemoveRoleFromInstanceProfile
        - iam:DeleteInstanceProfile
        Condition:
          StringEquals:
            aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
            aws:ResourceTag/topology.kubernetes.io/region: "us-west-2"
          StringLike:
            aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass: "*"
      - Sid: AllowInstanceProfileReadActions
        Effect: Allow
        Resource: arn:aws:iam::767397897074:instance-profile/*
        Action: iam:GetInstanceProfile
      - Sid: AllowAPIServerEndpointDiscovery
        Effect: Allow
        Resource: arn:aws:eks:us-west-2:767397897074:cluster/gpusharing-demo
        Action: eks:DescribeCluster

vpc:
  cidr: 192.168.0.0/16
  hostnameType: resource-name

managedNodeGroups:
- name: main
  ami: ami-092590b6039cd49ed
  amiFamily: AmazonLinux2
  desiredCapacity: 2
  iam:
    attachPolicyARNs:
    - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  instanceType: t3.large
  minSize: 0
  maxSize: 10
  volumeSize: 80
  volumeType: gp3
  overrideBootstrapCommand: |
    #!/bin/bash
    /etc/eks/bootstrap.sh gpusharing-demo
  privateNetworking: true
  spot: false

GPU Time-slicing on Amazon EKS - eksdemo를 통하여 GPU 노드 그룹 추가

eksdemo를 통하여 간단하게 EKS 클러스터 구성
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- 실행 명령어: eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2

eksdemo를 통하여 eks 클러스터 내 GPU노드 추가
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- GPU 노드 추가: g5.8xlarge
- 실행 명령어: eksdemo create nodegroup gpu -i g5.8xlarge -N 1 -c gpusharing-demo

$ eksdemo create nodegroup gpu -i g5.8xlarge -N 1 -c gpusharing-demo --dry-run

Eksctl Resource Manager Dry Run:
eksctl create nodegroup -f - --install-nvidia-plugin=false --install-neuron-plugin=false
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: gpusharing-demo
  region: us-west-2
  version: "1.32"
  tags:
    eksdemo.io/version: 0.18.2

managedNodeGroups:
- name: gpu
  ami: ami-0111ed894dc3ec059
  amiFamily: AmazonLinux2
  desiredCapacity: 1
  iam:
    attachPolicyARNs:
    - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  instanceType: g5.8xlarge
  minSize: 0
  maxSize: 10
  volumeSize: 80
  volumeType: gp3
  overrideBootstrapCommand: |
    #!/bin/bash
    /etc/eks/bootstrap.sh gpusharing-demo
  privateNetworking: true
  spot: false
  taints:
  - key: nvidia.com/gpu
    value: ""
    effect: NoSchedule

GPU Time-slicing on Amazon EKS - Time Slicing 안했을 때

$ kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone
NAME                                             STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE   CAPACITYTYPE   ZONE
i-01edff2cd3456af51.us-west-2.compute.internal   Ready    <none>   43m     v1.32.1-eks-5d632ec   t3.large        ON_DEMAND      us-west-2d
i-072eb4b66c886d137.us-west-2.compute.internal   Ready    <none>   43m     v1.32.1-eks-5d632ec   t3.large        ON_DEMAND      us-west-2a
i-0af783eca345807e8.us-west-2.compute.internal   Ready    <none>   4m53s   v1.32.1-eks-5d632ec   g5.8xlarge      ON_DEMAND      us-west-2a

# g5 - GPU 노드 에 label을 지정: "i-0af783eca345807e8.us-west-2.compute.internal" 를 바꿔주세요!
$ kubectl label node i-0af783eca345807e8.us-west-2.compute.internal eks-node=gpu

# 참고: 노드 그룹 스케일링 (할당량 한도 내에서 2 등으로 변경 가능)
$ eksctl scale nodegroup --name gpu --cluster gpusharing-demo --nodes 2

# nvdp-values.yaml 파일 다운로드
$ curl -O https://raw.githubusercontent.com/sanjeevrg89/eks-gpu-sharing-demo/refs/heads/main/nvdp-values.yaml

# nvidia-device-plugin을 helm을 사용하여 설치
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  -f nvdp-values.yaml \
  --version 0.14.0
Release "nvdp" does not exist. Installing it now.
NAME: nvdp
LAST DEPLOYED: Sun Apr 13 00:50:38 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

$ kubectl get daemonset -n kube-system | grep nvidia
nvdp-nvidia-device-plugin   1         1         1       1            1           eks-node=gpu    23s

# time-slicing을 활성화하지 않았을 때: GPU가 1개임
$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
  "name": "i-0af783eca345807e8.us-west-2.compute.internal",
  "capacity": {
    "cpu": "32",
    "ephemeral-storage": "83873772Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "130502176Ki",
    "nvidia.com/gpu": "1",
    "pods": "234"
  }
}

# GPU 모델 배포
$ kubectl create namespace gpu-demo
$ cat << EOF > cifar10-train-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-cifar10-deployment
  namespace: gpu-demo
  labels:
    app: tensorflow-cifar10
spec:
  replicas: 5
  selector:
    matchLabels:
      app: tensorflow-cifar10
  template:
    metadata:
      labels:
        app: tensorflow-cifar10
    spec:
      containers:
      - name: tensorflow-cifar10
        image: public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
        resources:
          limits:
            nvidia.com/gpu: 1
EOF
$ kubectl apply -f cifar10-train-deploy.yaml

# 시간이 지나면 ContainerCreating -> Running으로 변경
$ watch -d 'kubectl get pods -n gpu-demo'

# [Optional] 상태 살펴보기: 모델 다운로드 중
$ kubectl describe pod tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb -n gpu-demo
Name:             tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb
Namespace:        gpu-demo
Priority:         0
Service Account:  default
Node:             i-0af783eca345807e8.us-west-2.compute.internal/192.168.123.238
Start Time:       Sun, 13 Apr 2025 00:53:46 +0900
Labels:           app=tensorflow-cifar10
                  pod-template-hash=7c6f89c8d6
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/tensorflow-cifar10-deployment-7c6f89c8d6
Containers:
  tensorflow-cifar10:
    Container ID:
    Image:          public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5fss (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-z5fss:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  78s   default-scheduler  Successfully assigned gpu-demo/tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb to i-0af783eca345807e8.us-west-2.compute.internal
  Normal  Pulling    77s   kubelet            Pulling image "public.ecr.aws/r5m2h0c9/cifar10_cnn:v2"

# 이제 Running!
$ kubectl get pods -n gpu-demo
NAME                                             READY   STATUS    RESTARTS   AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77   0/1     Pending   0          2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb   1/1     Running   0          2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg   0/1     Pending   0          2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws   0/1     Pending   0          2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq   0/1     Pending   0          2m6s

# 다른 pod를 조회해보면 GPU 리소스 부족으로 생성 불가능한 상황
$ kubectl describe pod tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 -n gpu-demo
Name:             tensorflow-cifar10-deployment-7c6f89c8d6-k7z77
Namespace:        gpu-demo
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=tensorflow-cifar10
                  pod-template-hash=7c6f89c8d6
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/tensorflow-cifar10-deployment-7c6f89c8d6
Containers:
  tensorflow-cifar10:
    Image:      public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x2jt4 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-x2jt4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m46s  default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

GPU Time-slicing on Amazon EKS - Time Slicing 했을 때

# Time-slicing 적용하기
$ cat << EOF > nvidia-device-plugin.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 10
EOF
$ kubectl apply -f nvidia-device-plugin.yaml

# 새로운 ConfigMap 기반으로 반영하기
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  -f nvdp-values.yaml \
  --version 0.14.0 \
  --set config.name=nvidia-device-plugin \
  --force
Release "nvdp" has been upgraded. Happy Helming!
NAME: nvdp
LAST DEPLOYED: Sun Apr 13 01:03:05 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None

# 조금 지나고 조회해보면 GPU값이 늘어남! 
$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
  "name": "i-0af783eca345807e8.us-west-2.compute.internal",
  "capacity": {
    "cpu": "32",
    "ephemeral-storage": "83873772Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "130502176Ki",
    "nvidia.com/gpu": "10",
    "pods": "234"
  }
}

# 모두 실행된 상황
$ kubectl get pods -n gpu-demo
NAME                                             READY   STATUS    RESTARTS      AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77   1/1     Running   1 (24s ago)   10m
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb   1/1     Running   0             10m
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg   1/1     Running   1 (34s ago)   10m
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws   1/1     Running   1 (35s ago)   10m
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq   1/1     Running   1 (34s ago)   10m

# 그런데 계속 재시작이 되고 있음
$ kubectl get pods -n gpu-demo
NAME                                             READY   STATUS    RESTARTS        AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77   1/1     Running   4 (57s ago)     12m
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb   1/1     Running   0               12m
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg   1/1     Running   4 (56s ago)     12m
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws   0/1     Error     3 (91s ago)     12m
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq   1/1     Running   1 (2m50s ago)   12m

# 메모리 부족 이슈
$ kubectl logs tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 -n gpu-demo | grep memory
2025-04-12 16:07:59.910576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 196 MB memory:  -> device: 0, name: NVIDIA A10G, pci bus id: 0000:00:1e.0, compute capability: 8.6
2025-04-12 16:07:59.924567: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:736] failed to allocate 196.62MiB (206176256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out ofmemory
2025-04-12 16:08:01.219272: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:222] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.