AI 워크로드에 대한 컨테이너 사용 배경
머신러닝과 딥러닝이 기업의 핵심 경쟁력으로 부상하면서, 데이터 과학자들은 점점 더 복잡하고 계산 집약적인 모델을 개발. 이러한 모델을 효율적으로 학습하고 배포하기 위해서는 GPU를 사용하는 강력한 컴퓨팅 인프라가 요구됨
전통적으로 ML 엔지니어들은 베어메탈 서버에 직접 GPU 드라이버와 라이브러리를 설치하여 작업
- 환경 구성의 복잡성: CUDA, cuDNN 등 복잡한 드라이버 스택 설치 및 관리가 필요. 특히 버전 호환성 문제로 인해 특정 프레임워크(TensorFlow, PyTorch)가 특정 CUDA/cuDNN 버전만 지원하는 경우가 많았음.
- 재현성 부족: 동일한 실험 환경을 다른 시스템에서 재현하기 어려움. '내 컴퓨터에서는 잘 작동하는데요?.'와 같은 이야기를 많이 들을 수 있었음.
- 리소스 비효율성: 고가의 GPU 리소스가 특정 사용자나 프로젝트에 고정되어 활용도가 저하. (베어메탈 GPU 서버의 활용률은 30% 미만인 경우가 많았음)
- 확장성 제한: 대규모 분산 학습을 위한 인프라 확장이 어려웠음. 새로운 GPU 서버를 추가할 때마다 동일한 환경 구성 과정을 반복 필요.
가상화 기술이 발달하면서 컨테이너 환경 등장하지만 기반 기술인 cgroups(CPU, MEM등 리소스 제어), namespace(프로세스격리)등이 GPU와 같은 특수 하드웨어 장치에 대해서는 근본적인 제약사항이 존재
- 물리적으로 분할하기 어려운 GPU: CPU 코어나 메모리와 달리, 초기 GPU는 물리적으로 분할하여 여러 컨테이너에 할당하기 어려웠음
- GPU 드라이버 복잡성: GPU 접근은 복잡한 사용자 공간 라이브러리와 커널 드라이버를 통해 이루어짐
- 장치 파일 접근 제어: /dev/nvidia* 과 같은 장치 파일에 대한 접근을 안전하게 관리 필요
컨테이너 환경에서의 GPU 리소스 사용 진화
- 컨테이너 환경에서의 GPU 리소스 사용 진화 (단일 GPU)
- 초기 단계 (2016-2018)
- 초기에는 GPU 장치 파일을 컨테이너에 직접 마운트하고 필요한 라이브러리를 볼륨으로 공유해야 했음.
- 아래는 Tensorflow에서 CUDA를 통해 NVIDIA GPU 장치에 액세스하도록 Docker 명령어를 수동으로 실행하는 예시
- 초기 단계 (2016-2018)
docker run --device=/dev/nvidia0:/dev/nvidia0 \
--device=/dev/nvidiactl:/dev/nvidiactl \
-v /usr/local/cuda:/usr/local/cuda \
tensorflow/tensorflow:latest-gpu
- 초기 GPU 리소스 사용에 대한 문제점
- 모든 장치 파일을 수동으로 지정해야 함
- 호스트와 컨테이너 간 라이브러리 버전 충돌 가능성
- 여러 컨테이너 간 GPU 공유 메커니즘 부재
- 오케스트레이션 환경에서 자동화하기 어려움
- NVIDIA Container Runtime 활용 (2018-2020)
- NVIDIA는 이러한 문제를 해결하기 위해 NVIDIA Container Runtime을 개발하였음. NVIDIA Container Runtime: Docker, CRI-O 등 컨테이너 기술에서 사용하는 Open Containers Initiative (OCI) 스펙과 호환되는 GPU 인식 컨테이너 런타임임.
- 이 런타임은 다음과 같은 기능을 자동화함:
- GPU 장치 파일 마운트
- NVIDIA 드라이버 라이브러리 주입
- CUDA 호환성 검사
- GPU 기능 감지 및 노출
# Docker 19.03 이전 버전 사용
docker run --runtime=nvidia nvidia/cuda:11.0-base nvidia-smi
# Docker 19.03 이후부터는 더 간단하게 --gpus 플래그를 사용
docker run --gpus '"device=0,1"' nvidia/cuda:11.0-base nvidia-smi
- 자동화된 GPU 검출 및 설정
- 호스트-컨테이너 간 드라이버 호환성 자동 관리
- 컨테이너 이미지 이식성 향상
- Kubernetes 장치 플러그인 등장 (2020-현재)
- Kubernetes 오픈 소스에 Device Plugin이라는 제안이 2017년 9월 처음으로 이루어짐
- NVIDIA Device Plugin: Kubernetes 클러스터에서 NVIDIA GPU 리소스를 노출하고 관리하는 플러그인으로, 이를 통해 Pod 스펙에서 선언적으로 GPU 리소스를 요청
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: nvidia/cuda:11.0-base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 2 # 2개의 GPU 요청
- 선언적 리소스 관리
- 클러스터 수준의 GPU 리소스 스케줄링
- 자동화된 GPU 할당 및 격리
- 다중 테넌트 환경에서의 리소스 공정성
실습 - eksdemo를 통하여 간단하게 EKS 클러스터 구성
GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances | Amazon Web Services
In today’s fast-paced technological landscape, the demand for accelerated computing is skyrocketing, particularly in areas like artificial intelligence (AI) and machine learning (ML). One of the primary challenges the enterprises face is the efficient ut
aws.amazon.com
- 사전 설치 도구
- eksdemo: Amazon EKS를 배우고 학습, 데모하기 위한 쉬운 옵션
- helm
- AWS CLI
- kubectl
- jq
- Terraform
- eksdemo 설치
#eksdemo 다운로드
wget https://github.com/awslabs/eksdemo/releases/download/v0.18.2/eksdemo_Linux_arm64.tar.gz
#eksdemo 설치파일 압축해제
tar -xvzf eksdemo_Linux_x86_64.tar.gz
#eksdemo 실행파일 /usr/local/bin 디렉토리로 이동
mv eksdemo /usr/local/bin/
#명령어 실행 확인
eksdemo version
- eksdemo를 통하여 간단하게 EKS 클러스터 구성
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- 실행 명령어: eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2
$ eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2 --dry-run
Eksctl Resource Manager Dry Run:
eksctl create cluster -f -
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: gpusharing-demo
region: us-west-2
version: "1.32"
tags:
eksdemo.io/version: 0.18.2
addons:
- name: vpc-cni
version: latest
configurationValues: |-
enableNetworkPolicy: "true"
env:
ENABLE_PREFIX_DELEGATION: "false"
cloudWatch:
clusterLogging:
enableTypes: ["*"]
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: aws-load-balancer-controller
namespace: awslb
roleName: eksdemo.us-west-2.gpusharing-demo.awslb.aws-load-balanc-e4dab3bd
roleOnly: true
attachPolicy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- iam:CreateServiceLinkedRole
Resource: "*"
Condition:
StringEquals:
iam:AWSServiceName: elasticloadbalancing.amazonaws.com
- Effect: Allow
Action:
- ec2:DescribeAccountAttributes
- ec2:DescribeAddresses
- ec2:DescribeAvailabilityZones
- ec2:DescribeInternetGateways
- ec2:DescribeVpcs
- ec2:DescribeVpcPeeringConnections
- ec2:DescribeSubnets
- ec2:DescribeSecurityGroups
- ec2:DescribeInstances
- ec2:DescribeNetworkInterfaces
- ec2:DescribeTags
- ec2:GetCoipPoolUsage
- ec2:DescribeCoipPools
- elasticloadbalancing:DescribeLoadBalancers
- elasticloadbalancing:DescribeLoadBalancerAttributes
- elasticloadbalancing:DescribeListeners
- elasticloadbalancing:DescribeListenerCertificates
- elasticloadbalancing:DescribeSSLPolicies
- elasticloadbalancing:DescribeRules
- elasticloadbalancing:DescribeTargetGroups
- elasticloadbalancing:DescribeTargetGroupAttributes
- elasticloadbalancing:DescribeTargetHealth
- elasticloadbalancing:DescribeTags
- elasticloadbalancing:DescribeTrustStores
- elasticloadbalancing:DescribeListenerAttributes
Resource: "*"
- Effect: Allow
Action:
- cognito-idp:DescribeUserPoolClient
- acm:ListCertificates
- acm:DescribeCertificate
- iam:ListServerCertificates
- iam:GetServerCertificate
- waf-regional:GetWebACL
- waf-regional:GetWebACLForResource
- waf-regional:AssociateWebACL
- waf-regional:DisassociateWebACL
- wafv2:GetWebACL
- wafv2:GetWebACLForResource
- wafv2:AssociateWebACL
- wafv2:DisassociateWebACL
- shield:GetSubscriptionState
- shield:DescribeProtection
- shield:CreateProtection
- shield:DeleteProtection
Resource: "*"
- Effect: Allow
Action:
- ec2:AuthorizeSecurityGroupIngress
- ec2:RevokeSecurityGroupIngress
Resource: "*"
- Effect: Allow
Action:
- ec2:CreateSecurityGroup
Resource: "*"
- Effect: Allow
Action:
- ec2:CreateTags
Resource: arn:aws:ec2:*:*:security-group/*
Condition:
StringEquals:
ec2:CreateAction: CreateSecurityGroup
'Null':
aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- ec2:CreateTags
- ec2:DeleteTags
Resource: arn:aws:ec2:*:*:security-group/*
Condition:
'Null':
aws:RequestTag/elbv2.k8s.aws/cluster: 'true'
aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- ec2:AuthorizeSecurityGroupIngress
- ec2:RevokeSecurityGroupIngress
- ec2:DeleteSecurityGroup
Resource: "*"
Condition:
'Null':
aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- elasticloadbalancing:CreateLoadBalancer
- elasticloadbalancing:CreateTargetGroup
Resource: "*"
Condition:
'Null':
aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- elasticloadbalancing:CreateListener
- elasticloadbalancing:DeleteListener
- elasticloadbalancing:CreateRule
- elasticloadbalancing:DeleteRule
Resource: "*"
- Effect: Allow
Action:
- elasticloadbalancing:AddTags
- elasticloadbalancing:RemoveTags
Resource:
- arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
- arn:aws:elasticloadbalancing:*:*:loadbalancer/net/*/*
- arn:aws:elasticloadbalancing:*:*:loadbalancer/app/*/*
Condition:
'Null':
aws:RequestTag/elbv2.k8s.aws/cluster: 'true'
aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- elasticloadbalancing:AddTags
- elasticloadbalancing:RemoveTags
Resource:
- arn:aws:elasticloadbalancing:*:*:listener/net/*/*/*
- arn:aws:elasticloadbalancing:*:*:listener/app/*/*/*
- arn:aws:elasticloadbalancing:*:*:listener-rule/net/*/*/*
- arn:aws:elasticloadbalancing:*:*:listener-rule/app/*/*/*
- Effect: Allow
Action:
- elasticloadbalancing:ModifyLoadBalancerAttributes
- elasticloadbalancing:SetIpAddressType
- elasticloadbalancing:SetSecurityGroups
- elasticloadbalancing:SetSubnets
- elasticloadbalancing:DeleteLoadBalancer
- elasticloadbalancing:ModifyTargetGroup
- elasticloadbalancing:ModifyTargetGroupAttributes
- elasticloadbalancing:DeleteTargetGroup
- elasticloadbalancing:ModifyListenerAttributes
Resource: "*"
Condition:
'Null':
aws:ResourceTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- elasticloadbalancing:AddTags
Resource:
- arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
- arn:aws:elasticloadbalancing:*:*:loadbalancer/net/*/*
- arn:aws:elasticloadbalancing:*:*:loadbalancer/app/*/*
Condition:
StringEquals:
elasticloadbalancing:CreateAction:
- CreateTargetGroup
- CreateLoadBalancer
'Null':
aws:RequestTag/elbv2.k8s.aws/cluster: 'false'
- Effect: Allow
Action:
- elasticloadbalancing:RegisterTargets
- elasticloadbalancing:DeregisterTargets
Resource: arn:aws:elasticloadbalancing:*:*:targetgroup/*/*
- Effect: Allow
Action:
- elasticloadbalancing:SetWebAcl
- elasticloadbalancing:ModifyListener
- elasticloadbalancing:AddListenerCertificates
- elasticloadbalancing:RemoveListenerCertificates
- elasticloadbalancing:ModifyRule
Resource: "*"
- metadata:
name: ebs-csi-controller-sa
namespace: kube-system
roleName: eksdemo.us-west-2.gpusharing-demo.kube-system.ebs-csi-c-937ae3a3
roleOnly: true
attachPolicyARNs:
- arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
- metadata:
name: external-dns
namespace: external-dns
roleName: eksdemo.us-west-2.gpusharing-demo.external-dns.external-dns
roleOnly: true
attachPolicy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- route53:ChangeResourceRecordSets
Resource:
- arn:aws:route53:::hostedzone/*
- Effect: Allow
Action:
- route53:ListHostedZones
- route53:ListResourceRecordSets
- route53:ListTagsForResource
Resource:
- "*"
- metadata:
name: karpenter
namespace: karpenter
roleName: eksdemo.us-west-2.gpusharing-demo.karpenter.karpenter
roleOnly: true
attachPolicy:
Version: "2012-10-17"
Statement:
- Sid: AllowScopedEC2InstanceAccessActions
Effect: Allow
Resource:
- arn:aws:ec2:us-west-2::image/*
- arn:aws:ec2:us-west-2::snapshot/*
- arn:aws:ec2:us-west-2:*:security-group/*
- arn:aws:ec2:us-west-2:*:subnet/*
Action:
- ec2:RunInstances
- ec2:CreateFleet
- Sid: AllowScopedEC2LaunchTemplateAccessActions
Effect: Allow
Resource: arn:aws:ec2:us-west-2:*:launch-template/*
Action:
- ec2:RunInstances
- ec2:CreateFleet
Condition:
StringEquals:
aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
StringLike:
aws:ResourceTag/karpenter.sh/nodepool: "*"
- Sid: AllowScopedEC2InstanceActionsWithTags
Effect: Allow
Resource:
- arn:aws:ec2:us-west-2:*:fleet/*
- arn:aws:ec2:us-west-2:*:instance/*
- arn:aws:ec2:us-west-2:*:volume/*
- arn:aws:ec2:us-west-2:*:network-interface/*
- arn:aws:ec2:us-west-2:*:launch-template/*
- arn:aws:ec2:us-west-2:*:spot-instances-request/*
Action:
- ec2:RunInstances
- ec2:CreateFleet
- ec2:CreateLaunchTemplate
Condition:
StringEquals:
aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
StringLike:
aws:RequestTag/karpenter.sh/nodepool: "*"
- Sid: AllowScopedResourceCreationTagging
Effect: Allow
Resource:
- arn:aws:ec2:us-west-2:*:fleet/*
- arn:aws:ec2:us-west-2:*:instance/*
- arn:aws:ec2:us-west-2:*:volume/*
- arn:aws:ec2:us-west-2:*:network-interface/*
- arn:aws:ec2:us-west-2:*:launch-template/*
- arn:aws:ec2:us-west-2:*:spot-instances-request/*
Action: ec2:CreateTags
Condition:
StringEquals:
aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
ec2:CreateAction:
- RunInstances
- CreateFleet
- CreateLaunchTemplate
StringLike:
aws:RequestTag/karpenter.sh/nodepool: "*"
- Sid: AllowScopedResourceTagging
Effect: Allow
Resource: arn:aws:ec2:us-west-2:*:instance/*
Action: ec2:CreateTags
Condition:
StringEquals:
aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
StringLike:
aws:ResourceTag/karpenter.sh/nodepool: "*"
StringEqualsIfExists:
aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
ForAllValues:StringEquals:
aws:TagKeys:
- eks:eks-cluster-name
- karpenter.sh/nodeclaim
- Name
- Sid: AllowScopedDeletion
Effect: Allow
Resource:
- arn:aws:ec2:us-west-2:*:instance/*
- arn:aws:ec2:us-west-2:*:launch-template/*
Action:
- ec2:TerminateInstances
- ec2:DeleteLaunchTemplate
Condition:
StringEquals:
aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
StringLike:
aws:ResourceTag/karpenter.sh/nodepool: "*"
- Sid: AllowRegionalReadActions
Effect: Allow
Resource: "*"
Action:
- ec2:DescribeImages
- ec2:DescribeInstances
- ec2:DescribeInstanceTypeOfferings
- ec2:DescribeInstanceTypes
- ec2:DescribeLaunchTemplates
- ec2:DescribeSecurityGroups
- ec2:DescribeSpotPriceHistory
- ec2:DescribeSubnets
Condition:
StringEquals:
aws:RequestedRegion: "us-west-2"
- Sid: AllowSSMReadActions
Effect: Allow
Resource: arn:aws:ssm:us-west-2::parameter/aws/service/*
Action:
- ssm:GetParameter
- Sid: AllowPricingReadActions
Effect: Allow
Resource: "*"
Action:
- pricing:GetProducts
- Sid: AllowInterruptionQueueActions
Effect: Allow
Resource: arn:aws:sqs:us-west-2:767397897074:karpenter-gpusharing-demo
Action:
- sqs:DeleteMessage
- sqs:GetQueueUrl
- sqs:ReceiveMessage
- Sid: AllowPassingInstanceRole
Effect: Allow
Resource: arn:aws:iam::767397897074:role/KarpenterNodeRole-gpusharing-demo
Action: iam:PassRole
Condition:
StringEquals:
iam:PassedToService:
- ec2.amazonaws.com
- ec2.amazonaws.com.cn
- Sid: AllowScopedInstanceProfileCreationActions
Effect: Allow
Resource: arn:aws:iam::767397897074:instance-profile/*
Action:
- iam:CreateInstanceProfile
Condition:
StringEquals:
aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
aws:RequestTag/topology.kubernetes.io/region: "us-west-2"
StringLike:
aws:RequestTag/karpenter.k8s.aws/ec2nodeclass: "*"
- Sid: AllowScopedInstanceProfileTagActions
Effect: Allow
Resource: arn:aws:iam::767397897074:instance-profile/*
Action:
- iam:TagInstanceProfile
Condition:
StringEquals:
aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:ResourceTag/topology.kubernetes.io/region: "us-west-2"
aws:RequestTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:RequestTag/eks:eks-cluster-name: gpusharing-demo
aws:RequestTag/topology.kubernetes.io/region: "us-west-2"
StringLike:
aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass: "*"
aws:RequestTag/karpenter.k8s.aws/ec2nodeclass: "*"
- Sid: AllowScopedInstanceProfileActions
Effect: Allow
Resource: arn:aws:iam::767397897074:instance-profile/*
Action:
- iam:AddRoleToInstanceProfile
- iam:RemoveRoleFromInstanceProfile
- iam:DeleteInstanceProfile
Condition:
StringEquals:
aws:ResourceTag/kubernetes.io/cluster/gpusharing-demo: owned
aws:ResourceTag/topology.kubernetes.io/region: "us-west-2"
StringLike:
aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass: "*"
- Sid: AllowInstanceProfileReadActions
Effect: Allow
Resource: arn:aws:iam::767397897074:instance-profile/*
Action: iam:GetInstanceProfile
- Sid: AllowAPIServerEndpointDiscovery
Effect: Allow
Resource: arn:aws:eks:us-west-2:767397897074:cluster/gpusharing-demo
Action: eks:DescribeCluster
vpc:
cidr: 192.168.0.0/16
hostnameType: resource-name
managedNodeGroups:
- name: main
ami: ami-092590b6039cd49ed
amiFamily: AmazonLinux2
desiredCapacity: 2
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
instanceType: t3.large
minSize: 0
maxSize: 10
volumeSize: 80
volumeType: gp3
overrideBootstrapCommand: |
#!/bin/bash
/etc/eks/bootstrap.sh gpusharing-demo
privateNetworking: true
spot: false
GPU Time-slicing on Amazon EKS - eksdemo를 통하여 GPU 노드 그룹 추가
- eksdemo를 통하여 간단하게 EKS 클러스터 구성
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- 실행 명령어: eksdemo create cluster gpusharing-demo -i t3.large -N 2 --region us-west-2
- eksdemo를 통하여 eks 클러스터 내 GPU노드 추가
- 설치 전 --dry-run 명령어를 통해 어떻게 설치가 되는지 확인
- GPU 노드 추가: g5.8xlarge
- 실행 명령어: eksdemo create nodegroup gpu -i g5.8xlarge -N 1 -c gpusharing-demo
$ eksdemo create nodegroup gpu -i g5.8xlarge -N 1 -c gpusharing-demo --dry-run
Eksctl Resource Manager Dry Run:
eksctl create nodegroup -f - --install-nvidia-plugin=false --install-neuron-plugin=false
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: gpusharing-demo
region: us-west-2
version: "1.32"
tags:
eksdemo.io/version: 0.18.2
managedNodeGroups:
- name: gpu
ami: ami-0111ed894dc3ec059
amiFamily: AmazonLinux2
desiredCapacity: 1
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
instanceType: g5.8xlarge
minSize: 0
maxSize: 10
volumeSize: 80
volumeType: gp3
overrideBootstrapCommand: |
#!/bin/bash
/etc/eks/bootstrap.sh gpusharing-demo
privateNetworking: true
spot: false
taints:
- key: nvidia.com/gpu
value: ""
effect: NoSchedule
GPU Time-slicing on Amazon EKS - Time Slicing 안했을 때
$ kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE CAPACITYTYPE ZONE
i-01edff2cd3456af51.us-west-2.compute.internal Ready <none> 43m v1.32.1-eks-5d632ec t3.large ON_DEMAND us-west-2d
i-072eb4b66c886d137.us-west-2.compute.internal Ready <none> 43m v1.32.1-eks-5d632ec t3.large ON_DEMAND us-west-2a
i-0af783eca345807e8.us-west-2.compute.internal Ready <none> 4m53s v1.32.1-eks-5d632ec g5.8xlarge ON_DEMAND us-west-2a
# g5 - GPU 노드 에 label을 지정: "i-0af783eca345807e8.us-west-2.compute.internal" 를 바꿔주세요!
$ kubectl label node i-0af783eca345807e8.us-west-2.compute.internal eks-node=gpu
# 참고: 노드 그룹 스케일링 (할당량 한도 내에서 2 등으로 변경 가능)
$ eksctl scale nodegroup --name gpu --cluster gpusharing-demo --nodes 2
# nvdp-values.yaml 파일 다운로드
$ curl -O https://raw.githubusercontent.com/sanjeevrg89/eks-gpu-sharing-demo/refs/heads/main/nvdp-values.yaml
# nvidia-device-plugin을 helm을 사용하여 설치
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
-f nvdp-values.yaml \
--version 0.14.0
Release "nvdp" does not exist. Installing it now.
NAME: nvdp
LAST DEPLOYED: Sun Apr 13 00:50:38 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
$ kubectl get daemonset -n kube-system | grep nvidia
nvdp-nvidia-device-plugin 1 1 1 1 1 eks-node=gpu 23s
# time-slicing을 활성화하지 않았을 때: GPU가 1개임
$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
"name": "i-0af783eca345807e8.us-west-2.compute.internal",
"capacity": {
"cpu": "32",
"ephemeral-storage": "83873772Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "130502176Ki",
"nvidia.com/gpu": "1",
"pods": "234"
}
}
# GPU 모델 배포
$ kubectl create namespace gpu-demo
$ cat << EOF > cifar10-train-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-cifar10-deployment
namespace: gpu-demo
labels:
app: tensorflow-cifar10
spec:
replicas: 5
selector:
matchLabels:
app: tensorflow-cifar10
template:
metadata:
labels:
app: tensorflow-cifar10
spec:
containers:
- name: tensorflow-cifar10
image: public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
resources:
limits:
nvidia.com/gpu: 1
EOF
$ kubectl apply -f cifar10-train-deploy.yaml
# 시간이 지나면 ContainerCreating -> Running으로 변경
$ watch -d 'kubectl get pods -n gpu-demo'
# [Optional] 상태 살펴보기: 모델 다운로드 중
$ kubectl describe pod tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb -n gpu-demo
Name: tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb
Namespace: gpu-demo
Priority: 0
Service Account: default
Node: i-0af783eca345807e8.us-west-2.compute.internal/192.168.123.238
Start Time: Sun, 13 Apr 2025 00:53:46 +0900
Labels: app=tensorflow-cifar10
pod-template-hash=7c6f89c8d6
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/tensorflow-cifar10-deployment-7c6f89c8d6
Containers:
tensorflow-cifar10:
Container ID:
Image: public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5fss (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-z5fss:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 78s default-scheduler Successfully assigned gpu-demo/tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb to i-0af783eca345807e8.us-west-2.compute.internal
Normal Pulling 77s kubelet Pulling image "public.ecr.aws/r5m2h0c9/cifar10_cnn:v2"
# 이제 Running!
$ kubectl get pods -n gpu-demo
NAME READY STATUS RESTARTS AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 0/1 Pending 0 2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb 1/1 Running 0 2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg 0/1 Pending 0 2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws 0/1 Pending 0 2m6s
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq 0/1 Pending 0 2m6s
# 다른 pod를 조회해보면 GPU 리소스 부족으로 생성 불가능한 상황
$ kubectl describe pod tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 -n gpu-demo
Name: tensorflow-cifar10-deployment-7c6f89c8d6-k7z77
Namespace: gpu-demo
Priority: 0
Service Account: default
Node: <none>
Labels: app=tensorflow-cifar10
pod-template-hash=7c6f89c8d6
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/tensorflow-cifar10-deployment-7c6f89c8d6
Containers:
tensorflow-cifar10:
Image: public.ecr.aws/r5m2h0c9/cifar10_cnn:v2
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x2jt4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-x2jt4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m46s default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
GPU Time-slicing on Amazon EKS - Time Slicing 했을 때
# Time-slicing 적용하기
$ cat << EOF > nvidia-device-plugin.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 10
EOF
$ kubectl apply -f nvidia-device-plugin.yaml
# 새로운 ConfigMap 기반으로 반영하기
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
-f nvdp-values.yaml \
--version 0.14.0 \
--set config.name=nvidia-device-plugin \
--force
Release "nvdp" has been upgraded. Happy Helming!
NAME: nvdp
LAST DEPLOYED: Sun Apr 13 01:03:05 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
# 조금 지나고 조회해보면 GPU값이 늘어남!
$ kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity."nvidia.com/gpu" != null) | {name: .metadata.name, capacity: .status.capacity}'
{
"name": "i-0af783eca345807e8.us-west-2.compute.internal",
"capacity": {
"cpu": "32",
"ephemeral-storage": "83873772Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "130502176Ki",
"nvidia.com/gpu": "10",
"pods": "234"
}
}
# 모두 실행된 상황
$ kubectl get pods -n gpu-demo
NAME READY STATUS RESTARTS AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 1/1 Running 1 (24s ago) 10m
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb 1/1 Running 0 10m
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg 1/1 Running 1 (34s ago) 10m
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws 1/1 Running 1 (35s ago) 10m
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq 1/1 Running 1 (34s ago) 10m
# 그런데 계속 재시작이 되고 있음
$ kubectl get pods -n gpu-demo
NAME READY STATUS RESTARTS AGE
tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 1/1 Running 4 (57s ago) 12m
tensorflow-cifar10-deployment-7c6f89c8d6-kpzmb 1/1 Running 0 12m
tensorflow-cifar10-deployment-7c6f89c8d6-n6djg 1/1 Running 4 (56s ago) 12m
tensorflow-cifar10-deployment-7c6f89c8d6-x7zws 0/1 Error 3 (91s ago) 12m
tensorflow-cifar10-deployment-7c6f89c8d6-zx8dq 1/1 Running 1 (2m50s ago) 12m
# 메모리 부족 이슈
$ kubectl logs tensorflow-cifar10-deployment-7c6f89c8d6-k7z77 -n gpu-demo | grep memory
2025-04-12 16:07:59.910576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 196 MB memory: -> device: 0, name: NVIDIA A10G, pci bus id: 0000:00:1e.0, compute capability: 8.6
2025-04-12 16:07:59.924567: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:736] failed to allocate 196.62MiB (206176256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out ofmemory
2025-04-12 16:08:01.219272: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:222] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
'2025_AEWS Study' 카테고리의 다른 글
11주차 - ML Infra(GPU) on EKS(2) (0) | 2025.04.28 |
---|---|
12주차 - Amazon VPC Lattice for Amazon EKS(2) (0) | 2025.04.28 |
12주차 - Amazon VPC Lattice for Amazon EKS(1) (0) | 2025.04.27 |
10주차 - K8s 시크릿 관리 Update(5) (0) | 2025.04.14 |
10주차 - K8s 시크릿 관리 Update(4) (0) | 2025.04.14 |