ai-deployment-manager

AI Deployment Manager

Homebrew for AI Infrastructure - A simple CLI/UI tool that helps companies deploy and manage their own GPU infrastructure (bare metal, cloud, hybrid) without needing deep DevOps expertise.

Features

🚀 Deployment Automation: Easily deploy GPU clusters on bare metal, cloud, or hybrid infrastructure
📊 Workload Scheduling: Intelligent job queue management and resource allocation
📈 Resource Monitoring: Real-time GPU utilization, memory, and performance monitoring
🔧 AI Framework Integration: Support for PyTorch, TensorFlow, JAX, and more
☁️ Multi-Cloud Support: Deploy to AWS, GCP, Azure, or on-premises
💰 Cost Tracking: Track and allocate infrastructure costs across teams and projects

Installation

From Source

git clone https://github.com/dewitt4/ai-deployment-manager.git
cd ai-deployment-manager
make build
make install

Quick Start

# Initialize configuration
aidm init

# Deploy a GPU cluster
aidm deploy create

# Submit a workload
aidm schedule submit

# Monitor resources
aidm monitor resources

# Check costs
aidm cost report

Usage

Configuration

Initialize the AI Deployment Manager configuration:

aidm init

This creates a configuration file at ~/.aidm/config.yaml with default settings.

Deployment Commands

# Create a new GPU cluster deployment
aidm deploy create

# List all deployments
aidm deploy list

# Check deployment status
aidm deploy status

# Delete a deployment
aidm deploy delete

Workload Scheduling

# Submit a job to the queue
aidm schedule submit

# List all jobs
aidm schedule list

# Cancel a job
aidm schedule cancel <job-id>

# Check queue status
aidm schedule queue

Resource Monitoring

# View resource utilization
aidm monitor resources

# Check GPU status
aidm monitor gpu

# Run optimization
aidm monitor optimize

Cost Management

# Generate cost report
aidm cost report

# Update cost tracking
aidm cost track

# View cost allocations
aidm cost allocate

Architecture

ai-deployment-manager/
├── cmd/
│   └── aidm/           # CLI entry point
├── pkg/
│   ├── deployment/     # GPU cluster deployment automation
│   ├── scheduler/      # Workload scheduling and queue management
│   ├── monitor/        # Resource monitoring and optimization
│   ├── integration/    # AI framework integrations
│   ├── cloud/          # Multi-cloud provider support
│   └── cost/           # Cost tracking and allocation
└── internal/
    ├── config/         # Configuration management
    └── utils/          # Utility functions

Supported Platforms

Cloud Providers

Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Microsoft Azure

AI Frameworks

PyTorch
TensorFlow
JAX

GPU Types

NVIDIA GPUs (A100, V100, T4, etc.)
Support for CUDA-enabled workloads

Configuration File

Example ~/.aidm/config.yaml:

provider: local
gpu_type: nvidia
framework: pytorch

cloud:
  aws:
    region: us-west-2
  gcp:
    project: ""
  azure:
    subscription: ""

deployment:
  cluster_size: 1
  gpu_count: 1

monitoring:
  enabled: true
  interval: 60s

cost:
  tracking_enabled: true
  currency: USD

Development

Build

make build

Test

make test

Format Code

make fmt

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache 2 License - see the LICENSE file for details.

Support

For issues, questions, or contributions, please open an issue on GitHub.

This site is open source. Improve this page.