Git for AI Projects: Version Control Patterns That Work

AI projects break typical Git workflows because of large files, rapid experimentation, and artifacts that don’t fit traditional version control patterns. Understanding how to adapt Git for AI development prevents common problems and enables effective collaboration. These patterns are essential for professional AI engineering work.

Why AI Projects Need Different Git Patterns

Standard Git workflows assume small text files that change incrementally. AI projects violate these assumptions constantly.

AI-specific challenges:

Model weights can be gigabytes or larger
Datasets don’t fit in repositories
Experiments create many short-lived branches
Notebooks have messy diffs
Configuration and code are tightly coupled

Addressing these challenges upfront prevents the repository from becoming unusable as projects grow. This foundation supports production-ready AI development.

Repository Structure for AI Projects

How you organize the repository affects every aspect of AI development workflow.

Recommended Structure

Organize AI projects with clear separation:

project/
├── src/                 # Production code
├── notebooks/          # Experimentation notebooks
├── tests/              # Test suite
├── configs/            # Configuration files
├── scripts/            # Utility scripts
├── data/               # Data directory (git-ignored)
├── models/             # Model artifacts (git-ignored)
├── .gitignore          # Exclusion patterns
├── .gitattributes      # LFS and diff configs
├── requirements.txt    # Dependencies
└── README.md           # Documentation

This structure separates code (version controlled) from artifacts (tracked differently).

What to Track

Include in Git:

All source code
Configuration templates
Documentation
Test fixtures
Small reference datasets
Requirements and dependencies

Exclude from Git:

Large datasets
Model weights
Virtual environments
Generated outputs
Credentials and secrets
IDE-specific files

Gitignore for AI Projects

Comprehensive AI-focused .gitignore:

Cover common patterns:

Python artifacts (pycache, .pyc, etc.)
Environment directories (.venv, venv, env/)
Data and model directories
Jupyter checkpoints
IDE files
Credential files
OS-specific files

A thorough .gitignore prevents accidentally committing files that shouldn’t be tracked.

Handling Large Files

Large files are unavoidable in AI development. Handle them properly rather than fighting Git’s limitations.

Git LFS

Git Large File Storage handles binary files:

Good LFS candidates:

Model checkpoints under ~1GB
Reference datasets for testing
Image or audio samples
Compiled artifacts

Configure tracking in .gitattributes:

Track specific extensions (.h5, .pkl, .pt)
Track specific directories
Set appropriate storage limits

LFS keeps the repository fast while maintaining version history for large files.

External Artifact Storage

For truly large files, use external storage:

Options:

S3 or cloud storage with versioning
DVC (Data Version Control)
Weights & Biases artifacts
MLflow model registry

Reference external artifacts in code using:

Configuration files with URLs
Environment variables for paths
Scripts that download when needed

This approach scales better than LFS for very large artifacts and integrates with MLOps workflows.

DVC for Data and Model Versioning

DVC extends Git for data science:

Benefits:

Git-like commands for data versioning
Works with cloud storage backends
Tracks pipelines alongside data
Reproduces experiments

Workflow:

Store data in remote storage
Track metadata in Git with DVC
Share and reproduce through DVC commands

DVC bridges the gap between code versioning and data versioning.

Branching Strategies for AI Development

AI experimentation creates branch patterns different from typical software development.

Experiment Branches

For rapid experimentation:

Create short-lived branches for each experiment:

Name with experiment identifier (exp/embedding-size-512)
Run experiments to completion
Extract successful approaches to main
Archive or delete unsuccessful branches

Don’t try to merge experimental code directly. Extract learnings and implement cleanly.

Feature Branch Workflow

For production features:

Standard feature branch workflow works:

Create branch from main
Implement feature
Test thoroughly
Create pull request
Review and merge

Production code follows normal software practices even in AI projects.

Long-Running Research Branches

For ongoing research threads:

Maintain parallel tracks:

main for production-ready code
research branches for longer investigations
Regular syncing to avoid divergence

Communicate clearly about branch purposes and lifecycle.

Commit Practices for AI Development

Meaningful commits make AI project history useful for debugging and reproduction.

What Makes a Good AI Commit

Effective commits:

Change one logical thing
Include context in the message
Reference experiment or issue numbers
Can be reverted independently

Avoid:

“WIP” commits to main
Mixing code changes with config changes
Committing broken code
Giant commits that change everything

Commit Message Format

Include relevant context:

Structure:

Summary line describing the change
Why the change was made
Results or metrics if applicable
References to experiments or issues

For experiments, include key metrics in commit messages. This makes history searchable for successful configurations.

Frequency Matters

Commit often during development:

After each working step
Before making risky changes
At logical stopping points

Squash before merging if history is messy. Clean history helps future debugging.

Handling Jupyter Notebooks

Notebooks create uniquely difficult version control challenges.

The Notebook Problem

Notebooks are JSON with embedded outputs:

Outputs create large, meaningless diffs
Execution counts change constantly
Merge conflicts are nearly impossible to resolve
Binary outputs (images, plots) bloat history

Solutions That Work

Option 1: Strip outputs before commit

Use nbstripout or pre-commit hooks:

Automatically removes outputs on commit
Keeps cell contents only
Dramatically reduces diff noise

Option 2: Paired formats with Jupytext

Sync notebooks with plain text formats:

.py percent format
.md markdown format
Review text files, keep notebooks

Option 3: Separate notebooks from code

Keep notebooks for exploration only:

Production code in .py files
Notebooks import from modules
Only track notebook structure, not experiments

The Jupyter production patterns guide covers these approaches in detail.

Collaboration Patterns

AI teams need collaboration workflows that accommodate experimentation.

Pull Request Guidelines

For AI projects:

PRs should include:

What changed and why
How to test the changes
Performance or metric impacts
Configuration changes required

Review checklist:

Code quality and tests
No hardcoded values that should be config
No committed secrets or credentials
Appropriate documentation

Code Review for AI Code

AI code review specifics:

Check for:

Reproducibility (seeds, deterministic operations)
Error handling for model failures
Resource management (GPU memory, etc.)
Configuration externalization
Type hints for interfaces

AI-specific bugs often come from implicit assumptions that code review catches.

Shared Experiments

When multiple people work on related experiments:

Coordinate through:

Experiment tracking systems
Clear branch naming conventions
Regular syncs to share findings
Documentation of what’s been tried

Duplicated effort wastes time. Communication prevents running the same experiments.

CI/CD for AI Projects

Continuous integration adapts for AI development needs.

What to Test Automatically

Test on every commit:

Unit tests pass
Import checks succeed
Linting and formatting
Type checking if applicable
Small integration tests

Test periodically:

Full training runs (if fast enough)
Model inference benchmarks
Data pipeline validation
Deployment dry runs

Resource-intensive tests can run on schedule rather than every commit.

GitHub Actions for AI

Practical CI patterns:

Use caching aggressively:

Cache pip packages
Cache model weights for testing
Cache processed datasets

Configure GPU runners for tests that need them. Most CI can run on CPU with smaller models.

The GitHub Actions deployment guide covers these patterns in depth.

Pre-commit Hooks

Catch problems before commit:

Useful hooks:

Format checking (Black, Ruff)
Import sorting
Notebook output stripping
Large file detection
Credential scanning

Pre-commit prevents common issues from entering the repository.

Recovering from Problems

AI projects encounter Git problems that require specific solutions.

Accidentally Committed Large Files

When large files reach Git history:

Options:

git-filter-repo to rewrite history
BFG Repo-Cleaner for simpler cleanup
For less severe cases, just remove and add to .gitignore

Prevention is better: proper .gitignore and pre-commit hooks catch this before it happens.

Merge Conflicts in Notebooks

When notebooks conflict:

Options:

Regenerate notebook from one version
Use nbdime for notebook-aware merging
Resolve in plain text if using Jupytext

Notebook merge conflicts rarely have satisfying solutions. Prevention through workflow is better.

Diverged Experiment Branches

When branches diverge too far:

Approach:

Don’t try to merge directly
Identify valuable changes in each branch
Cherry-pick or manually apply changes
Create new clean branch with combined work

Forcing diverged branches together creates more problems than manual integration.

Advanced Patterns

Additional Git techniques for complex AI projects.

Git Worktrees

Run experiments in parallel:

Worktrees allow:

Multiple checkouts simultaneously
Different experiments without switching branches
Shared history with isolated working directories

Useful when experiments take time to set up and you want to work on other things.

Submodules for Shared Code

When projects share components:

Submodules enable:

Shared libraries across projects
Version-pinned dependencies
Independent development

Complexity increases, so use only when benefits are clear.

Monorepo vs Multi-repo

For multiple related AI projects:

Monorepo benefits:

Easier cross-project changes
Shared tooling and configuration
Single source of truth

Multi-repo benefits:

Independent release cycles
Smaller repository size
Clearer ownership

The right choice depends on team size and project coupling.

Building Good Habits

Git practices that compound over time.

Daily practices:

Pull before starting work
Commit frequently
Push at end of day
Review diffs before commit

Project practices:

Set up .gitignore thoroughly at start
Configure pre-commit hooks
Document branch conventions
Regular repository maintenance

Team practices:

Consistent workflows across team
Code review for all changes
Clear communication about branches
Shared experiment tracking

Next Steps

Effective Git practices support the broader AI engineering toolkit that enables production AI development. Version control is foundational to everything else.

For practical workflows and team collaboration patterns, join the AI Engineering community where we share what works in real AI projects.

Watch demonstrations on YouTube to see these Git patterns applied to AI development workflows.

Zen van Riel

Senior AI Engineer at GitHub | Ex-Microsoft

I grew from intern to Senior Engineer at GitHub, previously working at Microsoft. Now I teach 22,000+ engineers on YouTube, reaching hundreds of thousands of developers with practical AI engineering tutorials. My blog posts are generated from my own video content, focusing on real-world implementation over theory.

Blog last updated Jan 26, 2026

Git for AI Projects: Version Control Patterns That Work

Why AI Projects Need Different Git Patterns

Repository Structure for AI Projects

Recommended Structure

What to Track

Gitignore for AI Projects

Handling Large Files

Git LFS

External Artifact Storage

DVC for Data and Model Versioning

Branching Strategies for AI Development

Experiment Branches

Feature Branch Workflow

Long-Running Research Branches

Commit Practices for AI Development

What Makes a Good AI Commit

Commit Message Format

Frequency Matters

Handling Jupyter Notebooks

The Notebook Problem

Solutions That Work

Collaboration Patterns

Pull Request Guidelines

Code Review for AI Code

Shared Experiments

CI/CD for AI Projects

What to Test Automatically

GitHub Actions for AI

Pre-commit Hooks

Recovering from Problems

Accidentally Committed Large Files

Merge Conflicts in Notebooks

Diverged Experiment Branches

Advanced Patterns

Git Worktrees

Submodules for Shared Code

Monorepo vs Multi-repo

Building Good Habits

Next Steps

Zen van Riel

🎁 The AI Engineer Starter Kit

🎁 Last chanceGet the AI Engineer Starter Kit

🎁 Last chance
Get the AI Engineer Starter Kit