Using the API to Find Free Hosted Models on NVIDIA Builder

Using the API to Find Free Hosted Models on NVIDIA Builder

The NVIDIA Developer Program provides access to a wide catalog of AI models through NVIDIA Inference Microservices (NIM), offering an OpenAI-compatible API. You can browse and discover available models at build.nvidia.com/explore/discover .

If you want to find models with free hosted endpoints in the browser, you can enable the “Free Endpoint” filter on the model catalog page. But what if you need that information programmatically – in a script, a CI pipeline, or as part of an automated workflow? The browser filter is not accessible through the API, and the /v1/models endpoint does not distinguish between free hosted models and everything else.

I wrote a Bash script that takes the API approach: probe every model in the catalog and find out which ones actually respond to chat completion requests.

The Problem

The NIM /v1/models endpoint returns every model in the catalog. When I ran the script on April 6, 2026, that was 189 models. But not all of those are LLM-style chat models, and not all of them are currently live. The catalog includes embedding models, vision models, reward models, and models that have been retired or are temporarily unavailable. There is no field in the API response that tells you “this model is currently hosted and accepts /v1/chat/completions requests.”

The browser’s “Free Endpoint” filter solves this for interactive use, but if you want a machine-readable list – or need to verify endpoint availability as part of an automated workflow – you need a different approach.

What the Script Does

The script, list-nvidia-hosted-models.sh, automates a three-step process:

  1. Fetches the full model catalog from the NIM API (/v1/models)
  2. Probes each model by sending a minimal chat completion request
  3. Classifies and reports each model based on the HTTP response

Each model gets a short test payload:

{
  "model": "meta/llama-3.1-8b-instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Respond only with OK."}
  ],
  "max_tokens": 8
}

The max_tokens: 8 keeps costs and latency minimal. We don’t care about the response content, only whether the endpoint accepts the request.

How It Works

Let’s walk through the key parts of the script.

Setup and API Key

#!/usr/bin/env bash
set -euo pipefail
: "${NVIDIA_API_KEY?Must export NVIDIA_API_KEY}"

The script requires an NVIDIA_API_KEY environment variable. If it is not set, the script exits immediately with a clear error. See the Running It Yourself section below for how to get an API key.

Fetching the Model Catalog

BASE_URL="https://integrate.api.nvidia.com/v1"
curl -s --max-time "$TIMEOUT" "$BASE_URL/models" \
  -H "Authorization: Bearer $NVIDIA_API_KEY" \
  -H "Content-Type: application/json" \
  > "$MODELS_FILE"

The full model list is saved to a temporary file, and model IDs are extracted with jq:

jq -r '.data[].id // empty' "$MODELS_FILE"

Probing Each Model

For each model ID, the script sends a chat completion request and captures both the HTTP status code and response body:

raw_response=$(
  curl -s --max-time "$TIMEOUT" \
    -w 'STATUS:%{http_code}' \
    "$BASE_URL/chat/completions" \
    -H "Authorization: Bearer $NVIDIA_API_KEY" \
    -H "Content-Type: application/json" \
    -d "$payload" 2>/dev/null || echo "STATUS:000"
)

status_code="${raw_response##*STATUS:}"
response_body="${raw_response%STATUS:*}"

The -w 'STATUS:%{http_code}' flag appends the HTTP status code to the response body. The script then splits on the STATUS: marker using Bash parameter expansion to extract both parts. If curl fails entirely (network error, DNS failure), the || echo "STATUS:000" fallback ensures we still get a parseable result.

Classifying Results

The script categorizes each model into one of four buckets:

StatusMeaning
HTTP 200 with valid JSON containing choices or modelModel is hosted and responding to chat completions
HTTP 429Rate-limited, but this confirms the model endpoint exists and is hosted
HTTP 404/403/401/500Model is not available (removed, not a chat model, or broken)
Everything else (400, 422, 000, timeouts)Ambiguous – could be a non-chat model, temporarily down, or requires a different request format
case "$status_code" in
  200)
    if jq -e 'has("choices") or has("model")' <<<"$response_body" >/dev/null 2>&1; then
      echo "model_name"        # Hosted and working
    fi
    ;;
  429)
    echo "model_name"          # Rate-limited but hosted
    ;;
  404|403|401|500)
    # Not available
    ;;
esac

Output Files

The script produces two timestamped output files:

  • nvidia_hosted_models_<timestamp>.txt – Just the model IDs that are confirmed hosted, one per line. Useful as input to other scripts or for programmatic consumption.
  • nvidia_test_results_<timestamp>.txt – The full test log with every model and its status. Useful for auditing and debugging.

What I Found

Running the script on April 6, 2026, against the 189 models in the catalog:

  • 101 models returned successful chat completions or were rate-limited (confirmed hosted)
  • 62 models returned HTTP 404 (removed or never hosted as chat endpoints)
  • 6 models returned HTTP 500 (server errors, likely temporarily broken)
  • 20 models returned ambiguous results (400, 422, timeouts, or no response)

Some interesting observations:

Retired models still in the catalog. Models like meta/llama2-70b, databricks/dbrx-instruct, and adept/fuyu-8b return 404. They appear in the catalog but are no longer hosted.

Embedding models fail gracefully. Models like nvidia/nv-embed-v1 and baai/bge-m3 return 404 because they only serve embedding endpoints (/v1/embeddings), not chat completions.

Large models tend to time out. Several large models like nvidia/llama-3.1-nemotron-ultra-253b-v1 (253B parameters) and qwen/qwen3.5-397b-a17b returned HTTP 000, meaning the request timed out within the 5-second window. These models may need longer to cold-start or may have limited capacity.

Duplicate entries exist. nvidia/nemotron-3-super-120b-a12b and openai/gpt-oss-120b each appeared twice in the results, suggesting duplicate entries in the catalog.

The catalog is diverse. Working models span providers including Meta, Google, Microsoft, Mistral, Qwen, DeepSeek, NVIDIA, and many others. The platform is not just for NVIDIA models.

Why This Is Useful

Programmatic access. The browser’s “Free Endpoint” filter is great for interactive browsing, but scripts and automation need a machine-readable list. This script produces clean text files you can pipe into other tools.

Model selection. If you are building an application on the NVIDIA Developer Program, you need to know which models are actually available before you write code against them.

Monitoring. Run this script periodically to track which models come and go. You can diff the output files to detect changes over time.

Cost optimization. By knowing which smaller models have free hosted endpoints, you can test whether a 7B or 8B parameter model is sufficient for your use case before paying for a larger model.

CI/CD validation. If your pipeline depends on a specific NIM model, you can add this script (or a simplified version targeting a single model) as a health check.

Running It Yourself

Getting an API Key

You will need a free NVIDIA API key from the NVIDIA Developer Program :

  1. Sign in (or create an account) at build.nvidia.com
  2. Click your avatar icon in the top-right corner
  3. Select “API Keys” from the menu
  4. Click “Generate API Key”

Your key will start with nvapi-. For more details, see the NVIDIA API Quickstart Guide .

Running the Script

Export your API key and run the script:

export NVIDIA_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
./list-nvidia-hosted-models.sh

The script takes a few minutes to complete since it tests each model sequentially with a 0.1-second delay and a 5-second timeout per request.

The Full Script

#!/usr/bin/env bash

set -euo pipefail

: "${NVIDIA_API_KEY?Must export NVIDIA_API_KEY}"

BASE_URL="https://integrate.api.nvidia.com/v1"
MODELS_FILE=$(mktemp /tmp/nvidia_models.XXXXXX.json)
trap 'rm -f "$MODELS_FILE"' EXIT
INTERVAL=0.1
TIMEOUT=5

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
SUCCESS_FILE="nvidia_hosted_models_${TIMESTAMP}.txt"
RESULTS_FILE="nvidia_test_results_${TIMESTAMP}.txt"

>&2 echo "Fetching models from $BASE_URL/models ..."

curl -s --max-time "$TIMEOUT" "$BASE_URL/models" \
  -H "Authorization: Bearer $NVIDIA_API_KEY" \
  -H "Content-Type: application/json" \
  > "$MODELS_FILE"

# Check for pagination
if jq -e '.has_more // false' "$MODELS_FILE" >/dev/null 2>&1; then
  >&2 echo "Warning: API response indicates additional pages exist."
  >&2 echo "   Results may be incomplete."
fi

>&2 echo "Extracting model IDs ..."

MODEL_IDS=()
while IFS= read -r id; do
  [[ -n "$id" ]] && MODEL_IDS+=("$id")
done < <(jq -r '.data[].id // empty' "$MODELS_FILE")

>&2 echo "Found ${#MODEL_IDS[@]} models; testing endpoints (timeout=${TIMEOUT}s)..."

{
  echo "# Potentially hosted and reachable models for key $NVIDIA_API_KEY"
  echo "# (LLM-style chat endpoints that accept /v1/chat/completions)"
  echo "# Tested: $(date)"
  echo ""
} > "$RESULTS_FILE"

> "$SUCCESS_FILE"

for model in "${MODEL_IDS[@]}"; do
  sleep "$INTERVAL"

  payload=$(jq -n --arg m "$model" '{
    model: $m,
    messages: [
      {role: "system", content: "You are a helpful assistant."},
      {role: "user",   content: "Respond only with OK."}
    ],
    max_tokens: 8
  }')

  raw_response=$(
    curl -s --max-time "$TIMEOUT" \
      -w 'STATUS:%{http_code}' \
      "$BASE_URL/chat/completions" \
      -H "Authorization: Bearer $NVIDIA_API_KEY" \
      -H "Content-Type: application/json" \
      -d "$payload" 2>/dev/null || echo "STATUS:000"
  )

  status_code="${raw_response##*STATUS:}"
  response_body="${raw_response%STATUS:*}"

  case "$status_code" in
    200)
      if jq -e 'has("choices") or has("model")' <<<"$response_body" \
           >/dev/null 2>&1; then
        echo "model_id"
        echo "model_id" >> "$RESULTS_FILE"
        echo "$model" >> "$SUCCESS_FILE"
      else
        echo "model_id (HTTP 200 but non-LLM or unexpected schema)"
        echo "model_id (HTTP 200 but unexpected schema)" >> "$RESULTS_FILE"
      fi
      ;;
    429)
      echo "model_id (HTTP 429 rate-limited -- likely hosted)"
      echo "model_id (HTTP 429 rate-limited)" >> "$RESULTS_FILE"
      echo "$model" >> "$SUCCESS_FILE"
      ;;
    404|403|401|500)
      echo "model_id (HTTP $status_code)"
      echo "model_id (HTTP $status_code)" >> "$RESULTS_FILE"
      ;;
    *)
      echo "model_id (HTTP $status_code or no response)"
      echo "model_id (HTTP $status_code or no response)" >> "$RESULTS_FILE"
      ;;
  esac
done

SUCCESS_COUNT=$(wc -l < "$SUCCESS_FILE")
>&2 echo ""
>&2 echo "Results saved to: $RESULTS_FILE"
>&2 echo "Successful models ($SUCCESS_COUNT) saved to: $SUCCESS_FILE"

Limitations and Future Improvements

  • Sequential execution. The script tests models one at a time. Parallelizing with xargs -P or GNU parallel would significantly reduce runtime.
  • No pagination support. If the catalog grows beyond a single page of results, the script will miss models. It does warn about this.
  • Chat-only testing. The script only tests /v1/chat/completions. Models that serve only embedding, reranking, or image generation endpoints will show as unavailable.
  • Timeout sensitivity. The 5-second timeout may be too aggressive for very large models that need cold-start time. You could increase TIMEOUT for a more complete picture, at the cost of longer runtime.

Frequently Asked Questions

How do I find free hosted models on NVIDIA Builder?

There are two ways. In the browser, go to the NVIDIA Builder model catalog and enable the “Free Endpoint” filter . For a programmatic approach, you can use the script described in this post, which probes each model’s /v1/chat/completions endpoint and produces a machine-readable list of hosted models.

Is the NVIDIA NIM API free to use?

NVIDIA provides free API credits through the NVIDIA Developer Program . You can generate a free API key and use it to access models that have free hosted endpoints (marked as “Free Endpoint” in the catalog). Rate limits apply to free-tier usage.

What is the NVIDIA NIM API base URL?

The base URL for NVIDIA NIM API requests is https://integrate.api.nvidia.com/v1. This endpoint is OpenAI-compatible, so you can use it with any client library that supports the OpenAI API format, including the official OpenAI Python and Node.js SDKs.

Why do some NVIDIA NIM models return 404 or 500 errors?

Models return HTTP 404 when they have been retired from hosted service or were never available as chat completion endpoints (such as embedding or vision-only models). HTTP 500 errors typically indicate temporary server-side issues. Models that time out (HTTP 000) may require longer cold-start times, especially for very large models with hundreds of billions of parameters.

How many models does NVIDIA NIM host for free?

The catalog changes frequently. As of April 2026, the /v1/models endpoint listed 189 models, of which 101 responded successfully to chat completion requests. This number fluctuates as NVIDIA adds new models and retires older ones.

Can I use the NVIDIA NIM API with the OpenAI Python SDK?

Yes. Since the NIM API is OpenAI-compatible, you can point the OpenAI Python SDK at the NIM base URL (https://integrate.api.nvidia.com/v1) and use your NVIDIA API key. This makes it straightforward to switch between OpenAI and NVIDIA-hosted models in your application code.

Conclusion

The NVIDIA Developer Program model catalog is a moving target. Models come and go, endpoints break and recover, and the catalog listing does not always reflect reality. The browser’s “Free Endpoint” filter is a quick way to check interactively, but when you need a programmatic, scriptable answer, this script gives you ground truth: a timestamped snapshot of which models are actually hosted and responding to chat completion requests right now.

How to build an upstream Fedora Kernel from source

How to build an upstream Fedora Kernel from source

I typically keep my Fedora system current, updating it once every week or two. More recently, I wanted to test the Idle Page Tracking feature, but this wasn’t enabled in the default kernel provided by Fedora.

# grep CONFIG_IDLE_PAGE_TRACKING /boot/config-$(uname -r)
# CONFIG_IDLE_PAGE_TRACKING is not set

To enable the feature, we need to build a custom kernel with the feature(s) we need. Thankfully, the process isn’t too difficult.

For this walk through, I’ll be building a customised version of the Fedora 32 kernel version I already have installed (5.8.7-200.fc32.x86_64), using some of the instructions from https://fedoraproject.org/wiki/Building_a_custom_kernel .

Read More
Programming Persistent Memory: A Comprehensive Guide for Developers Book

Programming Persistent Memory: A Comprehensive Guide for Developers Book

After many months of hard work by everyone involved, I’m very pleased to announce that the book “Programming Persistent Memory: A Comprehensive Guide for Developers” is now available for download in digital PDF & ePUB formats from https://pmem.io/book , and Kindle & paperback through Amazon .

Beginner and experienced programmers will use this comprehensive guide to persistent memory programming. You will understand how persistent memory brings together several new software/hardware requirements, and offers great promise for better performance and faster application startup times―a huge leap forward in byte-addressable capacity compared with current DRAM offerings.
This revolutionary new technology gives applications significant performance and capacity improvements over existing technologies. It requires a new way of thinking and developing, which makes this highly disruptive to the IT/computing industry. The full spectrum of industry sectors that will benefit from this technology include, but are not limited to, in-memory and traditional databases, AI, analytics, HPC, virtualization, and big data.   
Programming Persistent Memory describes the technology and why it is exciting the industry. It covers the operating system and hardware requirements as well as how to create development environments using emulated or real persistent memory hardware. The book explains fundamental concepts; provides an introduction to persistent memory programming APIs for C, C++, JavaScript, and other languages; discusses RMDA with persistent memory; reviews security features; and presents many examples. Source code and examples that you can run on your own systems are included.
What You’ll Learn
- Understand what persistent memory is, what it does, and the value it brings to the industry
- Become familiar with the operating system and hardware requirements to use persistent memory
- Know the fundamentals of persistent memory programming: why it is different from current programming methods, and what developers need to keep in mind when programming for persistence
- Look at persistent memory application development by example using the Persistent Memory Development Kit (PMDK)
- Design and optimize data structures for persistent memory
- Study how real-world applications are modified to leverage persistent memory
- Utilize the tools available for persistent memory programming, application performance profiling, and debugging
Who This Book Is For
C, C++, Java, and Python developers, but will also be useful to software, cloud, and hardware architects across a broad spectrum of sectors, including cloud service providers, independent software vendors, high performance compute, artificial intelligence, data analytics, big data, etc. 

Read More
Understanding Memory Usage with `smem`

Understanding Memory Usage with `smem`

Memory management is crucial for Linux administrators and developers, especially when optimizing performance for resource-intensive applications. While tools like top and htop are commonly used to monitor system performance, they often don’t provide enough detail regarding memory usage breakdown. This is where smem comes into play.

What is smem?

smem is a command-line tool that reports memory usage per process and provides better insight into shared memory than most traditional tools, taking shared memory pages into account. Unlike top or htop, which primarily display RSS (Resident Set Size), smem can also show USS (Unique Set Size), which is a better metric for understanding how much memory would be freed if a particular process were terminated. This blog will guide you through using smem, explaining these critical memory metrics and providing comparisons to more familiar tools.

Read More