NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI

Learn how to set up and run the NVIDIA Nemotron 3 Super 120B parameter model for agentic AI applications, including basic inference and performance optimization.

Introduction

In this tutorial, you'll learn how to run and experiment with the Nemotron 3 Super model, a 120 billion parameter open-source AI model developed by NVIDIA. This model is designed for complex multi-agent applications and offers significantly higher throughput than previous models. We'll walk through setting up your environment, downloading the model, and running basic inference to understand how it works.

Prerequisites

A computer with at least 16GB of RAM (32GB recommended)
Python 3.8 or higher installed
Basic understanding of command-line interfaces
Access to a GPU with CUDA support (NVIDIA GPU recommended)

Step-by-Step Instructions

1. Setting Up Your Environment

1.1 Install Required Dependencies

First, we need to install the necessary Python packages. Open your terminal and run:

pip install torch transformers accelerate

Why this step? These packages provide the core functionality needed to load and run large language models. PyTorch is the deep learning framework, transformers handles model loading, and accelerate helps with GPU management.

1.2 Create a Project Directory

Create a new folder for this project:

mkdir nemotron_project
 cd nemotron_project

Why this step? Keeping all files in one directory makes it easier to manage your work and ensures everything stays organized.

2. Downloading the Nemotron Model

2.1 Access the Model Repository

Visit the official Nemotron repository on Hugging Face or NVIDIA's website to find the model files. For this tutorial, we'll use a simplified version of the model structure.

Why this step? The model files are large and need to be downloaded to your local machine to run locally.

2.2 Create a Basic Model Loading Script

Create a file called model_loader.py with the following content:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
model_name = "nvidia/Nemotron-3-Super-120B"

try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    print("Model loaded successfully!")
except Exception as e:
    print(f"Error loading model: {e}")

Why this step? This script sets up the basic framework to load the model. The torch_dtype=torch.float16 reduces memory usage, and device_map="auto" automatically assigns the model to available GPU memory.

3. Running Inference

3.1 Create an Inference Script

Create a new file called inference.py:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
model_name = "nvidia/Nemotron-3-Super-120B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Define your prompt
prompt = "Explain how agentic AI works in simple terms."

# Tokenize the input
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

# Generate output
with torch.no_grad():
    output = model.generate(input_ids, max_length=150, num_return_sequences=1)

# Decode and print the result
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Why this step? This script takes a prompt, processes it through the model, and returns a generated response. It demonstrates how to interact with the model programmatically.

3.2 Run the Inference Script

In your terminal, run:

python inference.py

Why this step? This executes your script and shows how the model responds to your input prompt, giving you a real-world demonstration of the model's capabilities.

4. Understanding Model Performance

4.1 Measure Inference Time

Update your inference.py script to include timing:

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
model_name = "nvidia/Nemotron-3-Super-120B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Define your prompt
prompt = "Explain how agentic AI works in simple terms."

# Tokenize the input
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

# Measure time
start_time = time.time()

# Generate output
with torch.no_grad():
    output = model.generate(input_ids, max_length=150, num_return_sequences=1)

end_time = time.time()
print(f"Inference took {end_time - start_time:.2f} seconds")

# Decode and print the result
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Why this step? Timing helps you understand how quickly the model processes information, which is crucial for applications requiring fast responses.

4.2 Experiment with Different Prompts

Try different prompts to see how the model responds:

prompts = [
    "What are the benefits of using hybrid Mamba-Attention models?",
    "How does MoE (Mixture of Experts) improve model efficiency?",
    "What makes Nemotron 3 Super different from other large language models?"
]

for prompt in prompts:
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        output = model.generate(input_ids, max_length=100, num_return_sequences=1)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Why this step? Testing different prompts helps you understand the model's range of capabilities and how it handles various types of questions.

5. Optimizing Performance

5.1 Adjust Generation Parameters

Modify your generation parameters to see how they affect performance and output:

# Example of adjusting parameters
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=200,           # Increase max length
        num_return_sequences=1,   # Return one sequence
        temperature=0.7,          # Controls randomness
        top_p=0.9,                # Nucleus sampling
        do_sample=True            # Enable sampling
    )

Why this step? Adjusting these parameters lets you control the creativity and specificity of the model's output, which is important for different applications.

Summary

In this tutorial, you've learned how to set up an environment for running the Nemotron 3 Super model, load the model using Python, and perform basic inference. You've also learned how to measure performance and adjust parameters to optimize output. This foundational knowledge gives you a starting point for exploring more advanced applications of this powerful open-source model.

Remember, working with 120 billion parameter models requires significant computational resources. For full-scale experimentation, you'll need access to high-end GPUs or cloud computing resources with sufficient memory.