# unified-engine
**Repository Path**: htqs_admin/unified-engine
## Basic Information
- **Project Name**: unified-engine
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-06-04
- **Last Updated**: 2026-06-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Contributors: Hasan Unlu, Siqin Liu, Tin Nguyen, Rohit Rao, Dave Wei, Hiruna Vishwamith, Yinuo Zhao
Contact:
hunlu@apexcompute.com,
siqin.liu@apexcompute.com,
tin.nguyen@apexcompute.com,
rohit@apexcompute.com,
dave.wei@apexcompute.com,
hiruna@apexcompute.com,
yinuo.zhao@apexcompute.com
⚙️ Hardware Architecture Update v1.2(update_2461830.bin)

🛒 Purchase FPGA Board with Unified Engine IP Block for $49.99
Includes ongoing hardware design updates so you always have the latest architecture.
# XDMA Driver Setup and Usage Guide
This guide covers installation and usage of the Xilinx XDMA driver for PCIe-based FPGA communication.
## Prerequisites
- Kernel headers installed: `sudo apt install linux-headers-$(uname -r)`
## Installation
### 1. Install XDMA Driver from Xilinx Repository
Clone the official Xilinx DMA driver repository:
```bash
git clone https://github.com/Xilinx/dma_ip_drivers.git
cd dma_ip_drivers/XDMA/linux-kernel/xdma
sudo make install
```
> **Tip:** If `sudo make install` fails, you may need to disable Secure Boot in your BIOS settings.
### 2. Load the Driver
Load the XDMA driver with interrupt mode 0 (auto-detect):
```bash
sudo insmod /lib/modules/$(uname -r)/xdma/xdma.ko interrupt_mode=0
```
### 3. Load the Driver Every Boot Automatically (Recommended)
Apply the following script
```bash
# 1. Remove any conflicting configs
sudo rm -f /etc/modprobe.d/blacklist-xdma.conf \
/etc/modprobe.d/xdma.conf \
/etc/modules-load.d/xdma.conf
# 2. Create systemd service
sudo tee /etc/systemd/system/xdma.service << 'EOF'
[Unit]
Description=Xilinx XDMA Driver
After=local-fs.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/sbin/insmod /lib/modules/$(uname -r)/xdma/xdma.ko || true'
ExecStartPost=/bin/sh -c 'chmod 666 /dev/xdma*'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
# 3. Enable and start
sudo systemctl daemon-reload
sudo systemctl enable xdma
sudo systemctl restart xdma
# 4. Verify
sudo systemctl status xdma
ls -la /dev/xdma* | head -5
```
### 4. Set Up Python Environment
```bash
python3 -m venv ~/my_torch_env
source ~/my_torch_env/bin/activate
pip install -r requirements.txt
```
### 5. Run Hardware Tests
```bash
python3 user_hw_test.py
```
### 6. Run Gemma3 Inference (requires Hugging Face)
The Gemma3 test downloads the gated [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it) model from Hugging Face. You need to:
1. Create a Hugging Face account at https://huggingface.co
2. Accept the Gemma license at https://huggingface.co/google/gemma-3-1b-it
3. Create an access token at https://huggingface.co/settings/tokens
4. Log in from the command line:
```bash
pip install huggingface-hub
huggingface-cli login
```
Then run:
```bash
python3 models/gemma3/gemma3_test.py --prompt "your prompt"
```
### 7. Updating HW bin file
```
python3 update_flash.py update_xxxxxxxx.bin
```
Cold reboot the PC.
---
## Apex Compute Unified Engine v1.1 — Benchmark Results
All benchmarks were collected on RTL running on a Kintex UltraScale+ FPGA in real time.
### Benchmark Datasheet
📄 Download Benchmark Datasheet (PDF)
| Specification | Value |
|---|---|
| Engine frequency | 333 MHz |
| Theoretical peak (BF16) | 42 GFLOPS/s |
| Memory interface | DDR4 @ 1333 MHz, 32-bit |
| AXI Master Data Width | 256 bits |
| On-chip SRAM | 1.05 MB |
| Total power | 4.5 W |
| BF16 MatMul | 40.17 GFLOPS/s (95.6% utilization) |
| BF16 MatMul + Bias + Activation | 40.03 GFLOPS/s (95.3% utilization) |
| BF16 Softmax MatMul | 37.76 GFLOPS/s (89.9% utilization) |
| Memory-Efficient Attention | ~90% utilization |
| Quantized MatMul (BF16 × INT4/FP4) | 40.03 GFLOPS/s (95.3% utilization) |
| Quantized MatVec (Streaming matrix, decoding mode friendly) (BF16 × INT4/FP4) | 31.33 GFLOPS/s (74.6% utilization) |
| RMSNorm | 4.81 GFLOPS/s |
| LayerNorm | 5.90 GFLOPS/s |
| Quantize (BF16 → INT4/FP4) | 5.72 GFLOPS/s |
| Dequantize (INT4/FP4 → BF16) | 3.31 GFLOPS/s |
| Hardware trace buffer | 8,192 timestamps |
| Multi-engine tensor parallelism | Supported with Synchronization Flag instructions |
### FPGA Presilicon Prototype Setup
#### System Parameters
| Parameter | Value |
|---|---|
| Memory interface | DDR4 at 1333 MHz, 32-bit data path |
| Engine frequency | 333 MHz |
| Memory interface clock | Synchronized 1:1 with engine clock |
| Data width | 256 bits |
| Total power consumption | 4.5 W |
| Total on-chip SRAM | 1.05 MB |
#### Peak Operation Rate
Total floating-point operations per second from the engine at 333 MHz is approximately **42 GFLOPS/s**.
#### FPGA Resource Utilization
| Name | CLB LUTs | CLB Registers | Block RAM Tile | URAM | DSPs |
|---|---|---|---|---|---|
| unified_engine_top | 78,348 | 50,045 | 16 | 30 | 197 |
### FLOPS Definitions
| Operation | FLOPS |
|---|---|
| FMA (Fused Multiply-Add) | 2 |
| Addition / Multiplication | 1 |
| Exponent | 1 |
| Division | 1 |
### BF16 Operation Benchmarks
Engine speed: **333 MHz**; theoretical peak: **42 GFLOPS/s**. Metrics based on **M=1024, K=1024, N=1024**. O denotes the output tensor. All matrix-matrix operations we are reaching up to **95% FLOPS** utilizations.
| Op | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
| A Bᵀ | A[M,K], B[N,K] → O[M,N] | 2MKN | 17,820,455 (53.3 ms) | 40.17 |
| A Bᵀ + C | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN | 17,858,564 (53.5 ms) | 40.10 |
| GELU(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 17,923,045 (53.7 ms) | 40.02 |
| GELU(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 17,927,850 (53.7 ms) | 40.03 |
| SiLU(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 17,921,594 (53.7 ms) | 40.02 |
| SiLU(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 17,926,623 (53.7 ms) | 40.03 |
| softmax(A Bᵀ) | A[M,K], B[N,K] → O[M,N] | 2MKN + 5MN | 19,004,997 (57.01 ms) | 37.76 |
| softmax(A Bᵀ + C) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 5MN | 19,051,310 (57.15 ms) | 37.68 |
| Aᵀ | A[M,N] → O[N,M] | 0 | 1,648,647 (4.9 ms) | N/A |
| A · scalar | A[M,N] → O[M,N] | MN | 180,500 (541 µs) | 1.94 |
| A + scalar | A[M,N] → O[M,N] | MN | 181,005 (543 µs) | 1.93 |
| A · B | A[M,N], B[M,N] → O[M,N] | MN | 263,580 (790 µs) | 1.33 |
| A + B | A[M,N], B[M,N] → O[M,N] | MN | 263,871 (791 µs) | 1.33 |
| RMSNorm(A) · γ | A[M,N], γ[N] → O[M,N] | 4MN | 290,945 (872 µs) | 4.81 |
| LayerNorm(A) · γ + β | A[M,N], γ[N], β[N] → O[M,N] | 7MN | 414,679 (1.24 ms) | 5.90 |
#### Memory-Efficient Attention
The following kernel computes the attention block for given query/key/value tensors and an optional mask or bias. It reaches almost **90% utilization** of theoretical FLOPS.
```
memory_efficient_attention(q, k, v, mask_or_bias)
```
Equivalent PyTorch reference:
```python
def memory_efficient_attention(q, k, v, attn_bias=None):
scale = 1.0 / math.sqrt(head_dim)
attn_weights = (q @ k.T) * scale
if attn_bias is not None:
attn_weights = attn_weights + attn_bias
scores = torch.softmax(attn_weights, dim=-1)
return scores @ v
```

Flash attention benchmark — bias off

Flash attention benchmark — bias on
### Quantized Operation Benchmarks
Engine speed: 333 MHz; theoretical peak: 42 GFLOPS/s. In quantized mode, achieved FLOPS are the same **for any M**. In contrast, for tiled matrix-matrix multiplication, smaller M reduces FLOPS utilization. fp4 refers to nvfp4 (Nvidia fp4).
Metrics based on M=1024, K=1024, N=1024.
| Op | Precision | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
| A Bᵀ | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN | 22,849,177 (68.5 ms) | 31.33 |
| A Bᵀ + C | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN | 23,073,635 (69.2 ms) | 31.04 |
| GELU(A Bᵀ) | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 22,850,336 (68.5 ms) | 31.39 |
| GELU(A Bᵀ + C) | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 23,100,231 (69.3 ms) | 31.06 |
| SiLU(A Bᵀ) | A(bf16) B(int4/fp4) O(bf16) | A[M,K], B[N,K] → O[M,N] | 2MKN + 4MN | 22,850,243 (68.5 ms) | 31.39 |
| SiLU(A Bᵀ + C) | A(bf16) B(int4/fp4) C(bf16) O(bf16) | A[M,K], B[N,K], C[M,N] → O[M,N] | 2MKN + MN + 4MN | 23,104,094 (69.3 ms) | 31.06 |
#### Quantization / Dequantization (N=131,072)
| Op | Precision | Operands | FLOPS | Cycles (latency) | Achieved GFLOPS/s |
| Quantize(A) | A(bf16) O(int4/fp4) | A[N] → O[N] | 2N | 15,266 (45.8 µs) | 5.72 |
| Dequantize(A) | A(int4/fp4) O(bf16) | A[N] → O[N] | N | 13,193 (39.5 µs) | 3.31 |
### Trace Buffer and Tensor Parallelism
The engine includes a hardware trace buffer capable of recording **8,192 timestamps**, allowing cycle-accurate profiling of kernel execution. This is useful for experimenting with tensor parallelism across multiple engines.
The example below demonstrates splitting a 256×2048 @ 2048×1024 matrix multiplication across two engines:
- **Engine 0:** 192×2048 @ 2048×1024 (larger partition)
- **Engine 1:** 64×2048 @ 2048×1024 (smaller partition)
Because the two partitions have unequal workloads, the smaller partition finishes before the larger one. A hardware **synchronization flag** is used to hold the faster engine until both are complete before proceeding to the next stage. The trace visualization below shows this synchronization in action — the idle gap on Engine 1 is where it waits for Engine 0 to finish.

Trace buffer visualization — 256×2048 @ 2048×1024 split across two engines with hardware synchronization