STM32N6作为意法半导体推出的首款集成自研神经处理单元的STM32产品以“MCU+NPU”的异构架构重新定义了边缘AI的算力边界,是意法半导体的MCU最前沿技术栈,不过由于其高难度技术应用以及需要的极其深厚的STM32使用经验以及神经网络基础概念,因此上手难度非常的高。
自从STM32N6发布以来,博主有幸获得一块STM32N6570-DK开发板,闲暇之余陆陆续续折腾如何开发。因此将会陆陆续续发表一些使用STM32N6的使用笔记,以供将来的使用者参考。
上一期介绍到我们在STM32N6中部署了一个Yolov8n的模型,完成了量化并实现了后处理的过程,本期我们在上一期的基础上了,利用Yolo格式的图片和标签自己训练一个SSD检测模型实现部署检测和后处理。
1、模型制作
首先还是准备我们制作好的yolo格式图片和对应的标签作为训练数据,接着我们设计一下我们的模型:
这个网络模型结构非常简单由六个大的卷积块构成,六层卷积块可以在有效的提取图片特征的同时防止梯度消失等现象的出现。
输出结果把图像分为20*20个方格块,每个方格块输出cx,cy,w,h,conf五个数据。
import tensorflow as tf
import numpy as np
import os
import cv2
IMG_SIZE = 320
GRID_SIZE = 20
BATCH_SIZE = 8
EPOCHS = 50
# =========================
# 1. 数据加载
# =========================
def load_dataset(image_dir, label_dir):
images = []
targets = []
for file in os.listdir(image_dir):
ifnot (file.endswith(".jpg") or file.endswith(".png")):
continue
img_path = os.path.join(image_dir, file)
label_path = os.path.join(label_dir, file.replace(".jpg", ".txt").replace(".png", ".txt"))
img = cv2.imread(img_path)
if img is None:
continue
img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
img = img.astype(np.uint8)
target = np.zeros((GRID_SIZE, GRID_SIZE, 5), dtype=np.float32)
if os.path.exists(label_path):
with open(label_path, "r") as f:
for line in f.readlines():
line = line.strip()
ifnot line:
continue
parts = line.split()
if len(parts) != 5:
continue
cls, x, y, w, h = map(float, parts)
gx = int(x * GRID_SIZE)
gy = int(y * GRID_SIZE)
gx = np.clip(gx, 0, GRID_SIZE - 1)
gy = np.clip(gy, 0, GRID_SIZE - 1)
target[gy, gx, 0] = x * GRID_SIZE - gx
target[gy, gx, 1] = y * GRID_SIZE - gy
target[gy, gx, 2] = w
target[gy, gx, 3] = h
target[gy, gx, 4] = 1.0
break
images.append(img)
targets.append(target)
return np.array(images), np.array(targets)
# =========================
# 2. 模型(20x20输出)
# =========================
def build_model():
inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3), dtype=tf.uint8)
x = tf.cast(inputs, tf.float32) / 255.0
for filters in [32, 64, 128, 256]:
x = tf.keras.layers.Conv2D(filters, 3, strides=2, padding='same', activation='relu')(x)
x = tf.keras.layers.Conv2D(256, 3, padding='same', activation='relu')(x)
output = tf.keras.layers.Conv2D(5, 1)(x)
return tf.keras.Model(inputs, output)
# =========================
# 3. Loss
# =========================
def loss_fn(y_true, y_pred):
obj_mask = tf.expand_dims(y_true[..., 4], axis=-1)
pred_xy = tf.sigmoid(y_pred[..., 0:2])
pred_wh = tf.sigmoid(y_pred[..., 2:4])
pred_obj = tf.sigmoid(y_pred[..., 4:5])
pred_wh = tf.clip_by_value(pred_wh, 0, 1)
xy_loss = tf.reduce_sum(obj_mask * tf.square(y_true[..., 0:2] - pred_xy))
wh_loss = tf.reduce_sum(obj_mask * tf.square(y_true[..., 2:4] - pred_wh))
bce = tf.keras.losses.binary_crossentropy(obj_mask, pred_obj)
bce = tf.expand_dims(bce, axis=-1)
obj_loss = tf.reduce_sum(
obj_mask * bce +
0.1 * (1 - obj_mask) * bce
)
return xy_loss + wh_loss + obj_loss
# =========================
# 4. 验证
# =========================
def evaluate_fast(model, images, targets):
preds = model(images, training=False)
pred_xy = tf.sigmoid(preds[..., 0:2])
pred_wh = tf.sigmoid(preds[..., 2:4])
true_xy = targets[..., 0:2]
true_wh = targets[..., 2:4]
obj_mask = targets[..., 4:5]
error_xy = tf.sqrt(tf.reduce_sum(tf.square(pred_xy - true_xy), axis=-1, keepdims=True))
error_wh = tf.abs(pred_wh - true_wh)
error = (error_xy + tf.reduce_sum(error_wh, axis=-1, keepdims=True)) / 2
total_error = tf.reduce_sum(error * obj_mask)
count = tf.reduce_sum(obj_mask)
return (total_error / (count + 1e-6)).numpy()
# =========================
# 5. 抽样验证
# =========================
def evaluate_sample(model, images, targets, sample=10):
if len(images) == 0:
return0
idx = np.random.choice(len(images), min(sample, len(images)), replace=False)
return evaluate_fast(model, images[idx], targets[idx])
# =========================
# 6. 训练
# =========================
def train():
images, targets = load_dataset("images", "labels")
print("Dataset size:", len(images))
split = int(len(images) * 0.8)
train_images = images[:split]
train_targets = targets[:split]
val_images = images[split:]
val_targets = targets[split:]
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_targets))
dataset = dataset.shuffle(200).batch(BATCH_SIZE)
model = build_model()
print("Model output shape:", model.output_shape)
optimizer = tf.keras.optimizers.Adam(1e-3)
for epoch in range(EPOCHS):
total_loss = 0
for img, tgt in dataset:
with tf.GradientTape() as tape:
pred = model(img, training=True)
loss = loss_fn(tgt, pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
total_loss += loss.numpy()
if epoch % 5 == 0:
val_error = evaluate_sample(model, val_images, val_targets)
print(f"Epoch {epoch}: Loss={total_loss:.4f}, ValError={val_error:.4f}")
else:
print(f"Epoch {epoch}: Loss={total_loss:.4f}")
model.save("ssd_model")
# =========================
# 7. TFLite量化
# =========================
def representative_data_gen():
images, _ = load_dataset("images", "labels")
for i in range(min(100, len(images))):
yield [np.expand_dims(images[i], axis=0)]
def export_tflite():
model = tf.keras.models.load_model("ssd_model", compile=False)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)
if __name__ == "__main__":
train()
export_tflite()
训练、量化、导出代码
2、STM32上的部署和使用
接着在STM32CubeMX中,在CubeAI中间件导入量化好的tflite格式模型
模型经过量化后占用的Flash和RAM分别为1KB和1.47MB,并且几乎都是由NPU负责计算:
接着准备一张图片,转为对应C语言数组并保存。
uint32_t buff_in_len,buff_out_len;
LL_ATON_RT_RetValues_t ll_aton_rt_ret = LL_ATON_RT_DONE;
const LL_Buffer_InfoTypeDef * ibuffersInfos = NN_Interface_Default.input_buffers_info();
const LL_Buffer_InfoTypeDef * obuffersInfos = NN_Interface_Default.output_buffers_info();
buffer_in = (uint8_t *)LL_Buffer_addr_start(&ibuffersInfos[0]);
buffer_out = (uint8_t *)LL_Buffer_addr_start(&obuffersInfos[0]);
buff_in_len = ibuffersInfos->offset_end - ibuffersInfos->offset_start;
buff_out_len = obuffersInfos->offset_end - obuffersInfos->offset_start;
memcpy(buffer_in,gImage_images,320*320*3);
SCB_CleanDCache_by_Addr((uint32_t*)buffer_in, buff_in_len); // 写回内存
SCB_InvalidateDCache_by_Addr((uint32_t*)buffer_in, buff_in_len); // 重新从内存
LL_ATON_RT_RuntimeInit();
在模型进入推理前,我们对模型进行预处理,可以在CubeMX中查看我们的模型的输入:
模型接收320*320*3的uint32类型输入,经过Transpose转置层后在epoch_2中通过Cast归一化层统一归一化到0~1,这样子就不用我们自己写归一化的逻辑了。
for (int inferenceNb = 0; inferenceNb<1; ++inferenceNb) {
LL_ATON_RT_Init_Network(&NN_Instance_Default); // Initialize passed network instance object
do {
/* Execute first/next step */
ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&NN_Instance_Default);
/* Wait for next event */
if (ll_aton_rt_ret == LL_ATON_RT_WFE) {
LL_ATON_OSAL_WFE();
}
} while (ll_aton_rt_ret != LL_ATON_RT_DONE
接着等待模型运算完成。
float *floatout = (float *)buffer_out;
int valid_count = 0;
// 按照 20x20的网格进行遍历
for (int gy = 0; gy < GRID_SIZE; ++gy) {
for (int gx = 0; gx < GRID_SIZE; ++gx) {
// 对应一维数组下标:gy * GRID_SIZE * 5 + gx * 5 + 偏移量
int base_idx = (gy * GRID_SIZE + gx) * 5;
float raw_conf = floatout[base_idx + 4];
// Sigmoid 激活函数计算置信度
float conf = 1.0f / (1.0f + expf(-raw_conf));
if (conf > 0.1f && valid_count < GRID_SIZE * GRID_SIZE) {
// 1. 中心点归一化坐标 (0-1)
float bx = (gx + floatout[base_idx + 0]) / (float)GRID_SIZE;
float by = (gy + floatout[base_idx + 1]) / (float)GRID_SIZE;
// 2. 宽高处理:使用 ReLU 确保非负,并限制在 0-1 之间 (Clip)
float raw_w = floatout[base_idx + 2];
float raw_h = floatout[base_idx + 3];
float bw = raw_w > 0.0f ? raw_w : 0.0f; // 类似 ReLU
float bh = raw_h > 0.0f ? raw_h : 0.0f; // 类似 ReLU
// Clip 到 0-1 范围
if (bw > 1.0f) bw = 1.0f;
if (bh > 1.0f) bh = 1.0f;
// 转换为 x1, y1, x2, y2 格式
boxes[valid_count].x1 = bx - bw / 2.0f;
boxes[valid_count].y1 = by - bh / 2.0f;
boxes[valid_count].x2 = bx + bw / 2.0f;
boxes[valid_count].y2 = by + bh / 2.0f;
boxes[valid_count].conf = conf;
boxes[valid_count].keep = 1;
valid_count++;
}
}
}
接着应用后处理过程,我们就可以得到:
包括后面也试了一个20*20*3输出,只包含XY坐标的模型部署:
506