STM32N6的开发日记(10):从训练到部署一个火焰识别模型

STM32N6作为意法半导体推出的首款集成自研神经处理单元的STM32产品以“MCU+NPU”的异构架构重新定义了边缘AI的算力边界，是意法半导体的MCU最前沿技术栈，不过由于其高难度技术应用以及需要的极其深厚的STM32使用经验以及神经网络基础概念，因此上手难度非常的高。

自从STM32N6发布以来，博主有幸获得一块STM32N6570-DK开发板，闲暇之余陆陆续续折腾如何开发。因此将会陆陆续续发表一些使用STM32N6的使用笔记，以供将来的使用者参考。

本期介绍一下在STM32N6中部署一个火焰识别的yolov8n模型测试。

1、模型制作

首先我们拍摄一段包含火焰目标的视频，通过脚本抽帧保存为图片，总共获得大概一百多张图片。

接着利用标注工具，对其进行标注，导出Yolo格式的位置标签。

需要注意的是，实际使用中需要收集的数据集越多越好。

导出标签后，我们就制作好了图片文件夹和标签文件夹，接着我们要来配置训练文件。

配置数据集路径配置如上所示，同时dataset.yaml配置要和数据集路径配置相同。

训练脚本如下：

import osfrom ultralytics import YOLOdef run_embedded_optimization():    # 1. 加载模型    model = YOLO('yolov8n.pt')    # 2. 开始训练    model.train(        data='./dataset.yaml',        epochs=300,        imgsz=320,        batch=16,        name='STM32N6_Opt_V8_End2End',        patience=50,            device=0,        optimizer="AdamW",        lr0=0.001,        lrf=0.01,    )
if __name__ == '__main__':    run_embedded_optimization()

训练完得到一个best.pt最佳模型。

这里插一句，上图其实是可以看出来模型过拟合了，主要原因就是数据集太少了。

由于STM32的工具链主要支持h5、onnx、tflite三种格式，实际使用下来呢，个人感觉是tflite格式用起来最舒服。

import osimport sysimport subprocessimport shutilimport globimport numpy as npimport cv2def export_pt_to_onnx(pt_path, imgsz=320, simplify=True):    from ultralytics import YOLO    model = YOLO(pt_path)    onnx_path = model.export(        format='onnx',        imgsz=imgsz,        simplify=simplify,        opset=12,        dynamic=False,    )    return onnx_pathdef onnx_to_tflite_uint8(onnx_path, output_dir, representative_images_dir=None, imgsz=320):    import onnx    from onnx_tf.backend import prepare    import tensorflow as tf    os.makedirs(output_dir, exist_ok=True)    base_name = os.path.splitext(os.path.basename(onnx_path))[0]    saved_model_path = os.path.join(output_dir, 'saved_model_temp')    onnx_model = onnx.load(onnx_path)    tf_rep = prepare(onnx_model)    tf_rep.export_graph(saved_model_path)       converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)    converter.optimizations = [tf.lite.Optimize.DEFAULT]    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]    converter.inference_input_type = tf.uint8    converter.inference_output_type = tf.float32    tflite_model = converter.convert()

接着我们利用python脚本，借助onnx中间件把pt格式文件转为tflite格式,同时进行int8量化。

2、STM32上的部署和使用

接着在STM32CubeMX中，在CubeAI中间件导入量化好的tflite格式模型。

模型经过量化后占用的Flash和RAM分别为3.19MB和2.05MB，点击生成代码后，转到用户文件：

接着准备一张图片，转为对应C语言数组并保存。

  uint32_t buff_in_len,buff_out_len;  LL_ATON_RT_RetValues_t ll_aton_rt_ret = LL_ATON_RT_DONE;  const LL_Buffer_InfoTypeDef * ibuffersInfos = NN_Interface_Default.input_buffers_info();  const LL_Buffer_InfoTypeDef * obuffersInfos = NN_Interface_Default.output_buffers_info();  buffer_in = (uint8_t *)LL_Buffer_addr_start(&ibuffersInfos[0]);  buffer_out = (uint8_t *)LL_Buffer_addr_start(&obuffersInfos[0]);  buff_in_len = ibuffersInfos->offset_end - ibuffersInfos->offset_start;  buff_out_len = obuffersInfos->offset_end - obuffersInfos->offset_start;  memcpy(buffer_in,gImage_images,320*320*3);  SCB_CleanDCache_by_Addr((uint32_t*)buffer_in, buff_in_len);   // 写回内存  SCB_InvalidateDCache_by_Addr((uint32_t*)buffer_in, buff_in_len);  // 重新从内存  LL_ATON_RT_RuntimeInit();

在模型进入推理前，我们对模型进行预处理，可以在CubeMX中查看我们的模型的输入：

模型接收320*320*3的uint32类型输入，经过Transpose转置层后在epoch_2中通过Cast归一化层统一归一化到0~1,这样子就不用我们自己写归一化的逻辑了。

for (int inferenceNb = 0; inferenceNb<1; ++inferenceNb) {    LL_ATON_RT_Init_Network(&NN_Instance_Default);  // Initialize passed network instance object    do {      /* Execute first/next step */      ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&NN_Instance_Default);      /* Wait for next event */      if (ll_aton_rt_ret == LL_ATON_RT_WFE) {        LL_ATON_OSAL_WFE();      }    } while (ll_aton_rt_ret != LL_ATON_RT_DONE

接着等待模型运算完成。

float *floatout = (float *)buffer_out;    int valid_count = 0;    for (int i = 0; i < 2100; ++i) {      float cx   = floatout[i + 0 * 2100];      float cy   = floatout[i + 1 * 2100];      float w    = floatout[i + 2 * 2100];      float h    = floatout[i + 3 * 2100];      float conf = floatout[i + 4 * 2100];        if (conf > 0.1f && valid_count < 2100) {            float cx_input = cx ;            float cy_input = cy;            float w_input = w ;            float h_input = h ;            printf("[%d] [%f %f %f %f %f]rn",i,cx,cy,w,h,conf);            boxes[valid_count].x1 = cx_input - w_input / 2.0f;            boxes[valid_count].y1 = cy_input - h_input / 2.0f;            boxes[valid_count].x2 = cx_input + w_input / 2.0f;            boxes[valid_count].y2 = cy_input + h_input / 2.0f;            boxes[valid_count].conf = conf;            boxes[valid_count].keep = 1;            valid_count++;        }    }    for (int i = 0; i < valid_count; i++) {        if (boxes[i].keep) {            for (int j = i + 1; j < valid_count; j++) {                if (boxes[j].keep) {                    float x1 = (boxes[i].x1 > boxes[j].x1) ? boxes[i].x1 : boxes[j].x1;                    float y1 = (boxes[i].y1 > boxes[j].y1) ? boxes[i].y1 : boxes[j].y1;                    float x2 = (boxes[i].x2 < boxes[j].x2) ? boxes[i].x2 : boxes[j].x2;                    float y2 = (boxes[i].y2 < boxes[j].y2) ? boxes[i].y2 : boxes[j].y2;                    float intersection = (x2 - x1) * (y2 - y1);                    if (intersection < 0) intersection = 0;                    float area_i = (boxes[i].x2 - boxes[i].x1) * (boxes[i].y2 - boxes[i].y1);                    float area_j = (boxes[j].x2 - boxes[j].x1) * (boxes[j].y2 - boxes[j].y1);                    float union_area = area_i + area_j - intersection;                    float iou = (union_area > 0) ? (intersection / union_area) : 0;                    if (iou > 0.7f) {                        boxes[j].keep = 0;                    }                }            }        }    }

接着是后处理的过程，将置信度达标的目标框计算IOU合并，之后通过DMA显示：

STM32N6的开发日记(10):从训练到部署一个火焰识别模型

1、模型制作

2、STM32上的部署和使用

相关推荐