TPC26 Tapis Tutorial GitHub public.tapis.io

Section 6: Ultralytics Fine-Tuning App

Lecture Slides

This application allows users to fine-tune Ultralytics YOLO 26 models using Singularity containers in a batch processing environment. It is designed to run on High-Performance Computing (HPC) systems via Tapis, leveraging GPU acceleration for training tasks.

Note: This app is already registered for the tutorial and is available to run via the Tapis UI.


Locating the App and Configure Job Submission

Login to public.tapis.io using your TACC username and password.

Go to the App tab and find the app with name yolo-finetuning-arm64.

App definition

Click on the Submit Job button to and then click on the USE GUIDED JOB LAUNCHER button.

USE GUIDED JOB LAUNCHER

Now we are in the job configuration interface. Click Continue on the job summary page.

USE GUIDED JOB LAUNCHER

In the Execution Options page, select the following:

  1. Execution System - vista-test-nairr
  2. Job Type - Batch
  3. Batch Logical Queue - gh

Click Continue

USE GUIDED JOB LAUNCHER

Click Continue

USE GUIDED JOB LAUNCHER

Click Continue

USE GUIDED JOB LAUNCHER

Click Continue

USE GUIDED JOB LAUNCHER

There are 4 environment variables important for the fine-tuning job.

  1. EPOCHS - number of learning rounds. 10 or 20 is a good number.
  2. YOLO_26_MODEL - the yolo model name. Here we use yolo26n for the best trade-off between quality and speed.
  3. TWO_STAGE_FINE_TUNE - If true, we use two-stage fine-tuning process where the first stage freezes the backbone and trains only the neck and head, allowing the detection layers to adapt to the new classes without disrupting pretrained features. The second stage unfreezes all layers and trains the full model with a lower learning rate to refine the backbone for the target domain.
  4. The freeze parameter accepts an integer. An integer freeze=10 freezes the first 10 layers (0 through 9, which corresponds to the backbone in YOLO26). This speeds up training and reduces overfitting when the dataset is small relative to the model capacity.

Just keep all these settings as is, and click Continue.

USE GUIDED JOB LAUNCHER

Expand TACC Resource Allocation and Reservation Name

  1. For TACC Resource Allocation, put a space and then TRA24006 after -A
  2. For Reservation Name, put a space and then your reservation code after --reservation
⚠️ Reservation Info
Allocation Code: TRA24006
Sunday sessions: Tapis+Tutorial-Sun
Monday sessions: Tapis+Tutorial-Mon

Click Continue

USE GUIDED JOB LAUNCHER

Click Continue

USE GUIDED JOB LAUNCHER

Submit the job

Click Submit Job, and this should submit your job.

USE GUIDED JOB LAUNCHER

It can take roughly 5-10 minutes to finish the job, but depending on the job waiting time, it can be even longer.

But once finished, you can open the tapisjob.out file and view it. At the end of the output, you should see message indicating that the fine-tuned models are now saved to FlexServ’s private model pool ($SCRATCH/flexserv/models).

USE GUIDED JOB LAUNCHER

Up Next

In our prompt engineering section, we will use a coding LLM in FlexServ to generate a python code that will call the Yolo inference API in FlexServ to perform the object detection inference using both the yolo26n based model and the yolo26n-fine-tuned model. We can see the difference in terms of the accuracy of these two models.