Video Object Detection

Introduction

This project is an interactive object detection system designed for robotics applications. It recognizes specific command words such as “Start,” “Stop,” “Go,” “Turn Left,” “Turn Right,” and “Turn Back” from a live or recorded video stream. Once these commands are detected, bounding boxes with descriptive labels are overlaid on the video directly in the web browser.

Technology Stack

Python & YOLOv8: The model is trained on custom data using YOLOv8 and then converted to the ONNX format.
ONNX Runtime for Web: The converted model is executed in the browser using ONNX Runtime for JavaScript, enabling high-performance inference without a heavy backend.
JavaScript & Web Workers: Video capture, canvas drawing, and inference initiation are handled by JavaScript, while a dedicated Web Worker offloads the model inference.
Cloud Deployment: Deploying the application in the cloud minimizes the need for a costly Python-based backend, as the heavy lifting is done client-side.

Code Explanation

Main JavaScript

The main JavaScript code manages video capture, drawing on the canvas, and communication with the Web Worker for inference.

Video & Canvas Setup


  const video = document.querySelector("video");
  const canvas = document.querySelector("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;

These lines select the video and canvas elements from the DOM and adjust the canvas size to match the video dimensions, ensuring that the bounding boxes match the video stream.

Frame Processing & Inference Trigger


  const worker = new Worker("model_processing.js");
  let boxes = [];
  let busy = false;
  let interval;
  
  video.addEventListener("play", () => {
    const context = canvas.getContext("2d");
    interval = setInterval(() => {
      context.drawImage(video, 0, 0);
      draw_boxes(canvas, boxes);
      const input = prepare_input(canvas);
      if (!busy) {
        worker.postMessage(input);
        busy = true;
      }
    }, 30);
  });

Every 30 milliseconds, the current frame is drawn onto the canvas, processed by the prepare_input function (which resizes and normalizes it), and then sent to the Web Worker for inference—if the worker is not already busy.

Handling Inference Results


  worker.onmessage = (event) => {
    const output = event.data;
    boxes = process_output(output, canvas.width, canvas.height);
    busy = false;
  };

When the worker returns the inference data, process_output() interprets the output, transforms prediction data into bounding boxes (with confidence scores and labels), and clears the busy flag.

Utility Functions

prepare_input(img): Creates a temporary canvas, resizes the image to 640 x 640, extracts pixel data, and normalizes it for the model.
process_output(output, img_width, img_height): Processes the raw inference data into bounding boxes, filtering for a confidence threshold and applying non-maximum suppression.
draw_boxes(canvas, boxes): Draws the detection bounding boxes and labels on the canvas for visual feedback.

Web Worker Script: model_processing.js

The Web Worker handles the heavy inference task: loading the ONNX model and running the input tensor through it.


  // Import ONNX Runtime for Web
  importScripts("https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js");
  
  let model = null;
  
  onmessage = async (event) => {
    const input = event.data;
    const output = await run_model(input);
    postMessage(output);
  }
  
  async function run_model(input) {
    if (!model) {
      model = await ort.InferenceSession.create("start_stop.onnx");
    }
    input = new ort.Tensor(Float32Array.from(input), [1, 3, 640, 640]);
    const outputs = await model.run({ images: input });
    return outputs["output0"].data;
  }

The worker lazily loads the model (only once on first inference) ensuring efficient resource usage. The input is converted into an ONNX-compatible tensor, and the resulting predictions are sent back to the main thread for processing.

Results & Advantages

This system demonstrates real-time detection of command words in video streams processed entirely in the browser. The main advantages of this method include:

Cost Efficiency & Scalability: In-browser inference with ONNX Runtime removes the need for an expensive Python backend, reducing cloud hosting costs while scaling easily.
Responsive User Experience: Offloading model inference to a Web Worker allows the main UI thread to remain responsive, crucial for smooth video playback and real-time visual feedback.
Platform Agility: Converting the model to ONNX enables deployment across various platforms including desktop and mobile, ensuring robust performance everywhere.
Ease of Integration: Leveraging HTML, JavaScript, and cloud deployment offers a straightforward integration path for modern robotics dashboards and interactive applications.

This approach effectively bridges the gap between powerful, custom-trained models and accessible, real-time web interfaces—ushering in a new era of cost-effective, scalable robotics applications.