JavaScript: controlling web page with gestures

Our experience in implementing remote control and experimenting with different approaches, including Computer Vision technology. In this article, we'll share the results of our experiments using Google's MEDIAPIPE library for Computer Vision.

Aleksey Andruschenko

Aleksey Andruschenko

Full-Stack Developer

02 Mar 2023
11 min read

During our work on Stardio, we were assigned the task of implementing remote control of the application. We explored different options for implementation, and one of the approaches we experimented with was Computer Vision technology. In this article, I will be sharing the results of our experiments with one of the well-known libraries for ComputerVision - MEDIAPIPE, which is developed by Google.

In the past, controlling the content of a web page using gestures was only seen in science fiction movies. But nowadays, all you need to make it a reality is a video camera, a browser, and a library from Google. In this tutorial, we will be demonstrating how to implement gesture control using pure JavaScript. To detect and track hand gestures, we will be using MediaPipe, and to manage dependencies, we will be using npm.

The sample code can be found in this repository.

Project preparation and setup

Step 1

Create a pure JS project with Vite by vanilla template:

  • motion-controls - name of project
  • vanilla - name of template
yarn create vite motion-controls --template vanilla

Step 2

Go to the created directory, install the dependencies and start development server:

cd motion-controls
npm i
npm run dev

Step 3

Edit content of body in index.html:

<video></video>
<canvas></canvas>

<script type="module" src="/js/get-video-data.js"></script>

Getting video data and rendering it on the canvas

Create a js directory at the root of the project and a get-video-data.js file in it.

Get references to the video and canvas elements, as well to the 2D graphics drawing context:

const video$ = document.querySelector("video");
const canvas$ = document.querySelector("canvas");
const ctx = canvas$.getContext("2d");

Define the width and height of the canvas, as well as the requirements (constraints) for the video data stream:

const width = 640;
const height = 480;

canvas$.width = width;
canvas$.height = height;

const constraints = {
  audio: false,
  video: { width, height },
};

Get access to the user's video input device using the getUserMedia method; pass the stream to the video element using the srcObject attribute; after loading the metadata, we start playing the video and call the requestAnimationFrame method, passing it the drawVideoFrame function as an argument:

navigator.mediaDevices
  .getUserMedia(constraints)
  .then((stream) => {
    video$.srcObject = stream;

    video$.onloadedmetadata = () => {
      video$.play();

      requestAnimationFrame(drawVideoFrame);
    };
  })
  .catch(console.error);

Finally, we define the function to draw the video frame on the canvas using the drawImage method:

function drawVideoFrame() {
  ctx.drawImage(video$, 0, 0, width, height);

  requestAnimationFrame(drawVideoFrame);
}

Note

That's calling requestAnimationFrame twice runs an infinite loop of animation at a device-specific frame rate, but typically 60 frames per second (FPS). The frame rate can be adjusted using the timestamp argument passed to the requestAnimationFrame callback (example):

function drawVideoFrame(timestamp) {
  // ...
}
JS gestures.jpg

Result

Hand detection and tracking

To detect and track the hand, we need a few additional dependencies:

yarn add @mediapipe/camera_utils @mediapipe/drawing_utils @mediapipe/hands

MediaPipe Hands first detects the hands, then determines 21 key points (3D landmarks), which are joints, for each hand. Here's what it looks like:

Hand 3D landmarks

Create a track-hand-motions.js file in the js directory.

Importing dependencies:

import { Camera } from "@mediapipe/camera_utils";
import { drawConnectors, drawLandmarks } from "@mediapipe/drawing_utils";
import { Hands, HAND_CONNECTIONS } from "@mediapipe/hands";

The Camera constructor allows you to create instances to control a video camera and has the following signature:

export declare class Camera implements CameraInterface {
  constructor(video: HTMLVideoElement, options: CameraOptions);
  start(): Promise<void>;
  // We will not use this method
  stop(): Promise<void>;
}

The constructor takes a video element and the following settings:

export declare interface CameraOptions {
  // Callback for frame caption
  onFrame: () => Promise<void>| null;
  // camera
  facingMode?: 'user'|'environment';
  // width of frame
  width?: number;
  // height of frame
  height?: number;
}

The start method starts the frame capture process.

The Hands constructor allows you to create instances for detecting hands and has the following signature:

export declare class Hands implements HandsInterface {
  constructor(config?: HandsConfig);
  onResults(listener: ResultsListener): void;
  send(inputs: InputMap): Promise<void>;
  setOptions(options: Options): void;
  // more method what we did not use
}

Constructor have this config:

export interface HandsConfig {
  locateFile?: (path: string, prefix?: string) => string;
}

This callback loads additional files needed to create an instance:

hand_landmark_lite.tflite
hands_solution_packed_assets_loader.js
hands_solution_simd_wasm_bin.js
hands.binarypb
hands_solution_packed_assets.data
hands_solution_simd_wasm_bin.wasm

The setOptions method allows you to set the following discovery options:

export interface Options {
  selfieMode?: boolean;
  maxNumHands?: number;
  modelComplexity?: 0|1;
  minDetectionConfidence?: number;
  minTrackingConfidence?: number;
}

You can read about these settings here. We will set maxNumHands: 1 to detect only one hand and modelComplexity: 0 to improve performance at the expense of detection accuracy.

The send method is used to process a single frame of data. It is called in the **onFrame **method of the Camera instance.

The onResults method accepts a callback to handle the hand detection results.

The drawLandmarksmethod allows you to draw hand keypoints and has the following signature:

export declare function drawLandmarks(
    ctx: CanvasRenderingContext2D, landmarks?: NormalizedLandmarkList,
    style?: DrawingOptions): void;

It accepts a drawing context, keypoints, and the following styles:

export declare interface DrawingOptions {
  color?: string|CanvasGradient|CanvasPattern|
      Fn<Data, string|CanvasGradient|CanvasPattern>;
  fillColor?: string|CanvasGradient|CanvasPattern|
      Fn<Data, string|CanvasGradient|CanvasPattern>;
  lineWidth?: number|Fn<Data, number>;
  radius?: number|Fn<Data, number>;
  visibilityMin?: number;
}

The drawConnectors method allows you to draw connection lines between keypoints and has the following signature:

export declare function drawConnectors(
    ctx: CanvasRenderingContext2D, landmarks?: NormalizedLandmarkList,
    connections?: LandmarkConnectionArray, style?: DrawingOptions): void;

It takes care of defining keypoints start and end keypoints index pairs (HAND_CONNECTIONS), and styles.

Back to editing track-hand-motions.js:

const video$ = document.querySelector("video");
const canvas$ = document.querySelector("canvas");
const ctx = canvas$.getContext("2d");
const width = 640;
const height = 480;
canvas$.width = width;
canvas$.height = height;

We define the function for processing the results of hand detection:

function onResults(results) {
  // of the entire result object, we are only interested in the `multiHandLandmarks` property,
  // which contains arrays of control points of detected hands
  if (!results.multiHandLandmarks.length) return;

  // when 2 hand are found, for example `multiHandLandmarks` will contain 2 arrays of control points
  console.log("@landmarks", results.multiHandLandmarks[0]);

  // draw a video frame
  ctx.save();
  ctx.clearRect(0, 0, width, height);
  ctx.drawImage(results.image, 0, 0, width, height);

  // iterate over arrays of breakpoints
   // we could do without iteration since we only have one array,
   // but this solution is more flexible
  for (const landmarks of results.multiHandLandmarks) {
    // draw keypoints
    drawLandmarks(ctx, landmarks, { color: "#FF0000", lineWidth: 2 });
    // draw lines
    drawConnectors(ctx, landmarks, HAND_CONNECTIONS, {
      color: "#00FF00",
      lineWidth: 4,
    });
  }

  ctx.restore();
}

Create an instance to detect the hand, set the settings, and register the result handler:

const hands = new Hands({
  locateFile: (file) => `../node_modules/@mediapipe/hands/${file}`,
});
hands.setOptions({
  maxNumHands: 1,
  modelComplexity: 0,
});
hands.onResults(onResults);

Finally, we create an instance to control the video camera, register the handler, set the settings and start the frame capture process:

const camera = new Camera(video$, {
  onFrame: async () => {
    await hands.send({ image: video$ });
  },
  facingMode: undefined,
  width,
  height,
});
camera.start();

Please note: by default, the facingMode setting is set to user - the source of video data is the front (front) laptop camera. Since in my case this source is a USB camera, the value of this setting should be undefined.

The array of control points of the detected brush looks like this:

How to control Javascript with gestures

The indexes correspond to the joints of the hand, as shown in the image above. For example, the index of the first index finger joint from the top is 7. Each control point has x, y, and z coordinates ranging from 0 to 1.

The result of executing the example code:

Pinching Gesture Definition:

A pinch as a gesture is the bringing together of the tips of the index and thumb to a fairly close distance.

You ask, 'What exactly is considered close enough distance?'" We have decided to define this distance as 0.8 for both the x and y coordinates, and 0.11 for the z coordinate. Personally, I concur with these calculations. Here's a visual representation:

const distance = {
    x: Math.abs(fingerTip.x - thumbTip.x),
    y: Math.abs(fingerTip.y - thumbTip.y),
    z: Math.abs(fingerTip.z - thumbTip.z),
  };
const areFingersCloseEnough =
  distance.x < 0.08 && distance.y < 0.08 && distance.z < 0.11;

Few more important things:

  • we want to register and handle the start, continuation, and end of a pinch (pinch_start, pinch_move, and pinch_stop, respectively);
  • to determine the transition of a pinch from one state to another (beginning -> end, or vice versa), it is required to save the previous state;
  • the transition detection must be performed with some delay, for example 250 ms.

Create a detect-pinch-gesture.js file in the js directory. The beginning of the code is identical to the code of the previous example:

import { Camera } from "@mediapipe/camera_utils";
import { Hands } from "@mediapipe/hands";

const video$ = document.querySelector("video");

const width = window.innerWidth;
const height = window.innerHeight;

const handParts = {
  wrist: 0,
  thumb: { base: 1, middle: 2, topKnuckle: 3, tip: 4 },
  indexFinger: { base: 5, middle: 6, topKnuckle: 7, tip: 8 },
  middleFinger: { base: 9, middle: 10, topKnuckle: 11, tip: 12 },
  ringFinger: { base: 13, middle: 14, topKnuckle: 15, tip: 16 },
  pinky: { base: 17, middle: 18, topKnuckle: 19, tip: 20 },
};

const hands = new Hands({
  locateFile: (file) => `../node_modules/@mediapipe/hands/${file}`,
});
hands.setOptions({
  maxNumHands: 1,
  modelComplexity: 0,
});
hands.onResults(onResults);

const camera = new Camera(video$, {
  onFrame: async () => {
    await hands.send({ image: video$ });
  },
  facingMode: undefined,
  width,
  height,
});
camera.start();

const getFingerCoords = (landmarks) =>
  landmarks[handParts.indexFinger.topKnuckle];

function onResults(handData) {
  if (!handData.multiHandLandmarks.length) return;

  updatePinchState(handData.multiHandLandmarks[0]);
}

Define event types, delay and pinch state:

const PINCH_EVENTS = {
  START: "pinch_start",
  MOVE: "pinch_move",
  STOP: "pinch_stop",
};

const OPTIONS = {
  PINCH_DELAY_MS: 250,
};

const state = {
  isPinched: false,
  pinchChangeTimeout: null,
};

Declare a pinch detection function:

function isPinched(landmarks) {
  const fingerTip = landmarks[handParts.indexFinger.tip];
  const thumbTip = landmarks[handParts.thumb.tip];
  if (!fingerTip || !thumbTip) return;

  const distance = {
    x: Math.abs(fingerTip.x - thumbTip.x),
    y: Math.abs(fingerTip.y - thumbTip.y),
    z: Math.abs(fingerTip.z - thumbTip.z),
  };

  const areFingersCloseEnough =
    distance.x < 0.08 && distance.y < 0.08 && distance.z < 0.11;

  return areFingersCloseEnough;
}

Define a function that creates a custom event using the CustomEvent constructor and calls it using the dispatchEvent method:

// the function takes the name of the event and the data - the coordinates of the finger
function triggerEvent({ eventName, eventData }) {
  const event = new CustomEvent(eventName, { detail: eventData });
  document.dispatchEvent(event);
}

Define a pinch state update function:

function updatePinchState(landmarks) {
  // determine the previous state
  const wasPinchedBefore = state.isPinched;
  // determine the beginning or end of the pinch
  const isPinchedNow = isPinched(landmarks);
  // define a state transition
  const hasPassedPinchThreshold = isPinchedNow !== wasPinchedBefore;
  // determine the state update delay
  const hasWaitStarted = !!state.pinchChangeTimeout;

  // if there is a state transition and we are not in idle mode
  if (hasPassedPinchThreshold && !hasWaitStarted) {
    // call the corresponding event with a delay
    registerChangeAfterWait(landmarks, isPinchedNow);
  }

  // if the state remains the same
  if (!hasPassedPinchThreshold) {
    // cancel standby mode
    cancelWaitForChange();

    // if the pinch continues
    if (isPinchedNow) {
      // trigger the corresponding event
      triggerEvent({
        eventName: PINCH_EVENTS.MOVE,
        eventData: getFingerCoords(landmarks),
      });
    }
  }
}

We define the functions for updating the state and canceling the wait:

function registerChangeAfterWait(landmarks, isPinchedNow) {
  state.pinchChangeTimeout = setTimeout(() => {
    state.isPinched = isPinchedNow;

    triggerEvent({
      eventName: isPinchedNow ? PINCH_EVENTS.START : PINCH_EVENTS.STOP,
      eventData: getFingerCoords(landmarks),
    });
  }, OPTIONS.PINCH_DELAY_MS);
}

function cancelWaitForChange() {
  clearTimeout(state.pinchChangeTimeout);
  state.pinchChangeTimeout = null;
}

We define the handlers for the beginning, continuation and end of the pinch (we simply print the coordinates of the upper joint of the index finger to the console):

function onPinchStart(eventInfo) {
  const fingerCoords = eventInfo.detail;
  console.log("Pinch started", fingerCoords);
}

function onPinchMove(eventInfo) {
  const fingerCoords = eventInfo.detail;
  console.log("Pinch moved", fingerCoords);
}

function onPinchStop(eventInfo) {
  const fingerCoords = eventInfo.detail;
  console.log("Pinch stopped", fingerCoords);
  // change background color on STOP
  document.body.style.backgroundColor =
    "#" + Math.floor(Math.random() * 16777215).toString(16);
}

And register them:

document.addEventListener(PINCH_EVENTS.START, onPinchStart);
document.addEventListener(PINCH_EVENTS.MOVE, onPinchMove);
document.addEventListener(PINCH_EVENTS.STOP, onPinchStop);

Result on the video:

https://www.youtube.com/watch?v=KsLQRb6BhbI

Conclusion

Now that we have reached this point, we can interact with our web application however we desire. This includes changing the state and interacting with HTML elements, among other things.

As you can see, the potential applications for this technology are virtually limitless, so feel free to explore and experiment with it.

That concludes what I wanted to share with you in this tutorial. I hope you found it informative and engaging, and that it has enriched your knowledge in some way. Thank you for your attention, and happy coding!