JavaScript: controlling web page with gestures

Our experience in implementing remote control and experimenting with different approaches, including Computer Vision technology. In this article, we'll share the results of our experiments using Google's MEDIAPIPE library for Computer Vision.

Table of contents

During our work on Stardio, we were assigned the task of implementing remote control of the application. We explored different options for implementation, and one of the approaches we experimented with was Computer Vision technology. In this article, I will be sharing the results of our experiments with one of the well-known libraries for ComputerVision - MEDIAPIPE, which is developed by Google.

In the past, controlling the content of a web page using gestures was only seen in science fiction movies. But nowadays, all you need to make it a reality is a video camera, a browser, and a library from Google. In this tutorial, we will be demonstrating how to implement gesture control using pure JavaScript. To detect and track hand gestures, we will be using MediaPipe, and to manage dependencies, we will be using npm.

The sample code can be found in this repository.

Project preparation and setup

Step 1

Create a pure JS project with Vite by vanilla  template:

  • motion-controls - name of project
  • vanilla - name of template

yarn create vite motion-controls --template vanilla

Step 2

Go to the created directory, install the dependencies and start development server:

cd motion-controlsnpm inpm run dev

Step 3

Edit content of body in index.html:

<video></video><canvas></canvas><script type="module" src="/js/get-video-data.js"></script>

Getting video data and rendering it on the canvas

Create a js directory at the root of the project and a get-video-data.js file in it.

Get references to the video and canvas elements, as well to the 2D graphics drawing context:

const video$ = document.querySelector("video");const canvas$ = document.querySelector("canvas");const ctx = canvas$.getContext("2d");

Define the width and height of the canvas, as well as the requirements (constraints) for the video data stream:

const width = 640;const height = 480;canvas$.width = width;canvas$.height = height;const constraints = {  audio: false,  video: { width, height },};

Get access to the user's video input device using the getUserMedia method; pass the stream to the video element using the srcObject attribute; after loading the metadata, we start playing the video and call the requestAnimationFrame method, passing it the drawVideoFrame function as an argument:

navigator.mediaDevices  .getUserMedia(constraints)  .then((stream) => {    video$.srcObject = stream;    video$.onloadedmetadata = () => {      video$.play();      requestAnimationFrame(drawVideoFrame);    };  })  .catch(console.error);

Finally, we define the function to draw the video frame on the canvas using the drawImage method:

function drawVideoFrame() {  ctx.drawImage(video$, 0, 0, width, height);  requestAnimationFrame(drawVideoFrame);}

Note

That's calling requestAnimationFrame twice runs an infinite loop of animation at a device-specific frame rate, but typically 60 frames per second (FPS). The frame rate can be adjusted using the timestamp argument passed to the requestAnimationFrame callback (example):

function drawVideoFrame(timestamp) {  // ...}

Result

JavaScript: controlling web page with gestures

                               
                           
                       
                           

Hand detection and tracking

To detect and track the hand, we need a few additional dependencies:

yarn add @mediapipe/camera_utils @mediapipe/drawing_utils @mediapipe/hands

MediaPipe Hands first detects the hands, then determines 21 key points (3D landmarks), which are joints, for each hand. Here's what it looks like:

 

Hand 3D landmarks

Create a track-hand-motions.js file in the js directory.

Importing dependencies:

import { Camera } from "@mediapipe/camera_utils";import { drawConnectors, drawLandmarks } from "@mediapipe/drawing_utils";import { Hands, HAND_CONNECTIONS } from "@mediapipe/hands";

The Camera constructor allows you to create instances to control a video camera and has the following signature:

export declare class Camera implements CameraInterface {  constructor(video: HTMLVideoElement, options: CameraOptions);  start(): Promise<void>;  // We will not use this method  stop(): Promise<void>;}

The constructor takes a video element and the following settings:

export declare interface CameraOptions {  // Callback for frame caption  onFrame: () => Promise<void>| null;  // camera  facingMode?: 'user'|'environment';  // width of frame  width?: number;  // height of frame  height?: number;}

The start method starts the frame capture process.

The Hands constructor allows you to create instances for detecting hands and has the following signature:

export declare class Hands implements HandsInterface {  constructor(config?: HandsConfig);  onResults(listener: ResultsListener): void;  send(inputs: InputMap): Promise<void>;  setOptions(options: Options): void;  // more method what we did not use}

Constructor have this config:

export interface HandsConfig {  locateFile?: (path: string, prefix?: string) => string;}

This callback loads additional files needed to create an instance:

hand_landmark_lite.tflitehands_solution_packed_assets_loader.jshands_solution_simd_wasm_bin.jshands.binarypbhands_solution_packed_assets.datahands_solution_simd_wasm_bin.wasm

The setOptions method allows you to set the following discovery options:

export interface Options {  selfieMode?: boolean;  maxNumHands?: number;  modelComplexity?: 0|1;  minDetectionConfidence?: number;  minTrackingConfidence?: number;}

You can read about these settings here. We will set maxNumHands: 1 to detect only one hand and modelComplexity: 0 to improve performance at the expense of detection accuracy.

The send method is used to process a single frame of data. It is called in the **onFrame **method of the Camera instance.

The onResults method accepts a callback to handle the hand detection results.

The drawLandmarksmethod allows you to draw hand keypoints and has the following signature:

export declare function drawLandmarks(    ctx: CanvasRenderingContext2D, landmarks?: NormalizedLandmarkList,    style?: DrawingOptions): void;

It accepts a drawing context, keypoints, and the following styles:

export declare interface DrawingOptions {  color?: string|CanvasGradient|CanvasPattern|      Fn<Data, string|CanvasGradient|CanvasPattern>;  fillColor?: string|CanvasGradient|CanvasPattern|      Fn<Data, string|CanvasGradient|CanvasPattern>;  lineWidth?: number|Fn<Data, number>;  radius?: number|Fn<Data, number>;  visibilityMin?: number;}

The drawConnectors method allows you to draw connection lines between keypoints and has the following signature:

export declare function drawConnectors(    ctx: CanvasRenderingContext2D, landmarks?: NormalizedLandmarkList,    connections?: LandmarkConnectionArray, style?: DrawingOptions): void;

It takes care of defining keypoints start and end keypoints index pairs (HAND_CONNECTIONS), and styles.

Back to editing track-hand-motions.js:

const video$ = document.querySelector("video");const canvas$ = document.querySelector("canvas");const ctx = canvas$.getContext("2d");const width = 640;const height = 480;canvas$.width = width;canvas$.height = height;

We define the function for processing the results of hand detection:

function onResults(results) {  // of the entire result object, we are only interested in the `multiHandLandmarks` property,  // which contains arrays of control points of detected hands  if (!results.multiHandLandmarks.length) return;  // when 2 hand are found, for example `multiHandLandmarks` will contain 2 arrays of control points  console.log("@landmarks", results.multiHandLandmarks[0]);  // draw a video frame  ctx.save();  ctx.clearRect(0, 0, width, height);  ctx.drawImage(results.image, 0, 0, width, height);  // iterate over arrays of breakpoints   // we could do without iteration since we only have one array,   // but this solution is more flexible  for (const landmarks of results.multiHandLandmarks) {    // draw keypoints    drawLandmarks(ctx, landmarks, { color: "#FF0000", lineWidth: 2 });    // draw lines    drawConnectors(ctx, landmarks, HAND_CONNECTIONS, {      color: "#00FF00",      lineWidth: 4,    });  }  ctx.restore();}

Create an instance to detect the hand, set the settings, and register the result handler:

const hands = new Hands({  locateFile: (file) => `../node_modules/@mediapipe/hands/${file}`,});hands.setOptions({  maxNumHands: 1,  modelComplexity: 0,});hands.onResults(onResults);

Finally, we create an instance to control the video camera, register the handler, set the settings and start the frame capture process:

const camera = new Camera(video$, {  onFrame: async () => {    await hands.send({ image: video$ });  },  facingMode: undefined,  width,  height,});camera.start();

Please note: by default, the facingMode setting is set to user - the source of video data is the front (front) laptop camera. Since in my case this source is a USB camera, the value of this setting should be undefined.

The array of control points of the detected brush looks like this:

 

How to control Javascript with gestures

The indexes correspond to the joints of the hand, as shown in the image above. For example, the index of the first index finger joint from the top is 7. Each control point has x, y, and z coordinates ranging from 0 to 1.

The result of executing the example code:

 

Pinching Gesture Definition:

A pinch as a gesture is the bringing together of the tips of the index and thumb to a fairly close distance.

 

You ask, 'What exactly is considered close enough distance?'"
We have decided to define this distance as 0.8 for both the x and y coordinates, and 0.11 for the z coordinate. Personally, I concur with these calculations. Here's a visual representation:

const distance = {    x: Math.abs(fingerTip.x - thumbTip.x),    y: Math.abs(fingerTip.y - thumbTip.y),    z: Math.abs(fingerTip.z - thumbTip.z),  };const areFingersCloseEnough =  distance.x < 0.08 && distance.y < 0.08 && distance.z < 0.11;

Few more important things:

  • we want to register and handle the start, continuation, and end of a pinch (pinch_start, pinch_move, and pinch_stop, respectively);
  • to determine the transition of a pinch from one state to another (beginning -> end, or vice versa), it is required to save the previous state;
  • the transition detection must be performed with some delay, for example 250 ms.

Create a detect-pinch-gesture.js file in the js directory.
The beginning of the code is identical to the code of the previous example:

import { Camera } from "@mediapipe/camera_utils";import { Hands } from "@mediapipe/hands";const video$ = document.querySelector("video");const width = window.innerWidth;const height = window.innerHeight;const handParts = {  wrist: 0,  thumb: { base: 1, middle: 2, topKnuckle: 3, tip: 4 },  indexFinger: { base: 5, middle: 6, topKnuckle: 7, tip: 8 },  middleFinger: { base: 9, middle: 10, topKnuckle: 11, tip: 12 },  ringFinger: { base: 13, middle: 14, topKnuckle: 15, tip: 16 },  pinky: { base: 17, middle: 18, topKnuckle: 19, tip: 20 },};const hands = new Hands({  locateFile: (file) => `../node_modules/@mediapipe/hands/${file}`,});hands.setOptions({  maxNumHands: 1,  modelComplexity: 0,});hands.onResults(onResults);const camera = new Camera(video$, {  onFrame: async () => {    await hands.send({ image: video$ });  },  facingMode: undefined,  width,  height,});camera.start();const getFingerCoords = (landmarks) =>  landmarks[handParts.indexFinger.topKnuckle];function onResults(handData) {  if (!handData.multiHandLandmarks.length) return;  updatePinchState(handData.multiHandLandmarks[0]);}

Define event types, delay and pinch state:

const PINCH_EVENTS = {  START: "pinch_start",  MOVE: "pinch_move",  STOP: "pinch_stop",};const OPTIONS = {  PINCH_DELAY_MS: 250,};const state = {  isPinched: false,  pinchChangeTimeout: null,};

Declare a pinch detection function:

function isPinched(landmarks) {  const fingerTip = landmarks[handParts.indexFinger.tip];  const thumbTip = landmarks[handParts.thumb.tip];  if (!fingerTip || !thumbTip) return;  const distance = {    x: Math.abs(fingerTip.x - thumbTip.x),    y: Math.abs(fingerTip.y - thumbTip.y),    z: Math.abs(fingerTip.z - thumbTip.z),  };  const areFingersCloseEnough =    distance.x < 0.08 && distance.y < 0.08 && distance.z < 0.11;  return areFingersCloseEnough;}

Define a function that creates a custom event using the CustomEvent constructor and calls it using the dispatchEvent method:

// the function takes the name of the event and the data - the coordinates of the fingerfunction triggerEvent({ eventName, eventData }) {  const event = new CustomEvent(eventName, { detail: eventData });  document.dispatchEvent(event);}

Define a pinch state update function:

function updatePinchState(landmarks) {  // determine the previous state  const wasPinchedBefore = state.isPinched;  // determine the beginning or end of the pinch  const isPinchedNow = isPinched(landmarks);  // define a state transition  const hasPassedPinchThreshold = isPinchedNow !== wasPinchedBefore;  // determine the state update delay  const hasWaitStarted = !!state.pinchChangeTimeout;  // if there is a state transition and we are not in idle mode  if (hasPassedPinchThreshold && !hasWaitStarted) {    // call the corresponding event with a delay    registerChangeAfterWait(landmarks, isPinchedNow);  }  // if the state remains the same  if (!hasPassedPinchThreshold) {    // cancel standby mode    cancelWaitForChange();    // if the pinch continues    if (isPinchedNow) {      // trigger the corresponding event      triggerEvent({        eventName: PINCH_EVENTS.MOVE,        eventData: getFingerCoords(landmarks),      });    }  }}

We define the functions for updating the state and canceling the wait:

function registerChangeAfterWait(landmarks, isPinchedNow) {  state.pinchChangeTimeout = setTimeout(() => {    state.isPinched = isPinchedNow;    triggerEvent({      eventName: isPinchedNow ? PINCH_EVENTS.START : PINCH_EVENTS.STOP,      eventData: getFingerCoords(landmarks),    });  }, OPTIONS.PINCH_DELAY_MS);}function cancelWaitForChange() {  clearTimeout(state.pinchChangeTimeout);  state.pinchChangeTimeout = null;}

We define the handlers for the beginning, continuation and end of the pinch (we simply print the coordinates of the upper joint of the index finger to the console):

function onPinchStart(eventInfo) {  const fingerCoords = eventInfo.detail;  console.log("Pinch started", fingerCoords);}function onPinchMove(eventInfo) {  const fingerCoords = eventInfo.detail;  console.log("Pinch moved", fingerCoords);}function onPinchStop(eventInfo) {  const fingerCoords = eventInfo.detail;  console.log("Pinch stopped", fingerCoords);  // change background color on STOP  document.body.style.backgroundColor =    "#" + Math.floor(Math.random() * 16777215).toString(16);}

And register them:

document.addEventListener(PINCH_EVENTS.START, onPinchStart);document.addEventListener(PINCH_EVENTS.MOVE, onPinchMove);document.addEventListener(PINCH_EVENTS.STOP, onPinchStop);

Result on the video:

https://www.youtube.com/watch?v=KsLQRb6BhbI

Conclusion

Now that we have reached this point, we can interact with our web application however we desire. This includes changing the state and interacting with HTML elements, among other things.

As you can see, the potential applications for this technology are virtually limitless, so feel free to explore and experiment with it.

That concludes what I wanted to share with you in this tutorial. I hope you found it informative and engaging, and that it has enriched your knowledge in some way. Thank you for your attention, and happy coding!

Read also

Blog posts you may be interested in

4
 minutes to read

Unveiling the Power of dlib: A Journey into Image Processing

Explore how dlib, renowned for its facial recognition and object detection capabilities, harnesses the Histogram of Oriented Gradients (HOG) method and Support Vector Machines (SVM) to transform images into condensed vectors for advanced analysis. Learn how the dlib library handles determining which images are similar and which are not.
4
 minutes to read

Why Outsource Development to the Czech Republic?

Developers in the Czech Republic are consistently ranking as some of the best in the world and many companies are now reaching from across the globe for our assistance and resources to develop projects large and small.
15
 minutes to read

What is WebRTC (Web Real Time Communications)?

In this article, we will reveal some of the features of using WebRTC and consider the advantages and disadvantages of this technology.

New articles

New blog posts you may be interested in

10
 minutes to read

Technical debt - Part 2 - What to look out for? How to work around it in agile and scrum?

This is the second part of our short series on technical debt. In this part we look more in depth at how to control technical debt and also how to work with it. Finally, we also look at three different cases of technical debt.
7
 minutes to read

How to Build a React Native App in 2024

Step-by-step guidance and insights on the process of developing a mobile application using React Native framework in the current year
8
 minutes to read

AI Technologies That Are Transforming Commercial Real Estate Right Now

Real Estate Transformation: The Impact of AI Technologies, this article explores different AI Tools

Got a project in mind? Tell us about it.

We help startups, IT companies and corporations with digital products.

Write a Message

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We will answer as soon as possible.
Your information is safe with us.
We are happy to answer all your questions!

Book a Meeting

Jakub Bílý

Head of Business Development
Do you want to talk to us directly? Book a meeting with Jakub from business development.
Book a Meeting