How PlayBook Processes User Input

title: How PlayBook Processes User Input
dated: Jan-Feb '26
author: Ivan Reese
related: Pens & Hands

The first half of this note is basically battle scars from the wrapper, and the second half is basically “don’t do React”.

PlayBook is a web app that can run in any browser, but we use it inside a special wrapper iPad app. The iPad app captures finger motions, pencil strokes, and other physical inputs with higher fidelity than the browser APIs allow, forwarding these inputs to a web view running PlayBook. This hybrid gives us the richness and capability of a native app with the portability and rapid iteration of a web app.

All inputs to the PlayBook web app pass through three processing stages: event capture, input enrichment, and gesture routing. The rest of this note will describe the function and design of these stages.

Terminology:

Event — the measured or predicted position of a finger or pencil at one moment in time
Touch — a series of events that represent the motion of a finger or the pencil
Gesture — code that decides how to turn one or more touches into effects

Event capture

PlayBook renders at 60 Hz, but the browser or wrapper deliver events at up to 240 Hz. For performance reasons, the event capture system queues events to be processed on the next frame, rather than acting on them immediately. When PlayBook is run on a conventional computer, the event capture system also transforms mouse and keyboard input into simulated finger and pencil input.

Here’s the metadata included with each event:

type NativeEvent = {
  id: TouchId // Stable identity shared by all events that make up one touch
  type: "pencil" | "finger"
  phase: "hover" | "began" | "moved" | "ended" | "risen"
  predicted: boolean // true if the event is a hypothetical future position
  position: Position // position in screen space
  worldPos: Position // position in world space — added during input enrichment
  pressure: number
  altitude: number
  azimuth: number
  rollAngle: number
  radius: number
  z: number // Pencil hover height
  timestamp: number
}

The pencil uses all 5 phases (“risen” occurs when the pencil moves far enough away that we no longer count it as hovering), while fingers only use the middle three.

Input enrichment

By the time the next frame begins, the queue usually contains 4-6 events per touch and 8-12 for the pencil. For a handful of reasons, these events arrive in an incoherent order, so the first step is to organize them — sorted by timestamp and grouped together by touch ID.

About half of the events are real, and the other half are predicted. The wrapper generates predicted events by estimating where the next real finger or pencil event will occur, and these predictions can be used to reduce perceptual latency. In the relatively long time between rendered frames, most predictions will already be obviated by later real events. The input enrichment system filters out stale and low-quality predictions, keeping at most one prediction per touch.

This enrichment stage also fixes a few kinds of bogus data (eg: “hover” after “began” due to UIKit handling touches and hovers separately) and precomputes some commonly used derived values (eg: worldPos, which is position converted from screen space to world space).

The most important job of the input enrichment system is to maintain a struct of continually evolving metadata for each active touch:

type TouchState = {
  beganEvent?: NativeEvent // The event when the touch first contacted the screen (null when pencil hovering)
  lastEvent: NativeEvent // The previous "real" (non-predicted) event for this touch
  dragDist: number // The distance between the beganEvent and lastEvent
  drag: boolean // Has the touch moved at least a tiny bit since it began?
  // The following relate to a special "firm press" pencil motion
  averageVel: Averager
  averagePressure: Averager
  pressureSpikeAt: number | null
  firmPress: boolean
}

This struct contains state that gestures would otherwise need to track for themselves. For instance, a gesture can:

compare the beganEvent to the current event to measure the overall direction that the touch has moved
compare lastEvent to the current event to measure instantaneous velocity
check the drag boolean to decide whether the touch should count as a tap or a drag.
check if drag is true and dragDist is near zero, meaning that the touch has drawn a closed path
perform a special action when the user does a firm press — a quick, click-like press with the pencil.

Gesture routing

Gestures are bundles of code that turn a sequence of events into specific actions. They are organized by a routing system that has a list of all the gestures supported by PlayBook — gestures for clicking, drawing, selecting, rotating, duplicating, panning, and so forth.

const gestureClasses = {
  finger: [Click, Interact, CloseSettings, OpenSettings, CloseDebug, OpenDebug, Pan, DuplicateSelection, RotateSelection],
  pencil: [Click, Interact, DuplicateSelection, MoveSelection, Draw]
}

interface GestureClass {
  // Offer an unclaimed touch to the class — it may return a instance to claim the touch
  offer?(ctx: EventContext): Gesture | void
}

Gesture classes are kept in a linear priority list, separately for finger and pencil. When a new touch begins, it’s offered to each of the gesture classes. The first to accept claims the touch and receives all its subsequent events. Gestures can do whatever they want internally — including morphing into a different gesture (eg: a pan becoming a pinch-zoom) — but they don’t need to know about each other. The only shared structure is the ordered list and the offer() protocol.

Inside the gesture, there are methods that can respond to different phases that events will have over the lifecycle of the touch.

// Instance methods for gestures
export interface Gesture {
  // Called for all events of the given phase
  hover?(ctx: EventContext): Gesture | void
  began?(ctx: EventContext): Gesture | void
  moved?(ctx: EventContext): Gesture | void
  ended?(ctx: EventContext): void
  risen?(ctx: EventContext): void

  // Called when a firm press is detected
  firmPress?(ctx: EventContext): Gesture | void

  // Called every frame
  tick?(ctx: PlaybookContext): void

  // Existing gestures get first dibs on all new touches
  offer?(ctx: EventContext): boolean | void
}

See that last instance method, offer? That’s similar to the static offer method shown previously. When a new touch begins, before offering it to the gesture classes, the system offers it to all existing gesture instances that implement this method. If any of them accept it (by returning true), that gesture becomes a de facto multitouch gesture, receiving the events from both touches.

Ivan Hi again Marcel. The API design around removing touches from gestures is a bit rough. At the moment it’s something gestures need to do manually in ended(). We could bake this into the system (eg by adding another optional gesture instance method or smth) if this becomes a pain point.

Design

The way we handle user input has evolved considerably since our earlier experiments with Crosscut, Inkling, and earlier versions of PlayBook.

The current design has the following goals:

You can write a new gesture without thinking about other gestures.
Gestures can do anything they want internally.
It’s easy to figure out what an existing gesture does by glancing at the code.
It’s easy to figure out which gestures are active while using PlayBook.
The design gently guides you toward writing gestures that satisfy the above goals.

We found it hard to satisfy these goals using the typical approaches of existing input / gesture handling systems.

Retained Mode systems like the DOM, which use something like addEventListener(), encourage you to write gestures in terms of persistent objects that exist on screen and can be hit tested against, and force these objects to coordinate according to the rendering tree that events flow through (ie: in advanced use you end up leaning on capture / bubbling phases). This conflates rendering and input handling — we want those to be decoupled. You can do that by adding all your listeners to a top-level object like window, but then you’ve just bypassed the entire event system and need to come up with some bespoke solution.
Immediate Mode systems like Dear IMGUI and (arguably) React ~~improve upon~~ offer an attractive alternative to retained mode systems. They deeply conflate rendering and input handling, for instance by asking you to write a function that looks at the state of all input and then chooses what things to render. Immediate mode is even worse for our needs than retained mode. We want to write a new gesture without having to know what any other gestures are doing — we want to encapsulate gesture code. Immediate mode has input handling logic splayed out across the rendering code. The system is oriented along an entirely different axis than the one we want.
There are many ways to use state machines to solve part of the problem. One popular approach is to have one or a few state machines that track the current mode(s) of the app, and use the input events to transition the machine or perform actions based on the current state. Given that one of the jobs of our gesture system is to route events, a state machine might make sense — it’s a popular way to route navigation in web frameworks, after all. For gestures, you might have a state machine for viewport panning/zooming that looks something like: inert <-> panning <-> pinch-zooming, where you move between these states by placing or removing fingers on the screen. Where state machines get tricky is that some of our gestures are mutually exclusive, some gestures are multi-touch, some gestures require a finger and the pencil, some gestures only happen within a specific spatial region, some gestures have behaviour that evolves over time, etc. There are a lot of potentially complex interactions between the gestures. Gestures coexist and overlap. Building one or more big state machines to describe how they relate to one another is a very top down way to control what is possible. That makes sense if you already have a complete design for all the gestures and want to ensure that all of their possible interactions are fully modelled and understood. We’re in the opposite position — we are trying new gestures all the time, our design keeps evolving. One similar option we haven’t tried is statecharts, which may improve composability, but they’d still require top-down enumeration of some state space.

Another approach is to use a state machine within each gesture. We have a page turn gesture the moves through a number of internal modes. When you swipe your finger in from the edge of the screen, this gesture measures the direction and distance of the touch. When the touch has moved far enough from the edge, and in a nearly perpendicular direction, the gesture signals to the camera that it should now follow the finger. If the camera is looking at the last page in the notebook, and the user is swiping toward the next page (which doesn’t exist), then after the camera crosses the halfway point a new page is created. When the user releases their finger, the gesture considers the velocity and direction of finger motion to determine whether to continue moving to the next page, or remain on the current page (removing any new page that might have been created). Each of these modes of the gesture could be modelled as states in a state machine. Our system encourages gestures to do anything they want internally, so while we don’t use any state machine of this form, we could if we needed to, and that’d be harmonious with all the other gestures (which also do whatever they want internally).
UIKit uses hybrid of the above two state machine approaches, where each GestureRecognizer instance has internal states like “pending”, “active”, “ended”, “cancelled”, “failed”, and so forth. When a new touch begins, all the views that pass a hit test instantiate all their GestureRecognizers, which all begin in a “pending” state. With each new event, they all have the opportunity to transition into an “active” state, at which point all the other GestureRecognizer instances are “failed” and no longer receive events, except when one uses the succinctly named method shouldRecognizeSimultaneouslyWith. This design has some nice qualities, like deferring the choice of gesture until (potentially) long after the touch has began, and allowing multiple gestures to run in parallel. We achieve similar results, but without requiring gestures to use a prescribed set of states, and without the spooky action at a distance effect on other gestures. In our system, each gesture decides for itself what internal states to use, and if a few gestures need to coordinate among themselves, they can do that however they see fit. In practice we almost never need this — the default coordination mechanisms are straightforward and automatic — but when gestures do need special coordination, they can do it.
Finally, one other approach worth comparing is the previous version of our own system. Each gesture implemented only one function:

update(claimed: boolean, ctx:EventContext): Gesture | boolean | void

When a new touch began, we’d create an instance of every gesture in the system, and keep those in a list associated with the touch. Then for each event, we’d pass it to each gesture instance one by one, along with a claimed boolean. Any of those gestures could return true to indicate that they’ve now claimed the event. It was customary for each gesture to begin by checking the passed-in claimed boolean and, typically, abort their behaviour if it was true. This approach offered even more flexibility than our current approach — there was almost zero grain imposed by the system. The gesture(s) performed by each touch could change dynamically with zero coordination needed. But we found that the system itself wasn’t doing enough to help. Gestures ended up needing to store a lot of state, and do some tricky logic to determine if (eg) they previous had a claim to a touch, but suddenly some gesture earlier in the list was now claiming and overriding their claim. When we switched to our current system (using offer()), we found that most gestures could be implemented almost identically, save for a bunch of oft-repeated bookkeeping code that could be deleted. Another downside of this old system is shared by our new system: gestures can do whatever they want internally. That’s one of our values. Why is it a downside? Because its’s in tension with another of our values: that it’s easy to figure out what an existing gesture does by glancing at the code. In the old system, the extra internal state and the lack of discrete methods for each event phase (old system just had update(), new system has began(), moved(), ended(), etc) made it slightly too hard to understand what gestures did. The new system maintains the freedom for gestures to each do their own thing, but it gives just enough structure to make it easier to skim through gesture code and understand what will happen. Also, we made certain commonly-used special behaviours like tick() and firmPress() part of the system itself, so that gestures don’t need to implement them, and that really helped tamp down duplication and inconsistency.