Invoke Features

Invoke offers a new way of working with spatial audio. As a spatial audio production tool, it explores different ways to embody audio workflow. The main feature of the app is a voice-based drawing tool that is used to make trajectories for spatial audio objects.

Voice Drawing

The interaction dynamic for drawing trajectories is a new way to work with spatial audio. Combining input from the voice and hand provides a continuous space-time method of composition. Voice Drawing allows detailed production of trajectories to spatially and temporally mix tracks. For a user, you position a “pen” in virtual space, pull the controller trigger, make a sound with your voice, and build shapes in space. After creating a Voice Sketch, the line data transforms into a control point based bezier curve, a trajectory, that retains the volume information of the voice input. Then, by placing an audio object on a trajectory, an audio object’s volume is automated based on the recorded volume of the voice.

When using Voice Drawing collaboratively, the volume information from both collaborators is used to draw each line. This means two lines could be drawn with the same volume information in two different places. But also that when you or your partner draws a line, the resulting trajectory is not totally controlled by you.

Pen active before drawing
Voice drawing pre-trajectory render
Voice drawing pre-trajectory render
Trajectory, a voice drawing converted to a bezier line. Colour is taken from the audio source that is attached to it.

Object Interaction

Traditionally object selection and manipulation in VR is considered either direct or indirect. Direct interaction uses a natural metaphor, you grab a thing close enough to touch, like you would a cup or a ball. Indirect interaction uses a form of mediation to allow action at a distance, like picking up a car from a crane. Each of these methods gives different sensations of embodiment but also changes how to design spaces for action.

Direct Interaction
Direct Interaction
Indirect Interaction
Indirect Interaction

Invoke uses both direct and indirect action; this allows precise control but also extended interaction spaces. For the user, laser-based object interaction sits on top of direct spatial selection and manipulation. What this means is you can either walk up to an object and grab it or, from a distance, aim and grab an object. When holding objects, you can pull them closer or push them further away using controls on the hardware VR controller.

The system is built on top of the VR control system by Hecomi, VRGrabber.


While it would be possible to put all functionality “in-the-world”, it was decided to use a set of Menus to manage the various options and abstractions. There are three main menu types:

  • Mixer – a timeline and audio mixer to control gain, solo, mute functions and spatial parameters like Doppler, Reverb Send, and Volumetric Radius.
  • Trajectory Manager – a means to – overview, toggle visibility, delete – trajectories.
  • Hand Menu – a way to manage world space menus and other global settings.
Invoke VR audio mixer panel

This system is built on top of some great work by Aesthetic Interactive, Hover UI Kit.


Avatar used for each player

As a shared VR experience, the mapping of your body to the virtual space is an important feature. Using an Inverse Kinematics system a VR puppet maps to your movements. This system uses the HMD, controllers and a tracking puck attached to your waist. Sometimes the mapping can go a bit funny though…

An early experiment with IK and avatar scaling.

Assorted Features


The picture highlights that the level of opacity on an object has meaning. For instance, when an audio source is muted, the object has a see-through quality. Also to manage the complexity of the space, the trajectory lines can be made semi-transparent, this also removes access to control points.

Transparency of objects and lines

Getting Around and staying in touch

As the interaction space provided is quite large, a teleport system was added. Also as a shared experience, spatialised voice communication system is available.

Non-realistic scaling

Given the size of the interaction space, objects change size depending on their distance from the user, getting bigger the farther away they get. This does introduce a subtle set of issues but gains improved usability for selection and manipulation from a distance. One issue is the perceptual confusion of pushing something into the distance and it gets bigger. The other issue is that for each user, there is asymmetric perceptual information about space and objects.

Spatial Music Making in VR

Do you create electronic music or sound design? Or are you a student or professional in Audio / Music Technology? If so, I am running a study over the next few weeks (August and September) and would be great to have your participation!

You will be asked to collaboratively mix a short track using a shared VR spatial audio app. You will then be asked to complete a survey about your experience.

The study will take two hours to complete. All studies will be done in the Media and Arts Technology studios in the Engineering and Materials Science building of Queen Mary University of London, Mile End Road, Tower Hamlets.

Study slots were available from 13/08/19 to 28/08/19.

If you are interested in the context of the research I have some resources here:

Forest and Desert

I assisted RCA MA student Min Young Kim on the Forest and Desert work:

Apoptosis is a process of programmed cell death, a biochemical event in which enables surviving possible for every living organism. Forest and Desert, 2018, is a real-time simulation referring to such a biological function. The concept of natural death becoming a core engine for a virtual body to sustain its life, prevention of over population and over clocking, as well as the break-down of the system. The null and void routine of system renders the surreal but live images that continually undergoes change and challenges the self-contained capacity.


Over a series of meetings the following issues were encountered and worked through:

  • Establish a workflow between artist and developer: deciding a shared language, a set of design guides documents, regular meetings, project management structure.
  • Discussion of different ways to achieve a real-time simulation of an ecosystem of bacteria like agents.
  • Developer assistance in setting up agent simulation behaviour, I designed scripts to be editable from the Unity Editor allowing detailed tuning by the artist.
  • Addition of new agent features to represent different classes of predators and prey in the simulation with groups interactions (attach/evade/cohesion) and changes of agent state over time (growth/decay/transformation).
  • Optimization of the simulation by refactoring the agent interaction system into an Entity Component System architecture. Various tweaks to performance such as staggered agent updates.
  • Build support: eg forcing Unity to render in OpenGL modes compatible with MacMini used for installation, finding and adding lost shaders to builds.
  • Multiple display output from Unity to allow the projected image and the “Debug view” screen, see below, this screen displays information about the state of the simulation.


Getting my head around regression analysis Pt 1: Setting the problem

Regression analysis and specifically mixed effect linear models (LMMs) is hard – harder than I thought based on what I learned in traditional statistics classes. ‘Modern’ mixed model approaches, although more powerful (as they can handle more complex designs, lack of balance, crossed random factors, some kinds of non-normally distributed responses, etc.), also require a new set of conceptual tools.

This first post covers my process of understanding how to apply the magic of multiple regression to my experimental data. The next post will cover how the analysis was done using R.

The Basics

Before tackling my specific problem as a mixed effect linear model, it is important to review the basic building block of linear regression.

Linear regression is a standard way to build a model of your variables. You want to do this when [source]:

  • You have two variables: one dependent variable and one independent variable. Both variables are interval.
  • You want to express the relationship between the dependent variable and independent variable in a form of a line. That is, you want to express the relationship like y = ax + b, where x and y are the independent and dependent variables, respectively.

Also, there are 4 key concepts in linear regression that should be clear before you attempt extended techniques like LMM or GLM [source]

1. Understand what centring does to your variables: Intercepts are pretty important in multilevel models, so centring is often required to make intercepts meaningful.

2. Work with categorical and continuous predictors: You will want to use both dummy and effect coding in different situations.  Likewise, you want to be able to understand what it means if you make a variable continuous or categorical.  What different information do you get from it and what does it mean?  Even if you’re a regular ANOVA user, it may make sense to treat time as continuous, not categorical.

3. Interactions: Make sure you can interpret interactions regardless of how many categorical and continuous variables they contain.  And make sure you can interpret an interaction regardless of whether the variables in the interaction are both continuous, both categorical, or one of each.

4. Polynomial terms: Random slopes can be hard enough to grasp.  Random curvature is worse, be comfortable with polynomial functions if you have complex data (e.g. the Wundt curve, the bell-shaped relationship of positive affect and complexity in music).

Finally, understand how all these concepts fit together. This means understanding what the estimates in your model mean and how to interpret them.

What is a mixed effect linear model?

Simply, they are statistical models of parameters that vary at more than one level. They are a generalised form of linear regression that builds multiple linear models to provide data on how predictors relate parameters.

Many kinds of data, including observational data collected in experiments, have a hierarchical or clustered structure. For example, children with the same parents tend to be more alike in their physical and mental characteristics than individuals chosen at random from the population at large. Individuals may be further nested within demographic and psychometric features. Multilevel data structures also arise in longitudinal studies where an individual’s responses over time are correlated with each other. In experimental data, LMM is a good way to position individual difference between participants. For example, some participants may be more comfortable with using touchscreens than the others, and thus, their performance in a task might have been better. If we tried to represent this with linear regression,  the model tries to represent the data with one line, this aggressively aggregates differences which may matter to the results being effective and contextually understood.

Multilevel regression, intuitively, allows us to have a model for each group represented in the within-subject factors. In this way, we can also consider the individual differences of the participants (they will be described as differences between the models). What multilevel regression actually does is something like between completely ignoring the within-subject factors (sticking with one model) and building a separate model for every single group (making n separate models for n participants). LMM controls for non-independence among the repeated observations for each individual by adding one or more random effects for individuals to the model. They take the form of additional residual terms, each of which has its own variance to be estimated. Roughly speaking, there are two strategies you can take for random effects: varying-intercept or varying-slope (or do both). Varying-intercept means differences in random effects are described as differences in intercepts. Varying-slope means vice versa: changing the coefficients of some factors.


Dependant/Response variable the variable that you measure and expect to vary given experimental manipulation.

Independent/Explanatory/exogenous variables and Fixed effects are all variables that we expect will have an effect on the dependent/response variable. Factors whose levels are experimentally determined or whose interest lies in the specific effects of each level, such as effects of covariates, differences among treatments and interactions.

Random effects are usually grouping factors for which we are trying to control. In repeated measures designs, they can be either crossed or hierarchical/nested, more on that later. Random effects are factors whose levels are sampled from a larger population, or whose interest lies in the variation among them rather than the specific effects of each level. The parameters of random effects are the standard deviations of variation at a particular level (e.g.among experimental blocks).

The precise definitions of ‘fixed’ and ‘random’ are controversial; the status of particular variables depends on experimental design and context.

My Research Problem

In an experiment comparing Desktop (DT) computer and VR interfaces in a collaborative music-making task, I think that individual users and the dyadic session dynamics affect the amount of speech when doing the task and that the amount of talk will also be affected by media (DT/VR). Basically, the mixture of people and experimental condition will both have effects, but I really want to know the specific effect of media on speech amount.

Data structure

The dependent variable is the frequency of coded speech per user, while demographic surveys produced multiple explanatory variables along with the independent variable of media. So, we also have a series of other variables that may affect the volume of communication. Altogether variables of interest for linear modelling include:

  • Media: media condition DT or VR.
  • User: repeated measure grouping by the participant ID.
  • Session: categorical dyad grouping e.g. A, B, C.
  • Utterance: A section of transcribed speech, a sentence or comparable. Frequencies of utterances used.
  • Pam: Personal acquaintance measure, a psychometric method of evaluating how much you know another person.
  • VrScore: level of experience with VR, simple one to seven scores.
  • MsiPa: Musical sophistication index perceptual ability factor for each user.
  • MsiMtMusical sophistication index musical training factor for each user.

Using the right tool

As I used a repeated measure design for the experiment, where each participant used both interfaces, Media is a within-subject factor. This means I need a statistical method that can account for it. A simple paired t-test or repeated measures ANOVA may be of use but it lacks the ability to include all of the explanatory variables, this leaves us with regression analysis. This decision tree highlights how to proceed with choosing the right form of regression analysis:

  1. If you have one independent variable and do not have any within-subject factor, consider Linear regression. If your dependent variable is binomial, Logistic regression may be more appropriate.
  2. If you have multiple independent variables and do not have any within-subject factor, consider Multiple linear regression.
  3. If you have any within-subject factor, consider Multi-level linear regression (mixed-effect linear model).
  4. For some special cases, consider the Generalized Linear Model (GLM) or Generalized Linear Mixed Model (GLMM).

So, at first I chose to use a mixed-effect linear model (LMM), as I am trying to fit a model that has two random intercepts, e.g. two groups. As such, we are trying to fit a model with nested random effects.

Crossed or Nested random effects

As each User only appears once in each Session, the data can be treated as nested. For nested random effects, the factor appears only within a particular level of another factor; for crossed effects, a given factor appears in more than one level of another factor (User’s appearing within more than one session). An easy rule of thumb is that if your random effects aren’t nested, then they are crossed!

Special Cases…GLM

After a bit of further reading, I found out that my dependent variable meant a standard LMM was not suitable. As the response variable is count data of speech, it violates the assumptions of normal LMMs. When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the assumptions of linear mixed models (LMMs). In steps the flexible, but highly sensitive, Generalised Linear Mixed Models (GLMM).  The difference between LMMs and GLMMs is that the response variables can come from different distributions besides Gaussian, for count data this is often of a Poisson distribution. There are a few issues to keep in mind, though.

  1. Rather than modelling the responses directly, some link function is often applied, such as a log link. For Poisson, the link function (the transformation of Y) is the natural log.  So all parameter estimates are on the log scale and need to be transformed for interpretation, the means applying inverse function of the link, for log this is exponential.
  2. It is often necessary to include an offset parameter in the model to account for the amount of risk each individual had to the event, practically this is a normalising factor such as the total number of utterance across repeated condition.
  3. One assumption of Poisson Models is that the mean and the variance are equal, but this assumption is often violated.  This can be dealt with by using a dispersion parameter if the difference is small or a negative binomial regression model if the difference is large.
  4. Sometimes there are many, many more zeros than even a Poisson Model would indicate.  This generally means there are two processes going on–there is some threshold that needs to be crossed before an event can occur.  A Zero-Inflated Poisson Model is a mixture model that simultaneously estimates the probability of crossing the threshold, and once crossed, how many events occur.

Moving forward

In the next post, I will cover how this analysis is done in the R environment using the lme4 package.


Looking back at people looking forward

In 1995, Heath, Luff, & Sellen lamented the uptake of video conferencing indicating that it had not at the time reached its promise. But looking back at this projection, the ubiquity of video systems for social and work communication can be seen. And subsequently, research has gone about understanding it further in a variety of HCI paradigms (CHI2010, CSCW2010, CHI2018). So, for my research, making projections on the use of VR for music collaboration, it might be that findings and insights do not reach fruition, either, in a timely fashion, or in the domain of interest that they were investigated in, or ever! Though this could be touching on a form of hindsight bias.

Going back to the article that speculated on the unobtained promise of video conferencing technologies, Heath Luff, and Sellen (1995), provide a piece of insight that can still be placed into perspective on design interventions for collaboration:

It becomes increasingly apparent, when you examine work and collaboration in more conventional environments, that the inflexible and restrictive views characteristic of even the most sophisticated media spaces, provide impoverished settings in which to work together. This is not to suggest that media space research should simply attempt to ‘replace’ co-present working environments, such ambitions are way beyond our current thinking and capabilities. Rather, we can learn a great deal concerning the requirements for the virtual office by considering how people work together and collaborate in more conventional settings. A more rigorous understanding of more conventional collaborative work, can not only provide resources with which to recognise how, in building technologies we are (inadvertently) changing the ways in which people work together, but also with ways in which demarcate what needs to be supported and what can be left to one side (at least for time being). Such understanding might also help us deploy these advanced technologies.

The bold section highlights the nub of what I’m interested in; for VR music collaboration systems. I break this down into how I’ve tackled framing collaboration in my research:

  • conventional collaborative work – ethnographies of current and developing practice. Even if you pitch a radical agenda of VR workspace, basic features of the domain of interest need to be understood for their contextual and technical practices.
  • building technology is changing practice – observing the impact of design interventions on how people collaborate in media production. Not only does a technology suggest new ways of working, it can enforce them! Observing and understanding this in domain-specific ways is important.
  • what needs to be supported – basic interactional requirements, we have to be able to make sense of each other, and the work, together, in an efficient manner.
  • what can be left to one side – the exact models and metaphors of how work is constructed in reality, in VR we can create work setups and perspectives that cannot exist in reality. For instance, shared spatial perspectives i.e. seeing the same thing from the same perspective is impossible in reality as we have to occupy a separate physical space. In repositioning basic features of spatial collaboration, the effects need to be understood in terms of interaction and domain requirement. But the value is in finding new ways of doing things not possible in face to face collaboration.

Overall, the key theme that should be taken away is that of humans’ need to communicate and collaborate. In this sense, any research that looks to make collaboration easier is provisioning for basic human understanding. That is quite nice to be a part of.

Software Architecture for Polyadic

The Polyadic interface enables collaborative composition of 16 step drum loops to accompany backing tracks in 4 different genres of electronic music for two or more co-located participants utilising two user interface media, Virtual Reality (VR) and Desktop (DT).
To accommodate the cross-platform development of the system an overview of programming paradigms was made, to determine an architecture for quick prototyping. The architecture design goals were:
  • Ease of feature development for testing multiple approaches.
  • Deterministic network interaction with interfaces.
  • Modular code structure to allow a Git-Flow style of development with parallel feature development not causing merge nightmares.

The final architectures included:

  • Entity-Component-System – the final winner, talk more about this later.
  • Dependency Injection / Inversion of Control – lots of supporters, but it all seemed a bit weird to set up and work with for this application. Initial tests were positive, but the style of the structure started to annoy me.
  • Model-view-controller – classic, solid design pattern. But scaling it to maintain a tidy feature set for the cross-platform network application felt dangerous. I saw it turning into a pseudo pattern, where best intentions are kept but the flexibility and my laziness would make me turn it into illogical spaghetti.
  • Hack away – what I have done previously, a lot. The speed of just bringing functions together however you like is always appealing in the short term, like a really fatty burger, but it will shorten your life somehow.


Herehere, and here are some good introductions/discussions by Maxim Zaks, a major contributor to the Entitas ECS framework. To summarise, ECS, and specifically Entitas, reduce everything down to data, groups of data, and systems that act on data. This is very different from classic OOP design and required a little retraining of my process and thinking. I made many mistakes and introduced a lot of pseudo-dependencies during this process. Now, in the process of refactoring, after doing some other projects with it, I am rooting out these pseudo-dependencies and reducing the reliance on wasteful LINQ operations. In the end, it mostly produced decoupled code that allowed very feature-driven development. As I’m working by myself on the project, I haven’t got into the unit testing possibilities, but these are said to be great.

So when not to use ECS?

Building frameworks for others or purely computational systems, see this for discussion. Though I am toying with a fully ECS driven audio signal processing idea, might be folly though…

Positive future

Also, the good news is that by choosing ECS, I have started to train myself in the current path that Unity is taking multithreaded systems, so that’s nice! But as this is still in early development I will stick with Entitas.

Polyadic update: changing hands

Managed to get the VR version of Polyadic scaled down, instead of a massive panel you have to stretch across to operate on, the scaled down version is roughly the width of an old MPC. This is important for visual pattern recognition in the music making process, but also the sizing allows for alternate workspace configurations, that are more ergonomic and can handle more toys being added!

To get the scaled down features to work a tool morphing process has been designed. The problem is the Oculus Rift and HTC Vice controllers are quite large, especially in comparison to a mouse pointer. So by using smaller hand models when you are in the proximity of the drum machine you can have a higher ratio of control to display, with respect to less of the hand model being able to physically touch features in the interface.

Control-display (C-D) ratio adaptation is an approach for facilitating target acquisition, for a mouse the C-D ratio is a coefficient that maps the physical displacement of the pointing device to the resulting on-screen cursor movement (Blanch, 2004), for VR it is the ratio between the amplitude of movements of the user’s real hand and the amplitude of movements of the virtual hand model. Low C-D ratio (high sensitivity) could save time when users are approaching a target, while high C-D ratio (low sensitivity) could help a user with fine adjustment when they reach the target area. Adaptive Control-Display ratios such as non-linear mappings have been shown to benefit 3D rotation and 3D navigation tasks.

But the consequence of this mapping change will be an expressive difference. In the original prototype with the oversized wall of buttons and sliders, the experience of physical exertion might have been quite enjoyable? By reducing this down, a very different body space will be created, the effects of this remain to be tested. Subjectively it did feel more precise and coherent as a VR interface, less toy-like and comical. As mentioned in the introduction, the sizing can have implications for pattern recognition. The smaller size allows you to overview the whole pattern while working on it, whereas previously the size meant stepping back or craning your neck to take it all in. It would be interesting to know how much effect the gestalt principles of pattern recognition have on cognitive load in music making situations, given the time-critical nature of the audiovisual interaction.

Blanch, R., Guiard, Y. & Beaudouin-Lafon, M., 2004. Semantic Pointing – Improving Target Acquisition with Control-display Ratio Adaptation. Proceedings of the International Conference on Human Factors in Computing Systems (CHI’04), 6(1), pp.519–526. Available at:

Adaptive landscapes

A little experiment on how to modulate a mesh using a video and a sound file.

Used the following to achieve the effect.


  • Add video file to your assets
  • Copy the Tesselation example from Wireframe shader samples
  • Remove the animated control script that controls the material values
  • Create a render texture for the video frames to go to
  • Replace the displacement texture of the material with the render texture
  • Using Klak, grab the RMS of an audio source in the scene, map this to the displacement height of the material/shader.