Thoughts on Human-AI Collaboration

I recently built a voice todos app to probe how human and AI collaboration will evolve.
Todos show up on the screen as you talk: you can backtrack, self-correct, and the list updates as you speak.
It provides direct visual feedback that you can act upon.

The magical feel to interacting with this app comes from a combination of:

Live interactions: you get live feedback on your input
Multimodal input/output

Live interaction

Typing/writing is inherently asynchronous.
Voice is the first modality for live interaction.
It’s both higher throughput and messy: voice unlocks the stream of thoughts.

So far, however, models are not made for live interactions.
You see people submitting huge voice memos like you would enter a huge piece of text.
It gets processed as a whole, and you get feedback only once it’s processed.

Creating natural live interactions is hard for many reasons.
One big issue is that models cannot think and interact at the same time.
So every new bit of information is processed sequentially once the previous bit of information has been processed.

But some labs are demonstrating that full-duplex models are possible.
Recently Thinking Machines Lab showed what their Interaction Models would be capable of:

The first signal in that direction was actually provided by Kyutai, about 2 years ago already.

Multimodal collaboration

Look at humans collaborating around a white-board:

They talk
They point
They draw
They gesture, and more…

The voice todos app edits live, providing visual cues (flashing changed elements).
That’s very limited.

The best example of what I believe collaboration will look like is provided by Google Deepmind:

You can point as you speak
The model has visual and audio understanding
The model can output provides audio feedback (but limited visual feedback)

Even though traditional app UIs are likely to disappear as we know them, designing input and feedback along multiple modalities will be key to AI collaboration workflows.