Building a text-based conversational agent is one problem. Adding a face and a voice to it turns out to be a different problem almost entirely.
Most of the mental models I brought from text agents were wrong. The architecture is different. The failure modes are different. What counts as a good interaction is different.
A text agent can get away with answering correctly. A video conversational agent has to feel present.
That distinction has real engineering consequences.
What changes when you add a face
With a text chatbot, latency is an annoyance. With a video conversational agent, latency is a social signal.
If there is a 900ms pause before a response, the user reads it as hesitation. If the avatar keeps speaking after the user started talking, it reads as rudeness. If the lips do not sync with the audio, the whole experience falls apart even if the words are right.
This is not a UI problem. It is an architecture problem.
The moment you put a face and a voice on a conversational agent, the user brings all their social instincts to the interaction. Those instincts are finely tuned. Humans are very good at detecting when something is off in a conversation. They may not know why it feels wrong. But they will feel it.
So the engineering challenge shifts. You are no longer building a system that produces correct output. You are building a system that produces output at the right moment, in the right way, with the right state.
The components you have to think about
A video conversational agent has several layers that a text agent does not need:
Voice pipeline. You need speech-to-text on the input side and text-to-speech on the output side, or a direct speech-to-speech model if latency is critical. Each hop adds delay. Each hop also introduces failure modes. If the voice pipeline degrades, the whole experience degrades, regardless of how good the underlying model is.
Avatar rendering. Generating a realistic face in real time is computationally expensive. You need to decide early whether you are rendering locally, streaming from a provider, or using a hybrid approach. This decision affects latency, cost, and how much control you retain over the visual state.
Turn-taking logic. In text chat, the user submits a message and the agent responds. In voice, the boundary between user input and agent response is not clean. You need to decide when the user has finished speaking, when to start responding, and how to handle interruptions. Get this wrong and the conversation feels robotic or chaotic.
Interruption handling. Users interrupt. They correct themselves mid-sentence. They change the subject. A video agent that cannot handle interruptions gracefully will feel unnatural fast. This requires the backend to listen continuously, detect when the user is speaking, and stop the agent mid-output if needed. That stop needs to be clean, not a hard cut.
State synchronization. The visual state of the avatar must stay in sync with the conversation state. If the agent is thinking, the avatar should not be smiling and nodding. If the user said something difficult, the avatar should not look cheerful. This means the frontend and backend need a shared state model that updates fast enough to feel natural.
Session lifecycle. Real-time connections are fragile. Networks drop. Tabs go to background. Users return after two minutes and expect the conversation to continue. You need to think about reconnection, session persistence, and how gracefully the agent recovers from interruptions in the connection itself.
Beneath all of this is a state machine. At any moment the agent is in one of a small number of states: listening, processing, speaking, interrupted, recovering. Each state has valid transitions and invalid ones. If you do not model this explicitly, you end up debugging the emergent behavior of a system that was never designed to handle all the edge cases it will encounter in production.
A few specific things worth building early: VAD (voice activity detection) on the input side so you know when the user actually starts and stops speaking, not just when audio is present. Cancellation logic so the agent can stop mid-response cleanly when interrupted, without leaving the pipeline in a broken state. Partial response streaming so the first words reach the voice layer before the full response is generated. Reconnect handling at the session level so a dropped connection does not reset the conversational context. And structured logging across the audio, model, and video layers, because debugging a real-time system without good observability is difficult.
Full stack or modular
When you start building, there is an obvious shortcut: use a full-stack provider that handles the avatar, the voice, the model, and the conversation all in one. Pass in some context, get back a video stream. Done.
This works well for demos. It works less well once you have specific requirements.
The problem with a fully managed stack is that the conversational logic lives inside someone else’s system. If you want to change how the agent handles certain topics, how it manages context, what it remembers and what it ignores, you are constrained by what the platform exposes.
For a simple use case, that is fine. For a product where the conversation itself is the experience, it becomes limiting quickly.
The alternative is a modular approach: use a provider for the video rendering layer, a separate provider or model for inference, your own voice pipeline if needed, and your own logic for everything that defines how the agent actually behaves.
This is more work to set up. It is also more work to maintain. But it gives you control over the parts that matter most.
The rough split I would suggest thinking about: outsource rendering, transport, and inference. Keep conversation logic, context management, and memory in your own system. Those are the parts that define the character of the agent. Letting someone else own them is a risk.
This is also the direction behind Prinsessa, a video-first conversational presence project that explores what happens when conversation, memory, voice, and a face become part of the same real-time experience.
Memory is not just storage
Most agent implementations treat memory as a context window problem. You summarize past conversations, store facts about the user, retrieve relevant history, and inject it into the prompt.
That works for task-oriented agents. It works less well when the agent is supposed to feel consistent over time.
The difference is this: for a task agent, memory helps it do its job. For a conversational agent, memory is what makes it feel consistent over time.
Getting this right is harder than it sounds. Too much memory feels invasive. Too little feels like the agent has forgotten you. Memory retrieved at the wrong moment feels mechanical. Memory that surfaces the right detail at the right time feels human.
A few things I think matter here:
First, not everything should be stored. Storing everything the user says and retrieving it later can make the agent feel like it is running a surveillance operation on the user. Memory should serve the conversation, not demonstrate that you were paying attention.
Second, stale memory is worse than no memory. If someone mentioned a problem six months ago and the agent brings it up as if it just happened, the experience breaks. Memory needs to have a sense of time and relevance, not just recall.
Third, memory in this context is closer to continuity than storage. The goal is not to remember facts about the user. The goal is for each conversation to feel like it builds on the last.
What are you optimizing for?
Once the agent has a face and a voice, engagement metrics become harder to interpret. A longer session is not automatically a better session. More interruptions are not always bad. Faster responses are not always better. Sometimes the right behavior is to pause, clarify, or let the user leave the session cleanly.
That means you need to define what a good interaction looks like before you start optimizing the loop.
If the only metric is time spent or sessions per week, the product decisions will drift toward maximizing those numbers. That drift is subtle but consistent. It shapes response latency targets, interruption thresholds, how aggressively the agent tries to keep the user engaged, and what counts as a successful eval.
Building evaluation criteria that reflect your actual goals is an engineering decision as much as a product one. A session that ends early because the user got what they needed can be a better outcome than a long session that left them uncertain. Your metrics should be able to tell the difference.
Where to start
If you are building something in this space and want to get a feel for the architecture before committing to anything, I would suggest starting with the turn-taking and interruption handling. It is the piece most developers underestimate, and it reveals a lot of other dependencies quickly.
Get the audio loop working first. Add the avatar layer second. Keep the conversational logic in your own hands from the start, even if the initial version is simple.
The complexity grows fast once you put a real user in front of it. Better to understand where the hard parts are early.


Comment section