Co-browsing AI: The Product Nobody Knew They Needed

Every high-intent web flow has the same problem. Someone lands on your loan application. They get to step 3. One field doesn't make sense. Nobody's there to answer. They leave.

You paid to acquire that user. You built the product. The form is fine. But there was a gap between "I want to do this" and "I know how to complete this"—and nobody closed it.

Live chat tried to solve it. The user describes what they're seeing. The agent guesses. The user explains again. The session ends with an "I'll look into this and get back to you." They don't come back.

Co-browsing AI changes the geometry of this entirely. The agent isn't guessing what's on the user's screen. It already knows. It can see the exact field the user is stuck on, highlight it, pre-fill it with data already in the system, and—if there's a voice call running—talk through it in real time while the form fills itself.

What co-browsing actually is (and what it isn't)

Co-browsing is not screen recording. It's not a support tool that captures sessions for review later. It's a live, bidirectional session layer that gives an AI agent contextual awareness of what the user is doing on the page right now.

The distinction matters technically.

Screen recording captures pixels. Co-browsing captures the DOM—the structured document tree of the page. This means the agent isn't reading an image of a form field. It knows the field's ID, its label, its current value, whether it's in an error state, and where it sits in the form's completion sequence. It can act on this programmatically, not visually.

// Simplified co-browse session initialization
const session = await CoBrowseSDK.init({
  sessionToken: await generateConsentToken(userId),
  maskedFields: ['[data-sensitive]', 'input[type="password"]', '#card-number'],
  allowedActions: ['highlight', 'autofill', 'scroll'],
  consentRequired: true,
  consentPrompt: "Allow our assistant to guide you through this form?",
  onConsent: (granted) => {
    if (granted) socket.emit('session:start', { userId, pageContext });
  }
});

Three things in that snippet matter:

maskedFields—PII and payment fields are masked before the session stream reaches the agent. The agent never sees raw card numbers or passwords. It sees a redacted placeholder. This is not optional for compliance.

allowedActions—The agent can highlight, fill, and scroll. It cannot click submit on behalf of the user. Actions that complete irreversible transactions stay with the user.

consentRequired: true—Co-browsing without explicit opt-in is a surveillance tool. The consent prompt has to be clear, dismissible, and revokable mid-session. Deployments that bury this in a terms-of-service checkbox will eventually create problems for the whole category.

The architecture: how the agent sees the page

Once a co-browsing session is live, a lightweight DOM observer serializes the page state and streams it to the agent's reasoning layer. The agent isn't running a browser internally. It's receiving a structured representation of the user's current view.

User's Browser                    Co-Browse Backend             AI Reasoning Layer
─────────────────                 ──────────────────            ──────────────────
DOM mutation observer             WebSocket relay               LLM with page context
      │                                  │                              │
      │── serialized DOM diff ──────────>│── structured page state ───>│
      │                                  │                              │
      │<── highlight instruction ────────│<── agent action payload ─────│
      │<── autofill payload ─────────────│                              │
      │<── scroll command ───────────────│                              │

The mutation observer sends diffs, not full page snapshots. A full-page serialization on every keypress would be prohibitive. The diff approach keeps the stream lean—typically 2–8KB per event depending on page complexity.

On the reasoning side, the LLM receives a structured context window that looks like this:

{
  "page": {
    "url": "https://app.example.com/apply/step-3",
    "title": "Property Details—Step 3 of 5",
    "form_state": {
      "completion_pct": 42,
      "current_field_focus": "property_pin_code",
      "fields": [
        { "id": "property_type", "label": "Property Type", "value": "Apartment", "status": "complete" },
        { "id": "property_pin_code", "label": "PIN Code", "value": "", "status": "active", "error": null },
        { "id": "property_value", "label": "Approx. Value", "value": "", "status": "pending" }
      ]
    },
    "stall_duration_ms": 34200,
    "rage_clicks": 0,
    "scroll_depth_pct": 61
  },
  "user": {
    "id": "u_9182",
    "crm_data": { "city": "Bengaluru", "known_pin": "560103" },
    "active_call": true,
    "call_transcript_last_30s": "main pin code fill karna chahta hoon..."
  }
}

stall_duration_ms: 34200—the user has been on the PIN code field for 34 seconds without typing. That's the trigger. The agent now knows to intervene, knows the user is on a call, knows the CRM has a pin code on file, and can pre-fill it with one action while saying "I've got your PIN code—I'll fill that in for you."

Where it's actually deployed and what the numbers say

Co-browsing AI is not a research demo. It's running in production on high-stakes form flows—loan applications, property registrations, insurance claims, university admissions. These are contexts where one confusing field causes abandonment, and abandonment has a direct, measurable revenue cost.

Use case	Baseline form completion	With co-browse AI	Drop-off recovery
Loan application (5-step)	31%	68%	~42% of abandoned sessions
Property registration	44%	79%	~38%
Insurance FNOL intake	39%	74%	~45%
University admissions form	52%	83%	~36%
KYC / document upload flow	27%	61%	~48%

The 42% drop-off recovery figure for loan applications deserves some unpacking. It doesn't mean 42% of people who abandon forms come back. It means: among users who stalled for > 20 seconds on a field and received an AI intervention, 42% completed the form vs. a control group that received no intervention.

The stall threshold and intervention trigger design matters a lot here. Intervening too early feels intrusive—the user just finished typing and the agent jumps in. Too late and they've already opened a new tab. The sweet spot in most deployments is a 15–25 second stall with a gentle, non-blocking prompt ("Need help with this field?") rather than an immediate autofill.

Voice + co-browse: the multimodal session

This is the architecture that changes what's possible. Not co-browsing in isolation—the combination of a live voice call and a co-browsing session, coordinated by the same reasoning layer.

The user is on a call with an AI telephony agent about their home loan application. They're also trying to fill out the application on the website. Without co-browse, the voice agent is guessing what they're looking at. With co-browse, the voice agent can see exactly which field is active, which ones are complete, and what data the CRM already has.

Voice Agent (Telephony)           Co-Browse Agent (Web)
──────────────────────            ──────────────────────
"Main address section             [Sees: address field
 pe hoon—PIN code               is active, CRM has
 kya daalu?"                      560103 on file]
         │                                 │
         └──────── Shared reasoning ───────┘
                   layer (same session)
                         │
                    ┌────▼─────────────────────────┐
                    │ Action: autofill PIN = 560103  │
                    │ Speech: "I've filled that— │
                    │ now what's the property value?"│
                    └──────────────────────────────┘

The voice response and the DOM action fire in the same response cycle. The user hears the answer and sees the field populate simultaneously. This coordination requires the telephony session and the co-browsing session to share context through the same orchestration layer—not two separate agents that happen to be talking to the same user.

Session state sync between telephony and web channels:

class SharedSessionContext:
    def __init__(self, user_id: str):
        self.user_id = user_id
        self.voice_transcript = []
        self.page_state = {}
        self.crm_data = {}
        self.last_action = None

    async def sync_voice_turn(self, utterance: str, intent: dict):
        self.voice_transcript.append(utterance)
        # Page state informs voice response
        if self.page_state.get("current_field_focus"):
            intent["active_field"] = self.page_state["current_field_focus"]
        return intent

    async def sync_page_event(self, dom_diff: dict):
        self.page_state.update(dom_diff)
        # Voice agent gets notified of page changes
        if dom_diff.get("stall_duration_ms", 0) > 20000:
            await self.trigger_cobrowse_intervention()

The key is SharedSessionContext—a single object that both the telephony agent and co-browse agent read from and write to. Without this, you get two agents with partial information about the same user, and the coordination breaks down.

I want to spend real time on this because it's the part most implementation guides skip to the end.

Co-browsing, done wrong, is a surveillance product. An agent that can see what's on your screen, that can fill your forms, that can scroll your page—this has obvious and serious potential for misuse. The fact that it's AI-powered doesn't make the consent questions go away. It makes them more important.

What the agent should never see:

Field type	Handling
Passwords, PINs	Masked at DOM level—never in session stream
Payment card numbers	Replaced with `[MASKED]` before serialization
Government IDs (Aadhaar, PAN)	Masked unless user explicitly shares for KYC
OTPs	Excluded from session stream entirely
Sensitive health data	Masked by field attribute `data-sensitive`

Session data retention:

Co-browse session data should not be retained beyond the interaction unless explicitly required for compliance (and then it needs audit controls). The DOM diffs that power the live session are different from transcripts or recordings—they're operational data for the duration of the call, not a record to be stored.

Consent architecture:

// Consent must be: explicit, specific, revocable
const consentFlow = {
  prompt: "Our assistant would like to see your current form to help you complete it.",
  detail: "We can see form fields and your progress—not passwords or payment info.",
  accept: "Yes, guide me",
  decline: "No thanks",
  revoke: "Stop sharing my screen", // always visible during session
  onRevoke: () => {
    session.terminate();
    notifyAgent("cobrowse_revoked");
  }
};

GDPR and India's DPDP Act both have implications for session data collected via co-browsing. "The user clicked Accept" is not sufficient documentation on its own. The specific capability consented to (seeing form state, but not filling without confirmation, etc.) needs to be recorded.

Trigger logic: when the agent intervenes

The quality of the intervention trigger design determines whether co-browsing feels like a helpful colleague or an intrusive surveillance system. These are the event types that production systems use:

Trigger event	Threshold	Intervention type
Stall on field	> 20 seconds no input	Soft prompt: "Need help here?"
Field validation error	Error state persists > 10s	Direct guidance on correction
Rage click (3+ clicks same element)	Immediate	Proactive offer to assist
Scroll thrash (up/down > 3x)	Immediate	Navigation guidance
Session idle (no scroll/click)	> 45 seconds	Re-engagement: "Still there?"
Form abandon intent (back/close)	On `beforeunload`	Exit-intent recovery prompt

The exit-intent recovery on beforeunload is worth highlighting. When a user moves to close the tab or navigate back, the agent has a sub-second window to offer help before they're gone. This is the highest-leverage intervention point—the user was close enough to complete the form that they spent time on it, and one question or one autofill might be enough to get them over the line.

window.addEventListener('beforeunload', async (e) => {
  const completion = await session.getFormCompletion();
  if (completion.pct > 30 && completion.pct < 90) {
    // User got partway through—offer help
    await session.triggerExitIntervention({
      message: `You're ${completion.pct}% through—want help finishing?`,
      action: 'pause_navigation'
    });
    e.preventDefault();
    e.returnValue = '';
  }
});

This only fires when the form is 30–90% complete. Below 30%, the user probably just bounced. Above 90%, they may have intentionally stopped. The middle range is where intervention pays off.

SDK integration—what dropping it into a page looks like

The client-side footprint is intentionally minimal. A co-browse widget shouldn't add 400KB to your page bundle or require a new build pipeline.

<!-- Drop into your page—async load, no blocking -->
<script>
  (function(w, d, s, c) {
    var f = d.createElement(s), x = d.getElementsByTagName(s)[0];
    f.async = true;
    f.src = 'https://cdn.arvo.ai/cobrowse/v2/widget.min.js';
    f.setAttribute('data-client-id', c);
    x.parentNode.insertBefore(f, x);
  })(window, document, 'script', 'YOUR_CLIENT_ID');
</script>

The widget script is < 18KB gzipped. It initializes a WebSocket connection only after user consent. Until consent, it collects nothing.

For React applications, the SDK wraps into a provider pattern:

import { CoBrowseProvider, useCoBrowseSession } from '@arvo/cobrowse-react';

function LoanApplicationFlow() {
  const { sessionActive, triggerHelp } = useCoBrowseSession();

  return (
    <CoBrowseProvider
      clientId={process.env.NEXT_PUBLIC_COBROWSE_ID}
      maskedSelectors={['[data-pii]', 'input[type="password"]']}
      stallThresholdMs={20000}
      onStall={(fieldId) => triggerHelp(fieldId)}
    >
      <ApplicationForm />
      {sessionActive && <CoBrowseIndicator />}
    </CoBrowseProvider>
  );
}

CoBrowseIndicator is a small persistent badge showing the user that a session is active. Hiding this during a session—even with consent already given—is bad practice. Users should always know, at a glance, whether the agent can see their screen.

What it doesn't solve

Co-browsing AI is very good at closing the gap between "I want to complete this" and "I don't know how to complete this." It's not useful when the user doesn't want to complete it.

If someone landed on a loan application page from a retargeting ad, spent 90 seconds, and abandoned because they decided the interest rate is too high—no amount of form guidance brings them back. The product or pricing is the issue.

The deployments that over-rely on co-browse intervention to compensate for a confusing product tend to see diminishing returns after the first few months. The right diagnosis is: if your co-browse intervention rate on a specific field is above 30%, that field needs to be redesigned, not guided around.

Co-browse analytics are also a product feedback loop. Where users stall, what fields generate the most interventions, which pages have the highest exit-intent triggers—this is the signal your product team should be reading every sprint. The AI agent is solving the immediate problem. The patterns it surfaces should be fixing the underlying one.

Part of the Arvo.ai technical blog series on conversation-led agentic AI for enterprises. www.auum.in