Benefits Bot

Benefits Bot 
Turning an internal LLM into Oscar's first AI-powered member experience
Benefits Bot is Oscar's first member-facing AI assistant for insurance benefits. It provides immediate, reliable answers to coverage questions through the member messaging experience, even outside business hours when Care Guides are not available.
As project lead and full-stack engineer on the Member Experience team, I owned the integration of the Benefits Bot into secure messaging. I partnered with product, design, and the platform team behind Oscar's internal LLM to define how the model should behave for members, design the chat experience, and implement the backend and frontend changes needed to launch the pilot. The project bridged AI capabilities and user-focused design to make complex benefits information more accessible and trustworthy.
Software engineer at Oscar Health
View impact
See final experience
Background
Members needed immediate answers at critical moments
Even as we clarified the benefits experience, members still turned to messaging when they felt urgency or uncertainty. Messaging offered something the UI couldn't: immediate, personalized reassurance. Before appointments or during sudden health concerns, members reached out with similar questions across different personal context, such as:
"Hello, does my plan cover massage therapy?"
"I have an ultrasound, pelvis and abdomen planned for Monday, Feb 17 at 8 AM at Pascack Valley Hospital. Please let me know if this is covered by insurance and what my cost will be. Thank you!"
"Hi, I'm experiencing extreme pain in my left shoulder. I'm worried that I might have a hair line fracture. Would I be covered to go to urgent care and get an x-ray?"
The Problem
With no live chat, members' only option was secure messaging with Care Guides, for which responses could take up to one business day. The delay created friction at time-sensitive moments when members needed quick answers to proceed with their care. 
Care Guides, meanwhile, spent significant time fielding repetitive questions instead of focusing on complex cases that required empathy or human judgement. Each message added operational cost, and answering accurately often meant navigating multiple internal systems spanning plan design, networks, and authorization rules.
The Opportunity
Oscar built an internal LLM that supported Care Guides relied on to look up plan details and coverage rules, which presented a clear opportunity to adapt it for members to safely automate benefits questions in seconds.
However, we first needed to turn this internal expert tool into a member-ready experience:
How might we adapt an internal, expert-oriented LLM to best support our members, while keeping clear paths to human support?
Insights & Experience Requirements
To shape the experience, I collaborated with product, design, UX research, Care Guides, and the AI platform team. We drew from secure messaging transcripts, member feedback, early usage of Benefits Bot as a Care Guide tool, and the current LLM workflow.
What we saw in member messages
Members asked questions in everyday language, not insurance terms (e.g. "Will my IUD be covered?", "I think I broke my arm, where can I go?")
Many wanted quick, concrete answers and felt frustrated by long response times, especially during sudden health concerns.
Members asked for links or documents they could save for later reference.
Some explicitly appreciated the emotional support and reassurance from Care Guides.
What we learned from the internal pilot
Care Guides' feedback on the internal tool surfaced issues that would be critical for members:
Citations were accurate but returned as dense, unstructured text that was hard to parse.
Response times sometimes stretched to several minutes. Without visible feedback, the experience would feel broken. 
 
From this, we knew Benefits Bot would need to:
Answer in clear, concise language that matches how members naturally ask questions.
Provide source references and links members can revisit.
Easily escalate to a Care Guide should the member want to consult a human.
Before releasing to members, this meant:
Citations needed to be reformatted into scannable, member-friendly structures.
The chat UI needed clear signals (typing indicators, loading states) to show that the assistant was working on an answer.
Together, these inputs reinforced that we were not just "plugging in an LLM." We were designing a safe, guided layer around it: how it responded, how it appeared in messaging, and how it builds trust.
Problem Framing and Approach
To capture and address these considerations, we organized the work into 3 design focus areas:
01
LLM Interaction Design
How the bot thinks: what sources does the bot use? how long is context window? when does it escalate to a Care Guide?
02
User Journey and Chat Display
Where and how members encounter the bot in secure messaging, what they see while it is "thinking," and how the experience transitions between bot and human.
03
Source Citations and Actions
How the bot explains where its answer came from, and how it turns dense plan references into links and actions that help members move forward.
Ideation & Exploration
LLM Interaction Design
We started the ideation process by coming up with a product roadmap for the LLM interaction flow: when the bot should participate, what context it should use, when it should hand off to a human instead of guessing. Because this was a pilot in a high-stakes domain, we explored some ideas and landed on a tight scope and conservative options. We designed the bot to behave as follows:
When the LLM is involved
Because we were embedding this into secure messaging, all members in the pilot started with Benefits Bot when opening a new secure messaging thread. From there, the LLM decided whether it could safely answer or should hand off to a Care Guide.
Benefits Bot only attempted to respond at the start of a new secure messaging thread.
The assistant responded only when a message met all of the following requirements:
The member is asking a benefits-related question.
The question is in English.
The model does not detect conflicting information while forming a suggested response.
If any of these checks failed, the bot did not reply and the conversation stayed with a Care Guide.
Benefits Bot also needed to accurately handle messages like "I want to talk to a human" and immediately transition the thread to be handled by a Care Guide.
Decision Flow
 
How Benefits Bot handled a question
The bot was scoped to process one well-defined benefits question per thread. After the initial AI interaction, any follow-up questions were answered by a human.
It could ask clarifying questions (e.g. "Can you tell me which part of your body you want an MRI (for example, brain, knee, spine, or another area)? ") before returning a single, well-informed answer.
The context window was limited to this Q&A exchange within the thread. Once it answered and a handoff happened, its "memory" for that interaction ended. The conversation history still remained available for Care Guides to review.
If any dependent AI service was unavailable (for example, an OpenAI outage), the thread was automatically moved to be handled by a Care Guide.
What it could use to answer
The bot only referenced Oscar's internal tools and data (plan documents, benefits rules, member plan and policy details) to generate answers.
It did not call external tools or sources, so all guidance stayed grounded in Oscar's source of truth.
Ideation & Exploration
User Journey and Chat Display
Even with a well-scoped LLM, the experience would fail if it felt confusing or misleading. We needed to decide how members first encounter Benefits Bot in secure messaging, what they see while it is formulating a response, and how the thread moves between bot and human.
We focused on 3 questions to map the user flow:
How do we introduce the bot so it is clearly AI, not a human?
What should members see while the LLM is "thinking," especially when responses take time?
How do we handle follow-up questions and "I want a human" moments in the same thread?
Introducing Benefits Bot in secure messaging
We explored both adding a new "AI chat" surface and embedding the assistant where members already ask for help. To avoid fragmenting support and losing context upon Care Guide handoff, we chose to keep everything inside the existing secure messaging flow and make the AI identity explicit:
AI-specific identity
Use "Oscar AI Support" as the participant name with an AI gradient avatar so it looks distinct from Care Guides at a glance.
Upfront disclosure
Show a short notice on chat open that responses may be generated by AI.
Per-message labeling
Clearly mark each Benefits Bot answer as AI-generated in the thread, so members always know who is replying.
While the bot is working
Early internal tests showed that LLM responses could take time, and without feedback the experience felt broken. We tried static loading messages ("Your question is being processed…") and simple spinners, but they felt disconnected from the conversation. We decided to lean on existing chat patterns instead:
Typing indicator
Reuse the existing typing indicator pattern to show that their inquiry is being processed.
Chat-style follow-ups
When the bot needs more detail, it asks follow-up questions as individual chat messages rather than long forms, matching how members already use messaging.
Follow-up questions and escalation within the thread
The bot was not fit to answer every message in a thread, and some members explicitly prefer a human.
We landed on a simple set of rules:
Bot handles the first in-scope question
Benefits Bot answers the initial benefits question at the start of the thread.
Subsequent questions in the same thread are routed to a Care Guide, with a clear note that responses may take up to one business day.
Respect explicit human requests
If members say things like "I want to talk to a human" or "no bot please," the bot accurately processes this and transitions the thread to a Care Guide.
Clear handoff messaging
In both out-of-scope and "human, please" cases, we tell members who will reply next and when.
Ideation & Exploration
Source Citation and Action Design
On the internal tool, the citations from Benefits Bot looked like this:
The information technically accurate, but the unstructured format was overwhelming to read and act on. Improving how the bot displayed its sources quickly became one of our top priorities for the member pilot.
At a high level, we considered 3 options:
Given that our main goal was to provide transparency and trustworthy answers, we chose option 3: referenced resources. This meant we had to turn raw LLM outputs into structured, member-facing objects.
Understanding what the model returned
Together with design, I went through the following exercise:
01
what are all the sources Benefits Bot can return?
We began by mapping every citation type Benefits Bot could return:
benefit categories
CPT codes
providers and facilities
drugs
plan documents (Evidence of Coverage, Schedule of Benefits, etc.)
02
how do we want to display these sources to members?
For each type, I catalogued how the payloads varied in structure and length and worked with design to audit which ones were actually meaningful for members versus better kept behind the scenes.
For each, we defined:
what needed to be visible in the UI (names, labels, short descriptions)
what should become links or actions (for example, "View your full plan benefits," "View this provider")
what could be omitted to reduce cognitive load
Our brainstorming session and exploration resulted in our first pass:
provider / facility / drug → link to the corresponding profile
benefits documents → link to the PDF at the relevant page
benefit category → plain-text label for quick context
CPT code → link to the cost estimates tool, along with the code & description
Filling gaps in the API
To support the designs we landed on, we needed structured fields from the API. In review, all citation data was packed into a single text field, which made it hard to render clearly on the client side. We were missing basics like source type labels, human-readable benefit names, source IDs, and page references.
I documented these gaps, proposed a structured set of fields to the AI platform team, and validated the responses once changes were deployed. This gave us a cleaner schema we could reliably render on the member side and made the UI we designed actually possible.
proposed sources payload:
Designing the citation component
With better API responses in place, I explored different ways to display citations without burying the main answer:
Always displayed sources
Citations displayed in full within the message thread
Detached Sources button
A 'Sources' button below the thread that opens a bottom sheet
Collapsible section
A compact, expandable sources section within the thread
In early prototype testing, members found that:
always-on detail made the thread feel noisy and harder to scan
a separated "Sources" button was easy to overlook, even for people who cared about trust
To balance transparency (for members who want to see the evidence) and focus (for those who just want a clear answer), we landed on a collapsible citation component that:
is collapsed by default to keep the main message readable
expands on tap/click to show sources when members want more detail
turns structured fields from the API into clear, task-oriented actions
What started as raw, internal-facing outputs translated into a member-facing experience that feels transparent without being overwhelming.
Design Iterations
Once the main citation component was in place, testing the flow with real data surfaced edge cases that shaped the final experience.
Empty citations
Fallback: if no sources or extra information exist for a Benefit related question.
In some cases, the API returned no sources. Showing an AI-generated answer with nothing cited risked eroding trust. I designed a fallback state that surfaced a general link to the member's plan benefits page so there was always somewhere to verify or explore further.
We later chose to include this link in every response, not just when citations were missing. This way, we provided a consistent navigational entry point for members who wanted more detail.
Long citations lists
The API often returned multiple references from the same document, creating long, repetitive lists that pushed key information out of view. During QA, I found that sometimes more than 8 sources links were surfaced.
I raised this with the team and we debated two options:
Deduplicate and only show the first reference.
Show all to preserve completeness, even if the list was long.
We decided to keep all sources available, since each citation could contain the particular detail a member might need. To avoid overwhelming the UI, I partnered with design to implement a "Show X more sources" control: we show the first two citations, with an option to expand the rest. This keeps the view scannable while still supporting members who want the full list.
CPT code overload
With real member questions, we also saw answers that returned long lists of CPT codes. Rendering each code with its full description overloaded the sources component and made it harder to find the links that mattered.
In usability reviews, we noticed that members rarely had use for CPT descriptions. Instead, they copied the codes to share with Care Guides or to paste into the cost estimator tool. Based on this, we simplified the pattern to show a single, concatenated list of CPT codes and redacted the individual descriptions.
This kept the component compact, preserved the information members actually used, and reduced noise around the plan documents and provider links that were more meaningful for decision-making.
before
Long lists of CPT codes with full descriptions created visual clutter
after
Simplified to show concatenated codes, making sources easier to scan
Final Experience
A plan-aware AI assistant embedded in secure messaging
sources: before
Citations appeared as long text dumps with repeated source names, making it hard to scan or trust where answers came from.
sources: after
Structured labels and plan-document links help members understand an answer’s origin at a glance.
Impact & Reflection
Faster answers, reliable coverage guidance, and meaningful operational impact
With the member pilot rolled out to 10% of our members, we collected early metrics:
<1min
Median time to first response
Dropped from roughly 8 hours (within a 24-hour SLA) to under 1 minute for in-scope questions
52%
Full resolution rate
Benefits Bot fully resolved approximately 52% of in-scope benefits questions without requiring a Care Guide response
99%
Escalation accuracy
Human escalation was correct 99% of the time, with complex or ambiguous questions routed to Care Guides
92%
Completeness score
Internal evaluations showed the assistant's responses scored 92% vs 62% for Care Guide answers in completeness
99%
Relevance score
Benefits Bot scored 99% vs 82% for Care Guide answers in relevance
Faster answers for members
Median time to first response for common benefits questions dropped from roughly 8 hours (within a 24-hour SLA) to under 1 minute for in-scope questions.
Members could get coverage clarity even outside business hours, when Care Guides were unavailable.
Resolution and operational impact
Benefits Bot fully resolved approximately 52% of in-scope benefits questions without requiring a Care Guide response.
Human escalation was correct 99% of the time, with complex or ambiguous questions routed to Care Guides.
By offloading straightforward benefits questions, Care Guides could focus more on cases that required empathy and nuanced judgement.
Quality and trust
Internal evaluations showed the assistant's responses matched or exceeded Care Guide answers on key dimensions, scoring 92% vs 62% in completeness and 99% vs 82% in relevance.
Following the successful pilot with 10% of members, we plan to roll Benefits Bot out to 100% of eligible members.
What I learned
Working on Benefits Bot reaffirmed that building digital systems is not just about making them work, but about making them feel trustworthy. I saw how details often treated as engineering edge cases are, in fact, design decisions that shape how people feel: a typing indicator tells someone their question is being actively processed, and a collapsible sources section lets people see the evidence without drowning in it.
The project also showed me what it means to balance AI’s speed with the trust people expect from human guidance. Good AI service design isn’t only about delivering answers faster, but about making those answers clear, credible, and compassionate. This is the kind of problem I want to study more deeply in graduate school: how to design and evaluate AI-powered healthcare tools so they are technically sound and also easy to understand, honest about their limits, and safe to rely on when people are worried or unsure.