|||

AI Learns to Listen: TypeScript Client for OpenAI’s Realtime API

Developers can use the OpenAI Realtime API to build features that enable users to have back-and-forth voice conversations with an AI large language model. I’m working on a new startup with a friend, and our use case significantly benefits from voice interaction. Bringing AI and voice interaction together has created an incredibly fun experience — literally smile and laugh with the assistants. We’ve made great progress in creating a truly delightful user experience. I’m also excited about this work because I see its potential to make AI accessible to people outside of tech through a more intuitive and natural form of interaction.

“Parrot singing Old Mcdonald had a Farm” by San Diego Shooter(https://www.flickr.com/people/nathaninsandiego/) is licensed under CC BY-NC-ND 2.0“Parrot singing Old Mcdonald had a Farm” by San Diego Shooter(https://www.flickr.com/people/nathaninsandiego/) is licensed under CC BY-NC-ND 2.0

Outline

In this post, I’ll share the following:

  • How to get perfect TypeScript types for the OpenAI Realtime API.
  • A SDK (client) for the OpenAI Realtime API using WebRTC. Use it build voice interaction with AI into your applications.
  • A working demo of the SDK that you can use right now available in an MIT-licensed, open-source GitHub repository.

If you want to, jump ahead to the code and demo.

OpenAI’s Realtime API

One of the key technologies driving our project forward is the OpenAI Realtime API. This API combines the power of LLM language-based interaction, speech-to-text, and audio analysis (like turn detection) into a neat package. It’s then integrated with Web media technologies and the WebRTC API in particular. While WebSockets can be used, the WebRTC API offers incredible capabilities for streaming audio. If you haven’t explored the WebRTC samples, I highly recommend doing so!

OpenAI Realtime API Clients/SDKs

OpenAI provides the Reference Client (beta), and the official OpenAI SDK added support for the WebSocket API, released in beta on January 17, 2025. Initially, I found integrating with the Realtime API challenging due to the lack of comprehensive client-side support.

Getting Types for the Realtime API

When I started this project, OpenAI’s official SDK didn’t support the Realtime API. They’ve since added types in the beta/realtime folder of the official OpenAI SDK and added support for the WebSocket API, similar to their beta Reference Client.” However, as of this writing, they don’t provide support for managing conversation state, audio streams, or WebRTC. Essentially, their clients provide an event stream of raw data. My exploration of forums and solutions (like this one) suggests I’m not alone in having to decipher the APIs real-time behavior. Even with the newly added types, I’m still using the technique described below, as it allows me to pull up types and make them more specific than their SDK or OpenAPI spec allows.

OpenAI maintains a dedicated repository for their OpenAPI Specification at https://github.com/openai/openai-openapi, which is kept current. Kudos to OpenAI for that! With that in mind, we can use openapi-typescript to process the full OpenAPI Spec and get TypeScript types for it.

The full command that I use in a shellscript for this is below:

npx openapi-typescript "https://raw.githubusercontent.com/openai/openai-openapi/25be47865ea2df1179a46be045c2f6b65f38e982/openapi.yaml" -o "${parent_dir}/src/types/openai/openapi.d.ts"

This results in an openapi.d.ts file that is a bit unwieldy. For instance, the Realtime API uses 27 Realtime Server Events” each with a different schema. Having dedicated types for each is crucial. Here is an example of the RealtimeServerEventConversationItemCreated which is an essential event when using the Realtime API:

/** @description Returned when a conversation item is created. There are several scenarios that
 *     produce this event:
 *       - The server is generating a Response, which if successful will produce
 *         either one or two Items, which will be of type `message`
 *         (role `assistant`) or type `function_call`.
 *       - The input audio buffer has been committed, either by the client or the
 *         server (in `server_vad` mode). The server will take the content of the
 *         input audio buffer and add it to a new user message Item.
 *       - The client has sent a `conversation.item.create` event to add a new Item
 *         to the Conversation.
 *      */
RealtimeServerEventConversationItemCreated: {
/** @description The unique ID of the server event. */
event_id: string;
/**
 * @description The event type, must be `conversation.item.created`.
 * @enum {string}
 */
type: "conversation.item.created";
/** @description The ID of the preceding item in the Conversation context, allows the
 *     client to understand the order of the conversation.
 *      */
previous_item_id: string;
item: components["schemas"]["RealtimeConversationItem"];
};

Here is what I’d like you to note:

  1. It includes documentation—good documentation. While I’ve found occasional inaccuracies, it’s generally very helpful.
  2. This event is somewhat buried in the OpenAPI schema, accessible in your code as components["schemas"]["RealtimeServerEventConversationItemCreated"].
  3. It references other important objects, like components["schemas"]["RealtimeConversationItem"] that has the details about this event.

Cleaning this up is relatively straightforward. Create a new types.ts file and add something like this:

import type { components } from "./openapi"

export type RealtimeServerEventConversationItemCreated =
  components["schemas"]["RealtimeServerEventConversationItemCreated"]

export type RealtimeConversationItem =
  components["schemas"]["RealtimeConversationItem"]

Now you can use both of these types using a more familiar naming convention. However, there are 27 of these events and 9 client events, I certainly didn’t want to copy and paste all over the place.

So to get these events more efficiently you can use TypeScript’s turing complete type system to do this quite neatly:

/**
 * All the keys of components["schemas"] that begins with "RealtimeServerEvent"
 */
type RealTimeServerEventKeys = Extract<
  keyof components["schemas"],
  `RealtimeServerEvent${string}`
>

This results in a type that looks like this:

type RealTimeServerEventKeys = "RealtimeServerEventConversationCreated" | "RealtimeServerEventConversationItemCreated" | "RealtimeServerEventConversationItemDeleted" | ...

With the keys in hand, you can get a type representing all of the server events like this:

/**
 * All of the RealtimeServerEvent messages/events.
 * These are received by the client from the OpenAI server.
 */
export type RealtimeServerEvent = components["schemas"][RealTimeServerEventKeys]

I have derived many more types, refining some beyond what the OpenAPI spec provides through trial and error and documentation, but you should get the idea. I have also found they get a bit unruly at times and the Simplify type from type-fest has been useful — I think it’s kinda funny that that is needed, but it is great!

You can see the complete types that I have pulled out as I use the API in the open source repository below.

Code & Demo

I have published the code for the client and the demo under an MIT License at the following link:

GitHub - activescott/typescript-openai-realtime-api: TypeScript OpenAI Realtime API Client & ExamplesGitHub - activescott/typescript-openai-realtime-api: TypeScript OpenAI Realtime API Client & Examples

If you like it, go star it!

There’s more

The package is doing a lot more than providing types and the event stream. It integrates the Realtime APIs streaming audio with your browser’s output and connects your WebRTC microphone to OpenAI, manages conversation state, and much more. I’ll post more on that later.

If you’re interested in more info on AI & voice, let me know!

Up next Bridging the Gap: Communicating Between the “Browser” (Renderer) and the Main Process in an Electron App
Latest posts AI Learns to Listen: TypeScript Client for OpenAI’s Realtime API Bridging the Gap: Communicating Between the “Browser” (Renderer) and the Main Process in an Electron App Congratulations Micah Writing Shell Scripts with TypeScript instead of Bash Surprising Side Effects of Server Side Rendering in Next.js for SEO & Performance When Empowering Employees to Take Risks, Isn’t Empowering (and Why That Needs to Change) Rationalizing Frequent Deployments for Product Managers and Software Engineers Now Write Right: 3 Best Practices for Writing to Innovate and Influence Write Right Now: How Engineers Can Innovate, Influence, and Lead in Business