The Future is ‘Her’: OpenAI New GPT-4o Mirrors Scarlett Johansson’s AI Role

AI Summary

OpenAI has released GPT-4o, an advanced AI model that mirrors Scarlett Johansson's AI role in the movie "Her". The new model offers multimodal capabilities, including text, audio, and images, with extremely fast response times, rivaling human conversation speeds. It outperforms previous models on various benchmarks and is being rolled out to users starting today, initially as a limited version with free users having access to GPT-4o level intelligence with some usage limits.

May 13 2024 18:47
OpenAI has just unveiled their latest and most advanced AI model yet - GPT-4o. The "o" stands for "omni" and hints at the model's impressive capabilities across multiple modalities including text, audio, and images. This groundbreaking new model represents a significant step towards more natural and seamless human-computer interaction. Here is the GPT-4o demo:

Google also just happened to give us a sneak peek of their new multimodal AI-powered camera feature, right before the highly anticipated Google I/O developer conference. Here is the teaser video from Google, shared on X (formerly Twitter), which demo do you think is better?

One more day until #GoogleIO! We’re feeling 🤩. See you tomorrow for the latest news about AI, Search and more. pic.twitter.com/QiS1G8GBf9
— Google (@Google) May 13, 2024

After the event, OpenAI CEO Sam Altman posted just one word on X/Twitter: “her.” He previously said "Her" is his favorite movie. The assistant's voice response from the demo was a fascinating echo of the character brought to life by Scarlett Johansson in the 2013 movie 'Her'. This film spins a tale of a man who finds companionship in an advanced AI assistant, a narrative that our current interaction intriguingly mirrors. It's as if we've stepped into a scene from the movie itself!

her
— Sam Altman (@sama) May 13, 2024

Here are some excellent demo of the GPT-4o capabilities at the Spring Update announcement, I particularly like this voice variation & vision demo:

Sam Altman's Comments about GPT-4o

Sam Altman highlighted two major developments in his blog post right after the announcement. Firstly, the mission of OpenAI is to provide highly capable AI tools to people either for free or at a very affordable price. OpenAI takes pride in offering the world's best model, ChatGPT, for free, without any ads. Initially, OpenAI aimed to create AI and use it to benefit the world. However, the vision has evolved to creating AI that others can use to create amazing things that benefit everyone. As a business, OpenAI will find plenty of services to charge for, which will enable it to provide free, exceptional AI services to billions of people.

Secondly, the new voice and video mode feels like AI from the movies, and he's still surprising that it's real. Achieving human-level response times and expressiveness is a significant change. The original ChatGPT hinted at the possibilities with language interfaces, but this new development feels viscerally different. It's fast, smart, fun, natural, and helpful. Talking to a computer has never felt so natural. As OpenAI adds optional personalization, access to information, the ability to take actions on behalf of users, and more, an exciting future is foreseen where computers can be used to do much more than ever before.

Key Capabilities of GPT-4o

GPT-4o is lightning fast, with audio processing at just 232 milliseconds, matching the pace of human conversations. So what exactly can GPT-4o do? Here are some of the highlights:

Adaptable Input and Output: GPT-4o excels in interpreting and generating responses to not only text, but also audio and images. This adaptability paves the way for its use across various media formats.
Consolidated Model Structure: In contrast to its predecessors that used separate models for different types of interactions, GPT-4o employs a single, unified model. This streamlined approach enhances efficiency and performance.
Improved Multilingual Proficiency: GPT-4o showcases remarkable progress in processing non-English languages. This makes it a precious asset for worldwide applications, facilitating improved communication and comprehension across varied linguistic environments.
Exceptional Audio Processing: The model establishes new benchmarks in speech recognition and audio translation, making it extremely useful for voice-driven applications and multimedia content.
Progressive Visual Capabilities: GPT-4o’s capacity to perceive and interpret visual data enables groundbreaking applications in image analysis and more, narrowing the gap between AI and human-like visual comprehension.
Enhanced Efficiency: Thanks to its optimized tokenizer compression, GPT-4o needs fewer tokens to process data. This results in quicker response times and reduced computational expenses, making it more user-friendly for a range of applications.
Inherent Safety Features: OpenAI has incorporated strong safety measures into GPT-4o from the beginning. These precautions are designed to minimize risks and guarantee the model’s safe deployment and usage.
Affordability and Accessibility: The introduction of GPT-4o promises a more efficient service at a reduced cost.
API Access for Developers: Initially, GPT-4o will provide API access for text and vision modeling, with audio and video capabilities to be added later for trusted partners (like Jasper!). This presents fresh opportunities for developers to experiment and innovate with GPT-4o’s features.

GPT-4o's multimodal architecture, trained end-to-end on text, vision and audio, allows it to process and understand information more holistically. This unified model structure preserves more context and nuance compared to previous approaches that used separate specialized models for each modality. Below is a realtime translation demo of GPT-4o:

Key Differences from the existing Voice Mode

In the above live demo, you can see it allows users to interact with GPT-4o using natural, spoken language. The real-time conversational speech feature is accessible through a small icon on the bottom right of the ChatGPT app. When activated, users can simply start speaking to the AI, and it will respond in real-time, just like a human conversation. This new real-time conversational speech feature offers several key improvements:

Interruption: Users can now interrupt the AI mid-response, allowing for more natural, free-flowing conversations. There's no need to wait for the AI to finish speaking before you can start.
Real-time responsiveness: The new feature eliminates the awkward 2-3 second lag that was present in the previous voice mode. The AI now responds immediately, making the conversation feel more natural and seamless.
Emotion recognition: ChatGPT can now pick up on the user's emotions based on their tone of voice and speech patterns. In the demo, when Mark was breathing heavily due to nervousness, the AI recognized this and offered guidance to help him calm down.

The real-time conversational speech feature showcased in this demo represents a significant step forward in AI interaction by making the experience more natural, responsive, and emotionally intelligent.

Pushing the Boundaries of AI

On standard AI benchmarks, GPT-4o achieves state-of-the-art results, setting new records in reasoning, multilingual understanding, speech recognition, and visual perception. It outperforms GPT-4 on challenging tests like the M3Exam which evaluates both multilingual and visual reasoning capabilities.

It also outperforms GPT-4 Turbo (2024-04-09), Gemini 1.0 Ultra, Gemini 1.5 Pro, and Claude Opus in visual perception benchmarks.

Thoughtful Deployment for Maximum Benefit

As with any powerful new technology, it's critical that GPT-4o is developed and deployed responsibly. OpenAI has put the model through extensive testing and red teaming to identify and mitigate potential risks, especially around the new audio modalities.

The model also incorporates the latest efficiency optimizations, allowing it to be both faster and more affordable than GPT-4 while maintaining the same level of performance. This improved efficiency is what enables OpenAI to make GPT-4o's capabilities available much more broadly.

OpenAI is taking an iterative approach to rolling out GPT-4o's capabilities, starting with text and image today, and carefully expanding to audio and video over time with appropriate safeguards in place. The goal is to harness GPT-4o's potential to benefit as many people as possible while proactively managing downside risks.

ChatGPT MacOS App: Seamless Integration with Mac

OpenAI ChatGPT MacOS app is being rolled out to Plus users, with plans to make it more broadly available in the coming weeks. A Windows version is slated for release later this year. This phased rollout ensures that the app is thoroughly tested and optimized for the best possible user experience.

One of the standout features of the ChatGPT MacOS app is its convenient accessibility. With a simple keyboard shortcut (Option + Space), you can instantly summon ChatGPT, regardless of what you're currently working on. Whether you're in the middle of writing an email, coding a project, or browsing the web, ChatGPT is just a keystroke away, ready to answer your questions and provide valuable insights.

The app allows you to take and discuss screenshots directly within the app. This feature is particularly useful when seeking guidance on visual elements, such as design mockups, code snippets, or data visualizations. Simply capture a screenshot, and ChatGPT will be ready to analyze and provide feedback, streamlining your workflow and enhancing your productivity.

In addition to screenshots, the ChatGPT MacOS app enables you to start new conversations using photos from your computer or newly captured images. This opens up a world of possibilities for creative professionals, researchers, and anyone who relies on visual information. Whether you need help identifying an object, analyzing an image, or generating captions, ChatGPT is up to the task.

You can download ChatGPT MacOS App here: https://t.co/MhliG30zMa
ChatGPT MacOS App Help: More details about the MacOS App functionalities

Experience GPT-4o in ChatGPT and API

The great news is that GPT-4o is being made available in ChatGPT starting today! Free users will have access to GPT-4o level intelligence with some usage limits, while ChatGPT Plus subscribers get even higher limits. Developers can also start using GPT-4o's text and vision capabilities in the API immediately. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. OpenAI plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

Over the coming weeks and months, more of GPT-4o's advanced features like real-time voice conversations will be rolled out in ChatGPT. We're entering an exciting new era of AI assistants that can engage with us more naturally and helpfully than ever before. OpenAI is continuing to push the boundaries of what's possible while expanding access to these transformative tools.

Try on ChatGPT: ChatGPT with GPT-4o
Try in Playground: OpenAI Playground
Rewatch live demos: OpenAI Website
GPT-4o model capabilities: OpenAI samples (see below the photo to caricature samples)

Photo to caricature samples from GPT-4o