Most of the commentary about GPT-4o is missing the main point. Yes its faster, cheaper and feels like the movie Her. But there are much greater implications to consider.
Earlier this year, Yann LeCun, the head of AI at Meta, pointed out that biggest LLMs are trained on text. But text is an extremely low bandwidth way to learn how the world works compared to video. In fact, a 4 year old child will have seen 50x more data than our largest LLMs.
Up until a couple of weeks ago, even our most sophisticated models were built around text. For example, GPT-4 handled audio by transcribing it first into text, then applying reasoning to that text. GPT-4o, however, was designed to understand video and audio natively. That has implications for how much more data future versions can be trained on. Consider that a 4 year old will have experienced the equivalent of 16K hours of video. Compare that to YouTube, which alone contains over 150 million hours of video.
How much smarter can AI get? With a natively multi-modal architecture, I suspect the answer is much, much better.
Edit: Wow, this post really blew up. I want to acknowledge that we don't know with certainty how GPT-4o is architected (or Gemini 1.5 for that matter), what OpenAI means when it says "natively" multi-modal, and that the model can be further trained on video. It's an underlying assumption in my commentary above, but a reasonable one I think.
Atlassian Platinum Solution Partner | IT Sales Manager (UK)
6dLove this! Using current data to bridge the gap is a game changer