Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
That’s one example, but what about other models? What you just did is called cherry picking, or selective evidence.