OpenAI's new ChatGPT Plus search feature lets users ask complex questions, receive cited sources, and ask follow-up questions. While promising, hallucinations (AI fabrications) remain a problem. A new OpenAI research paper addresses this by creating a dataset of challenging questions and answers, revealing that larger AI models are more consistent and accurate, and surprisingly, better at assessing their own confidence levels. This research is a significant step towards more reliable AI search. But there is a problem here. and that problem is called hallucination. this is where AIs make things up. if you ask something, you are going to get an answer. whether that answer is correct or not, well, that depends. we are seeing less and less of those hallucinations now, but they still exist. these are challenging questions and answers with high correctness on a diverse set of topics. I mean, just look at that, many of these are a touch more complex than what you can just get through search. you see, when evaluating an AI model with answers to these questions, not even that is without challenges.03:17it may be flat out incorrect, but being incorrect gets a bit more insidious. look at this. oh yes. fellow scholars, this is an AI that is not sure and it is hedging. you know, if I say the stock market may go up or down.03:36or it might go flat, I am completely right, but not very informative. is that incorrect? hard to say. so what are the results? well, hold on to your papers fellow scholars, because when they compared their flagship models to their mini variants, oh goodness…that is a huge difference.03:55they are still wrong a lot, and our question now is, okay, but are they like people -do they know that they are likely wrong on many of these things? are AIs aware of their limits? I would say the answer is, surprisingly, yes.04:10when asking them how confident they are in their answers, whenever they feel more confident, they are more likely to be right. when less confident, less likely to be right. now, if we are using this for search, we are hoping that the answers are consistent.04:27are they? let’s have a look by asking the AI models the same question a 100 times and…are they consistent? I would say that the flagship models kind of are. not even close to perfect, so careful there, but once again, their flagship reasoning O1 and GPT-4O models are remarkably consistent. and this is where this new dataset shines., it helps AI models become more confident when they should be confident, and more consistent.., and this will lead to less hallucinations, and more accurate information for us fellow scholars.., it is still a long road,,