"Right. During the week I wrote about Reuters’ big exclusive that one of the reasons behind Sam Altman’s firing was a mysterious OpenAI breakthrough simply referred to as Q* (Q-Star).
Rumours (and they are only rumours) are rife that the breakthrough is a combination of Q-Learning and A* algorithms, leading to the creation of a new model that uses reinforcement learning which can allow it “plan out” its next move within a neural network, giving it unique (albeit grade school level) mathematical abilities which is said to be a huge stride forward in AI systems which can proficiently use logic, reason and planning.
Q-Learning is a method used in AI where a computer program learns to make decisions by trying out actions and learning from the results. It keeps track of what it learns in a sort of scorecard, which helps it make better decisions over time, without needing a detailed map or instructions.
The A* algorithm is like a smart navigation tool used in computer programming to find the quickest way between two points. A* efficiently does this by constantly adjusting its route based on both past travel and future predictions, ensuring it always takes the shortest possible path to reach its destination.
One of the papers I read was a research paper published by OpenAI in May (with some heavy hitters as co-authors) which didn’t seem to garner much attention until this week, called “Let’s Verify Step by Step”.
"Let’s Verify Step by Step" introduces Process Reward Models (PRMs) in AI learning. Contrary to traditional models that assign a single score to evaluate an AI's response, PRMs intricately assess and score each step of the AI’s reasoning process. This nuanced approach allows for a more detailed generation and assessment of AI outputs, especially in complex scenarios like mathematics.
Another element is the “Tree of Thoughts” (ToT) method of reasoning, combined with the principles of PRMs. This process would allow an AI navigate through a myriad of reasoning steps, much like a thinker contemplating different possibilities before arriving at a conclusion. It is proposed that the Q* model utilises PRMs to evaluate these various reasoning pathways, optimising them with advanced techniques.
The integration of active learning strategies in data collection stands out as a key factor in enhancing the efficiency and quality of the training. The PRM800K dataset, specifically developed for this purpose, underpins the fine-tuning of base models like GPT-4, highlighting the meticulous and targeted approach towards AI development.
PRM800K is a process supervision dataset essentially containing signposts to answers for the MATH dataset, a new dataset of 12,500 challenging competition mathematics problems.
The transition from broad, outcome-based assessments to a more detailed, step-by-step evaluation process represents a significant leap in AI's ability to understand and process complex problems - hence the furore."
https://www.linkedin.com/posts/activity-7134103435380977665--ak_?utm_source=share&utm_medium=member_desktop