I’ve been thinking of having a small model like a long context qwen 4b run and do quick code review to check for these issues, then just correct the main model.
It feels like a secondary model that only exists to validate that a task was actually completed could work.
Yeah, it can work, because it’ll trigger the recall of different types of input data. But it’s not magic and if you have a 25% chance of the model you’re using hallucinating, you probably end up still with an 8.5% chance of getting bullshit after doing this.
I’ve been thinking of having a small model like a long context qwen 4b run and do quick code review to check for these issues, then just correct the main model.
It feels like a secondary model that only exists to validate that a task was actually completed could work.
Yeah, it can work, because it’ll trigger the recall of different types of input data. But it’s not magic and if you have a 25% chance of the model you’re using hallucinating, you probably end up still with an 8.5% chance of getting bullshit after doing this.