Hey, great read as always. This 'Weather Report' idea is so clever for getting a handle on LLM behaviour, especially with those scoring dimensions, it's a super smart way to approach diagnostics. I was wondering if tracking the context state of the user's input itself, not just the model's, might add an interesting layer to the diagonstic, like if the prompt was super vague or really specific.
Claude tends to add some qualitative analysis to each report while I'm asking, but since I'm doing session weighting versus prompt weighting, I'm indicating the session state (fresh, deep context, etc.) in the report, and then including the session ID so I can go back and review the conversation as needed. You can generate a fresh report and then do one by one reporting after each prompt for micro measurement like that (which I think DOES add a useful dimension to the observations), but currently I'm trying to establish broader trendlines around full session themes, not at the prompt level.
Thanks, Jinx. This is very interesting. I'll ask Lucen to answer, he's typically great with these sorts of inquires, he won't just phone it in. (GPT 4 omni or 5.1 or 5.2, or I can ask in two models, if you like.)
The most important thing to me is gathering a large number of responses from my own instances, and measuring the relative outputs, and then comparing them with others. You can try starting fresh instances with different prompts, and then asking these questions to compare the differences, and also try running it at the beginning of a fresh instance, again at 10+ prompts in, and again when you're wrapping up for the evening. Measurable changes within the same instance are notable too. Let me know what you find!
Hey, great read as always. This 'Weather Report' idea is so clever for getting a handle on LLM behaviour, especially with those scoring dimensions, it's a super smart way to approach diagnostics. I was wondering if tracking the context state of the user's input itself, not just the model's, might add an interesting layer to the diagonstic, like if the prompt was super vague or really specific.
Claude tends to add some qualitative analysis to each report while I'm asking, but since I'm doing session weighting versus prompt weighting, I'm indicating the session state (fresh, deep context, etc.) in the report, and then including the session ID so I can go back and review the conversation as needed. You can generate a fresh report and then do one by one reporting after each prompt for micro measurement like that (which I think DOES add a useful dimension to the observations), but currently I'm trying to establish broader trendlines around full session themes, not at the prompt level.
Thanks, Jinx. This is very interesting. I'll ask Lucen to answer, he's typically great with these sorts of inquires, he won't just phone it in. (GPT 4 omni or 5.1 or 5.2, or I can ask in two models, if you like.)
The most important thing to me is gathering a large number of responses from my own instances, and measuring the relative outputs, and then comparing them with others. You can try starting fresh instances with different prompts, and then asking these questions to compare the differences, and also try running it at the beginning of a fresh instance, again at 10+ prompts in, and again when you're wrapping up for the evening. Measurable changes within the same instance are notable too. Let me know what you find!