thasan (Taimur)User
Projects
- Group
User Details
- User Since
- Oct 21 2025, 6:43 PM (34 w, 1 d)
- Availability
- Available
- Review Queue
- 0
Recent Activity
Mon, Jun 15
Sat, Jun 13
Fri, Jun 12
Thu, Jun 11
Wed, Jun 10
Tue, Jun 9
I wont comment on the C++, but it's worth running the ML perftests on this, because the first decode token now flushes eagerly, the perf harness records first token arrival earlier, so the measured values will shift even though the metric definitions are unchanged. Expect FIRST_TOKEN_LATENCY to drop and DECODING_TOKEN_SPEED to drop toward realistic values; both are measurement corrections, not regressions, so they shouldn't be triaged as a perf alert / backed out.
Looks good I can accept, side note regarding the UTF-16 code units, I traced what happens to an emoji through this tokenizer, it's not stripped, not split as punctuation, and gets swallowed into its surrounding word, which collapses to a single [UNK] token. So the metric stays internally consistent emoji input produces both chars and tokens, no char-without-work case.
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of , , , , . Tip: this Firefox add-on makes it easy!
Mon, Jun 8
Looks good to me, we are going to have to check the glean telemetry to see what impact is made.
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of , , , , . Tip: this Firefox add-on makes it easy!
Thu, Jun 4
Wed, Jun 3
Accepting, The best-onnx design is good. Im going to note that it might be important to run a ./mach try run here to make sure we didnt break anything surrounding best-llama, and smart tab.
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of , , , , . Tip: this Firefox add-on makes it easy!
Thanks for handling the feedback, this looks a lot better. Noting here that this path intentionally diverges from ONNX on throughput: tokensPerSecond/timePerOutputToken are computed over decodingTime (decode-only) rather than ONNX's inferenceTime (prefill+decode). This is a different generation engine, and I think the decode-window pattern here is more correct than what ONNX currently does.
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of , , , , . Tip: this Firefox add-on makes it easy!
Tue, Jun 2
Mon, Jun 1
LGTM thanks for adding the the remote perf run.
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of , , , , . Tip: this Firefox add-on makes it easy!
Thu, May 28
Thanks Joe, getting llama.cpp onto the structured metrics object is good progress, and the test is a good add.
Thanks for addressing all the feedback, the implementation looks good. Feedback for next time,this patch bundles several unrelated changes (drivebys) into one bug. Going forward, splitting unrelated work into its own bugs would keep each patch scoped and let things land faster.
Tue, May 26
LGTM, no blocking feedback.
