Skip to content

feat: add live-update support#2

Merged
Embers-of-the-Fire merged 16 commits intomainfrom
feat/live-update
Mar 17, 2026
Merged

feat: add live-update support#2
Embers-of-the-Fire merged 16 commits intomainfrom
feat/live-update

Conversation

@Embers-of-the-Fire
Copy link
Member

@Embers-of-the-Fire Embers-of-the-Fire commented Mar 16, 2026

Live-Update 支持功能实现

概述

该 PR 为系统增加了 Live-Update(在线更新)能力,包含机器人端的指令处理、事件循环消息封装改动、服务端 API、前端控制面板与使用文档。实现支持从指定 artifact URL 下载新二进制、在可执行文件同目录创建临时文件并验证 ELF 魔数、设置可执行权限后原子替换当前可执行文件(rename)、并调度进程重启(通过 syscall.Exec,响应返回后异步触发)。支持单机与批量更新、每机结果上报与聚合。PR 主要为新增功能与文档,未破坏现有公共接口签名,但引入了出站消息封装与完成信号的语义变动,需注意兼容性。

主要变更

文档

  • 新增 docs/live-update.md:详述端到端流程、wire 协议(指令/响应)、REST API(单机与批量)、同目录临时文件与 ELF 验证、原子重命名、清理行为、重启语义(原地替换 via syscall.Exec,文档示例中提到短延迟)、以及批量更新的顺序性与重连语义。

机器人端(Go)

  • 指令注册:
    • packages/bot/eventloop/instructions/instructions.go:将 UpdateBinaryHandler 加入指令处理集合。
  • 更新指令实现:
    • packages/bot/eventloop/instructions/update_binary.go:新增 UpdateBinary 指令、请求/响应类型及处理流程(解析可执行路径并解析符号链接、在同目录创建临时文件、带超时下载工件、校验 ELF 魔数、chmod、原子 rename 替换、记录日志、异步调度 syscall.Exec 重启)。每一步包含错误处理并返回结构化响应。
  • 事件循环与消息封装改动:
    • packages/bot/eventloop/eventloop.go:eventloopSendJson 支持发送 lib.SendEnvelope;若为封装类型则提取 Payload 发送并在发送完成后关闭 Done 通道。
    • packages/bot/eventloop/share/action.go:WrapResponseAction 在上下文中创建并传递 WsSendDoneCtxKey 对应的 done 通道,并通过 WsWriter 发送 lib.SendEnvelope(包含 Payload 与 Done),以支持写完成通知。
    • packages/bot/lib/context.go:新增公开类型 SendEnvelope { Payload any; Done chan struct{} } 与上下文键类型 WsSendDoneCtxKey。

服务端 API(Rust)

  • 新增 API 端点与模块:
    • packages/service/src/api/action.rs:导出新模块 update_binary,新增两个端点:
      • POST /action/update_binary:向指定 robot_id 下发更新指令并返回单机 UpdateBinaryResponse { status, message }(含超时处理)。
      • POST /action/update_binary_all:对所有已连接机器人按序下发更新,聚合每台机器人的结果并返回总体状态("ok" 或 "partial_failure")及各机器人结果列表。
  • 新增数据结构:
    • packages/service/src/api/action/update_binary.rs:新增 UpdateBinaryRequest、UpdateBinaryResponse、UpdateBinaryAllRequest、RobotUpdateResult、UpdateBinaryAllResponse(用于序列化与 OpenAPI 文档)。
  • 服务端指令层:
    • packages/service/src/service/instructions.rs:在 Instruction / InstructionContent 中新增 UpdateBinary 变体与 UpdateBinaryMessage(artifact_url)。
    • packages/service/src/service/instructions/update_binary.rs:新增构造函数 update_binary,用于生成包含 artifact_url 的 Instruction 并将响应通过 oneshot 通道转发回调用者。

前端(TypeScript / Svelte)

  • 客户端 API:
    • packages/workstation/src/lib/api/action/update_binary.ts:新增 actionUpdateBinary(单机,30s 超时)与 actionUpdateBinaryAll(批量,60s 超时),使用 zod 校验请求/响应并暴露相应类型与常量。
  • UI 与路由:
    • packages/workstation/src/routes/dashboard/SidebarContent.svelte:在侧边栏新增 “Update” 菜单项。
    • packages/workstation/src/routes/dashboard/update/+page.svelte:新增更新页面,包含 artifact URL、目标机器人选择、更新单机与批量两种动作,展示加载态、错误与单机/批量结果(按状态着色)。
    • packages/workstation/src/routes/dashboard/update/+page.ts:新增 load 实现(禁用 prerender 与 SSR),获取 onlineRobots 作为页面数据。

错误处理、安全与实现要点

  • 下载写入与可执行文件置于同一目录以保证原子 rename(同文件系统)。
  • 校验 ELF 魔数以避免部署非 ELF 二进制。
  • 设置可执行权限并使用原子 rename 防止部分写入造成损坏。
  • 每关键步骤记录日志并返回明确的 status/message;失败返回 status="error" 与说明。
  • 重启采用 syscall.Exec 的原地替换方式,异步调度以确保响应先行返回。
  • 批量更新按顺序对每台机器人单独下发并记录结果,服务端聚合返回。

导出与兼容性影响

  • 新增导出项:
    • Go:InstructionUpdateBinary、UpdateBinaryRequest/Response、UpdateBinaryHandler;lib 新增 SendEnvelope 与 WsSendDoneCtxKey。
    • Rust:API 模块 update_binary、端点与请求/响应类型、Service 层的 Instruction 变体与消息类型。
    • 前端:actionUpdateBinary / actionUpdateBinaryAll、zod 类型、路由与组件。
  • 未对现有公共接口做破坏性修改(主要为新增),但引入了出站消息封装与 Done 通道的语义,需要确认 WsWriter 与消息消费者正确消费 Done 通道以保证兼容性。

影响范围与审查关注点

  • 跨语言端到端功能(Go 二进制替换与重启、Rust 服务端派发与超时聚合、前端触发与展示、文档),代码量与关注点较多,评审难度中等偏高。
  • 重点关注:
    • 下载与验证流程的安全性(artifact 源、TLS、超时、内容校验、路径注入风险)。
    • 原子替换与权限设置在不同平台/文件系统上的兼容性与回退策略。
    • syscall.Exec 重启的语义与副作用(环境变量、工作目录、打开的文件描述符、日志/监控/心跳影响)。
    • 批量更新的错误传播、超时策略、以及在部分失败情况下的运维可见性/重试策略。
    • 新的 SendEnvelope/Done 完成信号是否在现有 WebSocket 写入路径中被正确消费与处理,避免死锁或丢失通知。

Embers-of-the-Fire and others added 3 commits March 16, 2026 19:00
Add a new instruction handler that enables remote binary updates on edge
devices. The handler downloads a binary from a provided artifact URL,
validates ELF magic bytes, atomically replaces the current executable
via same-filesystem rename, and schedules a restart using syscall.Exec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add UpdateBinary variant to the Instruction and InstructionContent enums
with PingPong-based session handling. Expose POST /action/update_binary
for single-bot updates and POST /action/update_binary_all for fleet-wide
binary updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a dashboard page for remotely updating bot binaries. Includes API
client with zod-validated request/response schemas, a page loader that
fetches online robots, and a form UI with single-bot and fleet-wide
update actions. Adds an Update entry to the dashboard sidebar.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Embers-of-the-Fire Embers-of-the-Fire self-assigned this Mar 16, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

新增跨前端/服务端/机器人端的二进制实时更新功能:包含文档、机器人端指令与下载/验证/原子替换实现、服务端单机与批量 API、前端客户端与 UI,以及 WebSocket 发送完成信号的协同与异步重启语义(无其他兼容性变更)。

Changes

Cohort / File(s) Summary
文档
docs/live-update.md
新增 Live-Update 文档,描述端到端流程、线协议、REST 端点、重启语义与批量行为。
机器人端 指令与实现(Go)
packages/bot/eventloop/instructions/..., packages/bot/eventloop/instructions/update_binary.go
新增 update_binary 指令常量、请求/响应类型与处理器;实现下载(HTTP)、ELF 验证、同目录临时文件、chmod +x、原子重命名并异步 syscall.Exec 重启;在 instructionHandlers 中注册新处理器。
事件循环与发送信号更改(Go)
packages/bot/eventloop/eventloop.go, packages/bot/eventloop/share/action.go, packages/bot/lib/context.go
新增 SendEnvelope{Payload,Done}WsSendDoneCtxKey;WrapResponseAction 在上下文中创建 Done 通道并通过 SendEnvelope 发送;eventloop 处理展开 SendEnvelope 并在写完成后关闭 Done 通道以实现写完成通知。
服务端 API(Rust)
packages/service/src/api/action.rs, packages/service/src/api/action/update_binary.rs, packages/service/src/service/instructions.rs, packages/service/src/service/instructions/update_binary.rs
新增 API 模块与端点 POST /action/update_binaryPOST /action/update_binary_all;添加请求/响应 schema、OpenAPI 注解、Instruction::UpdateBinary 变体、会话构造函数与批量结果聚合和超时处理。
前端 客户端(TypeScript)
packages/workstation/src/lib/api/action/update_binary.ts
新增客户端函数 actionUpdateBinaryactionUpdateBinaryAll、Zod schema、端点常量、请求超时与错误处理,返回单机或每机器人结果结构。
前端 UI(Svelte)
packages/workstation/src/routes/dashboard/SidebarContent.svelte, packages/workstation/src/routes/dashboard/update/+page.svelte, packages/workstation/src/routes/dashboard/update/+page.ts
在侧栏新增 “Update” 项;新增更新页面,包含 Artifact URL、目标机器人选择、单机与批量更新按钮、加载/结果显示;页面禁用 prerender 与 SSR 并加载在线机器人列表。

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Frontend as Frontend (Svelte)
    participant API as Service API (Rust)
    participant Bot as Bot (Go)

    User->>Frontend: 输入 artifact_url 并触发更新
    Frontend->>API: POST /action/update_binary(_all) {artifact_url, robot_id?}
    API->>Bot: 通过会话发送 Instruction::UpdateBinary {artifact_url}
    activate Bot
    Bot->>Bot: 解析可执行路径并在同目录创建临时文件
    Bot->>Bot: 下载工件(HTTP)并验证 ELF 魔数
    Bot->>Bot: chmod +x 并原子重命名替换可执行文件
    Bot->>Bot: 等待写完成通知(Done 通道)并异步 syscall.Exec 重启
    Bot-->>API: 返回 UpdateBinaryResponse {status, message}
    deactivate Bot
    API-->>Frontend: 返回单机或聚合结果
    Frontend->>User: 展示结果(含每机器人状态)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

诗歌

🐰 新工件飘到我门前,
轻嗅 ELF 闪着银色线,
暂存同目录悄作替,
原子置换不留悬念,
重启归来又是春天!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the primary feature being added: live-update support across the codebase including documentation, bot binary update functionality, service API endpoints, and UI components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/live-update
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Document the end-to-end binary update flow, wire protocol JSON examples,
REST API endpoints, safety guarantees (atomic rename, ELF validation,
same-filesystem temp), and restart semantics via syscall.Exec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
packages/service/src/api/action/update_binary.rs (1)

4-8: 可选改进:考虑使用更强的类型约束。

robot_id 字段使用 String 类型是可行的,但如果 API 始终期望 UUID 格式,使用 uuid::Uuid 类型可以在编译时和反序列化时提供更强的验证。当前实现可以正常工作,这只是一个可选的类型安全增强建议。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/service/src/api/action/update_binary.rs` around lines 4 - 8, The
robot_id field in UpdateBinaryRequest currently uses String; to enforce UUID
format, change UpdateBinaryRequest::robot_id to uuid::Uuid, add the uuid::Uuid
import, ensure serde support for Uuid is enabled (or add #[serde(with = "...")]
if needed), and update any code that constructs or reads UpdateBinaryRequest
(e.g., API handlers, tests) to pass/parse a Uuid instead of a raw String so
deserialization and compile-time typing enforce UUID validity.
packages/workstation/src/routes/dashboard/update/+page.ts (1)

7-13: 考虑为数据获取添加错误处理。

当前 fetchStatsOnlineRobots 调用失败时会直接导致页面加载失败。考虑添加 try-catch 以提供更友好的降级体验(例如返回空数组并在 UI 中显示提示)。

♻️ 可选的错误处理方案
 export const load: PageLoad = async ({ fetch }) => {
-  const onlineRobots = await fetchStatsOnlineRobots(fetch);
-
-  return {
-    onlineRobots,
-  };
+  try {
+    const onlineRobots = await fetchStatsOnlineRobots(fetch);
+    return { onlineRobots };
+  } catch (err) {
+    console.error('Failed to fetch online robots:', err);
+    return { onlineRobots: [] };
+  }
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workstation/src/routes/dashboard/update/`+page.ts around lines 7 -
13, 在 PageLoad 的 load 函数中为 fetchStatsOnlineRobots 添加错误处理:把调用
fetchStatsOnlineRobots(fetch) 包在 try-catch 中(引用函数名 fetchStatsOnlineRobots 和返回字段
onlineRobots),在 catch 分支返回友好的降级值(例如 onlineRobots: []),并在 catch
中记录或上报错误以便调试;确保返回结构与原来一致以让页面接收降级数据。
packages/workstation/src/routes/dashboard/update/+page.svelte (1)

82-94: 可选改进:考虑分离加载状态。

当前两个按钮共享同一个 loading 状态,导致点击任一按钮时两个按钮都会显示 Spinner。虽然按钮已正确禁用,但如果希望更精确地指示哪个操作正在进行,可以考虑使用独立的加载状态。

♻️ 分离加载状态的可选方案
- let loading = $state(false);
+ let loadingSelected = $state(false);
+ let loadingAll = $state(false);
+ const loading = $derived(loadingSelected || loadingAll);

然后在各自的处理函数中使用对应的状态变量,并在对应按钮中检查具体的加载状态来决定是否显示 Spinner。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/workstation/src/routes/dashboard/update/`+page.svelte around lines
82 - 94, Current shared loading state causes both buttons to show a Spinner;
introduce separate boolean states (e.g., loadingUpdateSelected and
loadingUpdateAll) and switch the bindings: in the handler for the selected-bot
update (the function bound to the first Button's onclick) set
loadingUpdateSelected true/false around the async work, and in handleUpdateAll
set loadingUpdateAll true/false; update the Buttons to disable based on their
specific loading state (disabled={loadingUpdateSelected || !artifactUrl} and
disabled={loadingUpdateAll || !artifactUrl}) and show Spinner conditionally
using {`#if` loadingUpdateSelected} and {`#if` loadingUpdateAll} respectively so
only the active action shows a spinner.
docs/live-update.md (1)

7-32: 为流程图代码块添加语言标识符。

静态分析工具提示此代码块缺少语言标识符。对于 ASCII 流程图,建议使用 textplaintext 作为语言标识符以满足 lint 规则。

📝 建议修复
-```
+```text
 Frontend                  Service                    Bot
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/live-update.md` around lines 7 - 32, 文档中的 ASCII 流程图代码块缺少语言标识符(在
docs/live-update.md 的三重反引号代码块),导致静态分析报警;请在该代码块的起始 ``` 后添加语言标识符(例如 ```text 或
```plaintext),以标明这是纯文本/ASCII 流程图并满足 lint 规则,确保只修改包含 “Frontend Service Bot”
流程图的代码块。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/bot/eventloop/instructions/update_binary.go`:
- Around line 44-45: 不要把完整的预签名 URL 打到日志;在 UpdateBinaryAction 中,当记录
req.ArtifactUrl 时(函数名 UpdateBinaryAction、变量 req.ArtifactUrl、logger.Logger().Info
调用处),把 URL 的敏感部分掩码或只记录 host/path 摘要:解析 req.ArtifactUrl,移除或替换 query
string(或将其全部打码为 "***"),然后把处理后的安全字符串传给 logger,而不是原始 req.ArtifactUrl。
- Around line 72-109: The current flow only checks ELF magic and then replaces
the running binary; enforce transport and content integrity: (1) verify
req.ArtifactUrl uses HTTPS and return an error if not; (2) after io.Copy to
tmpPath (and after the ELF magic check), compute the file's SHA-256 and compare
it to an expected checksum provided by the update request (e.g., req.SHA256) or,
if the request supplies a signature (e.g., req.Signature and a known public
key), verify the signature against the downloaded bytes; (3) only proceed to
chmod/rename/replace the executable when the checksum or signature verification
succeeds; (4) keep existing cleanup paths (cleanup(), os.Remove(tmpPath)) on any
verification failure. Reference symbols to change: req.ArtifactUrl,
req.SHA256/req.Signature, tmpFile/tmpPath, io.Copy, elfMagic, the ELF validation
block, and the final chmod/rename replacement logic.
- Around line 73-78: 在 UpdateBinaryAction 中不要使用 http.Get 无上下文/超时;用 ctx 和显式超时的
http.Client 替换:创建一个带 Timeout 的 http.Client(例如 30s),用
http.NewRequestWithContext(ctx, http.MethodGet, req.ArtifactUrl, nil) 构造请求并通过
client.Do 发起请求,按原逻辑在请求构建或 Do 返回错误时调用 cleanup() 并返回 UpdateBinaryResponse,成功后继续使用
resp 并确保 resp.Body.Close() 被 defer 调用以释放资源。

In `@packages/service/src/api/action.rs`:
- Around line 136-140: CI is failing cargo fmt; run cargo fmt on the repository
and commit the formatting changes so the code around CONNECTIONS.get(...),
conn.value().send_instruction::<serde_json::Value>(Instruction::UpdateBinary {
artifact_url: request.artifact_url.clone(), ... }) and the similar block at the
later location (the 188-193 section) are formatted per rustfmt; ensure you stage
and push the formatted files before merging.
- Around line 187-217: The current loop in update_binary_all uses
CONNECTIONS.iter() and await on conn.value().send_instruction::<AnyDeserialize>
serially and treats any Ok(_) as success, which misrepresents bot-side failures
and causes linear latency; change this to either (A) fire-and-forget by spawning
tasks for each conn.send_instruction and immediately return UpdateBinaryResponse
with "sent" semantic, or (B) concurrently collect results by mapping each conn
to a spawned async task (or join_all) that deserializes into a strong response
type (replace AnyDeserialize with the concrete update response type), aggregate
per-robot statuses/errors, and return an overall UpdateBinaryResponse reflecting
failures. Ensure to reference CONNECTIONS.iter(),
send_instruction::<AnyDeserialize>(), and update_binary::UpdateBinaryResponse
when making the change.

In `@packages/workstation/src/lib/api/action/update_binary.ts`:
- Around line 39-44: 非 2xx 响应不应只抛出 status/statusText,要读取并包含服务端返回的错误正文来保留后端具体原因;在
update_binary.ts 的非 OK 分支(检查 response.ok 的地方)先 await response.text()(或尝试
response.json() 并回退到 text),然后将该文本包含到抛出的 Error 消息中(例如 "Error updating binary:
<status> <statusText>: <serverBody>"),并在文件中其它类似位置(提到的 62-67 区块)做相同处理;保持对正常分支使用
await response.json() 并传入 ActionUpdateBinaryResponse.parse(data) 不变。

---

Nitpick comments:
In `@docs/live-update.md`:
- Around line 7-32: 文档中的 ASCII 流程图代码块缺少语言标识符(在 docs/live-update.md
的三重反引号代码块),导致静态分析报警;请在该代码块的起始 ``` 后添加语言标识符(例如 ```text 或
```plaintext),以标明这是纯文本/ASCII 流程图并满足 lint 规则,确保只修改包含 “Frontend Service Bot”
流程图的代码块。

In `@packages/service/src/api/action/update_binary.rs`:
- Around line 4-8: The robot_id field in UpdateBinaryRequest currently uses
String; to enforce UUID format, change UpdateBinaryRequest::robot_id to
uuid::Uuid, add the uuid::Uuid import, ensure serde support for Uuid is enabled
(or add #[serde(with = "...")] if needed), and update any code that constructs
or reads UpdateBinaryRequest (e.g., API handlers, tests) to pass/parse a Uuid
instead of a raw String so deserialization and compile-time typing enforce UUID
validity.

In `@packages/workstation/src/routes/dashboard/update/`+page.svelte:
- Around line 82-94: Current shared loading state causes both buttons to show a
Spinner; introduce separate boolean states (e.g., loadingUpdateSelected and
loadingUpdateAll) and switch the bindings: in the handler for the selected-bot
update (the function bound to the first Button's onclick) set
loadingUpdateSelected true/false around the async work, and in handleUpdateAll
set loadingUpdateAll true/false; update the Buttons to disable based on their
specific loading state (disabled={loadingUpdateSelected || !artifactUrl} and
disabled={loadingUpdateAll || !artifactUrl}) and show Spinner conditionally
using {`#if` loadingUpdateSelected} and {`#if` loadingUpdateAll} respectively so
only the active action shows a spinner.

In `@packages/workstation/src/routes/dashboard/update/`+page.ts:
- Around line 7-13: 在 PageLoad 的 load 函数中为 fetchStatsOnlineRobots 添加错误处理:把调用
fetchStatsOnlineRobots(fetch) 包在 try-catch 中(引用函数名 fetchStatsOnlineRobots 和返回字段
onlineRobots),在 catch 分支返回友好的降级值(例如 onlineRobots: []),并在 catch
中记录或上报错误以便调试;确保返回结构与原来一致以让页面接收降级数据。

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 47a68a4c-37c7-4610-9144-011c350a89f3

📥 Commits

Reviewing files that changed from the base of the PR and between ffce678 and 3a96bc2.

📒 Files selected for processing (11)
  • docs/live-update.md
  • packages/bot/eventloop/instructions/instructions.go
  • packages/bot/eventloop/instructions/update_binary.go
  • packages/service/src/api/action.rs
  • packages/service/src/api/action/update_binary.rs
  • packages/service/src/service/instructions.rs
  • packages/service/src/service/instructions/update_binary.rs
  • packages/workstation/src/lib/api/action/update_binary.ts
  • packages/workstation/src/routes/dashboard/SidebarContent.svelte
  • packages/workstation/src/routes/dashboard/update/+page.svelte
  • packages/workstation/src/routes/dashboard/update/+page.ts

Embers-of-the-Fire and others added 4 commits March 16, 2026 19:45
Strip query string from artifact_url before logging to prevent
presigned URL credentials from leaking into log output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…load

Replace bare http.Get() with http.NewRequestWithContext and an
http.Client with a 30-second timeout to prevent the handler from
blocking indefinitely on slow or unresponsive artifact sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace AnyDeserialize with serde_json::Value to parse bot responses
and report per-robot status/message. Return a new UpdateBinaryAllResponse
type with individual RobotUpdateResult entries and an overall status of
"partial_failure" when any robot reports an error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read response body text on non-OK responses instead of only showing
status/statusText, so the UI can display backend-provided error details.
Also update the update_binary_all types to match the new per-robot
aggregate response from the service.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/live-update.md`:
- Around line 7-32: The fenced diagram block in docs/live-update.md is missing a
language marker and triggers MD040; update the opening fence (the
triple-backtick that begins the diagram) to include a plain text language tag
such as "text" or "plain" (i.e., change ``` to ```text) so the code fence is
lint-compliant while preserving the diagram content.
- Around line 89-104: 文档中的 update_binary_all 响应示例和文字描述仍然使用旧的 {status, message}
结构,需要更新为与服务端实现一致的 {status, results[]};修改 live-update.md 中示例响应 JSON 为包含 results
数组(每项描述单个机器人更新结果和可能的 per-bot error 信息或状态),并调整说明文字为“服务端返回 status 和 results[],每个
results 项对应一个机器人,单机错误只记录在 results 中/服务器端日志,不会使整个请求失败”(参照 action.rs
中的实现以确保字段名和结构一致)。

In `@packages/bot/eventloop/instructions/update_binary.go`:
- Around line 143-150: The fixed 1s sleep before calling syscall.Exec is unsafe;
replace the sleep-based restart with a synchronization signal from the WebSocket
writer: create a restart/done channel (e.g., restartDone chan struct{}) and have
eventloopSendJson send a completion signal on it after the wsjson.Write path has
finished handling the response; change the goroutine that currently sleeps to
instead block on that channel (with an optional timeout fallback) and only call
syscall.Exec(execPath, os.Args, os.Environ()) after receiving the signal; update
references to eventloopSendJson and the exec goroutine so the writer emits the
done signal and the exec waiter consumes it.

In `@packages/service/src/api/action.rs`:
- Around line 147-156: The code extracts status/message from the JSON (via
info.get(...).and_then(...).unwrap_or(...)) and currently defaults
missing/invalid status to "unknown" and only treats "error" as failure, which
lets malformed responses be reported as success; change the parsing in the
status extraction (the variables created from info.get("status")) to treat
missing/invalid/unknown as a failure sentinel (e.g., default to "error" or set a
boolean parse_failed flag), and update the downstream check (the logic that
currently only treats status == "error") to consider parse failures/unrecognized
values as failure (i.e., treat anything other than explicit "ok" as failure) so
that functions/methods handling status in action.rs consistently mark
malformed/unknown responses as failed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9c1b26a2-8727-4c93-962b-8700fb559c5e

📥 Commits

Reviewing files that changed from the base of the PR and between 3a96bc2 and 5b03f7b.

📒 Files selected for processing (6)
  • docs/live-update.md
  • packages/bot/eventloop/instructions/update_binary.go
  • packages/service/src/api/action.rs
  • packages/service/src/api/action/update_binary.rs
  • packages/workstation/src/lib/api/action/update_binary.ts
  • packages/workstation/src/routes/dashboard/update/+page.svelte
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/workstation/src/routes/dashboard/update/+page.svelte

Embers-of-the-Fire and others added 3 commits March 16, 2026 20:25
…ation

Update the response schema from {status, message} to {status, results[]}
to match the current per-robot aggregation endpoint. Clarify batch
behavior description.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…restart

Introduce SendEnvelope with a done channel that the eventloop send
goroutine closes after wsjson.Write completes. WrapResponseAction now
sends responses through this envelope, and UpdateBinaryAction waits on
the done signal before calling syscall.Exec, eliminating the race
between the fixed sleep and actual WebSocket flush.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Default missing status to "error" instead of "unknown", and default
missing message to "invalid response: missing fields". In the
update_binary_all aggregation, check for the expected success status
("post_update") rather than only matching "error", so malformed or
unrecognised responses correctly trigger partial_failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
docs/live-update.md (1)

7-7: ⚠️ Potential issue | 🟡 Minor

给这个围栏代码块补上语言标记。

这里仍会触发 markdownlint 的 MD040;把 opening fence 改成带 text 语言标记的 fence 即可。

🛠️ 建议修改
-```
+```text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/live-update.md` at line 7, The code fence in the documentation's bare
triple-backtick block should include a language tag to satisfy markdownlint
MD040; update the opening fence from ``` to ```text so the block becomes
language-marked (i.e., replace the bare opening fence with one that specifies
"text").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/live-update.md`:
- Around line 98-115: Update the example response's top-level "status" from "ok"
to "partial_failure" to match the aggregation semantics implemented in action.rs
(the API's batch update aggregation) when any item in "results" has status
"error" — i.e., change the sample JSON to show "status": "partial_failure" to
reflect that one robot returned an error.

In `@packages/bot/eventloop/eventloop.go`:
- Around line 60-64: The current code closes the done channel regardless of
backend.SendJson outcome, causing callers (e.g., the restart goroutine in
update_binary.go) to treat failed writes as successful; change the logic in the
function that calls backend.SendJson so that close(done) is executed only when
err == nil (i.e., after a successful backend.SendJson), keep done untouched on
error, and if you still need to trigger a restart or fallback on send failures
implement a separate failure/timeout signal (e.g., a dedicated errChan or
context timeout) instead of reusing done.

In `@packages/bot/eventloop/instructions/update_binary.go`:
- Around line 110-128: 当前只检查 ELF 魔数(elfMagic)会把错误架构的二进制当作有效文件覆盖掉可执行文件;在执行
os.Rename 并返回 post_update 之前,请打开并解析该临时文件的 ELF 头(例如使用 debug/elf 包的 elf.Open 或读取
e_ident / e_type / e_machine 字段),并验证 ELF Class(32/64)和 Machine 与当前运行时匹配(将
elf.File.Class 与期望的 elf.ELFCLASS64/ELFCLASS32、elf.File.Machine 与对应
runtime.GOARCH 映射比对);如果不匹配,删除 tmpPath 并返回 UpdateBinaryResponse
错误(类似现有错误分支),只有在验证通过后再执行 os.Rename、调用 post_update 并允许 syscall.Exec 执行新二进制。

In `@packages/service/src/api/action.rs`:
- Around line 136-144: The HTTP handler is directly awaiting
conn.value().send_instruction(...) (called with Instruction::UpdateBinary) so a
stuck bot blocks the request; wrap each send_instruction call in a timeout
(e.g., tokio::time::timeout) at the API layer and treat a timeout as a per-robot
failure response instead of hanging the whole handler—apply this change for the
send_instruction usage at the CONNECTIONS.get(&request.robot_id) site and
similarly for the batch path (the update_binary_all usage around the other
occurrence), returning a clear timeout/failure result for that robot while
allowing others to proceed.

---

Duplicate comments:
In `@docs/live-update.md`:
- Line 7: The code fence in the documentation's bare triple-backtick block
should include a language tag to satisfy markdownlint MD040; update the opening
fence from ``` to ```text so the block becomes language-marked (i.e., replace
the bare opening fence with one that specifies "text").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8be5e88b-c625-458e-befa-185c269baa5e

📥 Commits

Reviewing files that changed from the base of the PR and between 5b03f7b and 1ac6088.

📒 Files selected for processing (6)
  • docs/live-update.md
  • packages/bot/eventloop/eventloop.go
  • packages/bot/eventloop/instructions/update_binary.go
  • packages/bot/eventloop/share/action.go
  • packages/bot/lib/context.go
  • packages/service/src/api/action.rs

Align the live-update batch response example with the
service aggregation semantics so mixed outcomes report
partial_failure at the top level.
Only signal update completion after SendJson succeeds so
restart handlers do not treat failed websocket writes as
flushed responses.
Wrap update_binary instruction waits with a 60-second API
timeout so hung bots stop blocking HTTP handlers and batch
updates report timeouts as per-robot failures.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/service/src/api/action.rs (1)

178-198: 超时与指令失败的 HTTP 状态码不一致。

当前实现中:

  • 指令失败(Ok(Err(err)))返回 HTTP 400(Line 186-188)
  • 超时(Err(_))返回 HTTP 200 并在响应体中标记 status: "error"(Line 196)

这种不一致可能导致客户端处理逻辑复杂化。建议统一:要么都返回 HTTP 200 + 业务状态码,要么将超时也作为 HTTP 错误返回。

🛠️ 可选方案:统一返回 HTTP 200 + 业务错误
                Ok(Err(err)) => {
                    log::error!(
                        "Failed to update binary on robot {}: {:?}",
                        request.robot_id,
                        err
                    );
-                   Err(GenericResponse::BadRequest(PlainText(
-                       "failed to update binary".to_string(),
-                   )))
+                   Ok(Json(update_binary::UpdateBinaryResponse {
+                       status: "error".to_string(),
+                       message: format!("instruction failed: {}", err),
+                   }))
                }
packages/bot/eventloop/eventloop.go (1)

55-58: 考虑兼容 *lib.SendEnvelope,提高接口健壮性。

当前仓库中 SendEnvelope 仅以值类型构造并发送(见 packages/bot/eventloop/share/action.go:48),但在 packages/bot/eventloop/eventloop.go:55 的断言中只处理值类型。为避免未来上游代码意外传入指针时出现静默失效,建议同时处理指针类型。

♻️ 建议修改
 			payload := msg
 			var done chan struct{}
-			if env, ok := msg.(lib.SendEnvelope); ok {
-				payload = env.Payload
-				done = env.Done
-			}
+			switch env := msg.(type) {
+			case lib.SendEnvelope:
+				payload = env.Payload
+				done = env.Done
+			case *lib.SendEnvelope:
+				if env != nil {
+					payload = env.Payload
+					done = env.Done
+				}
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/bot/eventloop/eventloop.go` around lines 55 - 58, The type assertion
for lib.SendEnvelope in the msg handling only covers the value type and should
also accept a pointer to make the interface robust; update the branch around the
msg type check (the block that currently does `if env, ok :=
msg.(lib.SendEnvelope) { payload = env.Payload; done = env.Done }`) to
additionally check for `*lib.SendEnvelope` (or use a type switch) and extract
Payload and Done from the pointer case as well so both value and pointer
SendEnvelope instances are handled identically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/bot/eventloop/eventloop.go`:
- Around line 55-58: The type assertion for lib.SendEnvelope in the msg handling
only covers the value type and should also accept a pointer to make the
interface robust; update the branch around the msg type check (the block that
currently does `if env, ok := msg.(lib.SendEnvelope) { payload = env.Payload;
done = env.Done }`) to additionally check for `*lib.SendEnvelope` (or use a type
switch) and extract Payload and Done from the pointer case as well so both value
and pointer SendEnvelope instances are handled identically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec6b50e6-1074-4abf-86af-5582e25c9a98

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac6088 and afa09a8.

📒 Files selected for processing (4)
  • docs/live-update.md
  • packages/bot/eventloop/eventloop.go
  • packages/bot/lib/context.go
  • packages/service/src/api/action.rs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/live-update.md (1)

7-7: ⚠️ Potential issue | 🟡 Minor

请为该围栏代码块补充语言标记(仍会触发 MD040)。

这里的流程图代码块起始围栏缺少语言标记,markdownlint-cli2 会持续告警。建议改为 ```text(或 ```plain)。

建议修改
-```
+```text
 Frontend                  Service                    Bot
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/live-update.md` at line 7, 当前文档中的流程图围栏代码块缺少语言标记,导致 markdownlint-cli2 的
MD040 报警;请在该起始围栏后添加语言标记(例如 将 ``` 改为 ```text 或
```plain),以便消除警告并保持渲染不变;查找文档中该流程图的起始 ``` 围栏并替换为 ```text(或 ```plain)。
🧹 Nitpick comments (1)
packages/service/src/api/action.rs (1)

228-281: 考虑并发执行以减少总耗时。

当前实现串行等待每个机器人的响应。在最坏情况下(N 台机器人都超时),总耗时为 N × 60 秒。如果机器人数量较多,可以考虑使用 futures::future::join_all 并发执行以改善响应时间。

♻️ 并发执行示例(可选)
use futures::future::join_all;

// 收集所有 futures
let futures: Vec<_> = CONNECTIONS
    .iter()
    .map(|conn| {
        let robot_id = conn.key().clone();
        let artifact_url = request.artifact_url.clone();
        let conn_value = conn.value().clone(); // 需要 Connection 实现 Clone
        async move {
            let result = timeout(
                UPDATE_BINARY_TIMEOUT,
                conn_value.send_instruction::<serde_json::Value>(
                    Instruction::UpdateBinary { artifact_url },
                ),
            )
            .await;
            (robot_id, result)
        }
    })
    .collect();

// 并发执行
let outcomes = join_all(futures).await;

注意:这需要确认 Connection 类型支持克隆或使用 Arc 包装。如果当前架构不支持,串行执行也是可接受的方案。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/service/src/api/action.rs` around lines 228 - 281, The loop over
CONNECTIONS is serial and should be converted to concurrent execution: build a
Vec of futures by mapping CONNECTIONS.iter() to async blocks that capture
robot_id, request.artifact_url, and a cloned/Arc-wrapped conn.value (so
send_instruction can be called concurrently), each future performing the
timeout(UPDATE_BINARY_TIMEOUT,
conn_value.send_instruction::<serde_json::Value>(Instruction::UpdateBinary {
artifact_url })). Use futures::future::join_all to await all futures, then
iterate the outcomes and for each (robot_id, result) run the same match logic
that calls parse_update_binary_response,
update_binary_instruction_failure_response, or update_binary_timeout_response
and pushes update_binary::RobotUpdateResult while setting has_failure
appropriately; ensure Connection is Clone or wrap it in Arc before cloning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/live-update.md`:
- Line 7: 当前文档中的流程图围栏代码块缺少语言标记,导致 markdownlint-cli2 的 MD040 报警;请在该起始围栏后添加语言标记(例如
将 ``` 改为 ```text 或 ```plain),以便消除警告并保持渲染不变;查找文档中该流程图的起始 ``` 围栏并替换为 ```text(或
```plain)。

---

Nitpick comments:
In `@packages/service/src/api/action.rs`:
- Around line 228-281: The loop over CONNECTIONS is serial and should be
converted to concurrent execution: build a Vec of futures by mapping
CONNECTIONS.iter() to async blocks that capture robot_id, request.artifact_url,
and a cloned/Arc-wrapped conn.value (so send_instruction can be called
concurrently), each future performing the timeout(UPDATE_BINARY_TIMEOUT,
conn_value.send_instruction::<serde_json::Value>(Instruction::UpdateBinary {
artifact_url })). Use futures::future::join_all to await all futures, then
iterate the outcomes and for each (robot_id, result) run the same match logic
that calls parse_update_binary_response,
update_binary_instruction_failure_response, or update_binary_timeout_response
and pushes update_binary::RobotUpdateResult while setting has_failure
appropriately; ensure Connection is Clone or wrap it in Arc before cloning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 425723e4-1ebd-4965-844b-456396344f7b

📥 Commits

Reviewing files that changed from the base of the PR and between afa09a8 and d42ceef.

📒 Files selected for processing (3)
  • docs/live-update.md
  • packages/bot/eventloop/eventloop.go
  • packages/service/src/api/action.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/bot/eventloop/eventloop.go

@Embers-of-the-Fire Embers-of-the-Fire merged commit c75972a into main Mar 17, 2026
4 checks passed
@Embers-of-the-Fire Embers-of-the-Fire deleted the feat/live-update branch March 20, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant