Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
0a55b02
feat(bot): add update_binary instruction handler for live self-update
Embers-of-the-Fire Mar 16, 2026
8f06ea8
feat(service): add update_binary instruction and API endpoints
Embers-of-the-Fire Mar 16, 2026
7cd2f04
feat(workstation): add binary update management UI
Embers-of-the-Fire Mar 16, 2026
252b3a1
docs: add live-update feature documentation
Embers-of-the-Fire Mar 16, 2026
9c6ba39
fix(bot): redact artifact URL query params from info log
Embers-of-the-Fire Mar 16, 2026
620c22e
fix(bot): add request timeout and context cancellation to binary down…
Embers-of-the-Fire Mar 16, 2026
c67657d
fix(service): properly aggregate per-robot results in update_binary_all
Embers-of-the-Fire Mar 16, 2026
5b03f7b
fix(workstation): include server error body in client error messages
Embers-of-the-Fire Mar 16, 2026
fa4bb73
docs: align update_binary_all response example with service implement…
Embers-of-the-Fire Mar 16, 2026
3ccef8f
fix(bot): replace fixed 1s sleep with write-completion signal before …
Embers-of-the-Fire Mar 16, 2026
1ac6088
fix(service): treat unknown or malformed bot responses as failures
Embers-of-the-Fire Mar 16, 2026
04403fd
docs: fix batch update status example
Embers-of-the-Fire Mar 16, 2026
fce49fe
fix: gate bot restart on successful ws writes
Embers-of-the-Fire Mar 16, 2026
afa09a8
fix: bound update binary api waits
Embers-of-the-Fire Mar 16, 2026
d42ceef
fix(live-update): chore fix job
Embers-of-the-Fire Mar 16, 2026
67c36cf
perf(service): run update_binary_all robot requests concurrently
Embers-of-the-Fire Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions docs/live-update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Live-Update Feature

Remote binary update mechanism for edge bot devices. The service sends an update instruction with an artifact URL to a connected bot over WebSocket. The bot downloads, validates, and atomically replaces its own executable, then restarts in-place.

## End-to-End Flow

```
Frontend Service Bot
│ │ │
│ POST /action/ │ │
│ update_binary │ │
│────────────────────────>│ │
│ │ WS: update_binary │
│ │ instruction │
│ │───────────────────────>│
│ │ │ 1. Resolve exec path
│ │ │ 2. Download binary
│ │ │ 3. Validate ELF magic
│ │ │ 4. chmod 0755
│ │ │ 5. Atomic rename
│ │ │
│ │ WS: response │
│ │<───────────────────────│
│ 200 OK │ │
│<────────────────────────│ │
│ │ │ 6. Wait for WS flush
│ │ │ 7. syscall.Exec
│ │ │ (process restarts)
│ │ │
│ │ WS: reconnect │
│ │<───────────────────────│
```

## Wire Protocol

### Instruction (Service → Bot)

```json
{
"instruction": "update_binary",
"message": {
"artifact_url": "https://artifacts.example.com/bot/v1.2.3/bot-linux-amd64"
}
}
```

### Response (Bot → Service)

**Success:**

```json
{
"status": "post_update",
"message": "success, restarting..."
}
```

**Error:**

```json
{
"status": "error",
"message": "downloaded file is not a valid ELF binary"
}
```

## REST API Endpoints

### `POST /action/update_binary`

Update a single bot's binary.

**Request:**

```json
{
"robot_id": "550e8400-e29b-41d4-a716-446655440000",
"artifact_url": "https://artifacts.example.com/bot/v1.2.3/bot-linux-amd64"
}
```

**Response:**

```json
{
"status": "post_update",
"message": "success, restarting..."
}
```

If the bot reports an instruction failure or the operation times out, the
endpoint still responds with HTTP 200 and a business error payload:

```json
{
"status": "error",
"message": "instruction timed out after 60 seconds"
}
```

### `POST /action/update_binary_all`

Update all connected bots. Returns per-robot results with individual status and message fields.

**Request:**

```json
{
"artifact_url": "https://artifacts.example.com/bot/v1.2.3/bot-linux-amd64"
}
```

**Response:**

```json
{
"status": "partial_failure",
"results": [
{
"robot_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "post_update",
"message": "success, restarting..."
},
{
"robot_id": "660e8400-e29b-41d4-a716-446655440001",
"status": "error",
"message": "downloaded file is not a valid ELF binary"
}
]
}
```

The overall `status` is `"ok"` when every bot succeeds and `"partial_failure"` when any bot reports an error or returns an unrecognised response.

## Safety Guarantees

| Mechanism | Purpose |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **Same-directory temp file** | Ensures temp file and target are on the same filesystem, which is required for `os.Rename` to be atomic |
| **ELF magic validation** | Checks first 4 bytes (`\x7fELF`) to prevent replacing the binary with an HTML error page or other invalid content |
| **Atomic rename** | `os.Rename` on the same filesystem is an atomic operation at the VFS level — the old binary is fully replaced in a single syscall |
| **Temp file cleanup** | On any error path, the temp file is removed before returning |

## Restart Semantics

The bot uses `syscall.Exec(execPath, os.Args, os.Environ())` to restart:

- **In-place replacement**: The current process image is replaced with the new binary. The PID remains the same.
- **Write-completion gated**: A goroutine waits for the send-done signal from the eventloop (indicating the WebSocket response has been flushed) before calling `syscall.Exec`, instead of relying on a fixed delay.
- **Re-initialization**: The new binary runs `main()` from scratch, re-authenticates with the service, and re-establishes the WebSocket connection via the existing retry loop.
- **No rollback (v1)**: If the new binary fails to start, the bot stays down. Rollback is a future enhancement.

## Batch Update Behavior

When using `update_binary_all`:

- The instruction is sent to each connected bot sequentially.
- Per-bot results are collected and returned in the response, including individual status and error messages.
- Bot restarts will drop the WebSocket connection, which is expected — the bot reconnects automatically after restart.
23 changes: 21 additions & 2 deletions packages/bot/eventloop/eventloop.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,19 @@ func serveEventloop(ctx context.Context, backend *EventloopBackend) {
logger.Logger().Info("Event loop shutting down")
}

func unwrapSendEnvelope(msg any) (any, chan struct{}) {
switch env := msg.(type) {
case lib.SendEnvelope:
return env.Payload, env.Done
case *lib.SendEnvelope:
if env != nil {
return env.Payload, env.Done
}
}

return msg, nil
}

func eventloopSendJson(ctx context.Context, backend *EventloopBackend, send chan any) {
for {
select {
Expand All @@ -49,12 +62,18 @@ func eventloopSendJson(ctx context.Context, backend *EventloopBackend, send chan
if !ok {
return
}
logger.Logger().Debug("Sending JSON message", zap.Any("message", msg))
err := backend.SendJson(ctx, msg)

payload, done := unwrapSendEnvelope(msg)

logger.Logger().Debug("Sending JSON message", zap.Any("message", payload))
err := backend.SendJson(ctx, payload)
if err != nil {
logger.Logger().Error("Failed to send JSON", zap.Error(err))
return
}
if done != nil {
close(done)
}
}
}
}
Expand Down
1 change: 1 addition & 0 deletions packages/bot/eventloop/instructions/instructions.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ type InstructionHandler struct {
var instructionHandlers = []InstructionHandler{
SyncRobotNameHandler,
FetchNetworkHandler,
UpdateBinaryHandler,
}

var InstructionHandlers = func() map[string]InstructionHandler {
Expand Down
159 changes: 159 additions & 0 deletions packages/bot/eventloop/instructions/update_binary.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
package instructions

import (
"context"
"fmt"
"io"
"net/http"
"net/url"
"os"
"path/filepath"
"syscall"
"time"

"github.com/Alliance-Algorithm/rmcs-actions/packages/bot/eventloop/share"
"github.com/Alliance-Algorithm/rmcs-actions/packages/bot/lib"
"github.com/Alliance-Algorithm/rmcs-actions/packages/bot/logger"
"go.uber.org/zap"
)

const InstructionUpdateBinary = "update_binary"

// UpdateBinaryRequest is the request payload sent from the service.
type UpdateBinaryRequest struct {
ArtifactUrl string `json:"artifact_url"`
}

// UpdateBinaryResponse is the response payload sent back to the service.
type UpdateBinaryResponse struct {
Status string `json:"status"`
Message string `json:"message"`
}

// UpdateBinaryHandler registers the update_binary instruction using the
// ResponseAction pattern.
var UpdateBinaryHandler = InstructionHandler{
Instruction: InstructionUpdateBinary,
Action: share.WrapResponseAction(UpdateBinaryAction),
}

// elfMagic is the first 4 bytes of any valid ELF binary.
var elfMagic = []byte{0x7f, 'E', 'L', 'F'}

// UpdateBinaryAction downloads a new binary from the given artifact URL,
// validates it as an ELF executable, atomically replaces the current
// executable, and schedules a restart via syscall.Exec.
// sanitizeURL returns a host/path summary with query parameters stripped to
// avoid leaking presigned URL credentials into logs.
func sanitizeURL(raw string) string {
u, err := url.Parse(raw)
if err != nil {
return "<invalid-url>"
}
return u.Host + u.Path
}

func UpdateBinaryAction(ctx context.Context, req UpdateBinaryRequest) UpdateBinaryResponse {
logger.Logger().Info("UpdateBinaryAction called", zap.String("artifact_url", sanitizeURL(req.ArtifactUrl)))

execPath, err := os.Executable()
if err != nil {
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to get executable path: %v", err)}
}
execPath, err = filepath.EvalSymlinks(execPath)
if err != nil {
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to resolve symlinks: %v", err)}
}

execDir := filepath.Dir(execPath)

// Create temp file in the same directory to ensure same-filesystem for
// atomic rename.
tmpFile, err := os.CreateTemp(execDir, ".update_binary_*")
if err != nil {
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to create temp file: %v", err)}
}
tmpPath := tmpFile.Name()

// Cleanup helper — removes the temp file on any error path.
cleanup := func() {
tmpFile.Close()
os.Remove(tmpPath)
}

// Download the binary.
httpClient := &http.Client{Timeout: 30 * time.Second}
httpReq, err := http.NewRequestWithContext(ctx, http.MethodGet, req.ArtifactUrl, nil)
if err != nil {
cleanup()
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to create request: %v", err)}
}
resp, err := httpClient.Do(httpReq)
if err != nil {
cleanup()
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to download binary: %v", err)}
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
cleanup()
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("download returned status %d", resp.StatusCode)}
}

_, err = io.Copy(tmpFile, resp.Body)
if err != nil {
cleanup()
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to write binary: %v", err)}
}
tmpFile.Close()

// Validate ELF magic bytes.
header := make([]byte, 4)
f, err := os.Open(tmpPath)
if err != nil {
os.Remove(tmpPath)
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to open temp file for validation: %v", err)}
}
_, err = io.ReadFull(f, header)
f.Close()
if err != nil {
os.Remove(tmpPath)
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to read header: %v", err)}
}
for i := 0; i < 4; i++ {
if header[i] != elfMagic[i] {
os.Remove(tmpPath)
return UpdateBinaryResponse{Status: "error", Message: "downloaded file is not a valid ELF binary"}
}
}

// Set executable permissions.
if err := os.Chmod(tmpPath, 0755); err != nil {
os.Remove(tmpPath)
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to chmod: %v", err)}
}

// Atomic replace via same-filesystem rename.
if err := os.Rename(tmpPath, execPath); err != nil {
os.Remove(tmpPath)
return UpdateBinaryResponse{Status: "error", Message: fmt.Sprintf("failed to replace binary: %v", err)}
}

logger.Logger().Info("Binary replaced successfully, scheduling restart", zap.String("path", execPath))

// Schedule restart after the WebSocket response has been flushed.
// WsSendDoneCtxKey carries a channel that is closed by the eventloop
// send goroutine once wsjson.Write completes for this response.
done, _ := ctx.Value(lib.WsSendDoneCtxKey{}).(chan struct{})
go func() {
if done != nil {
<-done
}
logger.Logger().Info("Restarting via syscall.Exec", zap.String("path", execPath))
if err := syscall.Exec(execPath, os.Args, os.Environ()); err != nil {
logger.Logger().Error("Failed to exec new binary", zap.Error(err))
}
}()

return UpdateBinaryResponse{Status: "post_update", Message: "success, restarting..."}
}
10 changes: 8 additions & 2 deletions packages/bot/eventloop/share/action.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,14 @@ func WrapResponseAction[T any, O any](action ResponseAction[T, O]) lib.SessionAc
}
}

response := (action)(ctx, req)
done := make(chan struct{})
actionCtx := context.WithValue(ctx, lib.WsSendDoneCtxKey{}, done)

response := (action)(actionCtx, req)
wrapped := NewMessage(ctx, NewResponse(response))
ctx.Value(lib.WsWriterCtxKey{}).(chan any) <- wrapped
ctx.Value(lib.WsWriterCtxKey{}).(chan any) <- lib.SendEnvelope{
Payload: wrapped,
Done: done,
}
}
}
Loading
Loading