Skip to content

mlflow + trackio logging#129

Merged
david-rfai merged 61 commits intomainfrom
feature/mlflow-trackio
Jan 6, 2026
Merged

mlflow + trackio logging#129
david-rfai merged 61 commits intomainfrom
feature/mlflow-trackio

Conversation

@humaira-rf
Copy link
Collaborator

@humaira-rf humaira-rf commented Dec 12, 2025

Changes

  • Added mlflow logging to evals
  • Added trackio logging to both fit and evals
  • Deprecated RF_TRACKING_BACKEND environment variable for backends
  • Environment variable RF_MLFLOW_ENABLED to enable MLFLOW tracking (default to true for non-Colab environments)
  • Environment variable RF_TENSORBOARD_ENABLED to enable TENSORBOARD tracking (default to true for Colab environments)
  • Environment variable RF_TRACKIO_ENABLED to enable TRACKIO tracking (default to false for all environments)
  • --tracking-backend flag for rapidfireai start changed to --tracking-backends to now allow multiple flags with one of the following: mlflow, tensorboard, trackio
  • Remove more pinning of dependent modules
  • Rename mlflow_run_id column in database to metric_run_id to be more generic for metrics logger
  • Rename mlflow_experiment_id column in databasse to metric_experment_id to be more generic for metrics logger
  • Consolidated mlflow_manager, tensorboard_manager, and trackio_manager for both evals and fit
  • New default metric_rfmetric_manager as Metrics manager that accepts one or more Metrics loggers

Note

Introduces a unified metrics layer and migrates the codebase off MLflow-only assumptions.

  • Add RFMetricLogger with backends: MLflow, TensorBoard, TrackIO; implement per-backend managers and default selection via RF_MLFLOW_ENABLED, RF_TENSORBOARD_ENABLED, RF_TRACKIO_ENABLED
  • Replace MLflow-specific fields/logic across evals/fit with generic metrics: DB schema migrations (mlflow_*metric_*), controller/worker/experiment now create/log/end runs via metric_manager
  • CLI: --tracking-backends (multi-select) replaces --tracking-backend; sets new env flags; add --force for non-interactive shutdown
  • Callbacks: rename to MetricLoggingCallback; Generation metrics use torch.amp.autocast("cuda"); trainer and configs updated to pass metric_run_id
  • Remove legacy MLflow-only helpers; add new metric managers (mlflow, tensorboard, trackio) and abstraction types
  • Update README (directory layout, env vars), start script and doctor output to new flags; loosen/select deps (add trackio, optionalize mlflow, set RF_TENSORBOARD_LOG_DIR default)
  • Evals/fit DB schema files and runtime add migrations; tests adjusted to new interfaces

Written by Cursor Bugbot for commit eb6153f. This will update automatically on new commits. Configure here.

  • Tested on GCP and Colab for both fit and evals

@humaira-rf humaira-rf changed the title Feature/mlflow trackio mlflow + trackio logging Dec 12, 2025
Copy link
Collaborator

@david-rfai david-rfai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, and also look into other ways of combining evals and fit functions/classes. Would also be nice if the database could abstract the metrics logger, so there does not need to be dedicated columns for different trackers.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Base automatically changed from feature/unified-automl to main December 18, 2025 22:41
@david-rfai david-rfai self-requested a review December 24, 2025 00:39
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Copy link
Collaborator

@arun-rfai arun-rfai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@david-rfai david-rfai merged commit 09f6d89 into main Jan 6, 2026
3 checks passed
@david-rfai david-rfai deleted the feature/mlflow-trackio branch January 6, 2026 09:04
kamran-rapidfireAI pushed a commit that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants