-
Notifications
You must be signed in to change notification settings - Fork 132
CLI for measuring execute_cuda encoding perf #6381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
0249a54 to
a4c923c
Compare
e52bb67 to
21537a6
Compare
| { | ||
| let reference = <T as From<u8>>::from(REFERENCE_VALUE); | ||
| let data: Vec<T> = (0..len) | ||
| .map(|i| <T as From<u8>>::from((i % 256) as u8) + reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was overflowing before?
052da59 to
249c24c
Compare
| let mut total_time = Duration::ZERO; | ||
| let mut cuda_ctx = CudaSession::create_execution_ctx(&VortexSession::empty()) | ||
| .vortex_expect("failed to create execution context") | ||
| .with_launch_strategy(Arc::new(timed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see here: instead of replicating the full launch setup in benchmark code, we can just stub in a launcher that collects timing information across runs
| }}; | ||
| /// Implementations can add tracing, async callbacks, or other behavior | ||
| /// around kernel launches. | ||
| pub trait LaunchStrategy: Debug + Send + Sync + 'static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where LaunchStrategy is defined and impled
Signed-off-by: Andrew Duffy <andrew@a10y.dev> fixup Signed-off-by: Andrew Duffy <andrew@a10y.dev>
780efdb to
7b61bd6
Compare
| pub fn launch_kernel<'a, F>( | ||
| &'a mut self, | ||
| function: &'a CudaFunction, | ||
| len: usize, | ||
| build_args: F, | ||
| ) -> VortexResult<()> | ||
| where | ||
| F: FnOnce(&mut LaunchArgs<'a>), | ||
| { | ||
| let mut launcher = self.launch_builder(function); | ||
| build_args(&mut launcher); | ||
|
|
||
| let events = launch_cuda_kernel_impl(&mut launcher, self.strategy.event_flags(), len)?; | ||
| self.strategy.on_complete(&events, len)?; | ||
|
|
||
| drop(events); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an odd place for a binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Overview of changes
ergonomics/API focused changes
LaunchStrategyon the execution context. This by default will launch kernels and not track any timing information, but it is pluggable. For example in benchmarks we replace this with aTimedLaunchedStrategywhich executes the kernels in blocking mode and logs their execution time.ctx.launch_kernel()method, which accepts a closure that is used to populate kernel argumentsA lot of test and benchmark code needed to be updated to use the new launch methods.
Fused FOR + BPThis has been shelved for a FLUP since this was too big
* I've updated the BP kernel generator to generate bp as FFOR, i.e. fused bitpacking with FOR. In practice, this is just adding aconst T referenceparam. By default the execution for BitPackedArray passeszero, but there is a specialization in theForArrayexecution tree where if it detects one of its descendants is BP, it fuses itself with the bit unpackingGPU tracing tool
There's a new binary in
vortex-test-e2e-cuda-scanwhich takes as input a Vortex file.It will recompress the file using only GPU-supported encodings, scan it back, and collect timings for how long each column scan took. The results are printed as either pretty text, or as JSON to stdout, which can be piped into duckdb or similar for analysis
Example usage: