Benchmark Tool
Band provides a simple C++ binary to benchmark a runtime performance. The binary generates repeatitive model requests based on a given config file, and reports latency statistics afterwards.
How to run
[root]/script/run_benchmark.py script will build band_benchmark binary file and execute it with a specified config file. Built binary file and target config file can be found in [root]/benchmark.
On Android
If you want to build binary from docker container (Refer to [root]/script/docker_util.sh for more detail)
python .\script\run_benchmark.py -android -docker -c .\benchmark_config.json
If you want to build locally
python .\script\run_benchmark.py -android -c .\benchmark_config.json
On local desktop (Windows or Ubuntu)
python .\script\run_benchmark.py -c .\benchmark_config.json
Config file
Structure
models: Models to run. For each model, specify the following fields.graph: Model path.period_ms: Optional The delay between subsequent requests in ms. The argument is only effective withperiodicexecution mode.batch_size: The number of model requests in a frame. [default: 1]worker_id: Optional Specify the worker id to run in int. The argument is only effective withfixed_devicescheduler.slo_usandslo_scale: Optional fields for specifying an SLO value for a model. Settingslo_scalewill make the SLO = worst profiled latency of that model *slo_scale.slo_scalewill be ignored ifslo_usis given (i.e., no reason to specify both options).
log_path: The log file path. (e.g.,/data/local/tmp/model_execution_log.json)schedulers: The scheduler types inlist[string]. If N schedulers are specified, then N queues are generated.fixed_workerround_robinshortest_expected_latencyleast_slack_time_firstheterogeneous_earliest_finish_timeheterogeneous_earliest_finish_time_reserved
minimum_subgraph_size: Minimum subgraph size. If candidate subgraph size is smaller thanminimum_subgraph_size, the subgraph will not be created. [default: 7]subgraph_preparation_type: For schedulers using fallback, determine how to generate candidate subgraphs. [default:merge_unit_subgraph]no_fallback_subgraph: Generate subgraphs per worker. Explicit fallback subgraph will not be generated.fallback_per_worker: Generate fallback subgraphs for each worker.unit_subgraph: Generate unit subgraphs considering all device supportiveness. All ops in same unit subgraph have same support devices.merge_unit_subgraph: Add merged unit subgraphs tounit_subgraph.
execution_mode: Specify a exeucution mode. Available execution modes are as follows:stream: consecutively run batches.periodic: invoke requests periodically.workload: execute pre-defined sequence instreammanner based on a given workload file.
cpu_masks: CPU cluster mask to set CPU affinity. [default:ALL]ALL: All ClusterLITTLE: LITTLE Cluster onlyBIG: Big Cluster onlyPRIMARY: Primary Core only
num_threads: Number of computing threads for CPU delegates. [default: -1]planner_cpu_masks: CPU cluster mask to set CPU affinity of planner. [default: same value as globalcpu_masks]workers: A vector-like config for per-processor worker. For each worker, specify the following fields. System creates 1 worker per device by default and first provided value overrides the settings (i.e.,cpu_masks,num_threads,profile_copy_computation_ratio, … ) and additional field will add additional worker per device.device: Target device of specific worker.CPUGPUDSPNPU
cpu_masks: CPU cluster mask to set CPU affinity of specific worker. [default: same value as globalcpu_masks]num_threads: Number of threads. [default: same value as globalnum_threads]
running_time_ms: Experiment duration in ms. [default: 60000]profile_smoothing_factor: Current profile reflection ratio.updated_profile = profile_smoothing_factor * curr_profile + (1 - profile_smoothing_factor) * prev_profile[default: 0.1]model_profile: The path to file with model profile results. [default: None]profile_online: Online profile or offline profile [default: true]profile_warmup_runs: Number of warmup runs before profile. [default: 1]profile_num_runs: Number of runs for profile. [default: 1]profile_copy_computation_ratio: Ratio of computation / input-ouput copy inlist[int]. Used for latency estimation for each device type (e.g., CPU, GPU, DSP, NPU). The length of the list should be equal to the 4 (GetSize<DeviceFlags>()). [default: 30000, 30000, 30000, 30000]schedule_window_size: The number of planning unit.workload: The path to file with workload information. [default: None]
Example
{
"models": [
{
"graph": "/data/local/tmp/model/lite-model_efficientdet_lite0_int8_1.tflite",
"period_ms": 30,
"batch_size": 3
},
{
"graph": "/data/local/tmp/model/retinaface_mbv2_quant_160.tflite",
"period_ms": 30,
"batch_size": 3
},
{
"graph": "/data/local/tmp/model/ssd_mobilenet_v1_1_metadata_1.tflite",
"period_ms": 30,
"batch_size": 3
}
],
"log_path": "/data/local/tmp/log.json",
"schedulers": [
"heterogeneous_earliest_finish_time_reserved"
],
"minimum_subgraph_size": 7,
"subgraph_preparation_type": "merge_unit_subgraph",
"execution_mode": "stream",
"cpu_masks": "ALL",
"num_threads": 1,
"planner_cpu_masks": "PRIMARY",
"workers": [
{
"device": "CPU",
"num_threads": 1,
"cpu_masks": "BIG"
},
{
"device": "CPU",
"num_threads": 1,
"cpu_masks": "LITTLE"
},
{
"device": "GPU",
"num_threads": 1,
"cpu_masks": "ALL"
},
{
"device": "DSP",
"num_threads": 1,
"cpu_masks": "PRIMARY"
},
{
"device": "NPU",
"num_threads": 1,
"cpu_masks": "PRIMARY"
}
],
"running_time_ms": 10000,
"profile_smoothing_factor": 0.1,
"profile_data_path": "/data/local/tmp/profile.json",
"profile_online": true,
"profile_warmup_runs": 3,
"profile_num_runs": 50,
"profile_copy_computation_ratio": [
1000,
1000,
1000,
1000
],
"availability_check_interval_ms": 30000,
"schedule_window_size": 10
}