Benchmark Tool
Band provides a simple C++ binary to benchmark a runtime performance. The binary generates repeatitive model requests based on a given config file, and reports latency statistics afterwards.
How to run
[root]/script/run_benchmark.py
script will build band_benchmark
binary file and execute it with a specified config file. Built binary file and target config file can be found in [root]/benchmark
.
On Android
If you want to build binary from docker container (Refer to [root]/script/docker_util.sh
for more detail)
python .\script\run_benchmark.py -android -docker -c .\benchmark_config.json
If you want to build locally
python .\script\run_benchmark.py -android -c .\benchmark_config.json
On local desktop (Windows or Ubuntu)
python .\script\run_benchmark.py -c .\benchmark_config.json
Config file
Structure
models
: Models to run. For each model, specify the following fields.graph
: Model path.period_ms
: Optional The delay between subsequent requests in ms. The argument is only effective withperiodic
execution mode.batch_size
: The number of model requests in a frame. [default: 1]worker_id
: Optional Specify the worker id to run in int. The argument is only effective withfixed_device
scheduler.slo_us
andslo_scale
: Optional fields for specifying an SLO value for a model. Settingslo_scale
will make the SLO = worst profiled latency of that model *slo_scale
.slo_scale
will be ignored ifslo_us
is given (i.e., no reason to specify both options).
log_path
: The log file path. (e.g.,/data/local/tmp/model_execution_log.json
)schedulers
: The scheduler types inlist[string]
. If N schedulers are specified, then N queues are generated.fixed_worker
round_robin
shortest_expected_latency
least_slack_time_first
heterogeneous_earliest_finish_time
heterogeneous_earliest_finish_time_reserved
minimum_subgraph_size
: Minimum subgraph size. If candidate subgraph size is smaller thanminimum_subgraph_size
, the subgraph will not be created. [default: 7]subgraph_preparation_type
: For schedulers using fallback, determine how to generate candidate subgraphs. [default:merge_unit_subgraph
]no_fallback_subgraph
: Generate subgraphs per worker. Explicit fallback subgraph will not be generated.fallback_per_worker
: Generate fallback subgraphs for each worker.unit_subgraph
: Generate unit subgraphs considering all device supportiveness. All ops in same unit subgraph have same support devices.merge_unit_subgraph
: Add merged unit subgraphs tounit_subgraph
.
execution_mode
: Specify a exeucution mode. Available execution modes are as follows:stream
: consecutively run batches.periodic
: invoke requests periodically.workload
: execute pre-defined sequence instream
manner based on a given workload file.
cpu_masks
: CPU cluster mask to set CPU affinity. [default:ALL
]ALL
: All ClusterLITTLE
: LITTLE Cluster onlyBIG
: Big Cluster onlyPRIMARY
: Primary Core only
num_threads
: Number of computing threads for CPU delegates. [default: -1]planner_cpu_masks
: CPU cluster mask to set CPU affinity of planner. [default: same value as globalcpu_masks
]workers
: A vector-like config for per-processor worker. For each worker, specify the following fields. System creates 1 worker per device by default and first provided value overrides the settings (i.e.,cpu_masks
,num_threads
,profile_copy_computation_ratio
, … ) and additional field will add additional worker per device.device
: Target device of specific worker.CPU
GPU
DSP
NPU
cpu_masks
: CPU cluster mask to set CPU affinity of specific worker. [default: same value as globalcpu_masks
]num_threads
: Number of threads. [default: same value as globalnum_threads
]
running_time_ms
: Experiment duration in ms. [default: 60000]profile_smoothing_factor
: Current profile reflection ratio.updated_profile = profile_smoothing_factor * curr_profile + (1 - profile_smoothing_factor) * prev_profile
[default: 0.1]model_profile
: The path to file with model profile results. [default: None]profile_online
: Online profile or offline profile [default: true]profile_warmup_runs
: Number of warmup runs before profile. [default: 1]profile_num_runs
: Number of runs for profile. [default: 1]profile_copy_computation_ratio
: Ratio of computation / input-ouput copy inlist[int]
. Used for latency estimation for each device type (e.g., CPU, GPU, DSP, NPU). The length of the list should be equal to the 4 (GetSize<DeviceFlags>()
). [default: 30000, 30000, 30000, 30000]schedule_window_size
: The number of planning unit.workload
: The path to file with workload information. [default: None]
Example
{
"models": [
{
"graph": "/data/local/tmp/model/lite-model_efficientdet_lite0_int8_1.tflite",
"period_ms": 30,
"batch_size": 3
},
{
"graph": "/data/local/tmp/model/retinaface_mbv2_quant_160.tflite",
"period_ms": 30,
"batch_size": 3
},
{
"graph": "/data/local/tmp/model/ssd_mobilenet_v1_1_metadata_1.tflite",
"period_ms": 30,
"batch_size": 3
}
],
"log_path": "/data/local/tmp/log.json",
"schedulers": [
"heterogeneous_earliest_finish_time_reserved"
],
"minimum_subgraph_size": 7,
"subgraph_preparation_type": "merge_unit_subgraph",
"execution_mode": "stream",
"cpu_masks": "ALL",
"num_threads": 1,
"planner_cpu_masks": "PRIMARY",
"workers": [
{
"device": "CPU",
"num_threads": 1,
"cpu_masks": "BIG"
},
{
"device": "CPU",
"num_threads": 1,
"cpu_masks": "LITTLE"
},
{
"device": "GPU",
"num_threads": 1,
"cpu_masks": "ALL"
},
{
"device": "DSP",
"num_threads": 1,
"cpu_masks": "PRIMARY"
},
{
"device": "NPU",
"num_threads": 1,
"cpu_masks": "PRIMARY"
}
],
"running_time_ms": 10000,
"profile_smoothing_factor": 0.1,
"profile_data_path": "/data/local/tmp/profile.json",
"profile_online": true,
"profile_warmup_runs": 3,
"profile_num_runs": 50,
"profile_copy_computation_ratio": [
1000,
1000,
1000,
1000
],
"availability_check_interval_ms": 30000,
"schedule_window_size": 10
}