[Papers] Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Halide DSL describes image processing pipelines in a simple functional style.
– Scheduling representation composably models a range of tradeoffs between locality, parallelism, and avoiding redundant work.
– Split representation separates schedules from the underlying algorithm.

– Arithmetic and logical operations, loads from external images, if-then-else expressions, references to named values, calls to other functions including external C ABI functions.
– Reduction function: programmer specifies the order by defining a reduction domain, bounded by minimum and maximum expressions for each dimension.

Scheduling: values need to be computed before they can be used, but still some choices remain to be made…
– when and where should the value at each coordinate in each function be computed?
– where should they be stored?
– how long are values cached and communicated across multiple consumers, and when are they independently recomputed by each?

Motivation

One example, scheduling choices can be divided into granularity of storage and granularity of computation.
Breadth first: first do for x-axis and then y-axis –> lots of parallelism, not much locality (all things are saved to be overwritten and then reloaded)
Loop fusion: do x-axis and y-axis together for each point –> lots of parallelism, some locality (some parts are calculated then used in that iteration), redundant work
Sliding window: redundancy is removed by saving things in buffer –> limited parallelism, lost locality (some parts are reused among iterations but the locality depends), no redundant work.
Overlapped tiling: tiling –> enough parallelism, increase locality (some parts are calculated then used for that tile), some redundant work
Sliding window tiling: …

Model for scheduling choice space

Model for the space of important choices in scheduling: granularity to compute each of its inputs, granularity to store each for reuse, order of traversal for compute (domain order)…

Domain order: each dimension can be traversed sequentially or in parallel / constant-size dimensions can be unrolled or vectorized / dimensions can be reordered (column-major vs row-major) / dimensions can be split to inner and outer.
Call schedule: breadth-first / fused / sliding window

Compiling

Autotuning: a lot of possible schedules. Choosing starting point by leveraging the domain knowledge seems like a good thing to learn from!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.