Category Archives: Papers

[Papers] Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Halide DSL describes image processing pipelines in a simple functional style.
– Scheduling representation composably models a range of tradeoffs between locality, parallelism, and avoiding redundant work.
– Split representation separates schedules from the underlying algorithm.

– Arithmetic and logical operations, loads from external images, if-then-else expressions, references to named values, calls to other functions including external C ABI functions.
– Reduction function: programmer specifies the order by defining a reduction domain, bounded by minimum and maximum expressions for each dimension.

Scheduling: values need to be computed before they can be used, but still some choices remain to be made…
– when and where should the value at each coordinate in each function be computed?
– where should they be stored?
– how long are values cached and communicated across multiple consumers, and when are they independently recomputed by each?

Motivation

One example, scheduling choices can be divided into granularity of storage and granularity of computation.
Breadth first: first do for x-axis and then y-axis –> lots of parallelism, not much locality (all things are saved to be overwritten and then reloaded)
Loop fusion: do x-axis and y-axis together for each point –> lots of parallelism, some locality (some parts are calculated then used in that iteration), redundant work
Sliding window: redundancy is removed by saving things in buffer –> limited parallelism, lost locality (some parts are reused among iterations but the locality depends), no redundant work.
Overlapped tiling: tiling –> enough parallelism, increase locality (some parts are calculated then used for that tile), some redundant work
Sliding window tiling: …

Model for scheduling choice space

Model for the space of important choices in scheduling: granularity to compute each of its inputs, granularity to store each for reuse, order of traversal for compute (domain order)…

Domain order: each dimension can be traversed sequentially or in parallel / constant-size dimensions can be unrolled or vectorized / dimensions can be reordered (column-major vs row-major) / dimensions can be split to inner and outer.
Call schedule: breadth-first / fused / sliding window

Compiling

Autotuning: a lot of possible schedules. Choosing starting point by leveraging the domain knowledge seems like a good thing to learn from!

[Papers] Learning to Optimize Tensor Programs

Low experiment cost, Domain-specific problem structure, Large quantity of similar operators.

Polyhedral models are a popular choice for S_e (search space); they model the loop domains as integer linear constrains. An alternative approach originating from halide defines a schedule space using a set of transformation primitives. –> This paper uses one in TVM to form S_e!

Learning to Optimize Tensor Programs

They use GBTs (XGBoost) and TreeGRU.

Training objective function. Common choice is the regression which encourages the model to predict cost accurately. They instead use rank loss function.

Use Simulated Annealing with the loss function as the energy function to explore. We select the top-performing batch of candidates to run on real hardware. Diversity-Aware Exploration is achieved by maximizing the objective to select candidate set S from top candidates (3). Uncertainty Estimator… I don’t know

Accelerating Optimization via Transfer Learning

In practice want to optimize for many tensor operators with different input shapes and data types. Apply transfer learning… Key is to create a transferable representation that is invariant to the source and target domains.

Common practice is to directly use configuration as the model’s input. However, search space specification can change for different workloads or when the user specifies a new search space for the same workload. The configuration representation is not invariant to changes in the search space.

Low-level loop AST is a shared representation of programs that is invariant to the search space. To leverage invariance, use low-level loop AST as input.

[Papers] HyperNetworks

Hypernetwork refers to one network that generates weights for another network (analogous to genotype and phenotype). The author trains the network end-to-end using backpropagation.

Hypernetworks are linear, result of matrix multiplication and addition of bias. Input is z^j and number of its element is N_z. d in the above equations denote size of the hidden layer in the network.

For MNIST, model network was
conv1: 28x28x1 –> (16x7x7x1) –> 28x28x16
conv2: 14x14x16 –> (16x7x7x1) –> 14x14x16
fc: 7x7x16 –> (784×10) –> 10

Using this seems to give comparable results in MNIST and WRN with CIFAR-10 but I really would need to check…

Only on RNN in the final paper

Interestingly, the author pulled the CNN part out of the paper in the final draft through a revision… I believe this maybe because it did not give the best result…

Reference

Original paper: https://arxiv.org/pdf/1609.09106.pdf
ICLR 2017 Review: https://openreview.net/forum?id=rkpACe1lx
ICLR 2017 Final paper: https://openreview.net/pdf?id=rkpACe1lx
Author’s blog: http://blog.otoro.net/2016/09/28/hyper-networks/

[Papers] Summary

[ICLRW 2018] To prune, or not to prune: exploring the efficacy of pruning for model compression (Michael Zhu, Suyog Gupta)
– large-sparse models consistenly outperform small-dense models
– propose gradual pruning, increasing sparsity gradually plus make mask like method for TF
  . in higher variance models, layerwise constant pruning seems to mitigate the degradation from pruning. 

[CoRR abs 2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
– built using depthwise convolution and pointwise convolution, reducing model size
  . pointwise convolution can be directly mapped to GEMM without im2col

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
– push-the-button compression pipeline for determining pruning rate for each layer using DDPG
  . penalizing accuracy loss while encouraging model shrinking and speedup
  . actor-critic structure helps reduce variance, facilitating stabler training
– structured pruning, pruned weights are regular
– result:
  . ResNet: 1×1 have less redundancy and can be pruned less, 3×3 have more redundancy and can be pruned more

[CoRR abs 2015] Distilling the Knowledge in a Neural Network
– using a higher value for temperature (T) produces a softer probability distribution over classes. use this to train distilled model then use temperature of 1.
– if correct labels are known, modify the soft targets or use weighted average of two…
– it works well for MNIST, ASR…
– it can be used as regularizers since only soft targets of subset of images seem to prevent overfitting

[Papers] Viterbi-based Pruning for Sparse Matrix with Fixed and High Index Compression Ratio

Abstract

Conventional sparse matrix formats involve irregular index structures with large storage requirement and a sequential reconstruction process, resulting in inefficient use of highly parallel computing resources.

Numerous sparse matrix candidates are first generated by the Viterbi encoder, and the candidate that aims to minimize the model accuracy degradation is then selected by the Viterbi algorithm.

Introduction

It has been shown that pruning can achieve 9x to 13x reduction in connection (Han et al., 2015)

DNN after pruning heavily involves sparse matrix-vector and matrix-matrix multiplications (SpMV and SpMM, respectively). Despite the sparse content, the computation time for SpMM is longer than that of dense matrix multiplication in the modern graphic processing unit (GPU), due to its serialized index decoding process and irregular memory access patterns.

Pruning using Viterbi-based Approach

If the input sequence length and the total output sequence length of a VD are denoted as p and q, respectively, then the index compression ration can be calculated q/p. (q >> p)

Viterbi Decompressor (VD) Design Considerations

The goal of VD is to act as a random number generator using the input sequence. It is interesting to note that such an effort has already been studied in ECC design.

Fixed encoding rate was proposed to simulate random coding with an allowed decoding complexity.

Randomness of VD outputs is determined by the number of FFs and the XOR gates configuration.

Viterbi Algorithm for Pruning

Assign a cost function to each pruning case enabled by VD and evaluate all possible (2^p) pruning cases with a “Branch-and-Cut” algorithm.

The pruning case that has the optimal cost function should lead to minimal accuracy degradation.

The Viterbi algorithm computes the maximum-likelihood sequence in a hidden Markov model, and can be utilized as a fast and efficient pruning exploration technique for our pruning method.

  • Construct trellis diagram, a time-indexed version of a state diagram.
    • number of states are 2^k.
  • Computing cost function for possible transittions using the branch metric and the path metric.
  • Some scheme I don’t understand

Result

Pruning similar to magnitude based, but with smaller matrix for indexing.