University of California San Diego
My Ph.D will be an adventure about various ways to optimize systems 🙂
Following are the projects that I work on
My research were in part supported by DARPA, NIH, NSF, SRC, and generous gifts from Google, Samsung, Qualcomm, Microsoft, and Xilinx.
Machine Learning for Compiler Optimization
Success of Deep Neural Networks (DNNs) and their computational intensity has heralded Cambrian explosion of DNN hardware. While hardware design has advanced significantly, optimizing the code for them is still and open challenge. Recent research has moved past traditional compilation techniques and taken a stochastic search algorithmic path that blindly generates rather stochastic samples of the binaries for real hardware measurements to guide the search.
Chameleon leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain-knowledge inspired logic to improve the samples itself.
Glimpse opens a new dimension by incorporating the mathematical embeddings of the hardware specifications of the GPU accelerators dubbed Blueprint to better guide the search algorithm and focus on sub-spaces that have higher potential for yielding higher performance binaries.
- Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation (2019.02 – 2020.02) ICLR’20, ICML-W’19
- Use of Mathematical Embeddings of Hardware Specifications for Faster Neural Compilation (2021.07 – 2022.02) DAC’22 Best Paper Nomination
Spatial Multi-Tenancy for DNN Accelerators
Accelerator-based INFaaS (Google’s TPU, NVIDIA T4, Microsoft Brainwave, etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency.
Planaria is the first work to explore the spatial multi-tenancy in accelerators. Dynamic architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness.
- Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of DNNs (2020.02 – 2020.09) MICRO’20
- Accelerating Federated Learning with Attention NeurIPS Workshop’22
- Cross-Domain Multi-Acceleration IEEE Micro’22
- Microservices and Serverless Computing ISCA Workshop’22
- Neural Architecture Search on Resource-limited Platforms IEEE SMC’21 – with Mälardalen University in Sweden
Qualcomm AI Research
During my internships at Qualcomm, I worked on developing compiler optimization techniques to reduce memory footprint while executing deep neural networks. Compiler optimization research at Qualcomm forms an important part of my PhD dissertation research.
Compiler Optimization for Machine Learning
Recent advances demonstrate that irregularly wired neural networks from Neural Architecture Search (NAS) and Random Wiring can not only automate the design of deep neural networks but also emit models that outperform previous manual designs. These designs are especially effective while designing neural architectures under hard resource constraints (memory, MACs, . . . ) which highlights the importance of this class of designing neural networks. However, such a move creates complication in the previously streamlined pattern of execution. In fact one of the main challenges is that the order of such nodes in the neural network significantly effects the memory footprint of the intermediate activations. Current compilers do not schedule with regard to activation memory footprint that it significantly increases its peak compared to the optimum, rendering it not applicable for edge devices.
Serenity utilizes dynamic programming to find a sequence that finds a schedule with optimal memory footprint. Our solution also comprises of graph rewriting technique that allows further reduction beyond the optimum. As such, Serenity achieves optimal peak memory, and the graph rewriting technique further reduces the memory footprint.
- Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices (2019.07-2019.09) MLSys’20
- AI-Enabled Memory Management Algorithms (2020.06-2020.09)