ECE Assistant Professor Tushar Krishna has been chosen as one of the recipients of the Facebook Research Faculty Award for AI System Hardware/Software Co-Design. Krishna was among the nine winners who were selected from 132 worldwide submissions. This is the second year in a row that Krishna has won this award.
Tushar Krishna has been chosen as one of the recipients of the Facebook Research Faculty Award for AI System Hardware/Software Co-Design. Krishna was among the nine winners who were selected from 132 worldwide submissions. This is the second year in a row that Krishna has won this award.
The title of Krishna’s award-winning project is “HW/SW co-design of next-generation training platforms for DLRMs.” DLRMs stand for Deep Learning Recommendation Models and are used within online recommendation systems, such as ranking of search queries in Google, friend suggestions on Facebook, and job advertisements from LinkedIn. DLRMs are very different from Deep Learning models used for computer vision and natural language processing as they involve both continuous (or dense) features and categorical (or sparse) features. For example, the date and time for clicks on a webpage by a user can be used as dense features, while the representation of the user based on all the webpages visited by him/her in the past 48 hours can be used as sparse features for training recommendation models. The dense features are processed with multilayer perceptrons (MLPs) while the sparse features are processed using a technique called embeddings.
Training DLRMs constitutes more than 50 percent of the training demand in companies like Facebook. This is because storing the embeddings requires significant memory capacity, on the order of 100s of gigabytes to a few terabytes, which is more than the memory available on a single accelerator (GPU or TPU) node. Thus, DLRMs require clever partitioning and distribution of the model across multiple accelerator nodes. This naturally makes it crucial to optimize the communication between these nodes to reduce overall training time.
As part of the award, Krishna will explore mechanisms for efficient distributed training of recommendations models. The research will develop techniques involving co-design across software and hardware to enable scalability across 100s-1000s of accelerator nodes. The research effort will leverage ASTRA-sim, a distributed DL training simulator developed by Krishna and his Ph.D. student Saeed Rashidi in collaboration with Facebook and Intel.
Krishna is an assistant professor in the School of Electrical and Computer Engineering at Georgia Tech. He also holds the ON Semiconductor Junior Professorship. Krishna has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E. in Electrical Engineering from Princeton University (2009), and a B.Tech. in Electrical Engineering from the Indian Institute of Technology, Delhi (2007). Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC), and deep learning accelerators – with a focus on optimizing data movement in modern computing systems. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and three have won best paper awards. He received the National Science Foundation CRII award in 2018 and both a Google Faculty Award and a Facebook Faculty Award in 2019.