Greg Pauloski
Computer Scientist // Software Engineer
Hello there!
I am a fifth-year Ph.D. student in Computer Science at the University of Chicago interested in high-performance computing, distributed systems, and deep learning frameworks.
I am a member of Globus Labs where I am co-advised by Ian Foster and Kyle Chard.
I completed my Bachelors in Computer Science at the University of Texas at Austin and previously worked at Apple, Google, and the Texas Advanced Computing Center.
🎉 I am on the job market! Seeking full-time opportunities post-graduation (Spring/Summer 2025).
science RESEARCH link
book DISSERTATIONS link
Accelerating Communications in High- Performance Scientific Workflows [Apr 2025] link |
Abstract | Committee | Poster | Doctoral Dissertation (In Progress) |
ABSTRACT: Advances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute—such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing, and edge systems, but passing data among computational steps remains a challenge when applications are a composition of multiple distinct software with differing communication and patterns.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns—distributed futures, streaming, ownership, and stateful actors—make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated. |
Committee: Kyle Chard, Ian Foster, and Michael Franklin
|
Scalable Deep Neural Network Training with Distributed K-FAC [Mar 2022] link |
Abstract | Committee | PDF | Slides | Masters Thesis |
ABSTRACT: Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent (SGD) and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, has recently been shown to converge with fewer iterations in deep neural network (DNN) training than SGD; however, K-FAC's larger memory footprint and increased communication necessitates careful distribution of work for efficient usage. This thesis investigates scalable K-FAC algorithms to understand K-FAC's applicability in large-scale deep neural network training and presents KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework. Specifically, layer-wise distribution strategies, inverse-free second-order gradient evaluation, dynamic K-FAC update decoupling, and more are explored with the goal of preserving convergence while minimizing training time. KAISA can adapt the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability, and this adaptable distribution scheme generalizes existing strategies while providing a framework for scaling second-order methods beyond K-FAC. Compared to the original optimizers, KAISA converges 18.1–36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.
|
Committee: Kyle Chard, Ian Foster, and Zhao Zhang
|
engineering PROJECTS link
Check out all of my projects on GitHub.
star SELECTED PUBLICATIONS link
Ordered by most recent.
TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks [Sep 2024] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Maxime Gonthier, Nathaniel Hudson, Haochen Pan, Sicheng Zhou, Ian Foster, Kyle Chard |
eScience 2024 — Best Paper |
TLDR | PDF | Website | Code | Slides | Publication | BibTex |
TLDR: Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in parallel task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications.
|
@inproceedings{pauloski2024taps, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Gonthier, Maxime and Hudson, Nathaniel and Pan, Haochen and Zhou, Sicheng and Foster, Ian and Chard, Kyle}, title = {{TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks}}, address = {New York, NY, USA}, booktitle = {IEEE 20th International Conference on e-Science}, doi = {10.1109/e-Science62913.2024.10678702}, pages = {1-10}, publisher = {IEEE}, year = {2024} } |
Object Proxy Patterns for Accelerating Distributed Applications [Jul 2024] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Logan Ward, Alexander Brace, André Bauer, Kyle Chard, Ian Foster |
arXiv Preprint |
TLDR | PDF | Website | Code | Preprint | BibTex |
TLDR: In prior work, we demonstrated the transparent object proxy, which provides wide-area references that can resolve to data regardless of location, as an effective low-level building block for data flow optimization in distributed application design. Here we propose three high-level proxy-based programming patterns---distributed futures, streaming, and ownership---that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three substantial scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
|
@misc{pauloski2024proxystore, author = {J. Gregory Pauloski and Valerie Hayot-Sasson and Logan Ward and Alexander Brace and André Bauer and Kyle Chard and Ian Foster}, title = {{Object Proxy Patterns for Accelerating Distributed Applications}}, archiveprefix = {arXiv}, eprint = {2407.01764}, primaryclass = {cs.DC}, url = {https://arxiv.org/abs/2407.01764}, year = {2024} } |
Accelerating Communications in Federated Applications with Transparent Object Proxies [Nov 2023] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Logan Ward, Nathaniel Hudson, Charlie Sabino, Matt Baughman, Kyle Chard, Ian Foster |
SC 2023 |
TLDR | PDF | Website | Code | Poster | Slides | Publication | BibTex |
TLDR: We describe ProxyStore, a system that decouples control flow from data flow by extending the pass-by-reference model to distributed applications using object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
|
@inproceedings{pauloski2023proxystore, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Hudson, Nathaniel and Sabino, Charlie and Baughman, Matt and Chard, Kyle and Foster, Ian}, title = {{Accelerating Communications in Federated Applications with Transparent Object Proxies}}, address = {New York, NY, USA}, articleno = {59}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3581784.3607047}, isbn = {9798400701092}, location = {Denver, CO, USA}, numpages = {15}, publisher = {Association for Computing Machinery}, series = {SC '23}, url = {https://doi.org/10.1145/3581784.3607047}, year = {2023} } |
Deep Neural Network Training With Distributed K-FAC [Mar 2022] |
J. Gregory Pauloski, Lei Huang, Weijia Xu, Kyle Chard, Ian Foster, Zhao Zhang |
TPDS 2022 |
TLDR | PDF | Code | Publication | BibTex |
TLDR: We extend our SC 2020 paper to evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9—25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
|
@article{pauloski2022kfac, author = {Pauloski, J. Gregory and Huang, Lei and Xu, Weijia and Chard, Kyle and Foster, Ian T. and Zhang, Zhao}, title = {{Deep Neural Network Training With Distributed K-FAC}}, doi = {10.1109/TPDS.2022.3161187}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {12}, pages = {3616-3627}, volume = {33}, year = {2022} } |
KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks [Nov 2021] |
J. Gregory Pauloski, Qi Huang, Lei Huang, Shivaram Venkataraman, Kyle Chard, Ian Foster, Zhao Zhang |
SC 2021 |
TLDR | PDF | Code | Slides | Publication | BibTex |
TLDR: We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. Compared to the original optimizers, KAISA converges 18.1—36.3% faster across applications with the same global batch size.
|
@inproceedings{pauloski2021kaisa, author = {Pauloski, J. Gregory and Huang, Qi and Huang, Lei and Venkataraman, Shivaram and Chard, Kyle and Foster, Ian and Zhang, Zhao}, title = {{KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks}}, abstract = {Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1--36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.}, address = {New York, NY, USA}, articleno = {13}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3458817.3476152}, isbn = {9781450384421}, keywords = {second-order optimization, machine learning, distributed computing, K-FAC, data-parallel algorithms}, location = {St. Louis, Missouri}, numpages = {14}, publisher = {Association for Computing Machinery}, series = {SC '21}, url = {https://doi.org/10.1145/3458817.3476152}, year = {2021} } |
Convolutional Neural Network Training with Distributed K-FAC [Nov 2020] |
J. Gregory Pauloski, Zhao Zhang, Lei Huang, Weijia Xu, Ian Foster |
SC 2020 |
TLDR | PDF | Code | Slides | Publication | BibTex |
TLDR: We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. Our distributed optimizer design trains Resnet-50 18—25% faster than SGD.
|
@inproceedings{pauloski2020kfac, author = {Pauloski, J. Gregory and Zhang, Zhao and Huang, Lei and Xu, Weijia and Foster, Ian T.}, title = {{Convolutional Neural Network Training with Distributed K-FAC}}, abstract = {Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18--25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.}, articleno = {94}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.5555/3433701.3433826}, isbn = {9781728199986}, keywords = {optimization methods, neural networks, high performance computing, scalability}, location = {Atlanta, Georgia}, numpages = {14}, publisher = {IEEE Press}, series = {SC '20}, year = {2020} } |
article ALL PUBLICATIONS link
Ordered by most recent and grouped by topic. Bibtex file available for download here.
Oct 2024 | Accelerating Python Applications with Dask and ProxyStore link |
TLDR | PDF | Authors | Code | Preprint | BibTex | arXiv Preprint & HPPSS24 Demo | |
TLDR: Applications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-reference paradigm, has shown to be an effective mechanism for addressing these limitations. Here, we investigate integrating ProxyStore with Dask Distributed, one of the most popular libraries for distributed computing in Python, with the goal of supporting scalable and portable scientific workflows. Dask provides an easy-to-use and flexible framework, but is less optimized for scaling certain data-intensive workflows. We investigate these limitations and detail the technical contributions necessary to develop a robust solution for distributed applications and demonstrate improved performance on synthetic benchmarks and real applications.
|
|
@misc{pauloski2024accelerating, author = {J. Gregory Pauloski and Klaudiusz Rydzy and Valerie Hayot-Sasson and Ian Foster and Kyle Chard}, title = {{Accelerating Python Applications with Dask and ProxyStore}}, archiveprefix = {arXiv}, eprint = {2410.12092}, primaryclass = {cs.DC}, url = {https://arxiv.org/abs/2410.12092}, year = {2024} } |
|
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
TLDR | PDF | Authors | Website | Code | Slides | Publication | BibTex | eScience 2024 — Best Paper | |
TLDR: Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in parallel task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications.
|
|
@inproceedings{pauloski2024taps, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Gonthier, Maxime and Hudson, Nathaniel and Pan, Haochen and Zhou, Sicheng and Foster, Ian and Chard, Kyle}, title = {{TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks}}, address = {New York, NY, USA}, booktitle = {IEEE 20th International Conference on e-Science}, doi = {10.1109/e-Science62913.2024.10678702}, pages = {1-10}, publisher = {IEEE}, year = {2024} } |
|
Jul 2024 | Object Proxy Patterns for Accelerating Distributed Applications link |
TLDR | PDF | Authors | Website | Code | Preprint | BibTex | arXiv Preprint | |
TLDR: In prior work, we demonstrated the transparent object proxy, which provides wide-area references that can resolve to data regardless of location, as an effective low-level building block for data flow optimization in distributed application design. Here we propose three high-level proxy-based programming patterns---distributed futures, streaming, and ownership---that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three substantial scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
|
|
@misc{pauloski2024proxystore, author = {J. Gregory Pauloski and Valerie Hayot-Sasson and Logan Ward and Alexander Brace and André Bauer and Kyle Chard and Ian Foster}, title = {{Object Proxy Patterns for Accelerating Distributed Applications}}, archiveprefix = {arXiv}, eprint = {2407.01764}, primaryclass = {cs.DC}, url = {https://arxiv.org/abs/2407.01764}, year = {2024} } |
|
Nov 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
TLDR | PDF | Authors | Website | Code | Poster | Slides | Publication | BibTex | SC 2023 | |
TLDR: We describe ProxyStore, a system that decouples control flow from data flow by extending the pass-by-reference model to distributed applications using object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
|
|
@inproceedings{pauloski2023proxystore, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Hudson, Nathaniel and Sabino, Charlie and Baughman, Matt and Chard, Kyle and Foster, Ian}, title = {{Accelerating Communications in Federated Applications with Transparent Object Proxies}}, address = {New York, NY, USA}, articleno = {59}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3581784.3607047}, isbn = {9798400701092}, location = {Denver, CO, USA}, numpages = {15}, publisher = {Association for Computing Machinery}, series = {SC '23}, url = {https://doi.org/10.1145/3581784.3607047}, year = {2023} } |
Sep 2024 | Flight: A FaaS-Based Framework for Complex and Hierarchical Federated Learning link |
TLDR | PDF | Authors | Code | Preprint | BibTex | arXiv Preprint | |
TLDR: Federated Learning (FL) is a decentralized machine learning paradigm where models are trained on distributed devices and are aggregated at a central server. Existing FL frameworks assume simple two-tier network topologies where end devices are directly connected to the aggregation server. While this is a practical mental model, it does not exploit the inherent topology of real-world distributed systems like the Internet-of-Things. We present Flight, a novel FL framework that supports complex hierarchical multi-tier topologies, asynchronous aggregation, and decouples the control plane from the data plane. We compare the performance of Flight against Flower, a state-of-the-art FL framework. Our results show that Flight scales beyond Flower, supporting up to 2048 simultaneous devices, and reduces FL makespan across several models. Finally, we show that Flight's hierarchical FL model can reduce communication overheads by more than 60%.
|
|
@misc{hudson2024flight, author = {Nathaniel Hudson and Valerie Hayot-Sasson and Yadu Babuji and Matt Baughman and J. Gregory Pauloski and Ryan Chard and Ian Foster and Kyle Chard}, title = {{Flight: A FaaS-Based Framework for Complex and Hierarchical Federated Learning}}, archiveprefix = {arXiv}, eprint = {2409.16495}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2409.16495}, year = {2024} } |
|
Dec 2023 | Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision link |
TLDR | PDF | Authors | Publication | BibTex | BDCAT 2023 | |
TLDR: Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters---such as Huawei's PanGu-ÎŁ. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
|
|
@inproceedings{hudson2023trillion, author = {Hudson, Nathaniel C and Pauloski, J. Gregory and Baughman, Matt and Kamatar, Alok and Sakarvadia, Mansi and Ward, Logan and Chard, Ryan and Bauer, Andr\'{e} and Levental, Maksim and Wang, Wenyi and Engler, Will and Price Skelly, Owen and Blaiszik, Ben and Stevens, Rick and Chard, Kyle and Foster, Ian}, title = {{Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision}}, abstract = {Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters---such as Huawei's PanGu-ÎŁ. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.}, address = {New York, NY, USA}, articleno = {15}, booktitle = {Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies}, doi = {10.1145/3632366.3632396}, isbn = {9798400704734}, keywords = {artificial intelligence, grid computing, deep learning applications, systems design, survey}, location = {<conf-loc>, <city>Taormina (Messina)</city>, <country>Italy</country>, </conf-loc>}, numpages = {10}, publisher = {Association for Computing Machinery}, series = {BDCAT '23}, url = {https://doi.org/10.1145/3632366.3632396}, year = {2024} } |
|
Mar 2022 | Deep Neural Network Training With Distributed K-FAC link |
TLDR | PDF | Authors | Code | Publication | BibTex | TPDS 2022 | |
TLDR: We extend our SC 2020 paper to evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9—25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
|
|
@article{pauloski2022kfac, author = {Pauloski, J. Gregory and Huang, Lei and Xu, Weijia and Chard, Kyle and Foster, Ian T. and Zhang, Zhao}, title = {{Deep Neural Network Training With Distributed K-FAC}}, doi = {10.1109/TPDS.2022.3161187}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {12}, pages = {3616-3627}, volume = {33}, year = {2022} } |
|
Nov 2021 | KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks link |
TLDR | PDF | Authors | Code | Slides | Publication | BibTex | SC 2021 | |
TLDR: We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. Compared to the original optimizers, KAISA converges 18.1—36.3% faster across applications with the same global batch size.
|
|
@inproceedings{pauloski2021kaisa, author = {Pauloski, J. Gregory and Huang, Qi and Huang, Lei and Venkataraman, Shivaram and Chard, Kyle and Foster, Ian and Zhang, Zhao}, title = {{KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks}}, abstract = {Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1--36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.}, address = {New York, NY, USA}, articleno = {13}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3458817.3476152}, isbn = {9781450384421}, keywords = {second-order optimization, machine learning, distributed computing, K-FAC, data-parallel algorithms}, location = {St. Louis, Missouri}, numpages = {14}, publisher = {Association for Computing Machinery}, series = {SC '21}, url = {https://doi.org/10.1145/3458817.3476152}, year = {2021} } |
|
Nov 2020 | Convolutional Neural Network Training with Distributed K-FAC link |
TLDR | PDF | Authors | Code | Slides | Publication | BibTex | SC 2020 | |
TLDR: We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. Our distributed optimizer design trains Resnet-50 18—25% faster than SGD.
|
|
@inproceedings{pauloski2020kfac, author = {Pauloski, J. Gregory and Zhang, Zhao and Huang, Lei and Xu, Weijia and Foster, Ian T.}, title = {{Convolutional Neural Network Training with Distributed K-FAC}}, abstract = {Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our distributed K-FAC implementation converges to the 75.9% MLPerf baseline in 18--25% less time than does the classic stochastic gradient descent (SGD) optimizer across scales on a GPU cluster.}, articleno = {94}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.5555/3433701.3433826}, isbn = {9781728199986}, keywords = {optimization methods, neural networks, high performance computing, scalability}, location = {Atlanta, Georgia}, numpages = {14}, publisher = {IEEE Press}, series = {SC '20}, year = {2020} } |
|
May 2020 | Efficient I/O for Neural Network Training with Compressed Data link |
TLDR | PDF | Authors | Code | Publication | BibTex | IPDPS 2020 | |
TLDR: We investigate the tradeoff between runtime overhead and data compression ratio on real-world deep learning training datasets and applications. We show that storage can be reduced by 2—13x with minimal additional runtime overhead.
|
|
@inproceedings{zhang2020compressed, author = {Z. {Zhang} and L. {Huang} and J. G. {Pauloski} and I. T. {Foster}}, title = {{Efficient I/O for Neural Network Training with Compressed Data}}, booktitle = {2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, doi = {10.1109/IPDPS47924.2020.00050}, number = {}, pages = {409-418}, volume = {}, year = {2020} } |
|
Dec 2019 | Aggregating Local Storage for Scalable Deep Learning I/O link |
TLDR | PDF | Authors | Code | Publication | BibTex | DLS 2019 | |
TLDR: We develop a a user-level transient object store that provides low-latency and scalable POSIX-compliant file access for scalable deep learning training.
|
|
@inproceedings{zhang2019aggregating, author = {Z. {Zhang} and L. {Huang} and J. G. {Pauloski} and I. {Foster}}, title = {{Aggregating Local Storage for Scalable Deep Learning I/O}}, booktitle = {2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS)}, doi = {10.1109/DLS49591.2019.00014}, number = {}, pages = {69-75}, volume = {}, year = {2019} } |
Oct 2024 | Employing Artificial Intelligence to Steer Exascale Workflows with Colmena link |
TLDR | PDF | Authors | Website | Code | Publication | BibTex | IJHPCA 2024 | |
TLDR: We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI.
|
|
@article{ward2024colmena, author = {Logan Ward and J. Gregory Pauloski and Valerie Hayot-Sasson and Yadu Babuji and Alexander Brace and Ryan Chard and Kyle Chard and Rajeev Thakur and Ian Foster}, title = {{Employing Artificial Intelligence to Steer Exascale Workflows with Colmena}}, doi = {10.1177/10943420241288242}, eprint = {https://doi.org/10.1177/10943420241288242}, journal = {The International Journal of High Performance Computing Applications}, number = {0}, pages = {10943420241288242}, url = {https://doi.org/10.1177/10943420241288242}, volume = {0}, year = {0} } |
|
Nov 2023 | DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies link |
TLDR | PDF | Authors | Website | Preprint | BibTex | arXiv Preprint | |
TLDR: We present the DeepSpeed4Science initiative which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models.
|
|
@misc{song2023deepspeed4science, author = {Shuaiwen Leon Song and Bonnie Kruft and Minjia Zhang and Conglong Li and Shiyang Chen and Chengming Zhang and Masahiro Tanaka and Xiaoxia Wu and Jeff Rasley and Ammar Ahmad Awan and Connor Holmes and Martin Cai and Adam Ghanem and Zhongzhu Zhou and Yuxiong He and Pete Luferenko and Divya Kumar and Jonathan Weyn and Ruixiong Zhang and Sylwester Klocek and Volodymyr Vragov and Mohammed AlQuraishi and Gustaf Ahdritz and Christina Floristean and Cristina Negri and Rao Kotamarthi and Venkatram Vishwanath and Arvind Ramanathan and Sam Foreman and Kyle Hippe and Troy Arcomano and Romit Maulik and Maxim Zvyagin and Alexander Brace and Bin Zhang and Cindy Orozco Bohorquez and Austin Clyde and Bharat Kale and Danilo Perez-Rivera and Heng Ma and Carla M. Mann and Michael Irvin and J. Gregory Pauloski and Logan Ward and Valerie Hayot and Murali Emani and Zhen Xie and Diangen Lin and Maulik Shukla and Ian Foster and James J. Davis and Michael E. Papka and Thomas Brettin and Prasanna Balaprakash and Gina Tourassi and John Gounley and Heidi Hanson and Thomas E Potok and Massimiliano Lupo Pasini and Kate Evans and Dan Lu and Dalton Lunga and Junqi Yin and Sajal Dash and Feiyi Wang and Mallikarjun Shankar and Isaac Lyngaas and Xiao Wang and Guojing Cong and Pei Zhang and Ming Fan and Siyan Liu and Adolfy Hoisie and Shinjae Yoo and Yihui Ren and William Tang and Kyle Felker and Alexey Svyatkovskiy and Hang Liu and Ashwin Aji and Angela Dalton and Michael Schulte and Karl Schulz and Yuntian Deng and Weili Nie and Josh Romero and Christian Dallago and Arash Vahdat and Chaowei Xiao and Thomas Gibbs and Anima Anandkumar and Rick Stevens}, title = {{DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies}}, archiveprefix = {arXiv}, eprint = {2310.04610}, primaryclass = {cs.AI}, year = {2023} } |
|
May 2023 | The Diminishing Returns of Masked Language Models to Science link |
TLDR | PDF | Authors | Website | Publication | BibTex | Findings of the Association for Computational Linguistics: ACL 2023 | |
TLDR: We use 14 domain-specific transformer based models (including ScholarBERT, a new 770M-parameter science-focused masked language model pretrained on up to 225B tokens) to evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks. Interestingly, we find that increasing model sizes, training data, or compute time does not always lead to measurable improvements for scientific information extraction tasks.
|
|
@inproceedings{hong2023scholarbert, author = {Hong, Zhi and Ajith, Aswathy and Pauloski, J. Gregory and Duede, Eamon and Chard, Kyle and Foster, Ian}, title = {{The Diminishing Returns of Masked Language Models to Science}}, abstract = {Transformer-based masked language models such as BERT, trained on general corpora, have shown impressive performance on downstream tasks. It has also been demonstrated that the downstream task performance of such models can be improved by pretraining larger models for longer on more data. In this work, we empirically evaluate the extent to which these results extend to tasks in science. We use 14 domain-specific transformer-based models (including ScholarBERT, a new 770Mparameter science-focused masked language model pretrained on up to 225B tokens) to evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks. Interestingly, we find that increasing model size, training data, or compute time does not always lead to significant improvements (i.e., {\textgreater}1{\%} F1), if any, in scientific information extraction tasks. We offer possible explanations for this surprising result.}, address = {Toronto, Canada}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, doi = {10.18653/v1/2023.findings-acl.82}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, month = {July}, pages = {1270--1283}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-acl.82}, year = {2023} } |
|
Mar 2023 | Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources link |
TLDR | PDF | Authors | Code | Publication | BibTex | HCW @ IPDPS 2023 | |
TLDR: We describe our experiences in building and deploying AI driven workflows across multiple computing sites without networking hassles and without losing performance using Colmena, Globus, FuncX, and ProxyStore.
|
|
@misc{ward2023colmena, author = {Ward, Logan and Pauloski, J. Gregory and Hayot-Sasson, Valerie and Chard, Ryan and Babuji, Yadu and Sivaraman, Ganesh and Choudhury, Sutanay and Chard, Kyle and Thakur, Rajeev and Foster, Ian}, title = {{Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources}}, copyright = {arXiv.org perpetual, non-exclusive license}, doi = {10.48550/ARXIV.2303.08803}, keywords = {Distributed, Parallel, and Cluster Computing (cs.DC), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, publisher = {arXiv}, url = {https://arxiv.org/abs/2303.08803}, year = {2023} } |
|
Oct 2022 | GenSLMs: Genome-scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics link |
TLDR | PDF | Authors | Code | Publication | BibTex | IJHPCA — ACM Gordon Bell Special Prize for COVID-19 Research | |
TLDR: We build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern.
|
|
@article{zvyagin2022genslm, author = {Zvyagin, Maxim and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Orozco Bohorquez, Cindy and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and Mann, Carla M. and Irvin, Michael and Pauloski, J. Gregory and Ward, Logan and Hayot, Valerie and Emani, Murali and Foreman, Sam and Xie, Zhen and Lin, Diangen and Shukla, Maulik and Nie, Weili and Romero, Josh and Dallago, Christian and Vahdat, Arash and Xiao, Chaowei and Gibbs, Thomas and Foster, Ian and Davis, James J. and Papka, Michael E. and Brettin, Thomas and Stevens, Rick and Anandkumar, Anima and Vishwanath, Venkatram and Ramanathan, Arvind}, title = {{GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics}}, abstract = {Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.Competing Interest StatementThe authors have declared no competing interest.}, doi = {10.1101/2022.10.10.511571}, elocation-id = {2022.10.10.511571}, eprint = {https://www.biorxiv.org/content/early/2022/10/11/2022.10.10.511571.full.pdf}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, url = {https://www.biorxiv.org/content/early/2022/10/11/2022.10.10.511571}, year = {2022} } |
|
Nov 2021 | Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing link |
TLDR | PDF | Authors | Website | Code | Publication | BibTex | MLHPC @ SC 2021 | |
TLDR: We present Colmena, an open-source Python framework that allows users to steer massive computational campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.
|
|
@inproceedings{ward2021colmena, author = {Ward, Logan and Sivaraman, Ganesh and Pauloski, J. Gregory and Babuji, Yadu and Chard, Ryan and Dandu, Naveen and Redfern, Paul C. and Assary, Rajeev S. and Chard, Kyle and Curtiss, Larry A. and Thakur, Rajeev and Foster, Ian}, title = {{Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing}}, booktitle = {2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)}, doi = {10.1109/MLHPC54614.2021.00007}, number = {}, pages = {9-20}, volume = {}, year = {2021} } |
|
Aug 2021 | Models and Processes to Extract Drug-like Molecules From Natural Language Text link |
TLDR | PDF | Authors | Publication | BibTex | Frontiers in Molecular Biosciences | |
TLDR: We present (1) an iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for an NER model and (2) the application and evaluation of this method to identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers.
|
|
@article{hong2021moleculesnlp, author = {Hong, Zhi and Pauloski, J. Gregory and Ward, Logan and Chard, Kyle and Blaiszik, Ben and Foster, Ian}, title = {{Models and Processes to Extract Drug-like Molecules From Natural Language Text}}, abstract = {Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.}, doi = {10.3389/fmolb.2021.636077}, issn = {2296-889X}, journal = {Frontiers in Molecular Biosciences}, pages = {826}, url = {https://www.frontiersin.org/article/10.3389/fmolb.2021.636077}, volume = {8}, year = {2021} } |
|
Nov 2018 | Glioma Segmentation and a Simple Accurate Model for Overall Survival Prediction link |
TLDR | PDF | Authors | Publication | BibTex | BrainLes 2018 | |
TLDR: We develop a multi-stage pipeline for accurate patient survival prediction from brain tumor MRI scans. We segment tumor subvolumes using a multi-scale convolutional network, extract intensity and shape features, then use an ensemble of machine learning models to predict patient outcomes.
|
|
@inproceedings{gates2019glioma, author = {Gates, Evan and Pauloski, J. Gregory and Schellingerhout, Dawid and Fuentes, David}, title = {{Glioma Segmentation and a Simple Accurate Model for Overall Survival Prediction}}, abstract = {Brain tumor segmentation is a challenging task necessary for quantitative tumor analysis and diagnosis. We apply a multi-scale convolutional neural network based on the DeepMedic to segment glioma subvolumes provided in the 2018 MICCAI Brain Tumor Segmentation Challenge. We go on to extract intensity and shape features from the images and cross-validate machine learning models to predict overall survival. Using only the mean FLAIR intensity, nonenhancing tumor volume, and patient age we are able to predict patient overall survival with reasonable accuracy.}, address = {Cham}, booktitle = {Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries}, editor = {Crimi, Alessandro and Bakas, Spyridon and Kuijf, Hugo and Keyvan, Farahani and Reyes, Mauricio and van Walsum, Theo}, isbn = {978-3-030-11726-9}, pages = {476--484}, publisher = {Springer International Publishing}, year = {2019} } |
co_present PRESENTATIONS link
Ordered by most recent.
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
Slides | IEEE International Conference on eScience (eScience) | |
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
Slides | Video | ParslFest | |
Nov 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
Slides | Supercomputing | |
Oct 2023 | ProxyStore: Decoupling Control and Data Flow in Workflows link |
Slides | Video | ParslFest | |
Apr 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
Poster | Greater Chicago Area Systems Research Workshop (GCASR) | |
Sep 2022 | ProxyStore: a Data Fabric for Parsl and FuncX link |
Slides | Video | ParslFest | |
Nov 2021 | KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks link |
Slides | Supercomputing | |
Nov 2020 | Convolutional Neural Network Training with Distributed K-FAC link |
Slides | Supercomputing | |
Sep 2018 | Optimizing Deep Learning Methods for Image Segmentation with Distributed Training link |
Poster | TACC Symposium for Texas Researchers (TACCSTER) |