Greg Pauloski
Computer Scientist // Software Engineer
Hello there!
I am a fifth-year Ph.D. student in Computer Science at the University of Chicago interested in high-performance computing, distributed systems, and deep learning frameworks.
I am a member of Globus Labs where I am co-advised by Ian Foster and Kyle Chard.
I completed my Bachelors in Computer Science at the University of Texas at Austin and previously worked at Apple, Google, and the Texas Advanced Computing Center.
🎉 I am on the job market! Seeking full-time opportunities post-graduation (Spring/Summer 2025).
science RESEARCH link
book DISSERTATIONS link
Programming the Continuum: Towards Better Techniques for Developing Distributed Science Applications [Apr 2025] link |
Abstract | Committee | Poster | Doctoral Dissertation (In Progress) |
ABSTRACT: Advances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute—such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing, and edge systems, but rely on simple abstractions that pose challenges when building applications composed of multiple distinct software with differing communication and patterns. This dissertation introduces new techniques for programming distributed science applications deployed across the computing continuum. TaPS, a benchmarking suite for reliable evaluation of parallel execution frameworks, is developed and used to investigate limitations in existing solutions. This investigation motivates the design of ProxyStore, a library that extends the pass-by-reference model to distributed applications with the goal of decoupling data flow from control flow. ProxyStore's object proxy paradigm enables the dynamic selection of different data movement methods, depending on
|
Committee: Kyle Chard, Ian Foster, and Michael Franklin
|
Scalable Deep Neural Network Training with Distributed K-FAC [Mar 2022] link |
Abstract | Committee | PDF | Slides | Masters Thesis |
ABSTRACT: Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent (SGD) and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, has recently been shown to converge with fewer iterations in deep neural network (DNN) training than SGD; however, K-FAC's larger memory footprint and increased communication necessitates careful distribution of work for efficient usage. This thesis investigates scalable K-FAC algorithms to understand K-FAC's applicability in large-scale deep neural network training and presents KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework. Specifically, layer-wise distribution strategies, inverse-free second-order gradient evaluation, dynamic K-FAC update decoupling, and more are explored with the goal of preserving convergence while minimizing training time. KAISA can adapt the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability, and this adaptable distribution scheme generalizes existing strategies while providing a framework for scaling second-order methods beyond K-FAC. Compared to the original optimizers, KAISA converges 18.1–36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.
|
Committee: Kyle Chard, Ian Foster, and Zhao Zhang
|
engineering PROJECTS link
Check out all of my projects on GitHub.
star SELECTED PUBLICATIONS link
Ordered by most recent.
Object Proxy Patterns for Accelerating Distributed Applications [Dec 2024] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Logan Ward, Alexander Brace, André Bauer, Kyle Chard, Ian Foster |
TPDS 2024 |
TLDR | PDF | Website | Code | Publication | BibTex |
TLDR: In prior work, we demonstrated the transparent object proxy, which provides wide-area references that can resolve to data regardless of location, as an effective low-level building block for data flow optimization in distributed application design. Here we propose three high-level proxy-based programming patterns---distributed futures, streaming, and ownership---that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three substantial scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
|
@article{pauloski2024proxystore, title = {Object {P}roxy {P}atterns for {A}ccelerating {D}istributed {A}pplications}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Brace, Alexander and Bauer, André and Chard, Kyle and Foster, Ian}, doi = {10.1109/TPDS.2024.3511347}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {}, pages = {1-13}, volume = {}, year = {2024} } |
TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks [Sep 2024] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Maxime Gonthier, Nathaniel Hudson, Haochen Pan, Sicheng Zhou, Ian Foster, Kyle Chard |
eScience 2024 — Best Paper |
TLDR | PDF | Website | Code | Slides | Publication | BibTex |
TLDR: Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in parallel task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications.
|
@inproceedings{pauloski2024taps, title = {{TaPS}: {A} {P}erformance {E}valuation {S}uite for {T}ask-based {E}xecution {F}rameworks}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Gonthier, Maxime and Hudson, Nathaniel and Pan, Haochen and Zhou, Sicheng and Foster, Ian and Chard, Kyle}, address = {New York, NY, USA}, booktitle = {IEEE 20th International Conference on e-Science}, doi = {10.1109/e-Science62913.2024.10678702}, pages = {1-10}, publisher = {IEEE}, year = {2024} } |
Accelerating Communications in Federated Applications with Transparent Object Proxies [Nov 2023] |
J. Gregory Pauloski, Valerie Hayot-Sasson, Logan Ward, Nathaniel Hudson, Charlie Sabino, Matt Baughman, Kyle Chard, Ian Foster |
SC 2023 |
TLDR | PDF | Website | Code | Poster | Slides | Publication | BibTex |
TLDR: We describe ProxyStore, a system that decouples control flow from data flow by extending the pass-by-reference model to distributed applications using object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
|
@inproceedings{pauloski2023proxystore, title = {Accelerating {C}ommunications in {F}ederated {A}pplications with {T}ransparent {O}bject {P}roxies}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Hudson, Nathaniel and Sabino, Charlie and Baughman, Matt and Chard, Kyle and Foster, Ian}, address = {New York, NY, USA}, articleno = {59}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3581784.3607047}, isbn = {9798400701092}, location = {Denver, CO, USA}, numpages = {15}, publisher = {Association for Computing Machinery}, series = {SC '23}, url = {https://doi.org/10.1145/3581784.3607047}, year = {2023} } |
Deep Neural Network Training With Distributed K-FAC [Mar 2022] |
J. Gregory Pauloski, Lei Huang, Weijia Xu, Kyle Chard, Ian Foster, Zhao Zhang |
TPDS 2022 |
TLDR | PDF | Code | Publication | BibTex |
TLDR: We extend our SC 2020 paper to evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9—25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
|
@article{pauloski2022kfac, title = {Deep {N}eural {N}etwork {T}raining {W}ith {D}istributed {K}-{FAC}}, author = {Pauloski, J. Gregory and Huang, Lei and Xu, Weijia and Chard, Kyle and Foster, Ian T. and Zhang, Zhao}, doi = {10.1109/TPDS.2022.3161187}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {12}, pages = {3616-3627}, volume = {33}, year = {2022} } |
KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks [Nov 2021] |
J. Gregory Pauloski, Qi Huang, Lei Huang, Shivaram Venkataraman, Kyle Chard, Ian Foster, Zhao Zhang |
SC 2021 |
TLDR | PDF | Code | Slides | Publication | BibTex |
TLDR: We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. Compared to the original optimizers, KAISA converges 18.1—36.3% faster across applications with the same global batch size.
|
@inproceedings{pauloski2021kaisa, title = {{KAISA}: {A}n {A}daptive {S}econd-{O}rder {O}ptimizer {F}ramework for {D}eep {N}eural {N}etworks}, author = {Pauloski, J. Gregory and Huang, Qi and Huang, Lei and Venkataraman, Shivaram and Chard, Kyle and Foster, Ian and Zhang, Zhao}, address = {New York, NY, USA}, articleno = {13}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3458817.3476152}, isbn = {9781450384421}, location = {St. Louis, Missouri}, numpages = {14}, publisher = {Association for Computing Machinery}, series = {SC '21}, url = {https://doi.org/10.1145/3458817.3476152}, year = {2021} } |
Convolutional Neural Network Training with Distributed K-FAC [Nov 2020] |
J. Gregory Pauloski, Zhao Zhang, Lei Huang, Weijia Xu, Ian Foster |
SC 2020 |
TLDR | PDF | Code | Slides | Publication | BibTex |
TLDR: We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. Our distributed optimizer design trains Resnet-50 18—25% faster than SGD.
|
@inproceedings{pauloski2020kfac, title = {Convolutional {N}eural {N}etwork {T}raining with {D}istributed {K}-{FAC}}, author = {Pauloski, J. Gregory and Zhang, Zhao and Huang, Lei and Xu, Weijia and Foster, Ian T.}, articleno = {94}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.5555/3433701.3433826}, isbn = {9781728199986}, location = {Atlanta, Georgia}, numpages = {14}, publisher = {IEEE Press}, series = {SC '20}, year = {2020} } |
article ALL PUBLICATIONS link
Ordered by most recent and grouped by topic. Bibtex file available for download here.
Dec 2024 | Object Proxy Patterns for Accelerating Distributed Applications link |
TLDR | PDF | Authors | Website | Code | Publication | BibTex | TPDS 2024 | |
TLDR: In prior work, we demonstrated the transparent object proxy, which provides wide-area references that can resolve to data regardless of location, as an effective low-level building block for data flow optimization in distributed application design. Here we propose three high-level proxy-based programming patterns---distributed futures, streaming, and ownership---that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three substantial scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.
|
|
@article{pauloski2024proxystore, title = {Object {P}roxy {P}atterns for {A}ccelerating {D}istributed {A}pplications}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Brace, Alexander and Bauer, André and Chard, Kyle and Foster, Ian}, doi = {10.1109/TPDS.2024.3511347}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {}, pages = {1-13}, volume = {}, year = {2024} } |
|
Nov 2024 | Establishing a High-Per. and Productive Ecosystem for Dist. Execution of Python Functions Using Globus Compute link |
TLDR | PDF | Authors | Website | Code | Slides | BibTex | HUST @ SC 2024 | |
TLDR: The research computing ecosystem is increasingly heterogeneous and diverse. Democratizing access to these essential resources is critical for accelerating research progress. However, the gap between a high-level workload, such as Python in a Jupyter notebook, and the resources and interfaces exposed by HPC systems is significant. Users must securely authenticate, manage network connections, deploy and manage software, provision and configure nodes, and manage workload execution. Globus Compute reduces these barriers by providing a managed, fire-and-forget model that enables execution of Python functions across any resource to which a user has access. However, while Globus Compute has relieved users from many of the challenges of remote computing, we have observed some inefficiencies that remain in terms of use. For example, many users wrap external applications, such as C/C++, Fortran, and even MPI applications, in Python functions and users must deploy many endpoints on a single computer to exploit different configurations. We describe enhancements to Globus Compute to address these barriers: an asynchronous, future-based executor interface for submitting and monitoring tasks, shell and MPI-based function types, and a multi-user endpoint that can be deployed by administrators and used by authorized users.
|
|
@inproceedings{ananthakrishnan2024compute, title = {Establishing a {H}igh-{P}erformance and {P}roductive {E}cosystem for {D}istributed {E}xecution of {P}ython {F}unctions {U}sing {G}lobus {C}ompute}, author = {Rachana Ananthakrishnan and Yadu Babuji and Josh Bryan and Kyle Chard and Ryan Chard and Ben Clifford and Ian Foster and Lev Gorenstein and Kevin Hunter Kesling and Chris Janidlo and Daniel Katz and Reid Mello and J. Gregory Pauloski and Lei Wang}, booktitle = {IEEE/ACM International Workshop on HPC User Support Tools (HUST)}, doi = {10.1109/SCW63240.2024.00083}, year = {2024} } |
|
Oct 2024 | Accelerating Python Applications with Dask and ProxyStore link |
TLDR | PDF | Authors | Code | Slides | Preprint | BibTex | arXiv Preprint & HPPSS 2024 Demo | |
TLDR: Applications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-reference paradigm, has shown to be an effective mechanism for addressing these limitations. Here, we investigate integrating ProxyStore with Dask Distributed, one of the most popular libraries for distributed computing in Python, with the goal of supporting scalable and portable scientific workflows. Dask provides an easy-to-use and flexible framework, but is less optimized for scaling certain data-intensive workflows. We investigate these limitations and detail the technical contributions necessary to develop a robust solution for distributed applications and demonstrate improved performance on synthetic benchmarks and real applications.
|
|
@misc{pauloski2024accelerating, title = {Accelerating {P}ython {A}pplications with {D}ask and {ProxyStore}}, author = {J. Gregory Pauloski and Klaudiusz Rydzy and Valerie Hayot-Sasson and Ian Foster and Kyle Chard}, archiveprefix = {arXiv}, eprint = {2410.12092}, primaryclass = {cs.DC}, url = {https://arxiv.org/abs/2410.12092}, year = {2024} } |
|
Sep 2024 | An Empirical Investigation of Container Building [...] to Reduce Cold Starts in Sci. Computing Serverless Functions link |
TLDR | PDF | Authors | Publication | BibTex | eScience 2024 | |
TLDR: Serverless platforms dynamically create execution environments, often using containers. The cost to create and deploy these environments is known as "cold start" latency, and this cost can be particularly detrimental to scientific computing workloads characterized by sporadic and dynamic demands. We investigate methods to mitigate cold start issues in scientific computing applications by pre-installing Python packages in container images. Using data from Globus Compute and Binder, we empirically analyze cold start behavior and evaluate four strategies for building containers, including fully pre-built environments and dynamic, on-demand installations. Our results show that pre-installing all packages reduces initial cold start time but requires significant storage. Conversely, dynamic installation offers lower storage requirements but incurs repetitive delays. Additionally, we implemented a simulator and assessed the impact of different warm times, finding that moderate warm times significantly reduce cold starts without the excessive overhead of maintaining always-hot states.
|
|
@inproceedings{bauer2024containers, title = {An {E}mpirical {I}nvestigation of {C}ontainer {B}uilding {S}trategies and {W}arm {T}imes to {R}educe {C}old {S}tarts in {S}cientific {C}omputing {S}erverless {F}unctions}, author = {Bauer, André and Gonthier, Maxime and Pan, Haochen and Chard, Ryan and Grzenda, Daniel and Straesser, Martin and Pauloski, J. Gregory and Kamatar, Alok and Baughman, Matt and Hudson, Nathaniel and Foster, Ian and Chard, Kyle}, booktitle = {IEEE 20th International Conference on e-Science (e-Science)}, doi = {10.1109/e-Science62913.2024.10678668}, number = {}, pages = {1-10}, volume = {}, year = {2024} } |
|
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
TLDR | PDF | Authors | Website | Code | Slides | Publication | BibTex | eScience 2024 — Best Paper | |
TLDR: Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in parallel task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications.
|
|
@inproceedings{pauloski2024taps, title = {{TaPS}: {A} {P}erformance {E}valuation {S}uite for {T}ask-based {E}xecution {F}rameworks}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Gonthier, Maxime and Hudson, Nathaniel and Pan, Haochen and Zhou, Sicheng and Foster, Ian and Chard, Kyle}, address = {New York, NY, USA}, booktitle = {IEEE 20th International Conference on e-Science}, doi = {10.1109/e-Science62913.2024.10678702}, pages = {1-10}, publisher = {IEEE}, year = {2024} } |
|
Nov 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
TLDR | PDF | Authors | Website | Code | Poster | Slides | Publication | BibTex | SC 2023 | |
TLDR: We describe ProxyStore, a system that decouples control flow from data flow by extending the pass-by-reference model to distributed applications using object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
|
|
@inproceedings{pauloski2023proxystore, title = {Accelerating {C}ommunications in {F}ederated {A}pplications with {T}ransparent {O}bject {P}roxies}, author = {Pauloski, J. Gregory and Hayot-Sasson, Valerie and Ward, Logan and Hudson, Nathaniel and Sabino, Charlie and Baughman, Matt and Chard, Kyle and Foster, Ian}, address = {New York, NY, USA}, articleno = {59}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3581784.3607047}, isbn = {9798400701092}, location = {Denver, CO, USA}, numpages = {15}, publisher = {Association for Computing Machinery}, series = {SC '23}, url = {https://doi.org/10.1145/3581784.3607047}, year = {2023} } |
Sep 2024 | Flight: A FaaS-Based Framework for Complex and Hierarchical Federated Learning link |
TLDR | PDF | Authors | Code | Preprint | BibTex | arXiv Preprint | |
TLDR: Federated Learning (FL) is a decentralized machine learning paradigm where models are trained on distributed devices and are aggregated at a central server. Existing FL frameworks assume simple two-tier network topologies where end devices are directly connected to the aggregation server. While this is a practical mental model, it does not exploit the inherent topology of real-world distributed systems like the Internet-of-Things. We present Flight, a novel FL framework that supports complex hierarchical multi-tier topologies, asynchronous aggregation, and decouples the control plane from the data plane. We compare the performance of Flight against Flower, a state-of-the-art FL framework. Our results show that Flight scales beyond Flower, supporting up to 2048 simultaneous devices, and reduces FL makespan across several models. Finally, we show that Flight's hierarchical FL model can reduce communication overheads by more than 60%.
|
|
@misc{hudson2024flight, title = {Flight: {A} {FaaS}-{B}ased {F}ramework for {C}omplex and {H}ierarchical {F}ederated {L}earning}, author = {Nathaniel Hudson and Valerie Hayot-Sasson and Yadu Babuji and Matt Baughman and J. Gregory Pauloski and Ryan Chard and Ian Foster and Kyle Chard}, archiveprefix = {arXiv}, eprint = {2409.16495}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2409.16495}, year = {2024} } |
|
Dec 2023 | Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision link |
TLDR | PDF | Authors | Publication | BibTex | BDCAT 2023 | |
TLDR: Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters---such as Huawei's PanGu-ÎŁ. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
|
|
@inproceedings{hudson2023trillion, title = {Trillion {P}arameter {AI} {S}erving {I}nfrastructure for {S}cientific {D}iscovery: {A} {S}urvey and {V}ision}, author = {Hudson, Nathaniel C and Pauloski, J. Gregory and Baughman, Matt and Kamatar, Alok and Sakarvadia, Mansi and Ward, Logan and Chard, Ryan and Bauer, Andr\'{e} and Levental, Maksim and Wang, Wenyi and Engler, Will and Price Skelly, Owen and Blaiszik, Ben and Stevens, Rick and Chard, Kyle and Foster, Ian}, address = {New York, NY, USA}, articleno = {15}, booktitle = {Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies}, doi = {10.1145/3632366.3632396}, isbn = {9798400704734}, location = {Taormina (Messina), Italy}, numpages = {10}, publisher = {Association for Computing Machinery}, series = {BDCAT '23}, url = {https://doi.org/10.1145/3632366.3632396}, year = {2024} } |
|
Mar 2022 | Deep Neural Network Training With Distributed K-FAC link |
TLDR | PDF | Authors | Code | Publication | BibTex | TPDS 2022 | |
TLDR: We extend our SC 2020 paper to evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9—25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
|
|
@article{pauloski2022kfac, title = {Deep {N}eural {N}etwork {T}raining {W}ith {D}istributed {K}-{FAC}}, author = {Pauloski, J. Gregory and Huang, Lei and Xu, Weijia and Chard, Kyle and Foster, Ian T. and Zhang, Zhao}, doi = {10.1109/TPDS.2022.3161187}, journal = {IEEE Transactions on Parallel and Distributed Systems}, number = {12}, pages = {3616-3627}, volume = {33}, year = {2022} } |
|
Nov 2021 | KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks link |
TLDR | PDF | Authors | Code | Slides | Publication | BibTex | SC 2021 | |
TLDR: We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. Compared to the original optimizers, KAISA converges 18.1—36.3% faster across applications with the same global batch size.
|
|
@inproceedings{pauloski2021kaisa, title = {{KAISA}: {A}n {A}daptive {S}econd-{O}rder {O}ptimizer {F}ramework for {D}eep {N}eural {N}etworks}, author = {Pauloski, J. Gregory and Huang, Qi and Huang, Lei and Venkataraman, Shivaram and Chard, Kyle and Foster, Ian and Zhang, Zhao}, address = {New York, NY, USA}, articleno = {13}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.1145/3458817.3476152}, isbn = {9781450384421}, location = {St. Louis, Missouri}, numpages = {14}, publisher = {Association for Computing Machinery}, series = {SC '21}, url = {https://doi.org/10.1145/3458817.3476152}, year = {2021} } |
|
Nov 2020 | Convolutional Neural Network Training with Distributed K-FAC link |
TLDR | PDF | Authors | Code | Slides | Publication | BibTex | SC 2020 | |
TLDR: We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. Our distributed optimizer design trains Resnet-50 18—25% faster than SGD.
|
|
@inproceedings{pauloski2020kfac, title = {Convolutional {N}eural {N}etwork {T}raining with {D}istributed {K}-{FAC}}, author = {Pauloski, J. Gregory and Zhang, Zhao and Huang, Lei and Xu, Weijia and Foster, Ian T.}, articleno = {94}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, doi = {10.5555/3433701.3433826}, isbn = {9781728199986}, location = {Atlanta, Georgia}, numpages = {14}, publisher = {IEEE Press}, series = {SC '20}, year = {2020} } |
|
May 2020 | Efficient I/O for Neural Network Training with Compressed Data link |
TLDR | PDF | Authors | Code | Publication | BibTex | IPDPS 2020 | |
TLDR: We investigate the tradeoff between runtime overhead and data compression ratio on real-world deep learning training datasets and applications. We show that storage can be reduced by 2—13x with minimal additional runtime overhead.
|
|
@inproceedings{zhang2020compressed, title = {Efficient {I/O} for {N}eural {N}etwork {T}raining with {C}ompressed {D}ata}, author = {Z. {Zhang} and L. {Huang} and J. G. {Pauloski} and I. T. {Foster}}, booktitle = {IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, doi = {10.1109/IPDPS47924.2020.00050}, number = {}, pages = {409-418}, volume = {}, year = {2020} } |
|
Dec 2019 | Aggregating Local Storage for Scalable Deep Learning I/O link |
TLDR | PDF | Authors | Code | Publication | BibTex | DLS 2019 | |
TLDR: We develop a a user-level transient object store that provides low-latency and scalable POSIX-compliant file access for scalable deep learning training.
|
|
@inproceedings{zhang2019aggregating, title = {Aggregating {L}ocal {S}torage for {S}calable {D}eep {L}earning {I/O}}, author = {Z. {Zhang} and L. {Huang} and J. G. {Pauloski} and I. {Foster}}, booktitle = {IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS)}, doi = {10.1109/DLS49591.2019.00014}, number = {}, pages = {69-75}, volume = {}, year = {2019} } |
Oct 2024 | Employing Artificial Intelligence to Steer Exascale Workflows with Colmena link |
TLDR | PDF | Authors | Website | Code | Publication | BibTex | IJHPCA 2024 | |
TLDR: We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI.
|
|
@article{ward2024colmena, title = {Employing {A}rtificial {I}ntelligence to {S}teer {E}xascale {W}orkflows with {C}olmena}, author = {Logan Ward and J. Gregory Pauloski and Valerie Hayot-Sasson and Yadu Babuji and Alexander Brace and Ryan Chard and Kyle Chard and Rajeev Thakur and Ian Foster}, doi = {10.1177/10943420241288242}, eprint = {https://doi.org/10.1177/10943420241288242}, journal = {The International Journal of High Performance Computing Applications}, number = {0}, pages = {10943420241288242}, url = {https://doi.org/10.1177/10943420241288242}, volume = {0}, year = {0} } |
|
Nov 2023 | GenSLMs: Genome-scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics link |
TLDR | PDF | Authors | Code | Publication | BibTex | IJHPCA — ACM Gordon Bell Special Prize for COVID-19 Research | |
TLDR: We build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern.
|
|
@article{zvyagin2023genslms, title = {{GenSLMs}: {G}enome-scale language models reveal {SARS}-{CoV}-2 evolutionary dynamics}, author = {Zvyagin, Maxim and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Bohorquez, Cindy Orozco and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and others}, journal = {The International Journal of High Performance Computing Applications}, number = {6}, pages = {683--705}, publisher = {SAGE Publications Sage UK: London, England}, volume = {37}, year = {2023} } |
|
Nov 2023 | DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies link |
TLDR | PDF | Authors | Website | Preprint | BibTex | arXiv Preprint | |
TLDR: We present the DeepSpeed4Science initiative which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models.
|
|
@misc{song2023deepspeed4science, title = {{DeepSpeed4Science} {I}nitiative: {E}nabling {L}arge-{S}cale {S}cientific {D}iscovery through {S}ophisticated {AI} {S}ystem {T}echnologies}, author = {Shuaiwen Leon Song and Bonnie Kruft and Minjia Zhang and Conglong Li and Shiyang Chen and Chengming Zhang and Masahiro Tanaka and Xiaoxia Wu and Jeff Rasley and Ammar Ahmad Awan and Connor Holmes and Martin Cai and Adam Ghanem and Zhongzhu Zhou and Yuxiong He and Pete Luferenko and Divya Kumar and Jonathan Weyn and Ruixiong Zhang and Sylwester Klocek and Volodymyr Vragov and Mohammed AlQuraishi and Gustaf Ahdritz and Christina Floristean and Cristina Negri and Rao Kotamarthi and Venkatram Vishwanath and Arvind Ramanathan and Sam Foreman and Kyle Hippe and Troy Arcomano and Romit Maulik and Maxim Zvyagin and Alexander Brace and Bin Zhang and Cindy Orozco Bohorquez and Austin Clyde and Bharat Kale and Danilo Perez-Rivera and Heng Ma and Carla M. Mann and Michael Irvin and J. Gregory Pauloski and Logan Ward and Valerie Hayot and Murali Emani and Zhen Xie and Diangen Lin and Maulik Shukla and Ian Foster and James J. Davis and Michael E. Papka and Thomas Brettin and Prasanna Balaprakash and Gina Tourassi and John Gounley and Heidi Hanson and Thomas E Potok and Massimiliano Lupo Pasini and Kate Evans and Dan Lu and Dalton Lunga and Junqi Yin and Sajal Dash and Feiyi Wang and Mallikarjun Shankar and Isaac Lyngaas and Xiao Wang and Guojing Cong and Pei Zhang and Ming Fan and Siyan Liu and Adolfy Hoisie and Shinjae Yoo and Yihui Ren and William Tang and Kyle Felker and Alexey Svyatkovskiy and Hang Liu and Ashwin Aji and Angela Dalton and Michael Schulte and Karl Schulz and Yuntian Deng and Weili Nie and Josh Romero and Christian Dallago and Arash Vahdat and Chaowei Xiao and Thomas Gibbs and Anima Anandkumar and Rick Stevens}, archiveprefix = {arXiv}, eprint = {2310.04610}, primaryclass = {cs.AI}, year = {2023} } |
|
May 2023 | The Diminishing Returns of Masked Language Models to Science link |
TLDR | PDF | Authors | Website | Publication | BibTex | Findings of the Association for Computational Linguistics: ACL 2023 | |
TLDR: We use 14 domain-specific transformer based models (including ScholarBERT, a new 770M-parameter science-focused masked language model pretrained on up to 225B tokens) to evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks. Interestingly, we find that increasing model sizes, training data, or compute time does not always lead to measurable improvements for scientific information extraction tasks.
|
|
@inproceedings{hong2023scholarbert, title = {The {D}iminishing {R}eturns of {M}asked {L}anguage {M}odels to {S}cience}, author = {Hong, Zhi and Ajith, Aswathy and Pauloski, J. Gregory and Duede, Eamon and Chard, Kyle and Foster, Ian}, address = {Toronto, Canada}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, doi = {10.18653/v1/2023.findings-acl.82}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, month = {July}, pages = {1270--1283}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-acl.82}, year = {2023} } |
|
Mar 2023 | Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources link |
TLDR | PDF | Authors | Code | Publication | BibTex | HCW @ IPDPS 2023 | |
TLDR: We describe our experiences in building and deploying AI driven workflows across multiple computing sites without networking hassles and without losing performance using Colmena, Globus, FuncX, and ProxyStore.
|
|
@inproceedings{ward2023colmena, title = {Cloud {S}ervices {E}nable {E}fficient {AI}-{G}uided {S}imulation {W}orkflows across {H}eterogeneous {R}esources}, author = {Ward, Logan and Pauloski, J. Gregory and Hayot-Sasson, Valerie and Chard, Ryan and Babuji, Yadu and Sivaraman, Ganesh and Choudhury, Sutanay and Chard, Kyle and Thakur, Rajeev and Foster, Ian}, booktitle = {IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}, doi = {10.1109/IPDPSW59300.2023.00018}, number = {}, pages = {32-41}, volume = {}, year = {2023} } |
|
Nov 2021 | Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing link |
TLDR | PDF | Authors | Website | Code | Publication | BibTex | MLHPC @ SC 2021 | |
TLDR: We present Colmena, an open-source Python framework that allows users to steer massive computational campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to 65536 CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.
|
|
@inproceedings{ward2021colmena, title = {Colmena: {S}calable {M}achine-{L}earning-{B}ased {S}teering of {E}nsemble {S}imulations for {H}igh {P}erformance {C}omputing}, author = {Ward, Logan and Sivaraman, Ganesh and Pauloski, J. Gregory and Babuji, Yadu and Chard, Ryan and Dandu, Naveen and Redfern, Paul C. and Assary, Rajeev S. and Chard, Kyle and Curtiss, Larry A. and Thakur, Rajeev and Foster, Ian}, booktitle = {IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)}, doi = {10.1109/MLHPC54614.2021.00007}, number = {}, pages = {9-20}, volume = {}, year = {2021} } |
|
Aug 2021 | Models and Processes to Extract Drug-like Molecules From Natural Language Text link |
TLDR | PDF | Authors | Publication | BibTex | Frontiers in Molecular Biosciences | |
TLDR: We present (1) an iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for an NER model and (2) the application and evaluation of this method to identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers.
|
|
@article{hong2021moleculesnlp, title = {Models and {P}rocesses to {E}xtract {D}rug-like {M}olecules {F}rom {N}atural {L}anguage {T}ext}, author = {Hong, Zhi and Pauloski, J. Gregory and Ward, Logan and Chard, Kyle and Blaiszik, Ben and Foster, Ian}, doi = {10.3389/fmolb.2021.636077}, issn = {2296-889X}, journal = {Frontiers in Molecular Biosciences}, pages = {826}, url = {https://www.frontiersin.org/article/10.3389/fmolb.2021.636077}, volume = {8}, year = {2021} } |
|
Nov 2018 | Glioma Segmentation and a Simple Accurate Model for Overall Survival Prediction link |
TLDR | PDF | Authors | Publication | BibTex | BrainLes 2018 | |
TLDR: We develop a multi-stage pipeline for accurate patient survival prediction from brain tumor MRI scans. We segment tumor subvolumes using a multi-scale convolutional network, extract intensity and shape features, then use an ensemble of machine learning models to predict patient outcomes.
|
|
@inproceedings{gates2019glioma, title = {Glioma {S}egmentation and a {S}imple {A}ccurate {M}odel for {O}verall {S}urvival {P}rediction}, author = {Gates, Evan and Pauloski, J. Gregory and Schellingerhout, Dawid and Fuentes, David}, address = {Cham}, booktitle = {Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries}, editor = {Crimi, Alessandro and Bakas, Spyridon and Kuijf, Hugo and Keyvan, Farahani and Reyes, Mauricio and van Walsum, Theo}, isbn = {978-3-030-11726-9}, pages = {476--484}, publisher = {Springer International Publishing}, year = {2019} } |
co_present PRESENTATIONS link
Ordered by most recent.
Nov 2024 | Distributed Execution of Python Functions Using Globus Compute link |
Slides | SC24 Workshop on HPC User Support Tools | |
Nov 2024 | Accelerating Python Applications with Dask and ProxyStore link |
Slides | SC24 Workshop on High Performance Python for Science at Scale | |
Nov 2024 | Accelerating Communications in High-Performance Scientific Workflows link |
Slides | Poster | Doctoral Showcase at Supercomputing 2024 | |
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
Slides | Video | ParslFest | |
Sep 2024 | TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks link |
Slides | IEEE International Conference on eScience (eScience) | |
Nov 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
Slides | Supercomputing 2023 | |
Oct 2023 | ProxyStore: Decoupling Control and Data Flow in Workflows link |
Slides | Video | ParslFest | |
Apr 2023 | Accelerating Communications in Federated Applications with Transparent Object Proxies link |
Poster | Greater Chicago Area Systems Research Workshop (GCASR) | |
Sep 2022 | ProxyStore: a Data Fabric for Parsl and FuncX link |
Slides | Video | ParslFest | |
Nov 2021 | KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks link |
Slides | Supercomputing 2021 | |
Nov 2020 | Convolutional Neural Network Training with Distributed K-FAC link |
Slides | Supercomputing 2020 | |
Sep 2018 | Optimizing Deep Learning Methods for Image Segmentation with Distributed Training link |
Poster | TACC Symposium for Texas Researchers (TACCSTER) |