Access to published material may be restricted.
(This document is open access)Abstract: Introduction: Open-source software (OSS) is a critical component of open science, but contributions to the OSS ecosystem are systematically undervalued in the current academic system. The Journal of Open Source Software (JOSS) contributes to addressing this by providing a venue (that is itself free, diamond open access, and all open-source, built in a layered structure using widely available elements/services of the scholarly publishing ecosystem) for publishing OSS, run in the style of OSS itself. A particularly distinctive element of JOSS is that it uses open peer review in a collaborative, iterative format, unlike most publishers. Additionally, all the components of the process—from the reviews to the papers to the software that is the subject of the papers to the software that the journal runs—are open.
Background: We describe JOSS’s history and its peer review process using an editorial bot, and we present statistics gathered from JOSS’s public review history on GitHub showing an increasing number of peer reviewed papers each year. We discuss the new JOSSCast and use it as a data source to understand reasons why interviewed authors decided to publish in JOSS.
Discussion and Outlook: JOSS’s process differs significantly from traditional journals, which has impeded JOSS's inclusion in indexing services such as Web of Science. In turn, this discourages researchers within certain academic systems, such as Italy's, which emphasize the importance of Web of Science and/or Scopus indexing for grant applications and promotions. JOSS is a fully diamond open-access journal with a cost of around US$5 per paper for the 401 papers published in 2023. The scalability of running JOSS with volunteers and financing JOSS with grants and donations is discussed.
(This document is open access)Abstract: Research software for simulating Earth processes enables the estimation of past, current, and future world states and guides policy. However, this modelling software is often developed by scientists with limited training, time, and funding, leading to software that is hard to understand, (re)use, modify, and maintain and that is, in this sense, non-sustainable. Here we evaluate the sustainability of global-scale impact models across 10 research fields. We use nine sustainability indicators for our assessment. Five of these indicators – documentation, version control, open-source license, provision of software in containers, and the number of active developers – are related to best practices in software engineering and characterize overall software sustainability. The remaining four – comment density, modularity, automated testing, and adherence to coding standards – contribute to code quality, an important factor in software sustainability. We found that 29% (32 out of 112) of the global impact models (GIMs) participating in the Inter-Sectoral Impact Model Intercomparison Project were accessible without contacting the developers. Regarding best practices in software engineering, 75% of the 32 GIMs have some kind of documentation, 81% use version control, and 69% have an open-source license. Only 16% provide the software in a containerized form, which can potentially limit result reproducibility. Four models had no active development after 2020. Regarding code quality, we found that models suffer from low code quality, which impedes model improvement, maintenance, reusability, and reliability. Key issues include a non-optimal comment density in 75% of the GIMs, insufficient modularity in 88% of the GIMs, and the absence of a testing suite in 72% of the GIMs. Furthermore, only 5 out of 10 models for which the source code, either in part or in its entirety, is written in Python show good compliance with PEP8 coding standards, with the rest showing low compliance. To improve the sustainability of GIMs and other research software, we recommend best practices for sustainable software development to the scientific community. As an example of implementing these best practices, we show how reprogramming a legacy model using best practices has improved software sustainability.
(This document is open access)Abstract:
Research software is increasingly recognized as critical infrastructure in contemporary science. It spans a broad spectrum, including source code files, algorithms, scripts, computational workflows, and executables, all created for or during research. While research funders have developed programs, initiatives, and policies to bolster research software’s role, there has been no empirical study of how these funders prioritize support for research software. Understanding their priorities is essential to clarify where current support is concentrated and to identify strategic gaps.
We conducted an online mixed methods survey of international research funders (n=36) to explore their priorities in supporting research software. The survey gathered data on the specific outcomes funders emphasize in their programs and initiatives for research software.
The survey revealed that funders place strong emphasis on developing skills, promoting software sustainability, embedding open science practices, building community and collaboration, advancing research software funding mechanisms, increasing software visibility and use, fostering innovation, and ensuring security.
The findings highlight opportunities to enhance research software’s role through increased funder attention on professional recognition for software contributions and the non-technical, social aspects of research software sustainability. Addressing these areas could lead to more effective support and development of research software, ultimately benefitting the entire research ecosystem.
(This document is not yet open access)Abstract: The Common Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. In this paper, we describe our experiences integrating CWL with Parsl, a Python-based parallel programming library designed to manage execution of workflows across diverse computing environments. We propose a new method that converts CWL CommandLineTool definitions into Parsl apps, enabling Parsl scripts to easily import and use tools represented in CWL. We describe a Parsl runner that is capable of executing a CWL CommandLineTool directly. We also describe a proof-of-concept extension to support inline Python in a CWL workflow definition, enabling seamless use in Parsl’s Python ecosystem. We demonstrate the benefits of this integration by presenting example CWL CommandLineTool definitions that show how they can be used in Parsl, and comparing performance of executing an image processing workflow using the Parsl integration and other CWL runners.
(This document is not yet open access)Abstract: The research computing ecosystem is increasingly heterogeneous and diverse. Democratizing access to these essential resources is critical for accelerating research progress. However, the gap between a high-level workload, such as Python in a Jupyter notebook, and the resources and interfaces exposed by HPC systems is significant. Users must securely authenticate, manage network connections, deploy and manage software, provision and configure nodes, and manage workload execution. Globus Compute reduces these barriers by providing a managed, fire-and-forget model that enables execution of Python functions across any resource to which a user has access. However, while Globus Compute has relieved users from many of the challenges of remote computing, we have observed some inefficiencies that remain in terms of use. For example, many users wrap external applications, such as C/C++, Fortran, and even MPI applications, in Python functions and users must deploy many endpoints on a single computer to exploit different configurations. In this paper we describe enhancements to Globus Compute to address these barriers: an asynchronous, future-based executor interface for submitting and monitoring tasks, shell and MPI-based function types, and a multi-user endpoint that can be deployed by administrators and used by authorized users.
(This document is open access)Abstract: Globus Compute implements a hybrid Function as a Service (FaaS) model in which a single cloud-hosted service is used by users to manage execution of Python functions on user-owned and -managed Globus Compute endpoints deployed on arbitrary compute resources. Here we describe a new multi-user and multi-configuration Globus Compute endpoint. This system, which can be deployed by administrators in a privileged account, enables dynamic creation of user endpoints that are forked as new processes in user space. The multi-user endpoint is designed to provide the security interfaces necessary for deployment on large, shared HPC clusters by, for example, restricting user endpoint configurations, enforcing various authorization policies, and via customizable identity-username mapping.
(This document is green open access)Abstract: This is a virtual dialog between Jeffrey C. Carver and Daniel S. Katz on how people learn programming languages. It's based on a talk Jeff gave at the first US-RSE Conference (US-RSE'23), which led Dan to think about human languages versus computer languages. Dan discussed this with Jeff at the conference, and this discussion continued asynchronous, with this column being a record of the discussion.
(This document is open access)Abstract: Computational provenance has many important applications, especially to reproducibility. System-level provenance collectors can track provenance data without requiring the user to change anything about their application. However, system-level provenance collectors have performance overheads, and, worse still, different works use different and incomparable benchmarks to assess their performance overhead. This work identifies user-space system-level provenance collectors in prior work, collates the benchmarks, and evaluates each collector on each benchmark. We use benchmark minimization to select a minimal subset of benchmarks, which can be used as goalposts for future work on system-level provenance collectors.
(This document is green open access)Abstract: This article focuses on training work carried out in artificial intelligence (AI) at the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign via a research experience for undergraduates (REU) program named FoDOMMaT. It also describes why we are interested in AI, and concludes by discussing what we've learned from running this program and its predecessor over six years.
(This document is open access)Abstract: This paper extends the FAIR (Findable, Accessible, Interoperable, Reusable) guidelines to provide criteria for assessing if software conforms to best practices in open source. By adding "USE" (User-Centered, Sustainable, Equitable), software development can adhere to open source best practice by incorporating user-input early on, ensuring front-end designs are accessible to all possible stakeholders, and planning long-term sustainability alongside software design. The FAIR-USE4OS guidelines will allow funders and researchers to more effectively evaluate and plan open-source software projects. There is good evidence of funders increasingly mandating that all funded research software is open source; however, even under the FAIR guidelines, this could simply mean software released on public repositories with a Zenodo DOI. By creating FAIR-USE software, best practice can be demonstrated from the very beginning of the design process and the software has the greatest chance of success by being impactful.
(This document is open access)Abstract: Group authorship (also known as corporate authorship, team authorship, consortium authorship) refers to attribution practices that use the name of a collective (be it team, group, project, corporation, or consortium) in the authorship byline. Data shows that group authorships are on the rise but thus far, in scholarly discussions about authorship, they have not gained much specific attention. Group authorship can minimize tensions within the group about authorship order and the criteria used for inclusion/exclusion of individual authors. However, current use of group authorships has drawbacks, such as ethical challenges associated with the attribution of credit and responsibilities, legal challenges regarding how copyrights are handled, and technical challenges related to the lack of persistent identifiers (PIDs), such as ORCID, for groups. We offer two recommendations: 1) Journals should develop and share context-specific and unambiguous guidelines for group authorship, for which they can use the four baseline requirements offered in this paper; 2) Using persistent identifiers for groups and consistent reporting of members’ contributions should be facilitated through devising PIDs for groups and linking these to the ORCIDs of their individual contributors and the Digital Object Identifier (DOI) of the published item.
(This document is open access)Abstract: The findable, accessible, interoperable, and reusable (FAIR) data principles provide a framework for examining, evaluating, and improving how data is shared to facilitate scientific discovery. Generalizing these principles to research software and other digital products is an active area of research. Machine learning (ML) models—algorithms that have been trained on data without being explicitly programmed—and more generally, artificial intelligence (AI) models, are an important target for this because of the ever-increasing pace with which AI is transforming scientific domains, such as experimental high energy physics (HEP). In this paper, we propose a practical definition of FAIR principles for AI models in HEP and describe a template for the application of these principles. We demonstrate the template's use with an example AI model applied to HEP, in which a graph neural network is used to identify Higgs bosons decaying to two bottom quarks. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.
(This document is open access)Abstract: A burst buffer is a common method to bridge the performance gap between the I/O needs of modern supercomputing applications and the performance of the shared file system on large-scale supercomputers. However, existing I/O sharing methods require resource isolation, offline profiling, or repeated execution that significantly limit the utilization and applicability of these systems. Here we present ThemisIO, a policy-driven I/O sharing framework for a remote-shared burst buffer: a dedicated group of I/O nodes, each with a local storage device. ThemisIO preserves high utilization by implementing opportunity fairness so that it can reallocate unused I/O resources to other applications. ThemisIO accurately and efficiently allocates I/O cycles among applications, purely based on real-time I/O behavior without requiring user-supplied information or offline-profiled application characteristics. ThemisIO supports a variety of fair sharing policies, such as user-fair, size-fair, as well as composite policies, e.g., group-then-user-fair. All these features are enabled by its statistical token design. ThemisIO can alter the execution order of incoming I/O requests based on assigned tokens to precisely balance I/O cycles between applications via time slicing, thereby enforcing processing isolation. Experiments using I/O benchmarks show that ThemisIO sustains 13.5–13.7% higher I/O throughput and 19.5–40.4% lower performance variation than existing algorithms. For real applications, ThemisIO significantly reduces the slowdown by 59.1–99.8% caused by I/O interference.
(This document is open access)Abstract: A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022.
(This document is open access)Abstract: Research data is optimized when it can be freely accessed and reused. To maximize research equity, transparency, and reproducibility, policymakers should take concrete steps to ensure that research software is openly accessible and reusable.
(This document is open access)Abstract: This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.
(This document is open access)
(This document is not open access)Abstract: Workflows make it easier for scientists to assemble computational experiments consisting of many disparate components. However, those disparate components also increase the probability that the computational experiment fails to be reproducible. Even if software is reproducible today, it may become irreproducible tomorrow without the software itself changing at all, because of the constantly changing software environment in which the software is run.
To alleviate irreproducibility, workflow engines integrate with container engines. Additionally, communities that sprung up around workflow engines started to host registries for workflows that follow standards. These standards reduce the effort needed to make workflows automatically reproducible.
In this paper, we study automatic reproduction of workflows from two registries, focusing on non-crashing executions. The experimental data lets us analyze the upper bound to which workflow engines could achieve reproducibility. We identify lessons learned in achieving reproducibility in practice.
(This document is green open access)Abstract: Continuous integration (CI) has become a ubiquitous practice in modern software development, with major code hosting services offering free automation on popular platforms. CI offers major benefits, as it enables detecting bugs in code prior to committing changes. While high-performance computing (HPC) research relies heavily on software, HPC machines are not considered "common" platforms. This presents several challenges that hinder the adoption of CI in HPC environments, making it difficult to maintain bug-free HPC projects, and resulting in adverse effects on the research community. In this article, we explore the challenges that impede HPC CI, such as hardware diversity, security, isolation, administrative policies, and nonstandard authentication, environments, and job submission mechanisms. We propose several solutions that could enhance the quality of HPC software and the experience of developers. Implementing these solutions would require significant changes at HPC centers, but if these changes are made, it would ultimately enable faster and better science.
(This document is green open access)Abstract: As software has become more essential to research across disciplines, and as the recognition of this fact has grown, the importance of professionalizing the development and maintenance of this software has also increased. The community of software professionals who work on this software have come together under the title Research Software Engineer (RSE) over the last decade. This has led to the formalization of RSE roles and organized RSE groups in universities, national labs, and industry. This, in turn, has created the need to understand how RSEs come into this profession and into these groups, how to further promote this career path to potential members, as well as the need to understand what training gaps need to be filled for RSEs coming from different entry points. We have categorized three main classifications of entry paths into the RSE profession and identified key elements, both advantages and disadvantages, that should be acknowledged and addressed by the broader research community in order to attract and retain a talented and diverse pool of future RSEs.
(This document is green open access)Abstract: The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. BSSwF vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software. Over the last five years, many fellowship recipients and honorable mentions have identified as research software engineers (RSEs). Case studies from several of the program’s participants illustrate the diverse ways BSSwF has benefited both the RSE and scientific communities. In an environment where the contributions of RSEs are too often undervalued, we believe that programs such as BSSwF can help recognize and encourage community members to step outside of their regular commitments and expand on their work, collaborations and ideas for a larger audience.
(This document is open access)Abstract: As recognition of the vital importance of software for contemporary research is increasing, Research Software Engineering (RSE) is emerging as a discipline in its own right. We present an inventory of relevant research questions about RSE as a basis for future research and initiatives to advance the field, highlighting selected literature and initiatives. This work is the outcome of a RSE community workshop held as part of the 2020 International Series of Online Research Software Events (SORSE) which identified and prioritized key questions across three overlapping themes: people, policy and infrastructure. Almost half of the questions focus on the people theme, which addresses issues related to career paths, recognition and motivation; recruitment and retention; skills; and diversity, equity and inclusion. However, the people and policy themes have the same number of prioritized questions. We recommend that different types of stakeholders, such as RSE employers and policy makers, take responsibility for supporting or encouraging answering of these questions by organizations that have an interest. Initiatives such as the International Council of RSE Associations should also be engaged in this work.
(This document is open access)Abstract: Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR esearch software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.
(This document is not open access)Abstract: funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX’s endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, and supercomputers, in effect turning them into function serving systems. funcX’s cloud-hosted service provides a single location for registering, sharing, and managing both functions and endpoints. It allows for transparent, secure, and reliable function execution across the federated ecosystem of endpoints—enabling users to route functions to endpoints based on specific needs. funcX uses containers (e.g., Docker, Singularity, and Shifter) to provide common execution environments across endpoints. funcX implements various container management strategies to execute functions with high performance and efficiency on diverse funcX endpoints. funcX also integrates with an in-memory data store and Globus for managing data that may span endpoints. We motivate the need for funcX, present our prototype design and implementation, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than 130 000 concurrent workers. We show that funcX’s container warming-aware routing algorithm can reduce the completion time for 3000 functions by up to 61% compared to a randomized algorithm and the in-memory data store can speed up data transfers by up to 3x compared to a shared file system.
(This document is open access)Abstract: Research software is a fundamental and vital part of research worldwide, yet there remain significant challenges to software productivity, quality, reproducibility, and sustainability. Improving the practice of scholarship is a common goal of the open science, open source software and FAIR (Findable, Accessible, Interoperable and Reusable) communities, but improving the sharing of research software has not yet been a strong focus of the latter. To improve the FAIRness of research software, the FAIR for Research Software (FAIR4RS) Working Group has sought to understand how to apply the FAIR Guiding Principles for scientific data management and stewardship to research software, bringing together existing and new community efforts. Many of the FAIR Guiding Principles can be directly applied to research software by treating software and data as similar digital research objects. However, specific characteristics of software — such as its executability, composite nature, and continuous evolution and versioning — make it necessary to revise and extend the principles.
This document presents the first version of the FAIR Principles for Research Software (FAIR4RS Principles). It is an outcome of the FAIR for Research Software Working Group (FAIR4RS WG).
The FAIR for Research Software Working Group is jointly convened as an RDA Working Group, FORCE11 Working Group, and Research Software Alliance (ReSA) Task Force.
Going forward, the RDA Software Source Code Interest Group is the maintenance home for the principles. Concerns or queries about the principles can be raised at RDA plenary events organized by the SSC IG, where there may be opportunities for adopters to report back on progress. The full maintenance and retirement plan for the principles can be found on the RDA website.
(This paper is open access)Abstract: Research software is a critical component of contemporary scholarship. Yet, most research software is developed and managed in ways that are at odds with its long-term sustainability. This paper presents findings from a survey of 1,149 researchers, primarily from the United States, about sustainability challenges they face in developing and using research software. Some of our key findings include a repeated need for more opportunities and time for developers of research software to receive training. These training needs cross the software lifecycle and various types of tools. We also identified the recurring need for better models of funding research software and for providing credit to those who develop the software so they can advance in their careers. The results of this survey will help inform future infrastructure and service support for software developers and users, as well as national research policy aimed at increasing the sustainability of research software.
(This paper is open access)Abstract: To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.
(This paper is open access)Abstract: Nearly all research today has a digital component, and typically, scholarly results are strongly dependent on software. For the research results to be fully understood, the software that is used must be uniquely identified. Research software is frequently developed by researchers themselves, often initially to solve a single problem, and then later generalized to solve additional problems. Ideally, the software is shared so that other researchers can also benefit and avoid the duplicate work required for development and maintenance. The researchers must expect and receive value for their contribution and sharing. Because publishing is a key element of our existing scholarly structures, the research that was done must be clearly explained in papers. This can be used to create incentives for researchers not only to share their software, but also to contribute to community software, in both cases through software citation. Contributors to software that is used in papers and is cited by those papers can become authors of the software as it is tracked by indexes, which also track how often the software is cited.
(This paper is open access)Abstract: Software now lies at the heart of scholarly research. Here we argue that as well as being important from a methodological perspective, software should, in many instances, be recognised as an output of research, equivalent to an academic paper. The article discusses the different roles that software may play in research and highlights the relationship between software and research sustainability and reproducibility. It describes the challenges associated with the processes of citing and reviewing software, which differ from those used for papers. We conclude that whilst software outputs do not necessarily fit comfortably within the current publication model, there is a great deal of positive work underway that is likely to make an impact in addressing this.
(This paper is green open access)Abstract: The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows community together. This paper reports on discussions and findings from two virtual “Workflows Community Summits” (January and April, 2021). The overarching goals of these workshops were to develop a view of the state of the art, identify crucial research challenges in the workflows community, articulate a vision for potential community efforts, and discuss technical approaches for realizing this vision. To this end, participants identified six broad themes: FAIR computational workflows; AI workflows; exascale challenges; APIs, interoperability, reuse, and standards; training and education; and building a workflows community. We summarize discussions and recommendations for each of these themes.
(This paper is open access)Abstract: The long-term sustainability of the high-energy physics (HEP) research software ecosystem is essential to the field. With new facilities and upgrades coming online throughout the 2020s, this will only become increasingly important. Meeting the sustainability challenge requires a workforce with a combination of HEP domain knowledge and advanced software skills. The required software skills fall into three broad groups. The first is fundamental and generic software engineering (e.g., Unix, version control, C++, and continuous integration). The second is knowledge of domain-specific HEP packages and practices (e.g., the ROOT data format and analysis framework). The third is more advanced knowledge involving specialized techniques, including parallel programming, machine learning and data science tools, and techniques to maintain software projects at all scales. This paper discusses the collective software training program in HEP led by the HEP Software Foundation (HSF) and the Institute for Research and Innovation in Software in HEP (IRIS-HEP). The program equips participants with an array of software skills that serve as ingredients for the solution of HEP computing challenges. Beyond serving the community by ensuring that members are able to pursue research goals, the program serves individuals by providing intellectual capital and transferable skills important to careers in the realm of software and computing, inside or outside HEP.
(This paper is open access)Abstract: The FAIR Guiding Principles aim to improve findability, accessibility, interoperability and reusability for both humans and machines, initially aimed at scientific data, but also intended to apply to all sorts of research digital objects, with recent developments about their modification and application to software and computational workflows. In this position paper we argue that the FAIR principles also can apply to machine learning tools and models, though a direct application is not always possible as machine learning combines aspects of data and software. Here we discuss some of the elements of machine learning that lead to the need for some adaptation of the original FAIR principles, along with stakeholders that would benefit from this adaptation. We introduce the initial steps towards this adaptation, i.e., creating a community around it, some possible benefits beyond FAIR, and some of the open questions that such a community could tackle.
(This paper is green open access)Abstract: The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will soon carry out an unprecedented wide, fast, and deep survey of the sky in multiple optical bands. The data from LSST will open up a new discovery space in astronomy and cosmology, simultaneously providing clues toward addressing burning issues of the day, such as the origin of dark energy and and the nature of dark matter, while at the same time yielding data that will, in turn, pose fresh new questions. To prepare for the imminent arrival of this remarkable data set, it is crucial that the associated scientific communities be able to develop the software needed to analyze it. Computational power now available allows us to generate synthetic data sets that can be used as a realistic training ground for such an effort. This effort raises its own challenges—the need to generate very large simulations of the night sky, scaling up simulation campaigns to large numbers of compute nodes across multiple computing centers with different architectures, and optimizing the complex workload around memory requirements and widely varying wall clock times. We describe here a large-scale workflow that melds together Python code to steer the workflow, Parsl to manage the large-scale distributed execution of workflow components, and containers to carry out the image simulation campaign across multiple sites. Taking advantage of these tools, we developed an extreme-scale computational framework and used it to simulate five years of observations for 300 square degrees of sky area. We describe our experiences and lessons learned in developing this workflow capability, and highlight how the scalability and portability of our approach enabled us to efficiently execute it on up to 4000 compute nodes on two supercomputers.
(This paper is green open access)Abstract: The function as a service paradigm aims to abstract the complexities of managing computing infrastructure for users. While adoption in industry has been swift, we have yet to see widespread adoption in academia. This is in part due to barriers such as the need to access large research data, diverse hardware requirements, monolithic code bases, and existing systems available to researchers. We describe funcX, a federated function-as-a-service platform that addresses important requirements for use of FaaS in research computing. We outline how funcX has been used in early science deployments.
(This paper not open access)Abstract: The development of reusable artificial intelligence (AI) models for wider use and rigorous validation by the community promises to unlock new opportunities in multi-messenger astrophysics. Here we develop a workflow that connects the Data and Learning Hub for Science, a repository for publishing AI models, with the Hardware-Accelerated Learning (HAL) cluster, using funcX as a universal distributed computing service. Using this workflow, an ensemble of four openly available AI models can be run on HAL to process an entire month’s worth (August 2017) of advanced Laser Interferometer Gravitational-Wave Observatory data in just seven minutes, identifying all four binary black hole mergers previously identified in this dataset and reporting no misclassifications. This approach combines advances in AI, distributed computing and scientific data infrastructure to open new pathways to conduct reproducible, accelerated, data-driven discovery.
(This paper not open access)Abstract: Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be executed concurrently. Developers can then link together functions via the exchange of data. Parsl establishes a dynamic dependency graph and sends tasks for execution on connected resources when dependencies are resolved. Parsl's runtime system enables different compute resources to be used, from laptops to supercomputers, without modification to the Parsl program.
(This paper is green open access)Abstract: Research software is essential to modern research, but it requires ongoing human effort to sustain: to continually adapt to changes in dependencies, to fix bugs, and to add new features. Software sustainability institutes, amongst others, develop, maintain, and disseminate best practices for research software sustainability, and build community around them. These practices can both reduce the amount of effort that is needed and create an environment where the effort is appreciated and rewarded. The UK SSI is such an institute, and the US URSSI and the Australian AuSSI are planning to become institutes, and this extended abstract discusses them and the strengths and weaknesses of this approach.
(This paper is green open access)Abstract: Software citation contributes to achieving software sustainability in two ways: It provides an impact metric to incentivize stakeholders to make software sustainable. It also provides references to software used in research, which can be reused and adapted to become sustainable. While software citation faces a host of technical and social challenges, community initiatives have defined the principles of software citation and are working on implementing solutions.
(This paper is green open access)Abstract: Research software is a class of software developed to support research. Today a wealth of such software is created daily in universities, government, and commercial research enterprises worldwide. The sustainability of this software faces particular challenges due, at least in part, to the type of people who develop it. These Research Software Engineers (RSEs) face challenges in developing and sustaining software that differ from those faced by the developers of traditional software. As a result, professional associations have begun to provide support, advocacy, and resources for RSEs. These benefits are critical to sustaining RSEs, especially in environments where their contributions are often undervalued and not rewarded. This paper focuses on how professional associations, such as the United States Research Software Engineer Association (US-RSE), can provide this.
(This paper is open access)Abstract: Software is increasingly essential in most research, and much of this software is developed specifically for and during research. To make this research software findable, accessible, interoperable, and reusable (FAIR), we need to define exactly what FAIR means for research software and acknowledge that software is a living and complex object for which it is impossible to propose one solution that fits all software.
(This paper is open access)Abstract: Software is as integral as a research paper, monograph, or dataset in terms of facilitating the full understanding and dissemination of research. This article provides broadly applicable guidance on software citation for the communities and institutions publishing academic journals and conference proceedings. We expect those communities and institutions to produce versions of this document with software examples and citation styles that are appropriate for their intended audience. This article (and those community-specific versions) are aimed at authors citing software, including software developed by the authors or by others. We also include brief instructions on how software can be made citable, directing readers to more comprehensive guidance published elsewhere. The guidance presented in this article helps to support proper attribution and credit, reproducibility, collaboration and reuse, and encourages building on the work of others to further research.
(This paper is open access)Abstract: This document captures the discussion and deliberation of the FAIR for Research Software (FAIR4RS) subgroup that took a fresh look at the applicability of the FAIR Guiding Principles for scientific data management and stewardship for research software. We discuss the vision of research software as ideally reproducible, open, usable, recognized, sustained and robust, and then review both the characteristic and practiced differences of research software and data. This vision and understanding of initial conditions serves as a backdrop for an attempt at translating and interpreting the guiding principles to more fully align with research software. We have found that many of the principles remained relatively intact as written, as long as considerable interpretation was provided. This was particularly the case for the "Findable" and "Accessible" foundational principles. We found that "Interoperability" and "Reusability" are particularly prone to a broad and sometimes opposing set of interpretations as written. We propose two new principles modeled on existing ones, and provide modified guiding text for these principles to help clarify our final interpretation. A series of gaps in translation were captured during this process, and these remain to be addressed. We finish with a consideration of where these translated principles fall short of the vision laid out in the opening.
(This paper is open access)Abstract: This paper discusses why research software is important, and what sustainability means in this context. It then talks about how research software sustainability can be achieved, and what our experiences at NCSA have been using specific examples, what we have learned from this, and how we think these lessons can help others.
(This paper is open access)Abstract: Scientific software registries and repositories serve various roles in their respective disciplines. These resources improve software discoverability and research transparency, provide information for software citations, and foster preservation of computational methods that might otherwise be lost over time, thereby supporting research reproducibility and replicability. However, developing these resources takes effort, and few guidelines are available to help prospective creators of registries and repositories. To address this need, we present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories. These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Citation Implementation Working Group during the years 2019-2020. We believe that putting in place specific policies such as those presented here will help scientific software registries and repositories better serve their users and their disciplines.
(This paper is open access)Abstract: Software now lies at the heart of scholarly research. Here we argue that as well as being important from a methodological perspective, software should, in many instances, be recognised as an output of research, equivalent to an academic paper. The article discusses the different roles that software may play in research and highlights the relationship between software and research sustainability and reproducibility. It describes the challenges associated with the processes of citing and reviewing software, which differ from those used for papers. We conclude that whilst software outputs do not necessarily fit comfortably within the current publication model, there is a great deal of positive work underway that is likely to make an impact in addressing this.
(This paper is open access)Abstract: Understanding the characteristics of the rapidly evolving geospatial software ecosystem in the United States is critical to enable convergence research and education that are dependent on geospatial data and software. This paper describes a survey approach to better understand geospatial use cases, software and tools, and limitations encountered while using and developing geospatial software. The survey was broadcast through a variety of geospatial-related academic mailing lists and listservs. We report both quantitative responses and qualitative insights. As 42% of respondents indicated that they viewed their work as limited by inadequacies in geospatial software, ample room for improvement exists. In general, respondents expressed concerns about steep learning curves and insufficient time for mastering geospatial software, and often limited access to high-performance computing resources. If adequate efforts were taken to resolve software limitations, respondents believed they would be able to better handle big data, cover broader study areas, integrate more types of data, and pursue new research. Insights gained from this survey play an important role in supporting the conceptualization of a national geospatial software institute in the United States with the aim to drastically advance the geospatial software ecosystem to enable broad and significant research and education advances.
(This paper is open access)Abstract: Significant investments to upgrade and construct large-scale scientific facilities demand commensurate investments in R&D to design algorithms and computing approaches to enable scientific and engineering breakthroughs in the big data era. Innovative Artificial Intelligence (AI) applications have powered transformational solutions for big data challenges in industry and technology that now drive a multi-billion dollar industry, and which play an ever increasing role shaping human social patterns. As AI continues to evolve into a computing paradigm endowed with statistical and mathematical rigor, it has become apparent that single-GPU solutions for training, validation, and testing are no longer sufficient for computational grand challenges brought about by scientific facilities that produce data at a rate and volume that outstrip the computing capabilities of available cyberinfrastructure platforms. This realization has been driving the confluence of AI and high performance computing (HPC) to reduce time-to-insight, and to enable a systematic study of domain-inspired AI architectures and optimization schemes to enable data-driven discovery. In this article we present a summary of recent developments in this field, and describe specific advances that authors in this article are spearheading to accelerate and streamline the use of HPC platforms to design and apply accelerated AI algorithms in academia and industry.
(This paper is open access)Abstract: In recent years the importance of software in research has become increasingly recognized by the research community. This journey still has a long way to go. Research data is currently backed by a variety of efforts to implement and make FAIR principles a reality, complemented by Data Management Plans. Both FAIR data principles and management plans offer elements that could be useful for research software but none of them can be directly applied; in both cases there is a need for adaptation and then adoption. In this position paper we discuss current efforts around FAIR for research software that will also support the advancement of Software Management Plans. In turn, use of SMPs encourages researchers to make their datasets FAIR.
(This paper is open access)Abstract: New facilities of the 2020s, such as the High Luminosity Large Hadron Collider (HL-LHC), will be relevant through at least the 2030s. This means that their software efforts and those that are used to analyze their data need to consider sustainability to enable their adaptability to new challenges, longevity, and efficiency, over at least this period. This will help ensure that this software will be easier to develop and maintain, that it remains available in the future on new platforms, that it meets new needs, and that it is as reusable as possible. This report discusses a virtual half-day workshop on “Software Sustainability and High Energy Physics” that aimed 1) to bring together experts from HEP as well as those from outside to share their experiences and practices, and 2) to articulate a vision that helps the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP) to create a work plan to implement elements of software sustainability. Software sustainability practices could lead to new collaborations, including elements of HEP software being directly used outside the field, and, as has happened more frequently in recent years, to HEP developers contributing to software developed outside the field rather than reinventing it. A focus on and skills related to sustainable software will give HEP software developers an important skill that is essential to careers in the realm of software, inside or outside HEP. The report closes with recommendations to improve software sustainability in HEP, aimed at the HEP community via IRIS-HEP and the HEP Software Foundation (HSF).
(This paper is open access)Abstract: Background: Software is now ubiquitous within research. In addition to the general challenges common to all software development projects, research software must also represent, manipulate, and provide data for complex theoretical constructs. Ensuring this process of theory-software translation is robust is essential to maintaining the integrity of the science resulting from it, and yet there has been little formal recognition or exploration of the challenges associated with it.
Methods: We thematically analyse the outputs of the discussion sessions at the Theory-Software Translation Workshop 2019, where academic researchers and research software engineers from a variety of domains, and with particular expertise in high performance computing, explored the process of translating between scientific theory and software.
Results: We identify a wide range of challenges to implementing scientific theory in research software and using the resulting data and models for the advancement of knowledge. We categorise these within the emergent themes of design, infrastructure, and culture, and map them to associated research questions.
Conclusions: Systematically investigating how software is constructed and its outputs used within science has the potential to improve the robustness of research software and accelerate progress in its development. We propose that this issue be examined within a new research area of theory-software translation, which would aim to significantly advance both knowledge and scientific practice.
(This paper is not open access)Abstract: Industrial production of graphene by chemical vapor deposition (CVD) requires more than the ability to synthesize large domain, high quality graphene in a lab reactor. The integration of graphene in the fabrication process of electronic devices requires the cost-effective and environmentally-friendly production of graphene on dielectric substrates, but current approaches can only produce graphene on metal catalysts. Sustainable manufacturing of graphene should also conserve the catalyst and reaction gases, but today the metal catalysts are typically dissolved after synthesis. Progress toward these objectives is hindered by the hundreds of coupled synthesis parameters that can strongly affect CVD of low-dimensional materials, and poor communication in the published literature of the rich experimental data that exists in individual laboratories. We report here on a platform "Graphene – Recipes for synthesis of high quality material" (Gr-ResQ: pronounced graphene rescue), that includes powerful new tools for data-driven graphene synthesis. At the core of Gr-ResQ is a crowd-sourced database of CVD synthesis recipes and associated experimental results. The database captures ~300 parameters ranging from synthesis conditions like catalyst material and preparation steps, to ambient lab temperature and reactor details, as well as resulting Raman spectra and microscopy images. These parameters are carefully selected to unlock the potential of machine-learning models to advance synthesis. A suite of associated tools enable fast, automated and standardized processing of Raman spectra and scanning electron microscopy images. To facilitate community-based efforts, Gr-ResQ provides tools for cyber-physical collaborations among research groups, allowing experiments to be designed, executed, and analyzed by different teams. Gr-ResQ also allows publication and discovery of recipes via the Materials Data Facility (MDF), which assigns each recipe a unique identifier when published and collects parameters in a search index. We envision that this holistic approach to data-driven synthesis can accelerate CVD recipe discovery and production control, and open opportunities for advancing not only graphene, but also many other 1D and 2D materials.
(This paper is not open access)Abstract: The user-facing components of the Cyberinfrastructure (CI) ecosystem, science gateways and scientific workflow systems, share a common need of interfacing with physical resources (storage systems and execution environments) to manage data and execute codes (applications). However, there is no uniform, platform-independent way to describe either the resources or the applications. To address this, we propose uniform semantics for describing resources and applications that will be relevant to a diverse set of stakeholders. We sketch a solution to the problem of a common description and catalog of resources: we describe an approach to implementing a resource registry for use by the community and discuss potential approaches to some long-term challenges. We conclude by looking ahead to the application description language.
(This paper is green open access)Abstract: Building software that can support the huge growth in data and computation required by modern research needs individuals with increasingly specialist skill sets that take time to develop and maintain. The Research Software Engineering movement, which started in the UK and has been built up over recent years, aims to recognise and support these individuals. Why does research software matter to professional software development practitioners outside the research community? Research software can have great impact on the wider world and recent progress means the area can now be considered as a more realistic option for a professional software development career. In this article we present a structure, along with supporting evidence of real-world activities, that defines four elements that we believe are key to providing comprehensive and sustainable support for Research Software Engineering. We also highlight ways that the wider developer community can learn from, and engage with, these activities.
(This paper is open access)Abstract: The Theory-Software Translation Workshop, held in New Orleans in February 2019, explored in depth the process of both instantiating theory in software – for example, implementing a mathematical model in code as part of a simulation – and using the outputs of software – such as the behavior of a simulation – to advance knowledge. As computation within research is now ubiquitous, the workshop provided a timely opportunity to reflect on the particular challenges of research software engineering – the process of developing and maintaining software for scientific discovery. In addition to the general challenges common to all software development projects, research software additionally must represent, manipulate, and provide data for complex theoretical constructs. Ensuring this process is robust is essential to maintaining the integrity of the science resulting from it, and the workshop highlighted a number of areas where the current approach to research software engineering would benefit from an evidence base that could be used to inform best practice.
The workshop brought together expert research software engineers and academics to discuss the challenges of Theory-Software Translation over a two-day period. This report provides an overview of the workshop activities, and a synthesises of the discussion that was recorded. The body of the report presents a thematic analysis of the challenges of Theory-Software Translation as identified by workshop participants, summarises these into a set of research areas, and provides recommendations for the future direction of this work.
(This paper is not open access)Abstract: Multi-messenger astrophysics is a fast-growing, interdisciplinary field that combines data, which vary in volume and speed of data processing, from many different instruments that probe the Universe using different cosmic messengers: electromagnetic waves, cosmic rays, gravitational waves and neutrinos. In this Expert Recommendation, we review the key challenges of real-time observations of gravitational wave sources and their electromagnetic and astroparticle counterparts, and make a number of recommendations to maximize their potential for scientific discovery. These recommendations refer to the design of scalable and computationally efficient machine learning algorithms; the cyber-infrastructure to numerically simulate astrophysical sources, and to process and interpret multi-messenger astrophysics data; the management of gravitational wave detections to trigger real-time alerts for electromagnetic and astroparticle follow-ups; a vision to harness future developments of machine learning and cyber-infrastructure resources to cope with the big-data requirements; and the need to build a community of experts to realize the goals of multi-messenger astrophysics.
(This paper is open access)Abstract: The main output of the FORCE11 Software Citation working group was a paper on software citation principles published in September 2016. This paper laid out a set of six high-level principles for software citation (importance, credit and attribution, unique identification, persistence, accessibility, and specificity) and discussed how they could be used to implement software citation in the scholarly community. In a series of talks and other activities, we have promoted software citation using these increasingly accepted principles. At the time the initial paper was published, we also provided the following (old) guidance and examples on how to make software citable, though we now realize there are unresolved problems with that guidance. The purpose of this document is to provide an explanation of current issues impacting scholarly attribution of research software, organize updated implementation guidance, and identify where best practices and solutions are still needed.
(This paper is not open access)Abstract: The use of deep learning (DL) on HPC resources has become common as scientists explore and exploit DL methods to solve domain problems. On the other hand, in the coming exascale computing era, a high error rate is expected to be problematic for most HPC applications. The impact of errors on DL applications, especially DL training, remains unclear given their stochastic nature. In this paper, we focus on understanding DL training applications on HPC in the presence of silent data corruption. Specifically, we design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61-1.76% of SDCs cause training failures, and taking the SDC rate in modern hardware into account, the actual chance of a failure is one in thousands to millions of executions. With this quantitatively measured impact, computing centers can make rational design decisions based on their application portfolio, the acceptable failure rate, and financial constraints; for example, they might determine their confidence in the correctness of training results performed on processors without error correction code (ECC) RAM. We also discover that over 75-90% of the SDCs that cause catastrophic errors can be easily detected by a training loss in the next iteration. Thus we propose this error-aware software solution to correct catastrophic errors, as it has significantly lower time and space overhead compared to algorithm-based fault-tolerance (ABFT) and ECC.
(This paper is green open access)Abstract: Modern research in the sciences, engineering, humanities, and other fields depends on software, and specifically, research software. Much of this research software is developed in universities, by faculty, postdocs, students, and staff. In this paper, we focus on the role of university staff. We examine three different, independently-developed models under which these staff are organized and perform their work, and comparatively analyze these models and their consequences on the staff and on the software, considering how the different models support software engineering practices and processes. This information can be used by software engineering researchers to understand the practices of such organizations and by universities who want to set up similar organizations and to better produce and maintain research software.
(This paper is not open access)Abstract: Python is increasingly the lingua franca of scientific computing. It is used as a higher level language to wrap lower-level libraries and to compose scripts from various independent components. However, scaling and moving Python programs from laptops to supercomputers remains a challenge. Here we present Parsl, a parallel scripting library for Python. Parsl makes it straightforward for developers to implement parallelism in Python by annotating functions that can be executed asynchronously and in parallel, and to scale analyses from a laptop to thousands of nodes on a supercomputer or distributed system. We examine how Parsl is implemented, focusing on syntax and usage. We describe two scientific use cases in which Parsl’s intuitive and scalable parallelism is used.
(This paper is open access)Abstract: Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the "best" workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/T can be found on GitHub at, with full documentation provided at
(This paper is green open access)Abstract: High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking how parallelism is expressed in programs. Here, we present Parsl, a parallel scripting library that augments Python with simple, scalable, and flexible constructs for encoding parallelism. These constructs allow Parsl to construct a dynamic dependency graph of components that it can then execute efficiently on one or many processors. Parsl is designed for scalability, with an extensible set of executors tailored to different use cases, such as low-latency, high-throughput, or extreme-scale execution. We show, via experiments on the Blue Waters supercomputer, that Parsl executors can allow Python scripts to execute components with as little as 5 ms of overhead, scale to more than 250 000 workers across more than 8000 nodes, and process upward of 1200 tasks per second. Other Parsl features simplify the construction and execution of composite programs by supporting elastic provisioning and scaling of infrastructure, fault-tolerant execution, and integrated wide-area data management. We show that these capabilities satisfy the needs of many-task, interactive, online, and machine learning applications in fields such as biology, cosmology, and materials science.
(This paper is open access)Abstract: A growing number of largely uncoordinated initiatives focus on research software sustainability. A comprehensive mapping of the research software sustainability space can help identify gaps in their efforts, track results, and avoid duplication of work. To this end, this paper suggests enhancing an existing schematic of activities in research software sustainability, and formalizing it in a directed graph model. Such a model can be further used to define a classification schema which, applied to research results in the field, can drive the identification of past activities and the planning of future efforts.
(This paper is open access)Abstract: The profile of research software engineering has been greatly enhanced by developments at institutions around the world to form groups and communities that can support effective, sustainable development of research software. We observe, however, that there is still a long way to go to build a clear understanding about what approaches provide the best support for research software developers in different contexts, and how such understanding can be used to suggest more formal structures, models or frameworks that can help to further support the growth of research software engineering. This short paper provides an overview of some preliminary thoughts and proposes an initial high-level framework based on discussions between the authors around the concept of a set of pillars representing key activities and processes that form the core structure of a successful research software engineering offering.
(This paper is open access)Abstract: This paper uses the accepted submissions from the Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1) held in Manchester, UK in September 2017 and the speed blogs written during the event to examine the state of research software. It presents a schematic of the space, then examines coverage in terms of topics, actors, actees, and themes by both the submissions and the blogs.
(This paper is open access)Abstract: Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.
(This paper is open access)Abstract: To improve the quality and efficiency of research, groups within the scientific community seek to exploit the value of data sharing. Funders, institutions, and specialist organizations are developing and implementing strategies to encourage or mandate data sharing within and across disciplines, with varying degrees of success. Academic journals in ecology and evolution have adopted several types of public data archiving policies requiring authors to make data underlying scholarly manuscripts freely available. The effort to increase data sharing in the sciences is one part of a broader “data revolution” that has prompted discussion about a paradigm shift in scientific research. Yet anecdotes from the community and studies evaluating data availability suggest that these policies have not obtained the desired effects, both in terms of quantity and quality of available datasets. We conducted a qualitative, interview-based study with journal editorial staff and other stakeholders in the academic publishing process to examine how journals enforce data archiving policies. We specifically sought to establish who editors and other stakeholders perceive as responsible for ensuring data completeness and quality in the peer review process. Our analysis revealed little consensus with regard to how data archiving policies should be enforced and who should hold authors accountable for dataset submissions. Themes in interviewee responses included hopefulness that reviewers would take the initiative to review datasets and trust in authors to ensure the completeness and quality of their datasets. We highlight problematic aspects of these thematic responses and offer potential starting points for improvement of the public data archiving process.
(This paper is open access)Abstract: The identification of the electromagnetic counterpart of the gravitational wave event, GW170817, and discovery of neutrinos and gamma-rays from TXS 0506+056 heralded the new era of multi-messenger astrophysics. As the number of multi-messenger events rapidly grow over the next decade, the cyberinfrastructure requirements to handle the increase in data rates, data volume, need for event follow up, and analysis across the different messengers will also explosively grow. The cyberinfrastructure requirements to enhance multi-messenger astrophysics will both be a major challenge and opportunity for astronomers, physicists, computer scientists and cyberinfrastructure specialists. Here we outline some of these requirements and argue for a distributed cyberinfrastructure institute for multi-messenger astrophysics to meet these challenges.
(This paper is open access)Abstract: Many science advances have been possible thanks to the use of research software, which has become essential to advancing virtually every Science, Technology, Engineering and Mathematics (STEM) discipline and many non-STEM disciplines including social sciences and humanities. And while much of it is made available under open source licenses, work is needed to develop, support, and sustain it, as underlying systems and software as well as user needs evolve. In addition, the changing landscape of high-performance computing (HPC) platforms, where performance and scaling advances are ever more reliant on software and algorithm improvements as we hit hardware scaling barriers, is causing renewed tension between sustainability of software and its performance. We must do more to highlight the trade-off between performance and sustainability, and to emphasize the need for sustainability given the fact that complex software stacks don't survive without frequent maintenance; made more difficult as a generation of developers of established and heavily-used research software retire. Several HPC forums are doing this, and it has become an active area of funding as well. In response, the authors organized and ran a panel at the SC18 conference. The objectives of the panel were to highlight the importance of sustainability, to illuminate the tension between pure performance and sustainability, and to steer SC community discussion toward understanding and addressing this issue and this tension. The outcome of the discussions, as presented in this paper, can inform choices of advance compute and data infrastructures to positively impact future research software and future research.
(This paper is not open access)Abstract: The advent of experimental science facilities—instruments and observatories, such as the Large Hadron Collider, the Laser Interferometer Gravitational Wave Observatory, and the upcoming Large Synoptic Survey Telescope—has brought about challenging, large-scale computational and data processing requirements. Traditionally, the computing infrastructure to support these facility’s requirements were organized into separate infrastructure that supported their high-throughput needs and those that supported their high-performance computing needs. We argue that to enable and accelerate scientific discovery at the scale and sophistication that is now needed, this separation between high-performance computing and high-throughput computing must be bridged and an integrated, unified infrastructure provided. In this paper, we discuss several case studies where such infrastructure has been implemented. These case studies span different science domains, software systems, and application requirements as well as levels of sustainability. A further aim of this paper is to provide a basis to determine the common characteristics and requirements of such infrastructure, as well as to begin a discussion of how best to support the computing requirements of existing and future experimental science facilities.
(This paper is not open access)Abstract: Science gateways, virtual laboratories and virtual research environments are all terms used to refer to community-developed digital environments that are designed to meet a set of needs for a research community. Specifically, they refer to integrated access to research community resources including software, data, collaboration tools, workflows, instrumentation and high-performance computing, usually via Web and mobile applications. Science gateways, virtual laboratories and virtual research environments are enabling significant contributions to many research domains, facilitating more efficient, open, reproducible research in bold new ways. This paper explores the global impact achieved by the sum effects of these programs in increasing research impact, demonstrates their value in the broader digital landscape and discusses future opportunities. This is evidenced through examination of national and international programs in this field.
(This paper is open access)Abstract: In the 21st Century, research is increasingly data- and computation-driven. Researchers, funders, and the larger community today emphasize the traits of openness and reproducibility. In March 2017, 13 mostly early-career research leaders who are building their careers around these traits came together with ten university leaders (presidents, vice presidents, and vice provosts), representatives from four funding agencies, and eleven organizers and other stakeholders in an NIH- and NSF-funded one-day, invitation-only workshop titled "Imagining Tomorrow's University." Workshop attendees were charged with launching a new dialog around open research - the current status, opportunities for advancement, and challenges that limit sharing.
The workshop examined how the internet-enabled research world has changed, and how universities need to change to adapt commensurately, aiming to understand how universities can and should make themselves competitive and attract the best students, staff, and faculty in this new world. During the workshop, the participants re-imagined scholarship, education, and institutions for an open, networked era, to uncover new opportunities for universities to create value and serve society. They expressed the results of these deliberations as a set of 22 principles of tomorrow's university across six areas: credit and attribution, communities, outreach and engagement, education, preservation and reproducibility, and technologies.
Activities that follow on from workshop results take one of three forms. First, since the workshop, a number of workshop authors have further developed and published their white papers to make their reflections and recommendations more concrete. These authors are also conducting efforts to implement these ideas, and to make changes in the university system. Second, we plan to organise a follow-up workshop that focuses on how these principles could be implemented. Third, we believe that the outcomes of this workshop support and are connected with recent theoretical work on the position and future of open knowledge institutions.
(This paper is green open access)Abstract: Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, software has evolved organically and inconsistently, with its development largely as by-products of other initiatives. Moreover, challenges in scientific software are expanding due to disruptive changes in computer hardware, increasing scale and complexity of data, and demands for more complex simulations involving multiphysics, multiscale modeling and outer-loop analysis. In recent years, community members have established a range of grass-roots organizations and projects to address these growing technical and social challenges in software productivity, quality, reproducibility, and sustainability. This article provides an overview of such groups and discusses opportunities to leverage their synergistic activities while nurturing work toward emerging software ecosystems.
(This paper is open access)Abstract: To improve the quality and efficiency of research, groups within the scientific community seek to exploit the value of data sharing. Funders, institutions, and specialist organizations are developing and implementing strategies to encourage or mandate data sharing within and across disciplines, with varying degrees of success. Academic journals in ecology and evolution have adopted several types of public data archiving policies requiring authors to make data underlying scholarly manuscripts freely available. Yet anecdotes from the community and studies evaluating data availability suggest that these policies have not obtained the desired effects, both in terms of quantity and quality of available datasets. We conducted a qualitative, interview-based study with journal editorial staff and other stakeholders in the academic publishing process to examine how journals enforce data archiving policies. We specifically sought to establish who editors and other stakeholders perceive as responsible for ensuring data completeness and quality in the peer review process. Our analysis revealed little consensus with regard to how data archiving policies should be enforced and who should hold authors accountable for dataset submissions. Themes in interviewee responses included hopefulness that reviewers would take the initiative to review datasets and trust in authors to ensure the completeness and quality of their datasets. We highlight problematic aspects of these thematic responses and offer potential starting points for improvement of the public data archiving process.
(This paper is not open access)Abstract: Computational and data-driven research practices have significantly changed over the past decade to encompass new analysis models such as interactive and online computing. Science gateways are simultaneously evolving to support this transforming landscape with the aim to enable transparent, scalable execution of a variety of analyses. Science gateways often rely on workflow management systems to represent and execute analyses efficiently and reliably. However, integrating workflow systems in science gateways can be challenging, especially as analyses become more interactive and dynamic, requiring sophisticated orchestration and management of applications and data, and customization for specific execution environments. Parsl (Parallel Scripting Library), a Python library for programming and executing data-oriented workflows in parallel, addresses these problems. Developers simply annotate a Python script with Parsl directives wrapping either Python functions or calls to external applications. Parsl manages the execution of the script on clusters, clouds, grids, and other resources; orchestrates required data movement; and manages the execution of Python functions and external applications in parallel. The Parsl library can be easily integrated into Python-based gateways, allowing for simple management and scaling of workflows.
(This paper is open access)Abstract: Software is essential for the bulk of research today. It appears in the research cycle as infrastructure (both inputs and outputs, software obtained from others before the research is performed and software provided to others after the research is complete), as well as being part of the research itself (e.g., new software development). To measure and give credit for software contributions, the simplest path appears to be to overload the current paper citation system so that it also can support citations of software. A multidisciplinary working group built a set of principles for software citation in late 2016. Now, in ACAT 2017 and its proceedings, we want to experimentally encourage those principles to be followed, both to provide credit to the software developers and maintainers in the ACAT community and to try out the process, potentially finding flaws and places where it needs to be improved.
(This paper is open access)Abstract: In this chapter of the High Energy Physics Software Foundation Community Whitepaper, we discuss the current state of infrastructure, best practices, and ongoing developments in the area of data and software preservation in high energy physics. A re-framing of the motivation for preservation to enable re-use is presented. A series of research and development goals in software and other cyberinfrastructure that will aid in the enabling of reuse of particle physics analyses and production software are presented and discussed.
(This paper is open access)Abstract: Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
(This paper is green open access)Abstract: In most fields, computational models and data analysis have become a significant part of how research is performed, in addition to the more traditional theory and experiment. Mathematics is no exception to this trend. While the system of publication and credit for theory and experiment (journals and books, often monographs) has developed and has become an expected part of the culture, how research is shared and how candidates for hiring, promotion are evaluated, software (and data) do not have the same history. A group working as part of the FORCE11 community developed a set of principles for software citation that fit software into the journal citation system, allow software to be published and then cited, and there are now over 50,000 DOIs that have been issued for software. However, some challenges remain, including: promoting the idea of software citation to developers and users; collaborating with publishers to ensure that systems collect and retain required metadata; ensuring that the rest of the scholarly infrastructure, particularly indexing sites, include software; working with communities so that software efforts “count”; and understanding how best to cite software that has not been published.
(This paper is open access)Abstract: This article summarizes motivations, organization, and activities of the Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1) held in Manchester, UK in September 2017. The WSSSPE series promotes sustainable research software by positively impacting principles and best practices, careers, learning, and credit. This article discusses the Code of Conduct, idea papers, position papers, experience papers, demos, and lightning talks presented during the workshop. The main part of the article discusses the speed-blogging groups that formed during the meeting, along with the outputs of those sessions.
(This paper is open access)Abstract: The rapid evolution of technology and the parallel increasing complexity of algorithmic analysis in HEP requires developers to acquire a much larger portfolio of programming skills. Young researchers graduating from universities worldwide currently do not receive adequate preparation in the very diverse fields of modern computing to respond to growing needs of the most advanced experimental challenges. There is a growing consensus in the HEP community on the need for training programmes to bring researchers up to date with new software technologies, in particular in the domains of concurrent programming and artificial intelligence. We review some of the initiatives under way for introducing new training programmes and highlight some of the issues that need to be taken into account for these to be successful.
(This paper is open access)Abstract: At the heart of experimental high energy physics (HEP) is the development of facilities and instrumentation that provide sensitivity to new phenomena. Our understanding of nature at its most fundamental level is advanced through the analysis and interpretation of data from sophisticated detectors in HEP experiments. The goal of data analysis systems is to realize the maximum possible scientific potential of the data within the constraints of computing and human resources in the least time. To achieve this goal, future analysis systems should empower physicists to access the data with a high level of interactivity, reproducibility and through-put capability. As part of the HEP Software Foundation’s Community White Paper process, a working group on Data Analysis and Interpretation was formed to assess the challenges and opportunities in HEP data analysis and develop a roadmap for activities in this area over the next decade. In this report, the key findings and recommendations of the Data Analysis and Interpretation Working Group are presented.
(This paper is open access)Abstract: This article summarizes motivations, organization, and activities of the Fourth Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE4). The WSSSPE series promotes sustainable research software by positively impacting principles and best practices, careers, learning, and credit. This article discusses the code of conduct; the mission and vision statements that were drafted at the workshop and finalized shortly after it; the keynote and idea papers, position papers, experience papers, demos, and lightning talks presented during the workshop; and a panel discussion on best practices. The main part of the article discusses the set of working groups that formed during the meeting, along with contact information for readers who may want to join a group. Finally, it discusses a survey of the workshop attendees.
(This paper is open access)Abstract: This article describes the motivation, design, and progress of the Journal of Open Source Software (JOSS). JOSS is a free and open-access journal that publishes articles describing research software. It has the dual goals of improving the quality of the software submitted and providing a mechanism for research software developers to receive credit. While designed to work within the current merit system of science, JOSS addresses the dearth of rewards for key contributions to science made in the form of software. JOSS publishes articles that encapsulate scholarship contained in the software itself, and its rigorous peer review targets the software components: functionality, documentation, tests, continuous integration, and the license. A JOSS article contains an abstract describing the purpose and functionality of the software, references, and a link to the software archive. The article is the entry point of a JOSS submission, which encompasses the full set of software artifacts. Submission and review proceed in the open, on GitHub. Editors, reviewers, and authors work collaboratively and openly. Unlike other journals, JOSS does not reject articles requiring major revision; while not yet accepted, articles remain visible and under review until the authors make adequate changes (or withdraw, if unable to meet requirements). Once an article is accepted, JOSS gives it a digital object identifier (DOI), deposits its metadata in Crossref, and the article can begin collecting citations on indexers like Google Scholar and other services. Authors retain copyright of their JOSS article, releasing it under a Creative Commons Attribution 4.0 International License. In its first year, starting in May 2016, JOSS published 111 articles, with more than 40 additional articles under review. JOSS is a sponsored project of the nonprofit organization NumFOCUS and is an affiliate of the Open Source Initiative (OSI).
(This paper is open access)Abstract: Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
(This paper is open access)Abstract: Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.
(This paper is open access)Abstract: The High Energy Phyiscs community has developed and needs to maintain many tens of millions of lines of code and to integrate effectively the work of thousands of developers across large collaborations. Software needs to be built, validated, and deployed across hundreds of sites. Software also has a lifetime of many years, frequently beyond that of the original developer, it must be developed with sustainability in mind. Adequate recognition of software development as a critical task in the HEP community needs to be fostered and an appropriate publication and citation strategy needs to be developed. As part of the HEP Softare Foundation's Community White Paper process a working group on Software Development, Deployment and Validation was formed to examine all of these issues, identify best practice and to formulare recommendations for the next decade. Its report is presented here.
(This paper is not open access)Abstract: We present a novel computational framework that connects Blue Waters, the NSF-supported, leadership-class supercomputer operated by NCSA, to the Laser Interferometer Gravitational-Wave Observatory (LIGO) Data Grid via Open Science Grid technology. To enable this computational infrastructure, we configured, for the first time, a LIGO Data Grid Tier-1 Center that can submit heterogeneous LIGO workflows using Open Science Grid facilities. In order to enable a seamless connection between the LIGO Data Grid and Blue Waters via Open Science Grid, we utilize Shifter to containerize LIGO’s workflow software. This work represents the first time Open Science Grid, Shifter, and Blue Waters are unified to tackle a scientific problem and, in particular, it is the first time a framework of this nature is used in the context of large scale gravitational wave data analysis. This new framework has been used in the last several weeks of LIGO’s second discovery campaign to run the most computationally demanding gravitational wave search workflows on Blue Waters, and accelerate discovery in the emergent field of gravitational wave astrophysics. We discuss the implications of this novel framework for a wider ecosystem of Higher Performance Computing users.
(This paper is green open access)Abstract: Resource selection and task placement for distributed execution poses conceptual and implementation difficulties. Although resource selection and task placement are at the core of many tools and workflow systems, the methods are ad hoc rather than being based on models. Consequently, partial and non-interoperable implementations proliferate. We address both the conceptual and implementation difficulties by experimentally characterizing diverse modalities of resource selection and task placement. We compare the architectures and capabilities of two systems: the AIMES middleware and Swift workflow scripting language and runtime. We integrate these systems to enable the distributed execution of Swift workflows on Pilot-Jobs managed by the AIMES middleware. Our experiments characterize and compare alternative execution strategies by measuring the time to completion of heterogeneous uncoupled workloads executed at diverse scale and on multiple resources. We measure the adverse effects of pilot fragmentation and early binding of tasks to resources and the benefits of backfill scheduling across pilots on multiple resources. We then use this insight to execute a multi-stage workflow across five production-grade resources. We discuss the importance and implications for other tools and workflow systems.
(This paper is not open access)Abstract: This lightning talk paper discusses an initial data set that has been gathered to understand the use of software in research, and is intended to spark wider interest in gathering more data. The initial data analyzes three months of articles in the journal Nature for software mentions. The wider activity that we seek is a community effort to analyze a wider set of articles, including both a longer timespan of Nature articles as well as articles in other journals. Such a collection of data could be used to understand how the role of software has changed over time and how it varies across fields.
(This paper is open access)Abstract: This paper reports on the results of a 2017 survey conducted by email and web of members of the US National Postdoctoral Association regarding their use of software in research and their training regarding software development. The responses show that that 95% of respondents use research software. Of all the respondents, 63% state they could not do their research without research software, 31% could do it but with more effort, and 6% would not find a significant difference in their research without research software. In addition, 54% of respondents have not received any training in software development, though all respondents who develop software for researchers have received either self-taught or formal software development training.
Peer review of research articles is a core part of our scholarly communication system. In spite of its importance, the status and purpose of peer review is often contested. What is its role in our modern digital research and communications infrastructure? Does it perform to the high standards with which it is generally regarded? Studies of peer review have shown that it is prone to bias and abuse in numerous dimensions, frequently unreliable, and can fail to detect even fraudulent research. With the advent of web technologies, we are now witnessing a phase of innovation and experimentation in our approaches to peer review. These developments prompted us to examine emerging models of peer review from a range of disciplines and venues, and to ask how they might address some of the issues with our current systems of peer review. We examine the functionality of a range of social Web platforms, and compare these with the traits underlying a viable peer review system: quality control, quantified performance metrics as engagement incentives, and certification and reputation. Ideally, any new systems will demonstrate that they out-perform and reduce the biases of existing models as much as possible. We conclude that there is considerable scope for new peer review initiatives to be developed, each with their own potential issues and advantages. We also propose a novel hybrid platform model that could, at least partially, resolve many of the socio-technical issues associated with peer review, and potentially disrupt the entire scholarly communication system. Success for any such development relies on reaching a critical threshold of research community engagement with both the process and the platform, and therefore cannot be achieved without a significant change of incentives in research environments.
(This paper is open access)Abstract: Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations.
(This paper is open access)Abstract: Software is often a critical component of scientific research. It can be a component of the academic research methods used to produce research results, or it may itself be an academic research result. Software, however, has rarely been considered to be a citable artifact in its own right. With the advent of open-source software, artifact evaluation committees of conferences, and journals that include source code and running systems as part of the published artifacts, we foresee that software will increasingly be recognized as part of the academic process. The quality and sustainability of this software must be accounted for, both a priori and a posteriori.
The Dagstuhl Perspectives Workshop on "Engineering Academic Software" has examined the strengths, weaknesses, risks, and opportunities of academic software engineering. A key outcome of the workshop is this Dagstuhl Manifesto, serving as a roadmap towards future professional software engineering for software-based research instruments and other software produced and used in an academic context. The manifesto is expressed in terms of a series of actionable "pledges" that users and developers of academic research software can take as concrete steps towards improving the environment in which that software is produced.
(This paper is open access)Abstract: Multi-scale models can facilitate whole plant simulations by linking gene networks, protein synthesis, metabolic pathways, physiology, and growth. Whole plant models can be further integrated with ecosystem, weather, and climate models to predict how various interactions respond to environmental perturbations. These models have the potential to fill in missing mechanistic details and generate new hypotheses to prioritize directed engineering efforts. Outcomes will potentially accelerate improvement of crop yield, sustainability, and increase future food security. It is time for a paradigm shift in plant modeling, from largely isolated efforts to a connected community that takes advantage of advances in high performance computing and mechanistic understanding of plant processes. Tools for guiding future crop breeding and engineering, understanding the implications of discoveries at the molecular level for whole plant behavior, and improved prediction of plant and ecosystem responses to the environment are urgently needed. The purpose of this perspective is to introduce Crops in silico (, an integrative and multi-scale modeling platform, as one solution that combines isolated modeling efforts toward the generation of virtual crops, which is open and accessible to the entire plant biology community. The major challenges involved both in the development and deployment of a shared, multi-scale modeling platform, which are summarized in this prospectus, were recently identified during the first Crops in silico Symposium and Workshop.
(This paper is open access)Abstract: This report records and discusses the Fourth Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE4). The report includes a description of the keynote presentation of the workshop, the mission and vision statements that were drafted at the workshop and finalized shortly after it, a set of idea papers, position papers, experience papers, demos, and lightning talks, and a panel discussion. The main part of the report covers the set of working groups that formed during the meeting, and for each, discusses the participants, the objective and goal, and how the objective can be reached, along with contact information for readers who may want to join the group. Finally, we present results from a survey of the workshop attendees.
(This paper is not open access)Abstract: A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Datasets are growing larger and becoming distributed; their location, availability, and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.
(This paper is open access)Abstract: This document is an open response to the NIH Request for Information (RFI): Strategies for NIH Data Management, Sharing, and Citation, Notice Number: NOT-OD-17-015, written by the leaders of the FORCE11 Software Citation Working Group from its inception in mid-2015 through today. This group produced a set of Software Citation Principles and related discussion, which are the basis for this document. Here, we describe research software, summarize the software citation principles, discuss open issues related to software citation, and make recommendations to the NIH.
(This paper is open access)Abstract: Software is data, but it is not just data. While "data" in computing and information science can refer to anything that can be processed by a computer, software is a special kind of data that can be a creative, executable tool that operates on data. However, software and data are similar in that they both traditionally have not been cited in publications. This paper discusses the differences between software and data in the context of citation, by providing examples and referring to evidence in the form of citations.
(This paper is green open access)
(This paper is not open access)Abstract: Exascale systems promise the potential for computation at unprecedented scales and resolutions, but achieving exascale by the end of this decade presents significant challenges. A key challenge is due to the very large number of cores and components and the resulting mean time between failures (MTBF) in the order of hours or minutes. Since the typical run times of target scientific applications are longer than this MTBF, fault tolerance techniques will be essential. An important class of failures that must be addressed is process or node failures. While checkpoint/restart (C/R) is currently the most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible at exascale when the time to checkpoint exceeds the expected MTBF.
This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerate process failures in MPI applications at exascale. The discussed approach leverages User Level Failure Mitigation (ULFM), which is being proposed as an MPI extension to allow applications to create policies for tolerating process failures. Specifically, this paper demonstrates how different implementations of application-driven in-memory checkpoint storage and recovery compare in terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability of the Fenix online global recovery framework on a production system&emdash;the Titan Cray XK7 at ORNL&emdash;and demonstrate the ability of Fenix to tolerate dynamically injected failures using the execution of four benchmarks and mini-applications with different behaviors.
(This paper is open access)Abstract: Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its acknowledgement and citation. Inspired by the activities of the FORCE11 working group focused on data citation, this document summarizes the recommendations of the FORCE11 Software Citation Working Group and its activities between June 2015 and April 2016. Based on a review of existing community practices, the goal of the working group was to produce a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues. Our work is presented here as a set of software citation principles, a discussion of the motivations for developing the principles, reviews of existing community practice, and a discussion of the requirements these principles would place upon different stakeholders. Working examples and possible technical solutions for how these principles can be implemented will be discussed in a separate paper.
(This paper is open access)Abstract: This report records and discusses the Third Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE3). The report includes a description of the keynote presentation of the workshop, which served as an overview of sustainable scientific software. It also summarizes a set of lightning talks in which speakers highlighted to-the-point lessons and challenges pertaining to sustaining scientific software. The final and main contribution of the report is a summary of the discussions, future steps, and future organization for a set of self-organized working groups on topics including developing pathways to funding scientific software; constructing useful common metrics for crediting software stakeholders; identifying principles for sustainable software engineering design; reaching out to research software organizations around the world; and building communities for software sustainability. For each group, we include a point of contact and a landing page that can be used by those who want to join that group’s future activities. The main challenge left by the workshop is to see if the groups will execute these activities that they have scheduled, and how the WSSSPE community can encourage this to happen.
(This paper is not open access)Abstract: We are in the midst of a scientific data explosion in which the rate of data growth is rapidly increasing. While large-scale research projects have developed sophisticated data distribution networks to share their data with researchers globally, there is no such support for the many millions of research projects generating data of interest to much smaller audiences (as exemplified by the long tail scientist). In data-oriented research, every aspect of the research process is influenced by data access. However, sharing and accessing data efficiently as well as lowering access barriers are difficult. In the absence of dedicated large-scale storage, many have noted that there is an enormous storage capacity available via connected peers, none more so than the storage resources of many research groups. With widespread usage of the content delivery network model for disseminating web content, we believe a similar model can be applied to distributing, sharing, and accessing long tail research data in an e-Science context. We describe the vision and architecture of a social content delivery network - a model that leverages the social networks of researchers to automatically share and replicate data on peers’ resources based upon shared interests and trust. Using this model, we describe a simulator and investigate how aspects such as user activity, geographic distribution, trust, and replica selection algorithms affect data access and storage performance. From these results, we show that socially informed replication strategies are comparable with more general strategies in terms of availability and outperform them in terms of spatial efficiency.
(This paper is open access)Abstract: Application Skeleton is a simple and powerful tool to build simplified synthetic science and engineering applications (for example, modeling and simulation, data analysis) with runtime and I/O close to that of the real applications. It is intended for applied computer scientists who need to use science and engineering applications to verify the effectiveness of new systems designed to efficiently run such applications, so that they can bypass obstacles that they often encounter when accessing and building real science and engineering applications. Using the applications generated by Application Skeleton guarantees that the CS systems' effectiveness on synthetic applications will apply to the real applications.
Application Skeleton can generate bag-of-task, (iterative) map-reduce, and (iterative) multistage workflow applications. These applications are represented as a set of tasks, a set of input files, and a set of dependencies. These applications can be generally considered many-task applications, and once created, can be run on single-core, single-node, multi-core, or multi-node (distributed or parallel) computers, depending on what workflow system is used to run them. The generated applications are compatible with workflow system such as Swift and Pegasus, as well as the ubiquitous UNIX shell. The application can also be created as a generic JSON object that can be used by other systems such as the AIMES middleware.
(This paper is open access)Abstract: A "Ten Simple Rules" guide to git and GitHub. We describe and provide examples on how to use these software to track projects, as users, teams and organizations. We document collaborative development using branching and forking, interaction between collaborators using issues and continuous integration and automation using, for example, Travis CI and codevoc. We also describe dissemination and social aspects of GitHub such as GitHub pages, following and watching repositories, and give advice on how to make code citable.
(This paper is open access)
(This paper is not open access)Abstract: Computer scientists who work on tools and systems to support eScience (a variety of parallel and distributed) applications usually use actual applications to prove that their systems will benefit science and engineering (e.g., improve application performance). Accessing and building the applications and necessary data sets can be difficult because of policy or technical issues, and it can be difficult to modify the characteristics of the applications to understand corner cases in the system design. In this paper, we present the Application Skeleton, a simple yet powerful tool to build synthetic applications that represent real applications, with runtime and I/O close to those of the real applications. This allows computer scientists to focus on the system they are building; they can work with the simpler skeleton applications and be sure that their work will also be applicable to the real applications. In addition, skeleton applications support simple reproducible system experiments since they are represented by a compact set of parameters.
Our Application Skeleton tool (available as open source at https:// currently can create easy-to-access, easy-to-build, and easy-to-run bag-of-task, (iterative) map-reduce, and (iterative) multistage workflow applications. The tasks can be serial, parallel, or a mix of both. The parameters to represent the tasks can either be discovered through a manual profiling of the applications or through an automated method. We select three representative applications (Montage, BLAST, CyberShake Postprocessing), then describe and generate skeleton applications for each. We show that the skeleton applications have identical (or close) performance to that of the real applications. We then show examples of using skeleton applications to verify system optimizations such as data caching, I/O tuning, and task scheduling, as well as the system resilience mechanism, in some cases modifying the skeleton applications to emphasize some characteristic, and thus show that using skeleton applications simplifies the process of designing, implementing, and testing these optimizations.
(This paper is not open access)Abstract: Scripting languages such as Python and R have been widely adopted as tools for the development of scientific software because of the expressiveness of the languages and their available libraries. However, deploying scripted applications on large-scale parallel computer systems such as the IBM Blue Gene/Q or Cray XE6 is a challenge because of issues including operating system limitations, interoperability challenges, and parallel filesystem overheads due to the small file system accesses common in scripted approaches. We present a new approach to these problems in which the Swift scripting system is used to integrate high-level scripts written in Python, R, and Tcl with native code developed in C, C++, and Fortran, by linking Swift to the library interfaces to the script interpreters. We present a technique to efficiently launch scripted applications on supercomputers, and we demonstrate high performance, such as invoking 14M Python interpreters per second on Blue Waters.
(This paper is open access)Abstract: Science and engineering research increasingly relies on activities that facilitate research but are not currently rewarded or recognized, such as: data sharing; developing common data resources, software and methodologies; and annotating data and publications. To promote and advance these activities, we must develop mechanisms for assigning credit, facilitate the appropriate attribution of research outcomes, devise incentives for activities that facilitate research, and allocate funds to maximize return on investment. In this article, we focus on addressing the issue of assigning credit for both direct and indirect contributions, specifically by using JSON-LD to implement a prototype transitive credit system.
(This paper is not open access)Abstract: Efficiently porting ordinary applications to Blue Gene/Q supercomputers is a significant challenge. Codes are often originally developed without considering advanced architectures and related tool chains. Science needs frequently lead users to want to run large numbers of relatively small jobs (often called many-task computing, an ensemble, or a workflow), which can conflict with supercomputer configurations. In this paper, we discuss techniques developed to execute ordinary applications over leadership class supercomputers. We use the high-performance Swift parallel scripting framework and build two workflow execution techniques – sub-jobs and main-wrap. The sub-jobs technique, built on top of the IBM Blue Gene/Q resource manager Cobalt’s sub-block jobs, lets users submit multiple, independent, repeated smaller jobs within a single larger resource block. The main-wrap technique is a scheme that enables C/C++ programs to be defined as functions that are wrapped by a high-performance Swift wrapper and that are invoked as a Swift script. We discuss the needs, benefits, technicalities, and current limitations of these techniques. We further discuss the real-world science enabled by these techniques and the results obtained.
(This paper is not open access)Abstract: Scripting languages such as Python and R have been widely adopted as tools for the productive development of scientific software because of the power and expressiveness of the languages and available libraries. However, deploying scripted applications on large-scale parallel computer systems such as the IBM Blue Gene/Q or Cray XE6 is a challenge because of issues including operating system limitations, interoperability challenges, parallel filesystem overheads due to the small file system accesses common in scripted approaches, and other issues. We present here a new approach to these problems in which the Swift scripting system is used to integrate high-level scripts written in Python, R, and Tcl, with native code developed in C, C++, and Fortran, by linking Swift to the library interfaces to the script interpreters. In this approach, Swift handles data management, movement, and marshaling among distributed-memory processes without direct user manipulation of low-level communication libraries such as MPI. We present a technique to efficiently launch scripted applications on large-scale supercomputers using a hierarchical programming model.
(This paper is open access)Abstract: This technical report records and discusses the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2). The report includes a description of the alternative, experimental submission and review process, two workshop keynote presentations, a series of lightning talks, a discussion on sustainability, and five discussions from the topic areas of exploring sustainability; software development experiences; credit & incentives; reproducibility & reuse & sharing; and code testing & code review. For each topic, the report includes a list of tangible actions that were proposed and that would lead to potential change. The workshop recognized that reliance on scientific software is pervasive in all areas of world-leading research today. The workshop participants then proceeded to explore different perspectives on the concept of sustainability. Key enablers and barriers of sustainable scientific software were identified from their experiences. In addition, recommendations with new requirements such as software credit files and software prize frameworks were outlined for improving practices in sustainable software engineering. There was also broad consensus that formal training in software development or engineering was rare among the practitioners. Significant strides need to be made in building a sense of community via training in software and technical practices, on increasing their size and scope, and on better integrating them directly into graduate education programs. Finally, journals can define and publish policies to improve reproducibility, whereas reviewers can insist that authors provide sufficient information and access to data and software to allow them reproduce the results in the paper. Hence a list of criteria is compiled for journals to provide to reviewers so as to make it easier to review software submitted for publication as a "Software Paper."
(This paper is not open access)Abstract: This article evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing generic storage systems unable to harness all optimization opportunities as this often requires enabling conflicting optimizations or even conflicting design decisions at the storage system level. Second, most workflow runtime engines make suboptimal scheduling decisions as they lack the detailed data location information that is generally hidden by the storage system. This paper presents a limit study that evaluates the potential gains from building a workflow-aware storage system that supports per-file access optimizations and exposes data location. Our evaluation using synthetic benchmarks and real applications shows that a workflow-aware storage system can bring significant performance gains: up to 3x performance gains compared to a vanilla distributed storage system deployed on the same resources yet unaware of the possible file-level optimizations.
(This paper is not open access)Abstract: Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Process/node failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online (i.e., without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated checkpointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Fenix’s ability to tolerate high failure rates (e.g., more than one per minute) with low overhead and while sustaining performance.
(This paper is not open access)Abstract: Research data is experiencing a seemingly endless increase in both volume and production rate. At the same time, efficiently transferring, storing, and analyzing large scale research data have become major research foci. In this paper, we expand on our approach to sharing data for e-Science: a Social Content Delivery Network (S-CDN). A S-CDN leverages the social networks of researchers to automatically share data and place replicas on peers' resources based upon the premises of trust and interest in shared data. We denote a consumer of shared data as a data follower, similar to the notion of Twitter followers, except we add the element of bilateral authorization to capture a notion of trust. We describe a prototypical implementation for a S-CDN that captures an efficient asynchronous transfer mechanism for data management and replication. In addition, we study via simulation the interplay of user behavior with different replication strategies that capture social as well as more general premises for data sharing. Our results illustrate the opportunities and pitfalls of various replication and data access management strategies. Specifically, we show that socially-informed replication strategies are competitive with more general strategies in terms of availability, and outperform them in terms of spatial efficiency.
(This paper is not open access)Abstract: Computer scientists who work on tools and systems to support eScience (a variety of parallel and distributed) applications usually use actual applications to prove that their systems will benefit science and engineering (e.g., improve application performance). Accessing and building the applications and necessary data sets can be difficult because of policy or technical issues, and it can be difficult to modify the characteristics of the applications to understand corner cases in the system design. In this paper, we present the Application Skeleton, a simple yet powerful tool to build synthetic applications that represent real applications, with runtime and I/O close to those of the real applications. This allows computer scientists to focus on the system they are building; they can work with the simpler skeleton applications and be sure that their work will also be applicable to the real applications. In addition, skeleton applications support simple reproducible system experiments since they are represented by a compact set of parameters.
Our Application Skeleton tool currently can create easy-to-access, easy-to-build, and easy-to-run bag-of-task, (iterative) map-reduce, (iterative) multistage workflow applications. The tasks can be serial or parallel or a mix of both. We select three representative applications (Montage, BLAST, CyberShake Postprocessing), then describe and generate skeleton applications for each. We show that the skeleton applications have identical (or close) performance to that of the real applications. We then show examples of using skeleton applications to verify system optimizations such as data caching, I/O tuning, and task scheduling, as well as the system resilience mechanism, in some cases modifying the skeleton applications to emphasize some characteristic, and thus show that using skeleton applications simplifies the process of designing, implementing, and testing these optimizations.
(This paper is open access)Abstract: Science and engineering research increasingly relies on activities that facilitate research but are not currently rewarded or recognized, such as: data sharing; developing common data resources, software and methodologies; and annotating data and publications. To promote and advance these activities, we must develop mechanisms for assigning credit, facilitate the appropriate attribution of research outcomes, devise incentives for activities that facilitate research, and allocate funds to maximize return on investment. In this article, we focus on addressing the issue of assigning credit for both direct and indirect contributions, specifically by using JSON-LD to implement a prototype transitive credit system.
(This paper is open access)Abstract: Challenges related to development, deployment, and maintenance of reusable software for science are becoming a growing concern. Many scientists’ research increasingly depends on the quality and availability of software upon which their works are built. To highlight some of these issues and share experiences, the First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE1) was held in November 2013 in conjunction with the SC13 Conference. The workshop featured keynote presentations and a large number (54) of solicited extended abstracts that were grouped into three themes and presented via panels. A set of collaborative notes of the presentations and discussion was taken during the workshop.
Unique perspectives were captured about issues such as comprehensive documentation, development and deployment practices, software licenses and career paths for developers. Attribution systems that account for evidence of software contribution and impact were also discussed. These include mechanisms such as Digital Object Identifiers, publication of “software papers”, and the use of online systems, for example source code repositories like GitHub. This paper summarizes the issues and shared experiences that were discussed, including cross-cutting issues and use cases. It joins a nascent literature seeking to understand what drives software work in science, and how it is impacted by the reward systems of science. These incentives can determine the extent to which developers are motivated to build software for the long-term, for the use of others, and whether to work collaboratively or separately. It also explores community building, leadership, and dynamics in relation to successful scientific software.
(This paper is open access)Abstract: The pursuit of science and engineering research increasingly relies on activities that facilitate research but are not currently rewarded or recognized, such as development of products and infrastructure. In research publications, citations are used to credit previous works. This paper suggests that a modified citation system that includes the technological idea of transient credit could be used to recognize the developers of products other than research publications and that if this were done in a systematic manner, it would lead to social and cultural changes that would provide incentives for the further development of such products, accelerating overall scientific and engineering advances.
(This paper is open access)Abstract: e-Research infrastructure is increasingly important in the conduct of science and engineering research, and in many disciplines has become an essential part of the research infrastructure. However, this e-Research infrastructure does not appear from a vacuum; it needs both intent and effort first to be created and then to be sustained over time. Research cultures and practices in many disciplines have not adapted to this new paradigm, due in part to the absence of a deep understanding of the elements of e-Research infrastructure and the characteristics that influence their sustainability. This paper outlines a set of contexts in which e-Research infrastructure can be discussed, proposes characteristics that must be considered to sustain infrastructure elements, and highlights models that may be used to create and sustain e-Research infrastructure. We invite feedback on the proposed characteristics and models presented herein.
(This paper is not open access)Abstract: We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such "many-task" applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.
(This paper is not open access)Abstract: Infrastructure-as-a-Service (IaaS) clouds are an appealing resource for scientific computing. However, the bare-bones presentation of raw Linux virtual machines leaves much to the application developer. For many cloud applications, effective data handling is critical to ecient application execution. This paper investigates the capabilities of a variety of POSIX-accessible distributed storage systems to manage data access patterns resulting from workflow application executions in the cloud. We leverage the expressivity of the Swift parallel scripting framework to benchmark the performance of a number of storage systems using synthetic workloads and three real-world applications. We characterize two representative commercial storage systems (Amazon S3 and HDFS, respectively) and two emerging research-based storage systems (Chirp/Parrot and MosaStore). We find the use of aggregated node-local resources effective and economical compared with remotely located S3 storage. Our experiments show that applications run at scale with MosaStore show up to 30% improvement in makespan time compared with those run with S3. We also find that storage-system driven application deployments in the cloud results in better runtime performance compared with an on-demand data-staging driven approach.
(This paper is not open access)Abstract: Scripting is often used in science to create applications via the composition of existing programs. Parallel scripting systems allow the creation of such applications, but each system introduces the need to adopt a somewhat specialized programming model. We present an alternative scripting approach, AMFS Shell, that lets programmers express parallel scripting applications via minor extensions to existing sequential scripting languages, such as Bash, and then execute them in-memory on large-scale computers. We define a small set of commands between the scripts and a parallel scripting runtime system, so that programmers can compose their scripts in a familiar scripting language. The underlying AMFS implements both collective (fast file movement) and functional (transformation based on content) file management. Tasks are handled by AMFS's built-in execution engine. AMFS Shell is expressive enough for a wide range of applications, and the framework can run such applications efficiently on large-scale computers.
(This paper is open access)Abstract: Scripting is often used in science to create applications via the composition of existing programs. Parallel scripting systems allow the creation of such applications, but each system introduces the need to adopt a somewhat specialized programming model. We present an alternative scripting approach, AMFS Shell, that lets programmers express parallel scripting applications via minor extensions to existing sequential scripting languages, such as Bash, and then execute them in-memory on large-scale computers. We define a small set of commands between the scripts and a parallel scripting runtime system, so that programmers can compose their scripts in a familiar scripting language. The underlying AMFS implements both collective (fast file movement) and functional (transformation based on content) file management. Tasks are handled by AMFS's built-in execution engine. AMFS Shell is expressive enough for a wide range of applications, and the framework can run such applications efficiently on large-scale computers.
(This paper is open access)Abstract: This technical report discusses the submission and peer-review process used by the First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE) and the results of that process. It is intended to record both this alternative model as well as the papers associated with the workshop that resulted from that process.
(This paper is open access)Abstract: The pursuit of science increasingly relies on activities that facilitate science but are not currently rewarded or recognized. Of particular concern are the sharing of data; development of common data resources, software, and methodologies; and annotation of data and publications. This situation has been documented in a number of recent reports that focus on changing needs and mechanisms for attribution and citation of digital products, from the use of alternative metrics that track popularity, to work on data.
To promote such activities, we must develop mechanisms for assigning credit, facilitate the appropriate attribution of research outcomes, devise incentives for activities that facilitate research, and allocate funds to maximize return on investment. In this article, I introduce the idea of transitive credit, which addresses the issue of crediting indirect contributions, and discuss potential solutions to these other problems.
(This paper is open access)
(This paper is open access)Abstract: Effective use of parallel and distributed computing in science depends upon multiple interdependent entities and activities that form an ecosystem. Active engagement between application users and technology catalysts is a crucial activity that forms an integral part of this ecosystem. Technology catalysts play a crucial role benefiting communities beyond a single user group. An effective user-engagement, use and reuse of tools and techniques has a broad impact on software sustainability. From our experience, we sketch a life-cycle for user-engagement activity in scientific computational environment and posit that application level reusability promotes software sustainability. We describe our experience in engaging two user groups from different scientific domains reusing a common software and configuration on different computational infrastructures.
(This paper is not open access)Abstract: Efficiently utilizing the rapidly increasing concurrency of multi-petaflop computing systems is a significant programming challenge. One approach is to structure applications with an upper layer of many loosely coupled coarse-grained tasks, each comprising a tightly-coupled parallel function or program. “Many-task” programming models such as functional parallel dataflow may be used at the upper layer to generate massive numbers of tasks, each of which generates significant tightly coupled parallelism at the lower level through multithreading, message passing, and/or partitioned global address spaces. At large scales, however, the management of task distribution, data dependencies, and intertask data movement is a significant performance challenge. In this work, we describe Turbine, a new highly scalable and distributed many-task dataflow engine. Turbine executes a generalized many-task intermediate representation with automated self-distribution and is scalable to multi-petaflop infrastructures. We present here the architecture of Turbine and its performance on highly concurrent systems.
(This paper is not open access)Abstract: Increases in the size of research data and the move towards citizen science, in which everyday users contribute data and analyses, have resulted in a research data deluge. Researchers must now carefully determine how to store, transfer and analyze “Big Data” in collaborative environments. This task is even more complicated when considering budget and locality constraints on data storage and access. In this paper we investigate the potential to construct a Social Content Delivery Network (S-CDN) based upon the social networks that exist between researchers. The S-CDN model builds upon the incentives of collaborative researchers within a given scientific community to address their data challenges collaboratively and in proven trusted settings. In this paper we present a prototype implementation of a S-CDN and investigate the performance of the data transfer mechanisms (using Globus Online) and the potential cost advantages of this approach.
(This paper is not open access)Abstract: Many scientific applications can be efficiently expressed with the parallel scripting (many-task computing, MTC) paradigm. These applications are typically composed of several stages of computation, with tasks in different stages coupled by a shared file system abstraction. However, we often see poor performance when running these applications on large scale computers due to the applications' frequency and volume of filesystem I/O and the absence of appropriate optimizations in the context of parallel scripting applications. In this paper, we show the capability of existing large scale computers to run parallel scripting applications by first defining the MTC envelope and then evaluating the envelope by benchmarking a suite of shared filesystem performance metrics. We also seek to determine the origin of the performance bottleneck by profiling the parallel scripting applications' I/O behavior and mapping the I/O operations to the MTC envelope. We show an example shared filesystem envelope and present a method to predict the I/O performance given the applications' level of I/O concurrency and I/O amount. This work is instrumental in guiding the development of parallel scripting applications to make efficient use of existing large scale computers, and to evaluate performance improvements in the hardware/software stack that will better facilitate parallel scripting applications.
(This paper is not open access)Abstract: Many-task computing is a well-established paradigm for implementing loosely coupled applications (tasks) on large-scale computing systems. However, few of the model’s existing implementations provide efficient, low-latency support for executing tasks that are tightly coupled multiprocessing applications. Thus, a vast array of parallel applications cannot readily be used effectively within many-task workloads. In this work, we present JETS, a middleware component that provides high performance support for many-parallel-task computing (MPTC). JETS is based on a highly concurrent approach to parallel task dispatch and on new capabilities now available in the MPICH2 MPI implementation and the ZeptoOS Linux operating system. JETS represents an advance over the few known examples of multilevel many-parallel-task scheduling systems: it more efficiently schedules and launches many short-duration parallel application invocations; it overcomes the challenges of coupling the user processes of each multiprocessing application invocation via the messaging fabric; and it concurrently manages many application executions in various stages. We report here on the JETS architecture and its performance on both synthetic benchmarks and an MPTC application in molecular dynamics.
(This paper is not open access)Abstract: It is generally accepted that the ability to develop large-scale distributed applications has lagged seriously behind other developments in cyberinfrastructure. In this paper, we provide insight into how such applications have been developed and an understanding of why developing applications for distributed infrastructure is hard. Our approach is unique in the sense that it is centered around half a dozen existing scientific applications; we posit that these scientific applications are representative of the characteristics, requirements, as well as the challenges of the bulk of current distributed applications on production cyberinfrastructure (such as the US TeraGrid). We provide a novel and comprehensive analysis of such distributed scientific applications. Specifically, we survey existing models and methods for large-scale distributed applications and identify commonalities, recurring structures, patterns and abstractions. We find that there are many ad hoc solutions employed to develop and execute distributed applications, which result in a lack of generality and the inability of distributed applications to be extensible and independent of infrastructure details. In our analysis, we introduce the notion of application vectors: a novel way of understanding the structure of distributed applications. Important contributions of this paper include identifying patterns that are derived from a wide range of real distributed applications, as well as an integrated approach to analyzing applications, programming systems and patterns, resulting in the ability to provide a critical assessment of the current practice of developing, deploying and executing distributed applications. Gaps and omissions in the state of the art are identified, and directions for future research are outlined.
(This paper is not open access)Abstract: We seek to enable efficient large-scale parallel execution of applications in which a shared filesystem abstraction is used to couple many tasks. Such parallel scripting (many-task computing, MTC) applications suffer poor performance and utilization on large parallel computers because of the volume of filesystem I/O and a lack of appropriate optimizations in the shared filesystem. Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. We co-design the data management system with the data-aware scheduler to enable dataflow pattern identification and automatic optimization. The framework reduces the time to solution of parallel stages of an astronomy data analysis application, Montage, by 83.2% on 512 cores; decreases the time to solution of a seismology application, CyberShake, by 7.9% on 2,048 cores; and delivers BLAST performance better than mpiBLAST at various scales up to 32,768 cores, while preserving the flexibility of the original BLAST application.
(This paper is not open access)Abstract: Data volumes have increased so significantly that we need to carefully consider how we interact with, share, and analyze data to avoid bottlenecks. In contexts such as eScience and scientific computing, a large emphasis is placed on collaboration, resulting in many well-known challenges in ensuring that data is in the right place at the right time and accessible by the right users. Yet these simple requirements create substantial challenges for the distribution, analysis, storage, and replication of potentially "large" datasets. Additional complexity is added through constraints such as budget, data locality, usage, and available local storage. In this paper, we propose a "socially driven" approach to address some of the challenges within (academic) research contexts by defining a Social Data Cloud and underpinning Content Delivery Network: a Social CDN (SCDN). Our approach leverages digitally encoded social constructs via social network platforms that we use to represent (virtual) research communities. Ultimately, the S-CDN builds upon the intrinsic incentives of members of a given scientific community to address their data challenges collaboratively and in proven trusted settings. We define the design and architecture of a SCDN and investigate its feasibility via a coauthorship case study as first steps to illustrate its usefulness.
(This paper is not open access)Abstract: The availability of a large number of separate clusters has given rise to the field of multicluster systems in which these resources are coupled to obtain their combined benefits to solve large-scale compute-intensive applications. However, it is challenging to achieve automatic load balancing of the jobs across these participating autonomic systems. We developed a novel user space execution model named DA-TC to address the workload allocation techniques for the applications with large number of sequential jobs in multicluster systems. Through this model, we can achieve dynamic load balancing for task assignment, and slower resources become beneficial factors rather than bottlenecks for application execution. The effectiveness of this strategy is demonstrated through theoretical analysis. This model is also evaluated through extensive experimental studies and the results show that when compared with the traditional method, the proposed DA-TC model can significantly improve the performance of application execution in terms of application turnaround time and system reliability in multicluster circumstances.
(This paper is not open access)
(This paper is not open access)Abstract: Scientific experiments in a variety of domains are producing increasing amounts of data that need to be processed efficiently. Distributed Computing Infrastructures are increasingly important in fulfilling these large-scale computational requirements.
(This paper is not open access)Abstract: We present here the ExM (extreme-scale many-task) programming and execution model as a practical solution to the challenges of programing the higher-level logic of complex parallel applications on current petascale and future exascale computing systems. ExM provides an expressive, high-level functional programming model that yields massive concurrency through implicit, automated parallelism. It comprises a judicious integration of dataflow constructs, highly parallel function evaluation, and extremely scalable task generation. It directly addresses the intertwined programmability and scalability requirements of systems with massive concurrency, while providing a programming model that may be attractive and feasible for systems of much lower scale. We describe here the benefits of the ExM programming and execution model, its potential applications, and the performance of its current implementation.
(This paper is not open access)Abstract: Efficiently utilizing the rapidly increasing concurrency of multi-petaflop computing systems is a significant programming challenge. One approach is to structure applications with an upper-layer of many loosely-coupled coarse-grained tasks, each comprising a tightly coupled parallel function or program. "Many-task" programming models such as functional parallel dataflow may be used at the upper layer to generate massive numbers of tasks, each of which generates significant tighly-coupled parallelism at the lower level via multithreading, message passing, and/or partitioned global address spaces. At large scales, however, the management of task distribution, data dependencies, and inter-task data movement is a significant performance challenge. In this work, we describe Turbine, a new highly scalable and distributed many-task dataflow engine. Turbine executes a generalized many-task intermediate representation with automated self-distribution, and is scalable to multi-petaflop infrastructures. We present here the architecture of Turbine and its performance on highly concurrent systems.
(This paper is not open access)Abstract: In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering. We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources.
(This paper is not open access)Abstract: This paper evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing storage systems unable to harness all optimization opportunities as this often requires conflicting optimization options or even conflicting design decision at the level of the storage system. Second, when scheduling, workflow runtime engines make suboptimal decisions as they lack detailed data location information.
This paper discusses the feasibility, and evaluates the potential performance benefits brought by, building a workflow-aware storage system that supports per-file access optimizations and exposes data location. To this end, this paper presents approaches to determine the application-specific data access patterns, and evaluates experimentally the performance gains of a workflow-aware storage approach. Our evaluation using synthetic benchmarks shows that a workflow-aware storage system can bring significant performance gains: up to 7x performance gain compared to the distributed storage system - MosaStore and up to 16x compared to a central, well provisioned, NFS server.
(This paper is not open access)Abstract: Many-Task Computing (MTC) is a new application category that encompasses increasingly popular applications in biology, economics, and statistics. The high inter-task parallelism and data-intensive processing capabilities of these applications pose new challenges to existing supercomputer hardware-software stacks. These challenges include resource provisioning; task dispatching, dependency resolution, and load balancing; data management; and resilience.
This paper examines the characteristics of MTC applications which create these challenges, and identifies related gaps in the middleware that supports these applications on extreme-scale systems. Based on this analysis, we propose AME, an Anyscale MTC Engine, which addresses the scalability aspects of these gaps. We describe the AME framework and present performance results for both synthetic benchmarks and real applications. Our results show that AME's dispatching performance linearly scales up to 14,120 tasks/second on 16,384 cores with high efficiency. The overhead of the intermediate data management scheme does not increase significantly up to 16,384 cores. AME eliminates 73% of the file transfer between compute nodes and the global filesystem for the Montage astronomy application running on 2,048 cores. Our results indicate that AME scales well on today's petascale machines, and is a strong candidate for exascale machines.
(This paper is not open access)Abstract: Scientists, engineers, and statisticians must execute domain-specific application programs many times on large collections of file-based data. This activity requires complex orchestration and data management as data is passed to, from, and among application invocations. Distributed and parallel computing resources can accelerate such processing, but their use further increases programming complexity. The Swift parallel scripting language reduces these complexities by making file system structures accessible via language constructs and by allowing ordinary application programs to be composed into powerful parallel scripts that can efficiently utilize parallel and distributed resources. We present Swift’s implicitly parallel and deterministic programming model, which applies external applications to file collections using a functional style that abstracts and simplifies distributed parallel execution.
(This paper is not open access)Abstract: This paper is intended to explain how the TeraGrid would like to be able to measure "usage modalities." We would like to (and are beginning to) measure these modalities to understand what objectives our users are pursuing, how they go about achieving them, and why, so that we can make changes in the TeraGrid to better support them.
(is not open access)Abstract: The TeraGrid is an advanced, integrated, nationally-distributed, open, user-driven, US cyberinfrastructure that enables and supports leading edge scientific discovery and promotes science and technology education. It comprises supercomputing resources, storage systems, visualization resources, data collections, software, and science gateways, integrated by software systems and high bandwidth networks, coordinated through common policies and operations, and supported by technology experts. This paper discusses the TeraGrid itself, examples of the science that is occurring on the TeraGrid today, and applications that are being developed to perform science in the future.
(This paper is not open access)Abstract: Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders of magnitude smaller metadata before performing the actual I/O. Specifically, this paper details our experiences in deploying a large-scale system to facilitate the discovery of missing genes and constructing a genome similarity tree by encapsulating the mpiBLAST sequence-search algorithm into ParaMEDIC. The overall project involved nine computational sites spread across the U.S. and generated more than a petabyte of data that was 'teleported' to a large-scale facility in Tokyo for storage.
(This paper is open access in the US)Abstract: Montage is a portable software toolkit for constructing custom, science-grade mosaics by composing multiple astronomical images. The mosaics constructed by Montage preserve the astrometry (position) and photometry (intensity) of the sources in the input images. The mosaic to be constructed is specified by the user in terms of a set of parameters, including dataset and wavelength to be used, location and size on the sky, coordinate system and projection, and spatial sampling rate. Many astronomical datasets are massive, and are stored in distributed archives that are, in most cases, remote with respect to the available computational resources. Montage can be run on both single- and multi-processor computers, including clusters and grids. Standard grid tools are used to run Montage in the case where the data or computers used to construct a mosaic are located remotely on the Internet. This paper describes the architecture, algorithms, and usage of Montage as both a software toolkit and as a grid portal. Timing results are provided to show how Montage performance scales with number of processors on a cluster computer. In addition, we compare the performance of two methods of running Montage in parallel on a grid.
(This paper is not open access)Abstract: It is generally accepted that the ability to develop large-scale distributed applications that are extensible and independent of infrastructure details has lagged seriously behind other developments in cyberinfrastructure. As the sophistication and scale of distributed infrastructure increases, the complexity of successfully developing and deploying distributed applications increases both quantitatively and in qualitatively newer ways. In this paper we trace the evolution of a representative set of "state-of-the-art" distributed applications and production infrastructure; in doing so we aim to provide insight into the evolving sophistication of distributed applications – from simple generalizations of legacy static high-performance to applications composed of multiple loosely-coupled and dynamic components. The ultimate aim of this work is to highlight that even accounting for the fact that developing applications for distributed infrastructure is a difficult undertaking, there are suspiciously few novel and interesting distributed applications that utilize production Grid infrastructure. Along the way, we aim to provide an appreciation for the fact that developing distributed applications and the theory and practise of production Grid infrastructure have often not progressed in phase. Progress in the next phase and generation of distributed applications will require stronger coupling between the design and implementation of production infrastructure and the theory of distributed applications, including but not limited to explicit support for distributed application usage modes and advances that enable distributed applications to scale-out.
(This paper is not open access)Abstract: Louisiana researchers and universities are leading a concentrated, collaborative effort to advance statewide e-Research through a new cyberinfrastructure: computing systems, data storage systems, advanced instruments and data repositories, visualization environments, and people, all linked together by software and high performance networks. This effort has led to a set of interlinked projects that have started making a significant difference to the state and has created an environment that encourages increased collaboration, leading to new e-Research. This paper describes the overall effort, the new projects and environment, and the results to-date.
(This paper is not open access)Abstract: The problems of scheduling a single parallel job across a large scale distributed system are well known and surprisingly difficult to solve. In addition, because of the issues involved with distributed submission like co-reserving resources, managing accounts and certificates simultaneously on multiple machines etc., the vast number of HPC-application users have been happy to remain restricted to submitting jobs to single machines. Meanwhile, the need to simulate larger and more complex physical systems continues to grow, with a concomitant increase in the number of cores required to solve the resulting scientific problems. One might reduce the demand on load per machines, and eventually the wait-time on queue by decomposing the problem to utilise two resources in such circumstances, even though there might be a reduction in the peak performance. This motivates the question: can otherwise monolithic jobs running on single resources be distributed over more than one machine such that there is an overall reduction in the time-to-solution? In this paper, we briefly discuss the development and performance of a parallel molecular dynamics code and its generalisation to work on multiple distributed machines (using MPICH-G2). We benchmark and validate the performance of our simulations over multiple input-data sets of varying size. The primary aim of this work however, is to show that the time-to-solution can be reduced by sacrificing some peak performance and distributing over multiple machines.
(This paper is not open access)Abstract: Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders-of-magnitude smaller meta-data before performing the actual I/O. Specifically, this paper details our experiences in deploying a large-scale system to facilitate the discovery of missing genes and constructing a genome similarity tree by encapsulating the mpiBLAST sequence-search algorithm into ParaMEDIC. The overall project involved nine different computational sites spread across the U.S. generating more than a petabyte of data, that was "teleported" to a large-scale facility in Tokyo for storage.
(This paper is not open access)Abstract: Many scientific workflows are composed of fine computational granularity tasks, yet they are composed of thousands of them and are data intensive in nature, thus requiring resources such as the TeraGrid to execute efficiently. In order to improve the performance of such applications, we often employ task clustering techniques to increase the computational granularity of workflow tasks. The goal is to minimize the completion time of the workflow by reducing the impact of queue wait times. In this paper, we examine the performance impact of the clustering techniques using the Pegasus workflow management system. Experiments performed using an astronomy workflow on the NCSA TeraGrid cluster show that clustering can achieve a significant reduction in the workflow completion time (up to 97%).
(This paper is not open access)Abstract: Many emerging high performance applications require distributed infrastructure that is significantly more powerful and flexible than traditional Grids. Such applications require the optimization, close integration, and control of all Grid resources, including networks. The EnLIGHTened (ENL) Computing Project has designed an architectural framework that allows Grid applications to dynamically request (in-advance or on-demand) any type of Grid resource: computers, storage, instruments, and deterministic, high-bandwidth network paths, including lightpaths. Based on application requirements, the ENL middleware communicates with Grid resource managers and, when availability is verified, co-allocates all the necessary resources. ENL’s Domain Network Manager controls all network resource allocations to dynamically setup and delete dedicated circuits using Generalized Multiprotocol Label Switching (GMPLS) control plane signaling. In order to make optimal brokering decisions, the ENL middleware uses near-real-time performance information about Grid resources. A prototype of this architectural framework on a national-scale testbed implementation has been used to demonstrate a small number of applications. Based on this, a set of changes for the middleware have been laid out and are being implemented.
(This paper is not open access)Abstract: In this paper we examine the issue of optimizing disk usage and scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that workflows may have to be restructured to reduce the overall data footprint of the workflow. We show the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application and an astronomy application, Montage, running on a large-scale production grid-the Open Science Grid. We show that although reducing the data footprint of Montage by 48% can be achieved with dynamic data cleanup techniques, LIGO Scientific Collaboration workflows require additional restructuring to achieve a 56% reduction in data space usage. We also examine the cost of the workflow restructuring in terms of the application's runtime.
(This paper is not open access)Abstract: Astronomy has a rich heritage of discovery using image data sets that cover the full range of the electromagnetic spectrum. Image data sets in one frequency range have often been studied in isolation from those in other frequency ranges. This is mostly a consequence of the diverse properties of the data collections themselves. Images are delivered in different coordinate systems, map projections, spatial samplings, and image sizes, and the pixels themselves are rarely co-registered on the sky. Moreover, the spatial extent of many astronomically important structures, such as clusters of galaxies and star formation regions, is often substantially greater than that of individual images.
(This paper is not open access)Abstract: As is becoming commonly known, there is an explosion happening in the amount of scientific data that is publicly available. One challenge is how to make productive use of this data. This talk will discuss some parallel and distributed computing projects, centered around virtual astronomy, but also including other scientific data-oriented realms. It will look at some specific projects from the past, including Montage, Grist, OurOcean, and SCOOP, and will discuss the distributed computing, Grid, and Web-service technologies that have successfully been used in these projects.
(This paper is not open access)Abstract: Montage is a portable toolkit for constructing custom, science-grade mosaics by composing multiple astronomical images. The mosaics constructed by Montage preserve the astrometry (position) and photometry (intensity) of the sources in the input images. The mosaic to be constructed is specified by the user in terms of a set of parameters, including dataset and wavelength to be used, location and size on the sky, coordinate system and projection, and spatial sampling rate. Many astronomical datasets are massive, and are stored in distributed archives that are, in most cases, remote with respect to the available computational resources. The paper describes scientific applications of Montage by NASA projects and researchers, who run the software on both single- and multi-processor computers, including clusters and grids. Standard grid tools are used to run Montage in the case where the data or computers used to construct a mosaic are located remotely on the Internet. This paper describes the architecture, algorithms, and performance of Montage as both a software toolkit and as a grid portal.
(This paper is not open access)Abstract: The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
(This paper is not open access)Abstract: The Montage mosaic engine supplies on-request image mosaic services for the NVO astronomical community. A companion paper describes scientific applications of Montage. This paper describes one application in detail: the generation at SDSC of a mosaic of the 2MASS All-sky Image Atlas on the NSF TeraGrid. The goals of the project are: to provide a value-added 2MASS product that combines overlapping images to improve sensitivity; to demonstrate applicability of computing at-scale to astronomical missions and surveys, especially projects such as LSST; and to demonstrate the utility of the NVO Hyperatlas format. The numerical processing of an 8-TB 32-bit survey to produce a 64-bit 20-TB output atlas presented multiple scaleability and operational challenges. An MPI Python module, MYMPI, was used to manage the alternately sequential and parallel steps of the Montage process. This allowed us to parallelize all steps of the mosaic process: that of many, sequential steps executing simultaneously for independent mosaics and that of a single MPI parallel job executing on many CPUs for a single mosaic. The Storage Resource Broker (SRB) developed at SDSC has been used to archive the output results in the Hyperatlas. The 2MASS mosaics are now being assessed for scientific quality. The input images consist of 4,121,440 files, each 2MB in size. The input files that fall on mosaic boundaries are opened, read, and used multiple times in the processing of adjacent mosaics, so that a total of 14 TB in 6,275,494 files are actually opened and read in the creation of mosaics across the entire survey. Around 130,000 CPU-hours were used to complete the mosaics. The output consists of 1734 6-degree plates for each of 3 bands. Each of the 5202 mosaics is roughly 4 GB in size, and each has been tiled into a 12x12 array of 26-MB files for ease of handling. The total size is about 20 TB in 750,000 tiles.
(This paper is not open access)Abstract: This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring.
(This chapter is not open access)Abstract: This chapter discusses some grid experiences in solving the problem of generating large astronomical image mosaics by composing multiple small images, from the team that has developed Montage ( The problem of generating these mosaics is complex in that individual images must be projected into a common coordinate space, overlaps between images calculated, the images processed so that the backgrounds match, and images composed while using a variety of techniques to handle the presence of multiple pixels in the same output space. To accomplish these tasks, a suite of software tools called Montage has been developed. The modules in this suite can be run on a single processor computer using a simple shell script, and can additionally be run using a combination of parallel approaches. These include running MPI versions of some modules, and using standard grid tools. In the latter case, processing workflows are automatically generated, and appropriate data sources are located and transferred to a variety of parallel processing environments for execution. As a result, it is now possible to generate large-scale mosaics on-demand in timescales that support iterative, scientific exploration. In this chapter, we describe Montage, how it was modified to execute in the grid environment, the tools that were used to support its execution, as well as performance results.
(This paper is open access in the US)Abstract: This paper compares two methods for running an application composed of a set of modules on a grid. The set of modules (collectively called Montage) generates large astronomical image mosaics by composing multiple small images. The workflow that describes a particular run of Montage can be expressed as a directed acyclic graph (DAG), or as a short sequence of parallel (MPI) and sequential programs. In the first case, Pegasus can be used to run the workflow. In the second case, a short shell script that calls each program can be run. In this paper, we discuss the Montage modules, the workflow run for a sample job, and the two methods of actually running the workflow. We examine the run time for each method and compare the portions that differ between the two methods.
(This paper is not open access)Abstract: Pegasus is a planning framework for mapping abstract workflows for execution on the Grid. This paper presents the implementation of a web-based portal for submitting workflows to the Grid using Pegasus. The portal also includes components for generating abstract workflows based on a metadata description of the desired data products and application-specific services. We describe our experiences in using this portal for two Grid applications. A major contribution of our work is in introducing several components that can be useful for Grid portals and hence should be included in Grid portal development toolkits.
(This paper is not open access)Abstract: Montage is a software system for generating astronomical image mosaics according to user-specified size, rotation, WCS-compliant projection and coordinate system, with background modeling and rectification capabilities. Its architecture has been described in the proceedings of ADASS XII and XIII. It has been designed as a toolkit, with independent modules for image reprojection, background rectification and coaddition, and will run on workstations, clusters and grids. The primary limitation of Montage thus far has been in the projection algorithm. It uses a spherical trigonometry approach that is general at the expense of speed. The reprojection algorithm has now been made 30 times faster for commonly used tangent plane to tangent plane reprojections that cover up to several square degrees, through modification of a custom algorithm first derived by the Spitzer Space Telescope. This focus session will describe this algorithm, demonstrate the generation of mosaics in real time, and describe applications of the software. In particular, we will highlight one case study which shows how Montage is supporting the generation of science-grade mosaics of images measured with the Infrared Array Camera aboard the Spitzer Space Telescope.
(This paper is open access in the US)Abstract: The Grist project is developing a grid-technology based system as a research environment for astronomy with massive and complex datasets. This knowledge extraction system will consist of a library of distributed grid services controlled by a workflow system, compliant with standards emerging from the grid computing, web services, and virtual observatory communities. This new technology is being used to find high redshift quasars, study peculiar variable objects, search for transients in real time, and fit SDSS QSO spectra to measure black hole masses. Grist services are also a component of the "hyperatlas" project to serve high-resolution multi-wavelength imagery over the Internet. In support of these science and outreach objectives, the Grist framework will provide the enabling fabric to tie together distributed grid services in the areas of data access, federation, mining, subsetting, source extraction, image mosaicking, statistics, and visualization.
(This paper is not open access)Abstract: This paper describes the design of a grid-enabled version of Montage, an astronomical image mosaic service, suitable for large scale processing of the sky. All the re-projection jobs can be added to a pool of tasks and performed by as many processors as are available, exploiting the parallelization inherent in the Montage architecture. We show how we can describe the Montage application in terms of an abstract workflow so that a planning tool such as Pegasus can derive an executable workflow that can be run in the Grid environment. The execution of the workflow is performed by the workflow manager DAGMan and the associated Condor-G. The grid processing will support tiling of images to a manageable size when the input images can no longer be held in memory. Montage will ultimately run operationally on the Teragrid. We describe science applications of Montage, including its application to science product generation by Spitzer Legacy Program teams and large-scale, all-sky image processing projects.
(This paper is open access in the US)Abstract: This paper discusses work done by JPL's Parallel Applications Technologies Group in helping scientists access and visualize very large data sets through the use of multiple computing resources, such as parallel supercomputers, clusters, and grids. These tools do one or more of the following tasks: visualize local data sets for local users, visualize local data sets for remote users, and access and visualize remote data sets. The tools are used for various types of data, including remotely sensed image data, digital elevation models, astronomical surveys, etc. The paper attempts to pull some common elements out of these tools that may be useful for others who have to work with similarly large data sets.
(This paper is open access)Abstract: Montage is an Earth Science Technology Office (ESTO) Computational Technologies (CT) Round III Grand Challenge project that will deploy a portable, compute-intensive, custom astronomical image mosaicking service for the National Virtual Observatory (NVO). Although Montage is developing a compute- and data-intensive service for the astronomy community, we are also helping to address a problem that spans both Earth and space science: how to efficiently access and process multi-terabyte, distributed datasets. In both communities, the datasets are massive, and are stored in distributed archives that are, in most cases, remote with respect to the available computational resources. Therefore, use of state-of-the-art computational grid technologies is a key element of the Montage portal architecture. This paper describes the aspects of the Montage design that are applicable to both the Earth and space science communities.
(This paper is not open access)Abstract: The architecture of Montage, which delivers custom science-grade astronomical images, was presented at ADASS XII. That architecture has been tested on 2MASS images computed on single processor Linux machines that hold all image data in memory. This year, we describe the design of a grid-enabled version of Montage, suitable for large scale processing of the sky. It exploits to the maximum the parallelization inherent in the Montage architecture, whereby image re-projections are performed in parallel. All the re-projection jobs can be added to a pool of tasks and performed by as many processors as are available. We show how we can describe the Montage application in terms of an abstract workflow so that a planning tool such as Pegasus can derive an executable workflow that can be run in the Grid environment. The execution of the workflow is performed by the workflow manager DAGMan and the associated Condor-G. The grid processing will support tiling of images to a manageable size when the input images can no longer be held in memory. When fully tested, Montage will ultimately run operationally on the Teragrid. We will present processing metrics and describe how Montage is being used, including its application to science product generation by SIRTF Legacy Program teams and large-scale image processing pro jects such as Atlasmaker (this conference).
(This paper is not open access)Abstract: Systems that operate in extremely volatile environments, such as orbiting satellites, must be designed with a strong emphasis on fault tolerance. Rather than rely solely on the system hardware, it may be benecial to entrust some of the fault handling to software at the application level, which can utilize semantic information and software communication channels to achieve fault tolerance with considerably less power and performance overhead. This paper details the implementation and evaluation of such a software-level approach, Application-Level Fault Tolerance and Detection (ALFTD) into the Orbital Thermal Imaging Spectrometer (OTIS).
(This paper is not open access)Abstract: Earth system and environmental models present the scientist/programmer with multiple challenges in software design, development, and maintenance, overall system integration, and performance. We describe how work in the industrial sector of software engineering - namely component-based software engineering - can be brought to bear to address issues of software complexity. We explain how commercially developed component solutions are inadequate to address the performance needs of the Earth system modeling community. We describe a component-based approach called the Common Component Architecture that has as its goal the creation of a component paradigm that is compatible with the requirements of high-performance computing applications. We outline the relationship and ongoing collaboration between CCA and major Climate/Weather/Ocean community software projects. We present examples of work in progress that uses CCA, and discuss long-term plans for the CCA-climate/weather/ocean collaboration.
(This paper is open access in the US)Abstract: We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form: the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision numerical calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computation burden tradeoffs, and emphasize average-case algorithm behavior rather then using worst-case upper bounds on error.
(This paper is open access in the US)Abstract: NASA's successful exploration of space has uncovered vast amounts of new knowledge about the Earth, the solar system and its other planets, and the stellar spaces beyond. To continue gaining new knowledge has required - and will continue to require - new capabilities in onboard processing hardware, system software, and applications such as autonomy.
For example, initial robotic space exploration missions functioned, for the most part, as large flying cameras. These instruments have evolved over time to include more sophisticated imaging radar, multispectral imagers, spectrometers, gravity wave detectors, a host of prepositioned sensors and, most recently, rovers.
(This paper is open access in the US)Abstract: The development of large-scale multi-disciplinary scientific applications for high-performance computers today involves managing the interaction between portions of the application developed by different groups. The CCA (Common Component Architecture) Forum is developing a component architecture specification to address high-performance scientific computing, emphasizing scalable (possibly-distributed) parallel computations. This paper presents an examination of the CCA software in sequential and parallel electromagnetics applications using unstructured adaptive mesh refinement (AMR). The CCA learning curve and the process for modifying Fortran 90 code (a driver routine and an AMR library) into two components are described. The performance of the original applications and the componentized versions are measured and shown to be comparable.
(This paper is not open access)Abstract: The National Virtual Observatory (NVO) will provide on-demand access to data collections, data fusion services and compute intensive applications. The paper describes the development of a framework that will support two key aspects of these objectives: a compute engine that will deliver custom image mosaics, and a "request management system," based on an e-business applications server, for job processing, including monitoring, failover and status reporting. We will develop this request management system to support a diverse range of astronomical requests, including services scaled to operate on the emerging computational grid infrastructure. Data requests will be made through existing portals to demonstrate the system: the NASA/IPAC Extragalactic Database (NED), the On-Line Archive Science Information Services (OASIS) at the NASA/IPAC Infrared Science Archive (IRSA); the Virtual Sky service at Caltechs Center for Advanced Computing Research (CACR), and the yourSky mosaic server at the Jet Propulsion Laboratory (JPL).
(This paper is open access)