Sign up or log in to see what your friends are attending and create your own schedule!

View analytic
Filter: Toledo 5th Floor


Software: UltraScan Solution
    Tuesday July 17, 2012 10:00am - 10:30am @ Toledo 5th Floor

    Software: UltraScan Solution Modeler: Integrated Hydrodynamic Parameter and Small Angle Scattering Computation and Fitting Tools

    Abstract: UltraScan Solution Modeler (US-SOMO) processes atomic and lower-resolution bead model representations of biological and other macromolecules to compute various hydrodynamic parameters, such as the sedimentation and diffusion coefficient, relaxation time and intrinsic viscosity, and small angle scattering curves that contribute to our understanding of molecular structure in solution. Knowledge of biological macromolecules’ structure aids researchers in understanding their function as a path to disease prevention and therapeutics for conditions such as cancer, thrombosis, Alzheimer’s disease and others. US-SOMO provides a convergence of experimental, computational, and modeling techniques, in which detailed molecular structure and properties are determined from data obtained in a range of experimental techniques that, by themselves, give incomplete information. Our goal in this work is to develop the infrastructure and user interfaces that will enable a wide range of scientists to carry out complicated experimental data analysis techniques on XSEDE. Our user community consists of biophysics and structural biology researchers. A recent search on PubMed reports 9,205 papers in the decade referencing the techniques we support. We believe our software will provide these researchers a convenient and unique framework to determine structure, and thus advancing their research. 

    The computed hydrodynamic parameters and scattering curves are screened against experimental data, effectively pruning potential structures into equivalence classes. Experimental methods may include analytical ultracentrifugation, dynamic light scattering, small angle X-ray scattering, NMR, fluorescence spectroscopy, and others. One source of macromolecular models are X-ray crystallographic studies. A molecule’s behavior in solution may not match those observed in the crystal form. Using computational techniques, an initial fixed model can be expanded into a search space utilizing high temperature molecular dynamic approaches or stochastic methods such as Brownian dynamics. The number of structures produced can vary greatly, ranging from hundreds to tens of thousands or more. This introduces a number of cyberinfrastructure challenges. Computing hydrodynamic parameters and small angle scattering curves can be computationally intensive for each structure, and therefore cluster compute resources are essential for timely results. Input and output data sizes can vary greatly from less than 1 MB to 2 GB or more. Although the parallelization is trivial, along with data size variability there is a large range of compute sizes, ranging from one to potentially thousands of cores with compute time of minutes to hours. 

    In addition to the distributed computing infrastructure challenges, an important concern was how to allow a user to conveniently submit, monitor and retrieve results from within the C++/Qt GUI application while maintaining a method for authentication, approval and throttling of usage. Middleware supporting these design goals has been integrated into the application with assistance from the Open Gateway Computing Environments (OGCE) collaboration team. The approach was tested on various XSEDE clusters and local compute resources. This paper reviews current US-SOMO functionality and implementation with a focus on the newly deployed cluster integration.



    Type Software and Software Environments Track


Software: Trinity RNA-Seq
    Tuesday July 17, 2012 10:30am - 11:00am @ Toledo 5th Floor

    Software: Trinity RNA-Seq Assembler Performance Optimization

    Abstract: RNA-sequencing is a technique to study RNA expression in biological material. It is quickly gaining popularity in the field of transcriptomics. Trinity is a software tool that was developed for efficient de novo reconstruction of transcriptomes from RNA-Seq data. In this paper we first conduct a performance study of Trinity and compare it to previously published data from 2011. We examine the runtime behavior of Trinity as a whole as well as its individual components and then optimize the most performance critical parts. We find that standard best practices for HPC applications can also be applied to Trinity, especially on systems with large amounts of memory. When combining best practices for HPC applications along with our specific performance optimization, we can decrease the runtime of Trinity by a factor of 3.9. This brings the runtime of Trinity in line with other de novo assemblers while maintaining superior quality.



    Type Software and Software Environments Track


Software: Exploring Similarities
    Tuesday July 17, 2012 11:00am - 11:30am @ Toledo 5th Floor

    Software: Exploring Similarities Among Many Species Distributions

    Abstract: Collecting species presence data and then building models to predict species distribution has been long practiced in the field of ecology for the purpose of improving our understanding of species relationships with each other and with the environment. Due to limitations of computing power as well as limited means of using modeling software on HPC facilities, past species distribution studies have been unable to fully explore diverse data sets. We build a system that can, for the first time to our knowledge, leverage HPC to support effective exploration of species similarities in distribution as well as their dependencies on common environmental conditions. Our system can also compute and reveal uncertainties in the modeling results enabling domain experts to make informed judgments about the data. Our work was motivated by and centered around data collection efforts within the Great Smoky Mountains National Park that date back to the 1940s. Our findings present new research opportunities in ecology and produce actionable field-work items for biodiversity management personnel to include in their planning of daily management activities.



    Type Software and Software Environments Track


Software: The Prickly Pear
    Tuesday July 17, 2012 11:30am - 12:00pm @ Toledo 5th Floor

    Software: The Prickly Pear Archive: A Portable Hypermedia for Scholarly Publication

    Abstract: An executable paper is an hypermedia for publishing, review- ing, and reading scholarly papers which include a complete HPC software development or scientific code. A hyperme- dia is an integrated interface to multimedia including text, figures, video, and executables, on a subject of interest. Re- sults within the executable paper, including numeric output, graphs, charts, tables, equations and the underlying codes which generated such results. These results are dynamically regenerated and included in the paper upon recompilation and re-execution of the code. This enables a scientifically enriched environment which functions not only as a journal but a laboratory in itself, in which readers and reviewers may interact with and validate the results. 
    The Prickly Pear Archive (PPA) is such a system[1]. One distinguishing feature of the PPA is the inclusion of an un- derlying component-based simulation framework, Cactus[3], which simplifies the process of composing, compiling, and executing simulation codes. Code creation is simplified us- ing common bits of infrastructure; each paper augments to the functionality of the framework. Other distinguishing features include the portability and reproducibility of the archive, which allow researchers to re-create the software environment in which the simulation code was created. 
    A PPA production system hosted on HPC resources (e.g. an XSEDE machine) unifies the computational scientific pro- cess with the publication process. A researcher may use the production archive to test simulations; and upon arriving at a scientifically meaningful result, the user may then incor- porate the result in an executable paper on the very same resource the simulation was conducted. Housed within a vir- tual machine, the PPA allows multiple accounts within the same production archive, enabling users across campuses to bridge their efforts in developing scientific codes. 
    The executable paper incorporates into its markup code references to manipulable parameters, including symbolic equations, which the simulation executable accepts as input. An interface to this markup code enables authors, readers, and reviewers to control these parameters and re-generate the paper, potentially arriving at a novel result. Thus, the executable paper functions not only as a publication in it- self, but also as an interactive laboratory from which novel science may be extracted. One can imagine the executable paper environment, encapsulated and safeguarded by a vir- tual machine, as a portable laboratory in which the com- putational scientist arrived at the result. The notion of an executable paper is particularly useful in the context of com- puter and computational science, where the code underlying a (scientific) software development is of interest to the wider development community. 
    Why are executable papers to be preferred over traditional papers? As Gavish et al. have observed[2], the current work- flow in the life cycle of a traditional paper may be summa- rized in the following five steps: 
    1. Store a private copy of the original data to be processed on the local machine. 2. Write a script or computer program, with hard-coded tuning parameters, to load the data from the local file, analyze it, and out- put selected results (a few graphical figures, tables, etc) to the screen or to local files. 3. Withhold the source code that was executed, and the copy of the original data that was used, and keep them in the local file system in a directory called e.g. “code-final.” 4. Copy and paste the results into a publica- tion manuscript containing a textual descrip- tion of the computational process that pre- sumably took place. 5. Submit the manuscript for publication as a package containing the word processor source file and the graphical figure files. 
    Three issues with this workflow lie with the communication, validation, and reproducibility of the method used to ar- rive at the result. The novel execution-based publication paradigm for computer and computational science has a few advantages over the traditional paper-based paradigm which allow it to overcome these issues. 
    The first advantage is the enhanced explanatory power of- fered by multimedia such as graphs, charts, and numeric values which are generated from manipulable parameters. The reader and reviewer may view the parameters to see how such figures were arrived at. They may also vary the parameters, re-execute the code, and witness changes to me- dia themselves. This level of interactivity offered by the executable archive allows the audience to understand the simulation by allowing them to conduct it first-hand. 
    The second advantage is that increased level of validation and peer review of the experimental method–owing to to the inclusion of code and logs used to arrive at a result–which is essential for computational science to self-correct. Though a computational scientist may reproduce a simulation code according to a natural-language or pseudocode description, implementation differences may have a dramatic impact on the result. (minutia such as the order of nested loops, the order of computations, and the language constructs used). Were they subject to the review, inefficiencies and errors could be more easily corrected. Fortunately, since the ex- perimental apparatus in computational science is digitized, it is potentially easier for it to achieve this level of validation relative to paper-based media. 
    The third is reproduciblity, which supports efforts to mod- ify or extend the computational method. In other sciences, the experimental method is fleshed out in the traditional paper in such detail that anyone with the proper equipment may repeat it and expect the same results. Perhaps the computational scientist may be able to supply pseudocode in the traditional paper for others to implement. However, reproducing codes (especially simulation codes) from pseu- docode is a laborious and time-consuming process. Even when such reproduction is completed, gaining access to the proper equipment is often as problematic for the computa- tional scientist as for any other, as it requires allocations to high-security supercomputers and time invested in building the software dependencies necessary to run the code. 
    Given the digital nature of computational science, it seems that communicating, validating, and reproducing experimen- tal work should be easier. To facilitate the process of extend- ing or further investigating a computational model, ideally modifiable code would come with the paper, along with the process of building and executing it (in the form of a script) and the total environment in which the code was written, built, and executed. In addition, the paper results would be dynamically generated upon recompilation and execution following any modifications to the code. 
    The latter scenario may seem like a distant possibility, but the production of many executable archives is already under- way. One such archive is the Prickly Pear Archive, an exe- cutable paper journal which uses the Cactus Computational Toolkit as its underlying framework. The award-winning Cactus framework is used for general problem-solving on regular meshes, and is used across disciplines including nu- merical relativity, astrophysics, and coastal science. In ad- Figure 1: The PPA workflow integrates several com- ponents through its PHP interface, including Cac- tus, Kranc, SimFactory and LaTeX. 
    In addition to Cactus, the PPA interfaces with several other com- ponents, including: SimFactory, set of utilities for building and running Cactus applications and managing Cactus jobs; Kranc, a script for creating Cactus thorns from equations rendered in a Mathematica-style language; the Piraha pars- ing expression grammar for parsing parameter and configu- ration files; the LaTeX markup language; and an encapsu- lating virtual machine which enables portability and rep



    Type Software and Software Environments Track


Software: Invited Talk: Building your


Software: Roadmaps, Not Blueprints
    Tuesday July 17, 2012 3:15pm - 3:45pm @ Toledo 5th Floor

    Software: Roadmaps, Not Blueprints: Paving the Way to Science Gateway Success

    Abstract: As science today grows ever more digital, it poses exciting challenges and opportunities for researchers. The existence of science gateways—and the advanced cyberinfrastructure (CI) tools and resources behind the accessible web interfaces—can significantly improve the productivity of researchers facing the most difficult challenges, but designing the most effective tools requires an investment of time, effort, and money. Because all gateways cannot be funded in the long term, it is important to identify the characteristics of successful gateways and make early efforts to incorporate whatever strategies will set up new gateways for success. Our research seeks to identify why some gateway projects change the way science is conducted in a given community while other gateways do not. Through a series of five full-day, iterative, multidisciplinary focus groups, we have gathered input and insights from sixty-six participants representing a diverse array of gateways and portals, funding organizations, research institutions, and industrial backgrounds. In this paper, we describe the key factors for success as well as the situational enablers of these factors. These findings are grouped into five main topical areas—the builders, the users, the roadmaps, the gateways, and the support systems—but we find that many of these factors and enablers are intertwined and inseparable, and there is no easy prescription for success.



    Type Software and Software Environments Track


Software: Offline Parallel
    Tuesday July 17, 2012 3:45pm - 4:15pm @ Toledo 5th Floor

    Software: Offline Parallel Debugging: A Case Study Report

    Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactive debugging tools continue to improve in ways that mitigate the difficulties, and the best such systems will continue to be mission critical. Such tools have their limitations, however. They are often unable to operate across many thousands of cores. Even when they do function correctly, mining and analyzing the right data from the results of thousands of processes can be daunting, and it is not easy to design interfaces that are useful and effective at large scale. One additional challenge goes beyond the functionality of the tools themselves. Leadership class systems typically operate in a batch mode intended to maximize utilization and throughput. It is generally unrealistic to expect to schedule a large block of time to operate interactively across a substantial fraction of such a system. Even when large scale interactive sessions are possible, they can be expensive, and can impact system access for others.

    Given these challenges, there is potential value in research into other non-traditional debugging models. Here we describe our progress with one such approach: offline debugging. In Section \ref{Concept} we describe the concept of offline debugging in general terms. We then provide in Section \ref{Implementation} an overview of GDBase, a prototype offline debugger. Section \ref{Case Studies} describes proof-of-concept demonstrations of GDBase, and focuses on the first attempts to deploy GDBase in large-scale debugging efforts of major research codes. Section \ref{Conclusions} highlights lessons learned and recommendations.



    Type Software and Software Environments Track


BOF: Big Data
    Tuesday July 17, 2012 4:45pm - 5:45pm @ Toledo 5th Floor

    Abstract: “Big Data” is a major force in the current scientific environment. In March 2012, President Barack Obama released the details of the administration’s big data strategy. Dr. John P. Holdren, Assistant to the President and Director of the White House Office of Science and Technology Policy announced, “the initiative we are launching today promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security.” 

    The datasets used by XSEDE researchers are getting larger as well. In the 2012 survey of researchers using RDAV’s Nautilus system at NICS, over two-thirds of respondents indicated that their data will grow in size over the next year. As more powerful resources, such as the upcoming Stampede system at TACC, become available to XSEDE researchers, simulation sizes will continue to grow. Furthermore, the XSEDE Campus Champions represent college and university researchers who are confronted by large datasets. 

    There are many unsolved problems associated with analyzing, moving, storing, and understanding large scale data sets. At this BoF, researchers and XSEDE staff can discuss their challenges and successes with working with large datasets as well as the hardware, software, and support resources that XSEDE service providers can offer to researchers working with big data. 


    Type BOF



Software: The CIPRES
    Wednesday July 18, 2012 10:00am - 10:30am @ Toledo 5th Floor

    Software: The CIPRES Science Gateway: Enabling High-Impact Science for Phylogenetics Researchers with Limited Resources

    Abstract: The CIPRES Science Gateway (CSG) provides browser-based access to computationally demanding phylogenetic codes run on large HPC resources. Since its release in December 2009, there has been a sustained, near-linear growth in the rate of CSG use, both in terms of number of users submitting jobs each month and number of jobs submitted. The average amount of computational time used per month by CSG increased more than 5-fold over that time period. As of April 2012, more than 4,000 unique users have run parallel tree inference jobs on TeraGrid/XSEDE resources using the CSG. The steady growth in resource use suggests that the CSG is meeting an important need for computational resources in the Systematics/Evolutionary Biology community. To insure that XSEDE resources accessed through the CSG are used effectively, policies for resource consumption were developed, and an advanced set of management tools was implemented. Studies of usage trends show that these new management tools helped in distributing XSEDE resources across a large user population that has low-to-moderate computational needs. In the last quarter of 2012, 29% of all active XSEDE users accessed computational resources through the CSG, while the analyses conducted by these users accounted for 0.7% of all allocatable XSEDE computational resources. [Great!] User survey results showed that the easy access to XSEDE/TeraGrid resources through the CSG had a critical and measurable scientific impact: at least 300 scholarly publications spanning all major groups within the Tree of Life have been enabled by the CSG since 2009. The same users reported that 82% of these publications would not have been possible without access to computational resources available through the CSG. The results indicate that the CSG is a critical and cost-effective enabler of science for phylogenetic researchers with limited resources.



    Type Software and Software Environments Track


Software: Mojave
    Wednesday July 18, 2012 10:30am - 11:00am @ Toledo 5th Floor

    Software: Mojave: A Development Environment for the Cactus Computational Framework

    Abstract: ABSTRACT 
    This paper presents “Mojave,” a set of plug-ins for the Eclipse Integrated Development Environment (IDE), which provides a unified interface for HPC code development and job man- agement. Mojave facilitates code creation, refactoring, build- ing, and running of a set of HPC scientific codes based on the Cactus Computational Toolkit, a computational framework for general problem-solving on regular meshes. The award- winning Cactus framework has been used in numerous fields including numerical relativity, cosmology, and coastal sci- ence. Cactus, like many high-level frameworks, leverages DSLs and generated distributed data structures. Mojave fa- cilitates the development of Cactus applications and the sub- mission of Cactus runs to high end resources (e.g. XSEDE systems) using built-in Eclipse features, C/C++ Develop- ment Tooling (CDT), Parallel Tools Platform (PTP) plug- ins[6], and Simfactory (a Cactus-specific set of command line utilities)[5]. 
    Numerous quality and productivity gains can be achieved using integerated development environments (IDEs)[2, 3, 4]. IDEs offer advanced search mechanisms that provide devel- opers a broader range of utilities to manage their code base; for example, refactoring capabilities which enable developers to easily perform tedious and error-prone code transforma- tions and keep their code more maintainable [1]; and static analysis tools, which help locate and correct bugs; and many other advantages. 
    In order for Eclipse to provide these features, however, it needs to be able to index a Cactus codebase, and to do this it needs to understand how Cactus organizes its generated files. Originally, Cactus used per-directory macro definitions in its generated code to enable individual modules to access module-specific code; as a result, displaying or analyzing the content of many header files depended on which file included them, making it impossible for any IDE to properly render or index the code. Part of the Mojave development effort in- cludes refactoring Cactus to use per-directory include mech- anisms instead of defines, generating multiple versions of the same header files in different directories. These changes are invisible to codes that use Cactus, but required teaching Mojave to correctly and automatically configure (or dynam- ically reconfigure) the numerous module directories within a Cactus build directories. 
    Mojave also leverages the Parallel Tools Platform’s (PTP) feature-rich JAXB-based resource management interface. This new extensible resource managment component enables Mo- jave to add its own specialized resource manager which de- scribes the commands to interact with the remote resource manager as well as a means to display workload distribution on remote machines graphically. Thus one may view his job graphically in the context of e.g. an XSEDE resource’s workload. 
    Mojave provides specialized integration points for SimFac- tory, a command-line tool for Cactus job submission and monitoring. These integration points enable scientists to (1) develop new thorns, (2) create, (3) manage, and (4) monitor simulations or sets of simulations on remote machines. Mo- jave offers a menu-driven interface to Simfactory, allowing users to add ther own commands under the menu or attach an action script for Cactus simulation management opera- tions triggered by the matching of regular expressions on the console output stream. This allows for flexible responses to job monitoring information. 
    Because the Cactus Simfactory tools are designed to enable remote submissions to multiple resources and manage jobs on many machines at once, Mojave introduces a new job in- formation sharing feature which enables a research group to view and monitor a set of jobs running on diverse resources submitted by multiple scientists. The research group may then be better informed about its existing runs. In addi- tion, information shared about the results of the runs en- ables a quick response from the community, conveniently from within the same environment the code was developed. Job sharing capabilities combined with the productivity gains offered by Eclipse and the CDT and PTP plug-ins as well as the job submission and monitoring capabilities offered by Simfactory lend flexibility to Mojave as a unified computa- tional science interface for Cactus, thereby bridging scientific efforts across campuses around the globe. 

    REFERENCES [1] D. Dig F. Kjolstad and M. Snir. Bringing the HPC Programmer’s IDE into the 21st Century through Refactoring. In SPLASH 2010 Workshop on Concurrency for the Application Programmer (CAP’10). Association for Computing Machinery (ACM), Oct. 2010. [2] A. Frazer. Case and its contribution to quality. In Layman’s Guide to Software Quality, IEE Colloquium on, pages 6/1 –6/4, dec 1993. [3] M. J. Granger and R. A. Pick. Computer-aided software engineering’s impact on the software development process : An experiment. In Proceedings of the 24th Hawaii International Conference on System Sciences, pages 28–35, January 1991. [4] P.H. Luckey and R.M. Pittman. Improving software quality utilizing an integrated case environment. In Aerospace and Electronics Conference, 1991. NAECON 1991., Proceedings of the IEEE 1991 National, pages 665 –671 vol.2, may 1991. [5] M. W. Thomas and E. Schnetter. Simulation Factory: Taming Application Configuration and Workflow on High-End Resources. ArXiv e-prints, August 2010. [6] G.R. Watson, C.E. Rasmussen, and B.R. Tibbitts. An integrated approach to improving the parallel application development process. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1 –8, may 2009. 


    Type Software and Software Environments Track


Software: Enabling Large-scale
    Wednesday July 18, 2012 11:00am - 11:30am @ Toledo 5th Floor

    Software: Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

    Abstract: Computational scientists often need to execute large, loosely-coupled parallel applications such as workflows and bags of tasks in order to do their research. These applications are typically composed of many, short-running, serial tasks, but frequently demand large amounts of computation and storage. In order to produce results in a reasonable time, scientists would like to execute these applications using petascale resources. In the past this has been a challenge because petascale systems are not designed to execute such workloads efficiently. In this paper we describe a new approach to executing large, fine-grained workflows on distributed petascale systems. Our solution involves partitioning the workflow into independent subgraphs, and then submitting each subgraph as a self-contained MPI job to remote resources. We describe how the partitioning and job management has been implemented in the Pegasus Workflow Management System. We also explain how this approach provides an end to end solution for challenges related to system architecture, queue policies and priorities, and application reuse and development. Finally, we describe how the system is being used to enable the execution of a very large seismic hazard analysis application on XSEDE resources.



    Type Software and Software Environments Track


Software: The Eclipse Parallel Tools
    Wednesday July 18, 2012 11:30am - 12:00pm @ Toledo 5th Floor

    Software: The Eclipse Parallel Tools Platform: Toward an Integrated Development Environment for XSEDE Resources

    Abstract: Eclipse is a widely used, open source integrated development environment that includes support for C, C++, Fortran, and Python. The Parallel Tools Platform (PTP) extends Eclipse to support development on high performance computers. PTP allows the user to run Eclipse on her laptop, while the code is compiled, run, debugged, and profiled on a remote HPC system. PTP provides development assistance for MPI, OpenMP, and UPC; it allows users to submit jobs to the remote batch system and monitor the job queue; and it provides a visual parallel debugger.

    In this paper, we will describe the capabilities we have added to PTP to support XSEDE resources. These capabilities include submission and monitoring of jobs on systems running Sun/Oracle Grid Engine, support for GSI authentication and MyProxy logon, support for environment modules, and integration with compilers from Cray and PGI. We will describe ongoing work and directions for future collaboration, including OpenACC support and parallel debugger integration



    Type Software and Software Environments Track


Science: Monte Carlo strategies
    Wednesday July 18, 2012 1:15pm - 1:45pm @ Toledo 5th Floor

    Science: Monte Carlo strategies for first-principles simulations of elemental systems

    Abstract: We discuss the application of atomistic Monte Carlo simulation based on electronic structure calculations to elemental systems such as metals and alloys. As in prior work in this area [1,2], an approximate "pre-sampling" potential is used to generate large moves with a high probability of acceptance. Even with such a scheme, however, such simulations are extremely expensive and may benefit from algorithmic developments that improve acceptance rates and/or enable additional parallelization.

    Here we consider various such developments. The first of these is a three-level hybrid algorithm in which two pre-sampling potentials are used. The lowest level is an empirical potential, and the "middle" level uses a low-quality density functional theory. The efficiency of the multistage algorithm is analyzed and an example application is given.

    Two other schemes for reducing overall run-time are also considered. In the first, the Multiple-try Monte Carlo algorithm [4], a series of moves are attempted in parallel, with the choice of the next state in the chain made by using all the information gathered. This is found to be a poor choice for simulations of this type. In the second scheme, "tree sampling," multiple trial moves are made in parallel such that if the first is rejected, the second is ready and can be considered immediately. Performance of this scheme is shown to be quite effective under certain reasonable run parameters.

    [1] S. Wang et al., Comp. Mater. Sci. 29 (2004) 145-151.

    [2] M. J. McGrath et al., Comp. Phys. Comm. 169 (2005) 289-294.

    [3] L. D. Gelb and T. Carnahan, Chem. Phys. Letts. 417 (2006) 283-287.

    [4] J. S. Liu, Monte Carlo Strategies in Scientific Computing (2001), Springer, New York.



    Type Science Track
    Session Titles Statistical Methods/Weather


Science: Computational challenges
    Wednesday July 18, 2012 1:45pm - 2:15pm @ Toledo 5th Floor

    Science: Computational challenges in nanoparticle partition function calculation

    Abstract: Bottom-up building block assembly is a useful technique for determining thermodynamically stable configurations of certain physical particles. This paper provides a description of the computational bottlenecks encountered when generating large configurations of particles. We identify two components; cluster pairing and shape matching, that dominate the run time. We present scaling data for a simple example particle and discuss opportunities for enhancing implementations of bottom-up building block assembly for studying larger or more complex systems.



    Type Science Track
    Session Titles Statistical Methods/Weather


Science: Ensemble modeling
    Wednesday July 18, 2012 2:15pm - 2:45pm @ Toledo 5th Floor

    Science: Ensemble modeling of storm interaction with XSEDE

    Abstract: We applied TG/XSEDE HPCs to an ensemble modeling study of how thunderstorm severity depended on the proximity of nearby storms. The Weather Research and Forecasting model was used to investigate 52 idealized thunderstorm scenarios, changing the position of a nearby convective cell when another was developing. We found a large impact from having any other storm cell nearby, as well as very high sensitivity to where that cell was placed. This represents a significant change from forecast thinking that currently relies on expected storm behavior guidance for a quiescent environment. We are also studying a new, tornado-scale simulation based on the above findings. 

    In carrying out this study over the last 3 years we utilized five (perhaps 6-7 by July) XSEDE HPCs, and made use of both traditional batch capabilities as well as grid-based computing. We found our greatest challenges were less in high-performance computing than in data and storage. Moving, storing, and analyzing multi-TB data sets proved to be challenging and we found our data processing could significantly degrade the performance of high-performance disk systems. In addition to the details of our study, we will discuss these experiences, how we hope to make the most of XSEDE resources and note some of the near-term and long-term challenges we expect to encounter in our numerical research.



    Type Science Track
    Session Titles Statistical Methods/Weather


Science: Improving Tornado
    Wednesday July 18, 2012 2:45pm - 3:15pm @ Toledo 5th Floor

    Science: Improving Tornado Prediction Using Data Mining

    Abstract: Responsible for over four-hundred deaths in the United States in 2011, tornadoes are a major source of preventable deaths and property damage. One major cause for the high number of fatalities caused by tornadoes each year is the high False Alarm Ratio (FAR). FAR is a forecasting metric which describes the probability a tornado warning is issued, given no tornado exists. 

    While tornado forecasting has improved dramatically over the past decades, the FAR has consistently remained between seventy and eighty percent. This indicates as much as eighty percent of all tornado warnings are false-alarms. Consequently, the public has gradually become desensitized to tornado warnings, and thus do not seek shelter when it is appropriate [1]. If the number of fatalities caused by tornadoes is to be reduced, the FAR must first be improved.

    The reasons for the high FAR are complex and manifold. Arguably the most pragmatic reason is simply caution. Faced with the decision of whether or not to issue a warning, a meteorologist can err in one of two ways. The first error, a type I error, occurs when the meteorologist predicts a tornado, but one does not occur. The second error, a type II error, occurs when a meteorologist does not predict a tornado, but one does occur. Because the former is merely inconvenient and the latter is potentially fatal, meteorologists tend to err on the side of caution, and typically issue a tornado warning given a storm which shows any signs of being tornadic.

    In addition to caution, a critical cause for the high FAR is a limited understanding of tornadogenisis, or the process by which tornadoes form. Fortunately, tornadoes are a relatively infrequent phenomenon. However, tornado scarcity coupled with the limitations of current radar technology has result in a limited amount of real-world data. 

    Because of the extraordinary complexity of tornadogenisis and the limited amount of real-world data, tornadogenisis has stubbornly resisted complete understanding for centuries. However, for the first time in history, technology has reached sufficient fruition to begin solving this ancient problem. 

    The absence of real-world data can be addressed by running highly sophisticated and computationally intensive mathematical models on Kraken, a High-Performance Computer provided by XSEDE, to numerically simulate supercells. By using simulated in data in lieu of real-world data, datasets of arbitrary resolution and scale can be generated as needed [2].

    Unfortunately, while the use of a simulator mitigates the issue of data scarcity, it also creates a new one. If the simulations are of sufficiently high-resolution to be useful, the size of the dataset becomes astronomical. At present our dataset consists of approximately fifty individual simulations, each approximately one terabyte in size, totaling over fifty terabytes of memory. Because the dataset is too large for any individual to analyze and understand, new techniques needed to be developed which were capable of autonomously analyzing the dataset.

    Specifically, Spatiotemporal Relational Probability Trees (SRPTs) were developed. SRPTs are an augmentation of a classic data mining algorithm, the Probability Tree (PT). Probability Trees were chosen for their strong predictive ability, efficiency, and ability to scale to large datasets [3]. Additionally, unlike many data mining algorithms, a Probability Tree is human-readable. Consequently, after the PT has been grown using the simulated dataset, further insights can be drawn from its structure by domain scientists.

    SRPTs differ from PTs in several ways. The most important difference being SRPTs are capable of creating spatial, temporal, and relational distinctions within the tree. This gives SRPTs the ability to reason about spatiotemporal and relational data. Because the natural world is inherently spatiotemporal and relational, and many scientific datasets share these properties, this greatly enhances the strength and applicability of SRPTs.

    However, while SRPTs are powerful predictors, they do suffer from one major weakness: overfitting. Overfitting occurs when the SRPTs cease to discover generalizable, meaningful patterns within the data, and instead begin fitting to minutia and noise within the dataset. Overfitting in SRPTs can largely be mitigated by limiting the depth to which the tree grows. However, this also limits the predictive ability of an individual SRPT. 

    To address these issues, Spatiotemporal Relational Random Forests (SRRFs) were developed. Much like a traditional random forest, an SRRF is an ensemble of SRPTs. By growing hundreds of individual SRPTs from bootstrap samples of the dataset, then combining each individual tree’s prediction in an intelligent way, SRRFs are capable of discovering far more complex patterns within the data [4]. Additionally, because each tree within the SRRF is limited in size, overfitting is far less of an issue.

    By training SRRFs on the simulated tornado dataset, we hope to discover salient patterns and conditions necessary for the formation of tornadoes. By discovering and understanding these patterns, traditional tornado forecasting may be improved dramatically. Specifically, these patterns may aid in differentiating storms which will produce a tornado from those which will not. This ability would have the immediate effect of reducing the False-Alarm Ratio, and ultimately aid in restoring the public’s confidence in tornado warnings. By doing so, we may ultimately reduce the number of preventable fatalities caused by tornadoes.


    1. Rosendahl, D. H., 2008: Identifying precursors to strong low-level rotation with numerically simulated supercell thunderstorms: A data mining approach. Master's thesis, University of Oklahoma, School of Meteorology.

    2. Xue, Ming, Kevin Droegemeier, and V. Wong. "The Advanced Regional Prediction System (ARPS) - A Multiscale Nonhydrostatic Atmospheric Simulation and Prediction Model. Part 1: Model Dynamics and Verfication."Meteorology and Atmospheric Physics75 (2000): 161-193. Print.

    3. McGovern, Amy and Hiers, Nathan and Collier, Matthew and Gagne II, David J. and Brown, Rodger A. (2008). Spatiotemporal Relational Probability Trees. Proceedings of the 2008 IEEE International Conference on Data Mining, Pages 935-940. Pisa, Italy. 15-19 December 2008.

    4. Supinie, Timothy and McGovern, Amy and Williams, John and Abernethy, Jennifer. Spatiotemporal Relational Random Forests. Proceedings of the 2009 IEEE International Conference on Data Mining (ICDM) workshop on Spatiotemporal Data Mining, electronically published.



    Type Science Track
    Session Titles Statistical Methods/Weather


Panel: Security for Science Gateways and Campus Bridging
    Wednesday July 18, 2012 3:45pm - 5:15pm @ Toledo 5th Floor

    Abstract: The XSEDE science gateway and campus bridging programs share a mission to expand access to cyberinfrastructure, for scientific communities and campus researchers. Since the TeraGrid science gateway program began in 2003, science gateways have served researchers in a wide range of scientific disciplines, from astronomy to seismology. In its 2011 report, the NSF ACCI Task Force on Campus Bridging identified the critical need for seamless integration of cyberinfrastructure from the scientist’s desktop to the local campus, to other campuses, and to regional, national, and international cyberinfrastructure. 

    To effectively expand access to cyberinfrastructure across communities and campuses, XSEDE must address security challenges in areas such as identity/access management, accounting, risk assessment, and incident response. Interoperable authentication, as provided by the InCommon federation, enables researchers to conveniently "sign on" to access cyberinfrastructure across campus and across the region/nation/world. Coordinated operational protection and response, as provided by REN-ISAC, maintains the availability and integrity of highly connected cyberinfrastructure. Serving large communities of researchers across many campuses requires security mechanisms, processes, and policies to scale to new levels. 

    This panel will discuss the security challenges introduced by science gateways and campus bridging, potential approaches for addressing these challenges (for example, leveraging InCommon and REN-ISAC), and plans for the future. Panelists will solicit requirements and recommendations from attendees as input to future work.

    Panel Moderator:
    - Jim Basney, University of Illinois

    Panel participants:
    - Randy Butler, University of Illinois
    - Dan Fraser, Argonne National Laboratory
    - Suresh Marru, Indiana University
    - Craig Stewart, Indiana University



    Type Panel Session


BOF: Cloud Computing for Science: Challenges and Opportunities
    Wednesday July 18, 2012 5:30pm - 6:30pm @ Toledo 5th Floor

    BOF: Cloud Computing for Science: Challenges and Opportunities

    Abstract: Outsourcing compute infrastructure and services has many potential benefits to scientific projects: it offers access to sophisticated resources that may be beyond the means of a single institution to acquire, allows for more flexible usage patterns, creates potential for access to economies of scale via consolidation, and eliminates the overhead of system acquisition and operation for an institution allowing it to focus on its scientific mission. Cloud computing recently emerged as a promising paradigm to realize such outsourcing as it offers on-demand, short-term access, which allows users to flexibly manage peaks in demand, pay-as-you-go model, which helps save costs for bursty usage patterns (i.e., helps manage “valleys” in demand), and convenience, as users and institutions no longer have to maintain specialized IT departments. However, cloud computing brings with it also challenges as we seek to understand how to best leverage the paradigm. 

    Many scientific communities are experimenting with this new model, among others using FutureGrid resources and a testbed for initial exploration. The objective of this BOF is to focus discussion on experiences to date as well as define challenges and priorities in understanding how cloud computing can be best leveraged in the scientific context. We plan to discuss application patterns as well as highlight and discuss the priority of the current challenges and open issues in cloud computing for science. Specifically, we will discuss the following challenges. What types of applications are currently considered suitable for the cloud and what are the obstacles to enlarging that set? What is the state-of-the-art of cloud computing performance relative to scientific applications and how is it likely to change in the future? How would programming models have to change (or what new programming models need to be developed) to support scientific applications in the clouds? Given the current cloud computing offering, what middleware needs to be developed to enable scientific communities to leverage clouds? How does cloud computing change the potential for new attacks and what new security tools and mechanisms will be needed to support it? How can we facilitate transition to this new paradigm for the scientific community; what needs to be done/established first? Depending on the profile of attendance, we expect the last question in particular to form a substantial part of the discussion.  

    The BOF will be structured as follows. We will begin with a short structured talk session, led by the organizers, that will summarize and update several previous discussions on this topic, notably the MAGIC meetings in September, April and May as well as several parallel developments that took place in the scientific context such as the Magellan report, cloud-related experimentation status on the FutureGrid project, and application activity. The second session of the BOF will be devoted to the discussion, elaboration, and prioritization of the challenges listed above. Finally, we will address the prioritization and shape of concrete transition measures. The time allocated to the last two issues will depend on the structure of the attendance; if we can get feedback from XSEDE users we will emphasize the transition measures, if we attract CS practitioners we will focus on technical challenges. 


    Kate Keahey is a Scientist in the Distributed Systems Lab at...

    Type BOF



Software: What Is Campus Bridging
    Thursday July 19, 2012 8:45am - 9:15am @ Toledo 5th Floor

    Software: What Is Campus Bridging and What is XSEDE Doing About It?

    Abstract: The term “campus bridging” was first used in the creation of an NSF Advisory Committee for Cyberinfrastructure task force. That task force eventually arrived at the following description of campus bridging: 

    “Campus bridging is the seamlessly integrated use of cyberinfrastructure operated by a scientist or engineer with other cyberinfrastructure on the scientist’s campus, at other campuses, and at the regional, national, and international levels as if they were proximate to the scientist, and when working within the context of a Virtual Organization (VO) make the ‘virtual’ aspect of the organization irrelevant (or helpful) to the work of the VO.” 

    That definition and the task force report detail many things that could conceivably be done under the rubric of campus bridging. 

    But unlike other topics such as software or data, there is little ability to point to something and say, “Aha, there is a campus bridge.” Campus bridging is more a viewpoint and a set of usability, software, and information concerns that should inform everything done within XSEDE and the more general NSF strategy Cyberinfrastructure for 21st Century Innovation. 

    In this paper we outline several specific use cases of campus bridging technologies that have been identified as priorities for XSEDE in the next four years, ranging from documentation to software used entirely outside of XSEDE to software that helps bridge from individual researcher to campus to XSEDE cyberinfrastructure. 




    Type Software and Software Environments Track



Software: The Anatomy
    Thursday July 19, 2012 9:45am - 10:15am @ Toledo 5th Floor

    Software: The Anatomy of Successful ECSS Projects: Lessons of Supporting High-Throughput High-Performance Ensembles on XSEDE

    Abstract: The Extended Collaborative Support Service (ECSS) of XSEDE is aprogram to provide support for advanced user requirements thatcannot and should not be supported via a regular ticketingsystem. Recently, two ECSS projects have been awarded by XSEDEmanagement to support the high-throughput of high-performance (HTHP)molecular dynamics (MD) simulations; both of these ECSS projects usea SAGA-based Pilot-Jobs approach as the technology required tosupport the HTHP scenarios. Representative of the underlying ECSSphilosophy, these projects were envisioned as three-waycollaborations between the application stake-holders,advanced/research software development team, and the resourceproviders. In this paper, we describe the aims and objectives ofthese ECSS projects, how the deliverables have been met, and somepreliminary results obtained. We also describe how SAGA has beendeployed on XSEDE in a Community Software Area as a necessaryprecursor for these projects.


    Type Software and Software Environments Track


Get Adobe Flash player