Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures

John Leidel

Given the recent advent of highly heterogeneous computing platforms, the HPC user community is now forced to cope with a further bifurcated address space and programming environment. The age of distributed memory cluster computing architectures forced programmers to explicitly manage data locality at a massive scale. Few platform vendors offered truly shared memory platforms at a scale necessary to applications of sufficient scale and complexity. This migration from implicit shared memory to explicit distributed memory yielded programming models such as PVM and MPI.

With the advent of multicore processors, we have yet again begun a transition back to shared memory environments. Multicore and multi-socket/multicore systems have provided the ability to perform intra-node shared memory operations with occasional explicit data movement between similar computing units. This, coupled with research into partitioned global address space languages and memory models such as UPC, Chapel and OpenSHMEM, has yielded an overall increase in programmer productivity. However, the newfound focus on truly heterogeneous computing architectures has slowed this move back to shared memory programming.

Heterogeneous computing platforms have, yet again, forced the applications programmer to manage memory at a micro scale. Explicit data movement between a suitable host and accelerators/coprocessors has unfortunately become the modus operandi of those crafting code for heterogeneous platforms. The issue is further compounded in systems whose computational resources are heterogeneous in multiple dimensions. In this case, the user is forced to operate explicit data transfers using heterogeneous methods.

This work shall explore recent research and development efforts associated with creating a global virtual address specification that encapsulates both local virtual address capabilities as well as partitioned, global communication and locality data. This virtual address specification shall provide a mechanism to significantly simplify heterogeneous and multi-node runtime models by encapsulating the necessary locality data necessary to access individual byte-level payloads with simple address manipulation. Thereby, providing system software architects and application designers the ability express and utilize global memory locality regardless of the scale and heterogeneity of the target platform with very low instruction overhead.

This work shall present initial architectural data and design results associated with the aforementioned research. This shall include the initial virtual address specification, memory model and the effect on library-driven PGAS programming models such as UPC, Chapel and SHMEM. Furthermore, this work shall call for future research in a wider, community-driven specification such that cross-vendor platforms can be constructed providing support for the community address specification.


Programming infrastructure for heterogeneous computing based on OpenCL and its applications

Anton Efremov and Pavel Bogdanov

Download a video of the talk

Heterogeneous HPC systems based on highly-parallel computing accelerators are becoming more and more popular nowadays. Graphical processing unit (GPU) is currently the most popular kind of such accelerators. However the novel Intel Many Integrated Core (Intel MIC) architecture looks very promising. The first supercomputer based on Intel accelerators with around 20 PFlops performance is announced to be set up in 2012.

There are two major programming models for hybrid machines: CUDA and OpenCL. The first model is proprietary product of NVidia, the second is an open standard developed by the Kronos group. OpenCL model operates with an abstract representation of a computing unit that allows to program for CPU architecture as well as for GPU and some other architectures. OpenCL is supported by most of the main hardware developers including Intel, AMD, nVidia, IBM.
Our main goal is to develop a specific system-level service, a programming infrastructure for heterogeneous computing, which one one hand would have a simple and clear programming model, and on the other, would effectively use all devices presented in system. Then it could be used to write numerical libraries with standard interfaces which would provide a significant acceleration of applications on heterogeneous systems with a minor (ideally zero) effort on redesign of these applications.

The first version of such infrastructure had been implemented. It is based on the following common logical model: generally, an arbitrary program is represented as a dependency graph which is then parsed and executed by several schedulers. This version was tested on several tasks, see below.

All tests were launched on the machine with the following configuration: 2xAMD Opteron 6176 SE, 8xAMD Radeon 7970, 64 GB Ram, OS Ubuntu 12.04. Total performance is 7,5 TFlop/s in double precision.
The first case was to test the approach on the intensively used in HPL operations (High Performance Linpack test, used in TOP500 ranking) - DGEMM and DTRSM from BLAS. They were optimized within OpenCL standard for arbitrary matrix sizes and numbers of streaming accelerators. On 8 GPUs over 60% efficiency on big matrices was achieved.
Another operation that has been implemented and optimized with OpenCL is the sparse matrix vector product (SpMV) which is the basic operation for sparse iterative solvers widely used in CFD applications. Large sparse linear systems with unstructured matrices are very common for instance in finite-volume algorithms on unstructured meshes. In this SpMV implementation Sliced ELLPACK matrix storage format is used. It was developed for highly-parallel architectures and allows to get more than 50% from the theoretical peak performance estimation for SpMV operation on a streaming accelerator.

The third case is a simplified but representative example that reproduces a typical computing load. Dimensionless Euler equations are taken as a math model and a supersonic flow around a sphere (M=2.75) as a test problem. Discretization in space is done within finite-volume approach with a cell-centered explicit scheme and an unstructured tetrahedral mesh of up to 40 billions of cells. Fluxes through control volume faces are computed with the Roe scheme and the 1-st order Runge-Kutta scheme is used for time integration. Algorithm is memory-bound, and maximum theoretical acceleration on 8 GPUs compared to CPU was achieved.
At the moment, we are in the middle of research and implementation of much more complicated CFD model. Also we are working to launch infrastructure on a distributed system.


Modelling large-scale seismic problems using high-performance computing

Nikolay I. Khokhlov, Iakovos Panourgias, Michele Weiland and Igor B. Petrov

Download a PDF of the talk slides

The numerical simulation of seismic wave propagation presents an essential part of the exploration work the oil industry has to address on a day-to-day basis. The application presented here allows the
simulation of the propagation of dynamic wave disturbances in various geological mediums, including layered mediums and mediums with different inhomogeneities (e.g. cracks or cavities). The type of
numerical problem solved in these simulations is extremely compute-intensive and mandates large-scale HPC systems. The area of the ground medium simulated in these computations is typically a seismic cube with an edge length in the order of 1km to 10km. At the same time, inhomogeneities can be as small as a few meters in dimension. Therefore, the computational grid needs to be adequately small to be able to properly capture all inhomogeneities. Obtaining sufficient accuracy and specifying a large number of irregularities requires the use of very large computational grids; real-world computations use grids with several billion nodes.

This paper describes SEISMO, a new numerical seismology code that models the propagation of seismic waves in ground media with inhomogeneities on 2D and 3D structured and unstructured grids, and outlines its performance on a range of current HPC platforms. The mathematical model used in SEISMO is based on the theory of linear
elasticity. The finite-difference and finite-volume methods are then use to solve the resulting hyperbolic partial differential equations (PDEs). SEISMO implements the Total Variation Diminishing (TVD) discretisation scheme.

The code is parallelised using MPI, with low-level optimisations of the algorithm on SIMD CPUs using SSE and AVX instructions; it currently runs at 70% parallel efficiency when scaled up to 16k cores. The aim is to improve on this by exploring the benefits of using hybrid and accelerator programming approaches. The core algorithm of the application was ported to GPU accelerators using NVidia CUDA, OpenACC and HMPP compiler directives. The best
performance was achieved with the CUDA implementation, which yields a 50x speed-up over the serial CPU version. Multi-GPU, hybrid CPU and Maxeler FPGA system implementations are currently under development. The motivation driving the work is the need to evaluate and quantify the impact of various programming models on the parallel performance of this particular implementation and the productivity of exploring new algorithms. Directives-based accelerator programming enables rapid prototyping of new functionalities, an invaluable tool when designing an algorithm that can exploit the power of GPUs. However unsurprisingly the CUDA implementation delivers the best
performance once implementation choices have been made. Decisions regarding the further development of the application for future generations of massively parallel HPC systems are informed by the experience gathered in this exercise. At this stage, GPU acceleration looks to be a promising avenue to allow SEISMO to scale up and address
problems sizes of a massively large scale.

The work described in this paper was undertaken as a collaboration between the Moscow Institute of Physics and Technology (MIPT) and EPCC, the Supercomputing Centre at the University of Edinburgh as part of the
APOS project, which is funded by the European Commission under the FP7 programme and the Ministry of Education and Science of the Russian Federation.


The EPCC OpenACC Benchmark Suite

Nick Johnson, Adrian Jackson

OpenACC is a relatively new standard, first released in November 2011, which brings directives-based programming to accelerators such as NVidia or AMD GPUs, Intel MICs or other devices which can accelerate computational kernels. The use of accelerators as part of a heterogeneous multi-core system is one likely path towards exa-scale technologies. Programming models for such systems are highly likely to be directives-based as programmers will expect systems tools and compilers to make sensible choices about when and where to offload certain computational kernels dependent on a knowledge of the system.

In this work, we present a benchmark suite for OpenACC to compare systems and compilers. This suite provides three levels of benchmark ranging from low level data movement such as time taken to copy data to and from a device, through basic BLAS-type mathematical kernels such as DGEMM, ATAX and 2DCONV, to demonstration applications such as SEISMO, a 2D seismological simulation code. Each benchmark included is representative of operations commonly found in scientific computing.

To demonstrate the operation of the benchmark suite, we provide results obtained from the most recently released compilers from industry (CAPS, PGI, Cray) and academia (ULL). Our results show how a programmer can make an informed choice about which compiler would give best performance on their system. Alternatively, a programmer could use a single compiler but run the benchmarks on various systems to give an idea which system will best suit their application. This latter point is especially important where a choice of hardware is available, it may not be obvious which device is the best choice for a particular application without testing a range of devices.

We also envisage usage by compiler writers who can use our benchmarks to provide an independent reference source for the code they produce. Given the relative youth of the standard and implementations of that standard, it is important for compiler writers to be able to benchmark their compilers using externally provided codes.


Paving the path to exascale computing with CRESTA development environment

Jing Gong, Alistair Hart, David Henty, Stefano Markidis, Michael Schliephake, Paul Fischer and Katherine Heisey.

Download a PDF of the talk slides

The development and implementation of efficient computer codes for exascale supercomputers will require combined advancement of all development environment components: compilers, automatic tuning frameworks, run-time systems, debuggers and performance monitoring and analysis tools. The exascale era poses unprecedented challenges. Because the presence of accelerators is more and more common among the fastest supercomputer and will play a role in exascale computing, compilers will need to support hybrid computer architectures and generate efficient code hiding the complexity of programming accelerators. Hand optimization of the code will be very difficult on exascale machine and will be increasingly assisted by automatic tuners. Application tuning will be more focus on parallel aspects of the computation because of large amount of available parallelism. The application workload will be distributed over million of processes, and to implement ad-hoc strategies directly in the application will be probably unfeasible while an adaptive run-time system will provide automatic load balancing. Debuggers and performance monitoring tools will deal with million processes and with huge amount of data from application and hardware counters, but they will still be required to minimize the overhead and retain scalability. In this talk, we present how the development environment of the CRESTA exascale EC project meets all these challenges by advancing the state of the art in the field.

An investigation of compiler support for hybrid GPU programming, the design concepts, and the main characteristics of the alpha prototype implementation of the CRESTA development environment components for exascale computing are presented. A performance study of OpenACC compiler directives has been carried out, showing very promising results and indicating OpenACC as viable approach for programming hybrid exascale supercomputer. A new Domain-Specific Language (DSL) has been defined for the expression of parallel auto-tuning at very large scale. The focus of on the extension of the auto-tuning approach into the parallel domain to enable tuning of communication-related aspects of application. A new adaptive run-time system has been designed to schedule processes depending on the resource availability, on the workload, and on the run-time analysis of the application performance. The Allinea DDT debugger and the Dresden University of Technology MUST MPI correctness checker are being extended to provide a unified interface, to improve scalability, and to include new disruptive technology based on statistical analysis of run-time behavior of the application for anomalies detection. The new exascale prototypes of the Dresden University of Technology Vampir, VampirTrace and Score-P performance monitoring and analysis tools have been released. The new features include the possibility of applying filtering technique before loading performance data to drastically reduce memory needs during the performance analysis. The initial evaluation study of the development environment is targeted on the CRESTA project applications to determine how the development environment could be coupled into a production suite for exascale computing.


Scaling Soft Matter Physics to Thousands of GPUs in Parallel

Alan Gray, Alistair Hart, and Kevin Stratford

Download a PDF of the talk slides


PCJ - a Partioned Global Address Space Approach for Parallel Computations in Java

Marek Nowicki and Piotr Bala

Download a PDF of the talk slides

In this paper we present PCJ - a library for parallel computations in Java.
The PCJ is motivated by the partitioned global address space (PGAS) approach
represented by Co-Array Fortran, Unified Parallel C or Titanium (a scientific computing dialect of Java).

In the design of the PCJ we stress compliance to the Java standards. The PCJ has the form of Java library which can be used without any modification of the language. The programmer does not have to use extensions and libraries which are not part of the standard Java distribution. It much betters suits needs of the Java programmers than other existing solutions. The PCJ offers methods for partitioning work, synchronizing tasks, getting and putting values in means of asynchronous one-sided communication. The library provides methods for broadcasting, creating groups of tasks, and monitoring changes of the variables.

In the PCJ library each task runs its own calculations and has its own local memory. By default, the variables are stored and accessed locally. Some variables can be shared between tasks, so in the PCJ they are called shared variables. One task is intended to be the Manager which starts calculations on other tasks. It takes responsibility for setting unique identification to tasks, creating and assigning tasks into groups and synchronizing tasks within a group.

All variables, which are shareable, are stored in a special Storage class.Each task has one and only one Storage instance.

The PCJ library allows to run tasks within a single instance of the Java Virtual Machine as well as across different physical nodes with different JVM's running. The communication details between tasks are totally hidden and from the users point of view does not depend on the computer architecture. With the PCJ user can focus on implementation of the algorithm rather than on thread or network programming. The PCJ library is created to help develop parallel applications which require significant amounts of memory, bandwidth or processing power.

The PCJ library offers a new approach for the development of parallel, distributed application in Java language. It uses the newest advantages of Java and therefore can be a good base for new parallel applications. In contrast to other available solutions, it does not require any additional libraries, applications or modified version of JVM.

The results show good performance and scalability (up to few hundreds of cores and more) when compared to native MPI implementations. It is noteworthy that the PCJ library has great promise to be successful in scientific applications.



Developing the multi-level parallelisms for Fluidity-ICOM -- Paving the way to exascale for the next generation geophysical fluid modelling technology

Xiaohu Guo, Gerard Gorman, Michael Lange, Lawrence Mitchell and Michele Weiland

The major challenges caused by the increasing scale and complexity of the
current petascale and the future exascale systems are cross-cutting concerns of
the whole software ecosystem. The trend for compute nodes is towards greater
numbers of lower power cores, with a decreasing memory to core ratio. This is
imposing a strong evolutionary pressure on numerical algorithms and software to
efficiently utilise the available memory and network bandwidth.

Unstructured finite elements codes have been effectively parallelised using
domain decomposition methods, implemented using libraries such as the Message
Passing Interface (MPI) for a long time. However, there are many algorithmic and
implementation optimisation opportunities when threading is used for intra-node
parallelisation for the latest multi-core/many-core platforms. The benefits include
reduced memory requirements, cache sharing, reduced number of partitions and
less MPI communication. While OpenMP is promoted as being easy to use and
allows incremental parallelisation of codes, naive implementations frequently
yield poor performance. In practice, as with MPI, the same care and attention
should be exercised over algorithm and hardware details when programming with

In this paper, we report progress in implementing a hybrid OpenMP-MPI version of the
unstructured finite element application Fluidity. In the matrix
assembly kernels, the OpenMP parallel algorithm uses graph colouring to identify
independent sets of elements that can be assembled simultaneously with no race
conditions. The sparse linear systems defined by various equations are solved using
threaded PETSc and HYPRE which is utilised as a threaded preconditioner through
the PETSc interface. Since unstructured finite element codes are well known to be memory bound,
particular attention is paid to ccNUMA architectures where data locality is
particularly important to achieve good intra-node scaling characteristics. We also
demonstrate that utilising non-blocking algorithm and libraries are critical to
mixed-mode application so that it can achieve better parallel performance than
the pure MPI version.



A hybrid approach for extreme scalability when solving linear systems

Nick Brown, Mark Bull and Iain Bethune

Download a PDF of the talk slides

Iterative methods for solving large sparse systems of linear equations are widely used in many HPC applications. Extreme scaling of these methods can be difficult, however, since global synchronisation to form dot products is typically required at every iteration.

To try to overcome this limitation we propose a hybrid approach, where the solution space is partitioned up into blocks. The solving of these blocks occurs at two levels; interblock communication is performed synchronously and intrablock asynchronously. Following this approach it is possible to completely separate the concerns involved at the block and intra block level, allowing one to choose a highly optimised (parallel) internal block solver and an asynchronous method operating at the global level. Using this approach one can achieve extreme scalability - when the limits of conventional solvers start to be reached the system can be partitioned into one or more blocks, each operating with the same conventional solver but at a high level employing asynchronous block Jacobi or some other multisplitting technique.

Our block framework has been built to use PETSc, a popular scientific suite for solving sparse linear systems, as the synchronous interblock solver, and we demonstrate results on up to 32768 cores of a Cray XE6 system.


GPU implementation of some many-body potentials for molecular dynamics simulations

Andrey Knizhnik, Alexander Minkin and Boris Potapkin

Download a PDF of the talk slides

Molecular dynamics (MD) is computationally intensive problem but it's extremely amenable to the parallel architectures of different types. GPU is one of the architectures suitable for atomistic simulations based on MD. Using GPU is the way to improve the average performance of algorithms. So the part of the code requires most computational power is uploaded to GPU or accelerator.

Mechanical and transport properties of nanostructures are computed from the results of MD simulations. For the sake of efficiency of MD code we need to adapt all structures and algorithms to GPU. For accelerating of the neighbor list computation we use the GPU implementation of radix sort algorithm. All integrators are also adapted to GPU.

In general one of special technologies is needed for GPGPU computation on SIMD similar hardware. The most popular of them is CUDA. But it works only with NVidia GPUs. We use OpenCL as a free technology suggesting heterogeneous computing for various types of parallel hardware. That technology is mostly CUDA similar but it supports CPUs, GPUs and accelerators of different vendors. By means of OpenCL mostly the same code can be successfully used for parallel computing on different hardware. That approach reduces programming efforts.

The main point of molecular dynamics simulations is force calculation. It requires a lot of computations especially for many-body potentials and we use SIMD similar approach for acceleration. Many-body potentials are used for modeling of carbon and metallic nanostructures. Some features of parallel algorithms are discussed. We use the family of Tersoff potentials in case of carbon nanostructures and embedded atom potential for metallic nanostructures. Straightforward approach for implementation of parallel algorithms for many-body potentials is based on critical sections but it's not always good and we consider different realizations for different type of potentials. In general critical sections and atomic operations reduce parallel efficiency. That’s really true for embedded atom algorithm but not the case for Tersoff. The parallel algorithm for Tersoff bond order potential becomes faster with atomics as it uses less operations and registers.

We estimate the efficiency of different algorithm types for some many-body potentials. The performance of the different GPU’s is compared to their main processor counterpart.


Domain Decomposition Methods for Exascale Computing

Frederic Magoules

Existing numerical algorithms face their limits when running on hundreds of thousands cores. For instance, parallel iterative methods meet serious scalability limitation due to the synchronization procedure occurring between the processors at the end of each iteration… The traditional scheme for parallel iterative algorithms is synchronous iterations. This describes a method where a new iteration is only started when all the data from the previous one has been received. Initially this scheme has been used with synchronous communication. This means that data can only be exchanged when the source and destination processes have finished their computation. A well know improvement, allowing to overlap some of the communication time with computation, is to use asynchronous communication. The only difference is that data can be received by a process while it is still computing the previous iteration. All these synchronous iterations use the same mathematical model and have the same convergence as their sequential counterparts. They have been widely studied and are often simply called parallel iterative algorithms, synchronous being omitted. When using Exascale systems, the synchronization between the processors at the end of each iteration represents a real bottleneck to the scalability. Another kind of iterative algorithm can help solve these scalability problems, called chaotic iterations.

In this talk, chaotic iterations are considered and extended to optimized domain decomposition methods. Domain decomposition methods are well suited for parallel computations. Indeed, the division of a problem into smaller subproblems, through artificial subdivisions of the domain, is a means for introducing parallelism. Domain decomposition strategies include in one way or another the following ingredients: (i) a decomposer to split a mesh into subdomains; (ii) local solvers to find solutions for the subdomains for specific boundary conditions on the interface; (iii) interface conditions enforcing compatibility and equilibrium between overlapping or non-overlapping subdomains; (iv) a solution strategy for the interface problem. The differences between the methods lies in how those ingredients are actually put to work and how they are combined to form an efficient solution strategy for the problem at hand.

This talk shows how the domain decomposition methods have efficiently evolved over the years by using specially designed boundary conditions on the interface between the subdomains, leading to optimized domain decomposition methods. In order to use such methods on massive parallel computers, the iterative scheme should be modified, and chaotic iterations are here proposed for the solution strategy of the interface problem, leading to some convergence difficulties. After the presentation of the method, numerical experiments are performed on large scale engineering problems to illustrate the robustness and efficiency of the proposed method.



Investigation in to the optimal 3D FFT strategy for Quantum Mechanical Codes

Toni Collis and Adrian Jackson

We will present an analysis of 3D FFT methods scaling from matrices of 1283 to 10243. We will show how the performance is limited by the memory, so larger systems can only be on multiple nodes, necessitating the use of network interconnect and discuss the implications of using small FFTs limited to single nodes and the associated profitability of performing the same calculation multiple times to avoid network communication. This will be of particular importance to GPU implementations of software packages in order to reduce the requirement of moving data on and off GPU accelerators.

We will discuss our investigation into the optimal construction of 3D FFTs by using either a 3D FFT library function call or using multiple 1D FFTs. We will also consider the equivalent GPU directives by using cu-FFT library calls. Critically, the performance of 3D FFTs relies heavily on the performance of the all-to-all communications of a particular architecture.

Finally we will discuss a case study of FFT performance in CASTEP, in its current format and also by using data replication to reduce communication of FFT routines across nodes. Performance can be boosted by distributing small FFT calculations on processes (16 cores on HECToR), then nodes (two processes, with a total of 32 cores) and finally by blades (4 nodes with 128 cores). We discuss the benefits for various simulations in CASTEP of using fewer processes and minimising the use of the interconnect of HECToR.

The work is still ongoing and we would be happy to update the abstract to reflect our results closer to the conference date.


Level of Maturity Assessment for Exascale Software Components

Lee Margetts, Bernd Mohr, Andrew Jones and Francois Bodin

The EU FP7 funded European Exascale Software Initiative was set up to build a European vision and roadmap to address the challenges posed in using the next generation of massively parallel systems. These systems will comprise millions of heterogeneous cores, providing multi-Petaflop performance in the next few years and Exaflop performance in 2020. One of the recommendations made in the first stage of the project (EESI-1) was the establishment of a European Exascale Software Centre to coordinate the research, development, testing and validation of HPC Exascale software ecosystem components. In the second stage of the project (EESI-2) a work package was defined to investigate the feasibility of doing this and to prepare for the operation of such a centre. This paper will report on the first task of the work package: the development of a methodology for estimating the level of maturity of Exascale software components. The methodology proposes which tests should be performed, by whom and how. It also defines how the outcomes of the tests relate to a given level of software maturity.


DEUS Full Observable Universe Simulation: the numerical challenge

Jean-Michel Alimi, Vincent Bouillot, Yann Rasera, Vincent Reverdy, Pier-Stéfano Corasaniti, Irène Balmes, Stéphane Requena, Xavier Delaruelle and Jean-Noël Richet.

We have performed the first-ever numerical N-body simulations of the full observable universe with Dark Energy (DEUS "Dark Energy Universe Simulation" FUR "Full Universe Run"). Three different Dark Energy models have been performed. These have evolved 550 billion particles on an Adaptive Mesh Refinement grid with more than two and half trillion computing points along the entire evolutionary history of the universe and across 6 order of magnitudes length scales, from the size of the Milky Way to that of the whole observable Universe. To date, these are the largest and most advanced complete cosmological simulations ever run. It provides unique information on the formation and evolution of the largest structure in the universe and an exceptional support to future observational programs dedicated to mapping the distribution of matter and galaxies in the universe. The simulations have run on 4752 (of 5040) thin nodes of BULL’ supercomputer CURIE, using more than 300 TB of memory for 10 million hours of computing time each. About 150 PBytes of data were generated throughout the run. Using an advanced and innovative reduction workflow the amount of useful stored data has been reduced to 1.5 PBytes.


Disruptive Technologies

Iain Duff

Download a PDF of the talk slides

The European Exascale Software Initiative (EESI-2) started in September 2012. It is
an EU supported project following on from the successful EESI project that terminated
in autumn 2010. The main aim is to advise the European Union on the steps necessary to prepare Europe for exploiting the Exascale computers that are anticipated around
the end of this decade.

One of the buzzwords of EESI-2, is disruptive technologies. We will discuss what is meant by this term and give examples, particularly in the domain of numerical algorithms.


Benchmark and Profiling of NEURON in Strong Scaling Configuration

Fabien Delalondre, Michael Lee Hines, Michael Knobloch, Judit Gimenez, Jesus Labarta, Damian Alvarez Mallon and Felix Schürmann

The goal of the Blue Brain Project (BBP) is to develop a unifying model to support reverse engineering the mammalian brain. Development of the unifying model is through the implementation of software and workflows capable of building, simulating, analyzing and visualizing the brain.
In the past few years, part of the BBP developments has been focused on increasing the weak scaling capability of the simulation software stack [1, 2] to be able to simulate large portions of the brain. However multiple applications require fast result delivery and thus strong scaling. To support this requirement, it was necessary to distribute parts of a neuron over multiple processes and solve the resulting distributed algebraic system has been implemented [3]. The later proved to support strong scaling when distributing large neurons represented by hundreds of compartments on up to 8 processes. As new applications like physical robotic system coupling or brain plasticity require real time or faster than real time result delivery, a careful analysis of the present implementation becomes necessary to determine whether it can scale over a larger number of processes.

In an attempt to answer this question, the analysis of NEURON’s on core, intra and inter-node performance has been carried out using Scalasca [4] and Extrae [5] software as part of the DEEP project [6]. It led to the following conclusions:
1) The application is memory bound and memory latency driven.
2) The application does not yet take full advantage of the large registers supported by the latest supercomputing architectures.
3) The resolution of a small tridiagonal system assembled on a neuron basis does not scale when distributed over multiple processes. This loss of performance is due to the MPI implementation of the Schur Complement method [6] which serializes data transfer.
4) While the communication pattern supporting the resolution of the algebraic system is structured, the one used to synchronize neurons by spike exchange is unstructured.
5) The size of the packets being exchanged is small (less than 512 bytes) and extremely small (less than 13 bytes) for the respective structured and unstructured parts of the communication.
Taking those conclusions into account, the following development roadmap has been defined to optimize:
1) On core performance: Increasing code vectorization and cache usage.
2) Intra-node performance: Implementing the structured part of the communication using MPI persistent functionalities and investigating the use of an hybrid MPI/OmpSs [7] model to reduce the load unbalance of the Schur Complement method.
3) Inter-node performance: Investigating the use of predictor/corrector methods to better structure the communication pattern.

[1] N.T. Carnevale, M.L. Hines, The NEURON Book, Cambridge, UK: Cambridge University Press, 2006
[2] J.G. King, M.L. Hines, S. Hill,P.H. Goodman, H. Markram, F. Schürmann, A Component-Based Extension Framework for Large-Scale Parallel Simulations in NEURON, Front Neuroinformatics.2009; 3:10.
[3] M.L. Hines, H. Markram, F. Schurmann, Fully Implicit Parallel Simulation of Single Neurons, Journal of Computational Neuroscience, 25:439-448, 2008
[4] DEEP Project website, http://www.deep-project.eu/deep-project/EN/Home/home_node.html
[5] Scalasca website, http://www.scalasca.org
[6] Extrae website, http://www.bsc.es/computer-sciences/extrae
[7] OmpSs website, http://pm.bsc.es/ompss


A Pre-Processing Interface for Steering Exascale Simulations by Intermediate Result Analysis through In-Situ Post-Processing

Gregor Matura, Achim Basermann, Fang Chen, Markus Flatken, Andreas Gerndt, James Hetherington, Timm Krueger and Rupert Nash

Download a PDF of the talk slides

Today's simulations are typically not a single application but cover an entire tool chain. There is a tool for initial data creation, for partitioning this data, for actual solving, for afterward analysis, for visualisation. Each of these tools is separated and so is the data flow. This approach gets unfeasible for an exascale environment. The penalty for every data movement is extreme. A load balance optimised for the solver most likely does not hold true for the complete run time. Compared to tweaking each part, our solution is more disruptive: Merge the tools used and thereby improve overall simulation performance. In this paper, we provide a first step. We combine pre-processing, simulation core and post-processing. We outline our idea of a general pre-processing interface steering the overall simulation and demonstrate its applicability with the sparse geometry lattice-boltzmann code HemeLB, intended for hemodynamic simulations.

The tasks of the interface start with partitioning and distributing simulation data. Main aspect is to balance load and communication costs of the entire simulation by using the costs of each simulation part. Measurement of these costs for each simulation cycle makes possible further performance improvement: Data can be redistributed in between cycles in order to achieve a better load balance according to these costs. Additionally, our interface offers the possibility for other easy-to-integrate extensions covering, e.g., an automated mesh refinement or fault tolerance awareness. Finally, we investigate the applicability of our interface within HemeLB. We exploit the integrated partitioning methods and especially consider latest HemeLB-specific in-situ result analysis and visualisation methods.


A PGAS implementation of the ECMWF Integrated Forecasting System (IFS)

George Mozdzynski, Mats Hamrud and Nils Wedi

ECMWF is a partner in the Collaborative Research into Exascale Systemware, Tools and Applications (CRESTA) project, funded by a recent EU call for projects in Exascale computing, software and simulation (ICT-2011.9.13). The significance of the research carried out within the CRESTA project is that it will demonstrate techniques required to scale the current generation of Petascale capable simulation codes towards the performance levels required for running on future Exascale systems. Today the European Centre for Medium-Range Weather Forecasts (ECMWF) runs a 16 km global T1279 operational weather forecast model using 1,536 cores of an IBM Power7. Following the historical evolution in resolution upgrades, ECMWF could expect to be running a 2.5 km global forecast model by 2030 on an Exascale system that should be available and hopefully affordable by then. To achieve this would require IFS to run efficiently on about 1000 times the number of cores it uses today. This is a significant challenge, one that we are addressing within the CRESTA project.

After implementing an initial set of improvements ECMWF has now demonstrated IFS running a 10 km global model efficiently on over 50,000 cores of HECToR a Cray XE6 at EPCC, Edinburgh. Of course, getting to over a million cores remains a formidable challenge, and many scalability improvements have yet to be implemented. Within CRESTA, ECMWF is exploring the use of Fortran2008 coarrays; in particular it is possibly the first time that coarrays have been used in a world leading production application within the context of OpenMP parallel regions. The purpose of these optimizations is primarily to allow the overlap of computation and communication, and further, in the semi-Lagrangian advection scheme, to reduce the volume of data communicated. The importance of this research is such that if these developments are successful then the IFS model may continue to use the spectral method to 2030 and beyond on an Exascale sized system. The current status of the coarray scalability developments to IFS will be presented in this talk, including an outline of planned developments.


Attacking Visual and System Scalability in Performance Profilers

Mark O'Connor and Jonathan Byrd

The bandwidth of the human eye, estimated at 10 Mbits, has already been far
surpassed by the rate at which performance data can be produced from even a
moderately-sized HPC run, let alone the parallel torrent from a petascale
or exascale job. We now face two problems - the technical problem of
collecting, processing and storing vast quantities of performance data
without meaningfully impacting the performance of the application and the
design problem of displaying a vanishingly-small tiny subset of this to a
human in a meaningful, comprehensible way. Allinea has spent the past year
attacking both these problems with a novel approach to HPC performance
profiling that focuses on throwing away redundant data and producing
statistical summaries of relevant information instead of collecting and
processing as much as possible.

The technical data collection problem will be addressed with a
review of the strengths and weaknesses of current state-of-the-art
performance profilers and examination of the benefits of working with two
distinct kinds of performance profile - both traditional deep-mining
approaches and the optimal-yield approach employed in Allinea MAP.
Real-world examples at a variety of scales will be used to highlight the
strengths and weaknesses of both approaches and to explore the fertile
ground for research, interoperability and collaboration between commercial
and open-source tools to map out the performance landscape of any given
program in detail.

We will also present a range of novel performance data visualizations for
displaying and interacting with a rich array of performance metrics
including techniques such as time-based statistical summary and parallel
heat maps, and discuss how the choice of visualization may greatly affect
the type and quantity of data necessary for collection. As an example of
this we will look at a method for low-overhead statistical estimation of
CPU floating-point performance and memory bandwidth contention. Of
particular interest is the means of interaction between different metrics -
both source-based and time-based - and strategies for relating CPU-related
metrics to individual lines of source code.

Finally, these techniques will be demonstrated at petascale with Allinea
MAP and the efficacy of profiling at this scale discussed with particular
reference to wall-clock overhead, measurement skew and profiling jitter. We
will conclude by outlining the challenges involved in extending this
approach to deliver exascale-class performance profiling.


Weather prediction and climate modeling at Exascale: Introducing the Gung Ho project

C. M. Maynard, M. J. Glover, N. Wood, R. Ford, S. Pickles, D. Ham, E .Mueller, and  G. Riley

Download a PDF of the talk slides

The Met Office's numerical weather prediction and climate model code,
the Unified Model (UM) is almost 25 years old. It is a unified model
in the sense that the same model is used to predict both short term
weather and longer term climate change. More accurate and timely
predictions of extreme weather events and better understanding of the
consequences of climate change require bigger computers. Some of the
underlying assumptions in the UM will inhibit scaling to the number of
processor cores likely to be deployed at Exascale. In particular, the
longitude-latitude discretisation presently used in the UM would result in
12m grid resolution at the poles for a 10km resolution at mid-latitudes.
The Gung-Ho project is a research collaboration between the Met Office, NERC
funded researchers and STFC Daresbury to design a new dynamical core that
will exhibit very large data parallelism. The accompanying software design and
development project to implement this new core and enable it to be coupled
to physics parameterisation, ocean models etc is presented and some of the
computational science issues discussed.


From Particle to Continuum Physics - Performance Analysis Tools in the Exascale Age

Bernd Mohr and Felix Wolf

On the path to exascale, performance-tool builders will have to deal with million-fold concurrency (~100M-1B), challenging the scalability of performance data collection, analysis, and visualization alike. The situation is aggravated by the fact that limited memory bandwidth will make it harder to extract performance data from the system. Asynchronous programing models and spontaneous system reconfiguration in response to component failures or power management, on the other hand, introduces the problem of limited reproducibility. Finally, performance tools are still limited in their ability to identify the root cause of a performance problem, making it often hard to translate their findings into effective optimization strategies. In this talk, we will discuss our ideas how Scalasca, a scalable performance tool designed with high-level feedback on the nature of performance bottleneck in mind, will address these challenges in the future. Key elements include the in-situ processing of performance data, the notion of resilient performance metrics, and lightweight predictive capabilities.


Hybrid Strategies for the NEMO Ocean Model on Multi-core Processors

Andrew Porter, Stephen Pickles, Mike Ashworth, Giovanni Aloisio, Italo Epicoco and Silvia Movavero

Download a PDF of the talk slides

We present the results of an investigation into strategies for the introduction of OpenMP to the NEMO oceanographic code. At present NEMO is parallelized using only the Message Passing Interface (MPI) with a two-dimensional domain decomposition in latitude/longitude. This version performs well on core counts of up to a few thousand.

However, the number of cores on CPUs and consequently, supercomputer nodes, continues to increase. This trend is exaggerated by the introduction of many-core accelerators such as GPUs alongside traditional CPUs. Therefore, hybrid or mixed-mode parallelization is an attractive option for improving application scaling on large core counts. In this approach, maybe just one or two MPI processes are placed on each compute node, resulting in a significant reduction in the quantity of off-node message exchanges. The MPI processes are then themselves parallelized using OpenMP to make effective use of the multiple cores on the node. As of version 4.0, the OpenMP standard will provide support for many-core accelerators and therefore this hybrid approach should be applicable to the increasing number of machines with GPU or Intel Phi co-processors.

We have implemented OpenMP parallelism using both loop-level and tiling/coarse-grained approaches and report on their effectiveness. Of these, the loop-level approach is the simplest with each doubly- or triply-nested loop nest individually parallelised using the OMP DO...OMP END DO directives, all enclosed within a single PARALLEL region. An extension to this approach is to flatten each loop nest into a single, large loop with the previously rank three arrays converted to rank one vectors. This exposes finer-grained parallelism, possibly at the expense of the SIMD auto-vectorisation performed on the loop by the compiler. In the tiling approach, the sub-domain assigned to each MPI process is further sub-divided into overlapping ‘tiles.’ It is the loop over tiles that is parallelised over the OpenMP threads. The dimension and shape of the tiles are key factors for better exploiting the processor cache hierarchy and hence strongly influence the overall performance.

We have also examined the effect of array-index ordering. In NEMO, the 3D arrays have the level/depth index outermost. The outer loop for the vast majority of loop nests is therefore over this index. It is this loop that is parallelized in the loop-level approach to using OpenMP. We investigate the implications of this choice by applying the various OpenMP approaches to a version of NEMO that has been adapted so as to have the level index innermost. The loop over this index is then normally the best candidate for auto-vectorisation by the compiler.

The proposed approaches have been applied in two different forms of the tracer advection kernel (MUSCL and TVD) from NEMO and evaluated on an IBM Power6 cluster at CMCC, Italy, an IBM iDataPlex with Intel Westmere CPUs at CINECA, Italy, a dual-socket Intel Sandy Bridge system at STFC Daresbury, UK and on a Cray XE6 (HECToR), UK.


Remote Hybrid Rendering of Exascale Data for Immersive Virtual Environments

Martin Aumüller

Download a PDF of the talk slides

We aim to bring data from exascale simulations to immersive virtual en- vironments conncected to the exascale system via the Internet. As the simulated data is too large for rendering on a single computer and for transferring from com- pute to display system, this data has to be rendered remotely on or close to the compute resource using scalable methods. In order to avoid motion sickness and to provide a high feeling of presence, head-tracked virtual environments such as CAVEs impose tight bounds on input to display latency.

To improve on what is possible with naive remote rendering, we separate mostly static context information, such as bounding geometry, from the varying simulation results. We render the fixed part locally and composite remotely and locally ren- dered images, taking into account depth information, similiar to sort-last parallel rendering. We can reduce the latency even further without imposing additional communication overhead because of the availability of 2.5D data for the remote images. This allows us to reproject the remote images according to updated view points before overlaying context information rendered locally for this view point.

Because of its extensibility, its wide-spread use and the possibility of backwards compatibility of our system with existing clients we chose to base our protocol on the Remote Frame Buffer (RFB) protocol as used in all Virtual Network Computing (VNC) based systems. This protocol is used for the communication between the head node of a visualisation cluster or the exascale system and the head node of a display system. We define extensions to the RFB protocol such as for handling of multi-touch and 6DOF input data, frame synchronisation across several display surfaces, encodings for 2.5D and opacity images as well as an image back-channel.
Based on OpenCOVER, the virtual reality renderer of the distributed visualisa- tion system COVISE, we build a prototype of the system for evaluating and demonstrating the feasibility of our approach. We will describe our experience of using a head-tracked 3D high-resolution display for visualisation of simulated flow and will present first results.


Space Weather Applications for Exascale Computing

Giovanni Lapenta, Stefano Markidis, Alec Johnson, Alex Vapirev, Jan Deca, Maria Elena Innocenti, Jorge Amaya and Slavik Olshevskyi

Space weather refers to the dynamically changing space environment dominated by the Solar activity and the response of the interplanetary space and the planetary magnetospheres. The first principle modelling of space requires to meet two fundamental challenge: multiple scales and multiple physics. Advanced mathematical and numerical methods need to be used, implicit in time and adaptive in space, linking fluid with kinetic methods. We present three arguments.
First, we demonstrate that the required accuracy needed to model the Earth space environment (magnetosphere) requires exascale computing and that exascale computing is not only necessary but also sufficient.
Second, we modify our application to new hybrid architectures using GPUs or MICs with CPU and demonstrate that the particle in cell first principle approach to space plasma modelling can be modified to effectively in a codesign spirit: the algorithm can be modified to conform to hybrid architecture where vectorization as well as parallelization is key to performance.
Third, we provide new developments in the coupling of kinetic and fluid approaches on different meshes in a interlocked simulation approach.

Nu-FuSE Exascale Simulations and the XMP programming language

Taisuke Boku

Download a PDF of the talk slides

As a project in G8 Research Councils Initiative on Exascale Simulations, we have been performing an international collaboration named NuFuSE (Nuclear Fusion Simulation toward Exascale) in past two and half years. In Japan, several large scale nuclear fusion simulation codes have been developed as well as several international collaboration to port and develop the codes on multiple platforms over the world.

In this talk, I will focus on the speed-up and scale-up of our largest NuFuSE code in Japan: GT5D for five dimensional plasma fluid simulation code developed in JAERI which is running on K Computer in AISC-Riken and several large scale clusters including HA-PACS GPU cluster in University of Tsukuba as well as the status of other code developments.

Another effort in this project is to develop the programming framework and environment to describe post-petascale simulation code in high-level and effective way. We are working on XMP (XcalableMP) PGAS language and its special implementation for large scale GPU clusters named XMP-dev. I also introduce this project and current status to port our codes with this language.

Strategies for I/O-Bound Applications at Scale

Manish Modani and Andrew Porter

In the environmental sciences, the useful product of running a simulation code is usually one or more files containing four-dimensional 'history data' - time-series of various three-dimensional fields on the model domain (e.g. temperature, pressure, etc.). Writing this data to disk is increasingly seen to be the major bottleneck in running these types of simulation and the situation is getting worse: as machine core-counts increase, scientists take the opportunity to run higher-resolution models and thus each 'frame' of a time-series consists of more data.
In this work we investigate the I/O performance of a typical I/O-bound code, the Weather Research and Forecast Model (WRF).

As machine manufacturers continue the push towards exascale, minimising power consumption is a major requirement. We are therefore seeing a proliferation of relatively low-power processing cores, often sharing connections to high-performance, parallel file systems. IBM's BlueGene Q is typical of this approach, with a few hundred cores sharing a separate I/O node. Actual numbers depend upon machine configuration. It is also the case that the number of processing cores per node is increasing faster than the quantity of memory on a compute node.

WRF has a well-defined API for its I/O layer. This has enabled several different implementations of the I/O layer. In the default, all data is gathered to the master process before being written via the standard, serial NetCDF library. Although simple, this approach is limited by the amount of memory available to the master process and makes poor use of any parallel file system. In contrast, in the layer based on pNetCDF (parallel NetCDF), every process writes it's own part of the data. In addition to these layers, WRF has the ability to reserve a set of processes for performing parallel I/O. Once data is offloaded to these processes, the computation is able to proceed without blocking on the I/O operations themselves. All of these layers generate NetCDF files which is a popular format within the scientific community. However, WRF is also capable of outputting data in the alternative GRIB2 format which results in considerably smaller data files. We investigate the performance of all of these options on the BlueGene Q architecture,  and consider the implications as I/O-bound applications continue to be pushed towards the exascale.


Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP)

Damian Alvarez Mallon, Norbert Eicker, Estela Suarez and Wolfgang Guerich

Download a PDF of the talk slides

On the way towards Exascale the needed hardware for supercomputing is struggling to keep delivering increasing performance. To provide it without an unsustainable increase in the power consumption, HPC computing have to rely on extreme parallelism. However, leveraging massive amounts of parallelism is a complex task that cannot be easily accomplished by the programming models existing today. Furthermore, in the application landscape few algorithms expose enough parallelism and rewriting thousands or millions of lines of existing code is not a trivial task.

The DEEP project proposes a new architectural concept to enable the use of massive parallel hardware for a wider range of scientific applications. In DEEP applications are distributed according to their scalability characteristics. Code parts implementing algorithms that are not highly scalable will run on a Cluster, whereas code parts that are ready to extract performance from massively parallel hardware will run on a so-called Booster. In the Booster nodes containing coprocessors acting as stand-alone processors (Intel® Xeon Phi™ without host CPUs) are connected in a 3D-torus using an extra low latency network called EXTOLL.

DEEP has a hybrid programming environment. It allows the use of traditional paradigms such as MPI and OpenMP. However, to expose as much parallelism as possible, DEEP also relies on OmpSs, a novel programming model that dynamically detects interdependencies between user-annotated tasks. In DEEP, OmpSs has two functionalities: it schedules tasks in processes running inside the Booster Nodes (to leverage the use of the massive number of cores and hardware-supported threads present there) and it will allow offloading tasks to the Booster. Such tasks can be individually offloaded by every process, as well as be collectively offloaded, to allow communication within the Booster independently from the Cluster.

Developing and porting applications in DEEP implies some challenges, such as the tuning and refactoring of the code to take advantage of the wide vector units and high core count of the Booster. For the applications in the project, the initial decision on which parts of the codes will run on the Cluster and which ones on the Booster has been already taken. Walkers in Monte Carlo simulations for high temperature superconductivity, particles in plasma physics, signal reconstruction and partial imaging in seismic explorations and physicochemical interactions in the atmosphere are examples of codes with heavy computing power requirements that will run on the Booster. Non-scalable atmospheric models, image stacking, electromagnetic fields simulation and I/O managing are examples of code that will run on the Cluster. Another aspect of the DEEP project is that it enables cooperative research in application development. As example of such cooperation, brain researchers involved in DEEP foresee future implementations of their code, based on multiscale and multimodel brain simulation, to take full advantage of DEEP’s architecture.

The DEEP project makes an effort to enlarge the portfolio of applications able to exploit the benefits of Exascale computing hardware. To achieve this goal, a restructure of the system architecture, the programming models, and the applications themselves is proposed and investigated.


Making the case for reforming the I/O software stack of extreme-scale systems

Florin Isaila, Javier García, Jesús Carretero, Rob Ross and Dries Kimpe

Download a PDF of the talk slides

The ever-increasing data needs of scientific and engineering applications require novel approaches to manage and explore huge amounts of information in order to advance scientific discovery [1]. To achieve this goal, one of the main priorities of the international scientific community is addressing the challenges of performing scientific computing on Exascale machines by 2020. Future Exascale platforms will be likely characterized by a three to four orders of magnitude increase in concurrency, a substantially larger storage capacity, and a deepening of the storage hierarchy [2]. In the face of these foreseeable evolutions researchers agree that the current development model of the storage I/O stack for High End Computing (HEC) platforms will not scale to the new levels of concurrency, storage hierarchy, and capacity [3]. In this talk we identify some of the problems that need to be addressed in order to overcome these challenges:
1. As the depth of the storage hierarchy increases one of the biggest concerns is the programmability and performance of the I/O software stack. I/O system optimizations are often applied independently at each system layer. However, this approach can cause mismatches between the requirements of different layers (for instance in terms of load, locality, consistency). Due to this uncoordinated development, understanding the interplay between optimizations at different layers has become difficult even for parallel I/O researchers.
2. The storage I/O path through the I/O software stack lacks mechanisms for adapting to unexpected events such as sudden data volume variations and failures. However, managing data volume variability and failures requires cross-layer adaptive control mechanisms, which are unfortunately not available in the state-of-the-art I/O software stack for HEC platforms.
3. As the system scale increases, exposing and exploring data locality will become a key factor for improving both application performance and energy efficiency. The expected deepening of the storage hierarchy brings novel opportunities to exploit locality at different layers of the system architecture (e.g.: compute node, I/O node, storage nodes). However, existing standard storage I/O interfaces such as POSIX and MPI-IO lack the capability of exposing and exploiting data locality.

In order to timely address these challenges we advocate for radical new approaches of reforming the storage I/O stack including cross-layer optimizations, novel interfaces for exploiting and exploring locality, run-times for adapting to unexpected events, and in-system storage for absorbing burst storage I/O demands. We believe that I/O researchers should focus on improving the data path from applications to storage through data models that will smooth evolution from the currently still popular POSIX towards novel approaches to better map complex data to the storage system.


[1] Tony Hey, Stewart Tansley, and Kristin Tolle (Eds.).The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Research (2009).
[2] "The International Exascale Software Roadmap," Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications.
[3] M. Bancroft et al. HEC FSIO 2011 Workshop Document. http://institute.lanl.gov/hec-fsio/docs/.


Paving the path to exascale computing with CRESTA development environment

Stefano Markidis, Michael Schliephake, Xavier Aguilar, David Henty, Harvey Richardson, Alistair Hart, Alan Gray, David Lecomber, Tobias Hilbrich, Jens Doleschal and Erwin Laure

Download a PDF of the talk slides

The development and implementation of efficient computer codes for exascale supercomputers will require combined advancement of all development environment components: compilers, automatic tuning frameworks, run-time systems, debuggers and performance monitoring and analysis tools. The exascale era poses unprecedented challenges. Because the presence of accelerators is more and more common among the fastest supercomputer and will play a role in exascale computing, compilers will need to support hybrid computer architectures and generate efficient code hiding the complexity of programming accelerators. Hand optimization of the code will be very difficult on exascale machine and will be increasingly assisted by automatic tuners. Application tuning will be more focus on parallel aspects of the computation because of large amount of available parallelism. The application workload will be distributed over million of processes, and to implement ad-hoc strategies directly in the application will be probably unfeasible while an adaptive run-time system will provide automatic load balancing. Debuggers and performance monitoring tools will deal with million processes and with huge amount of data from application and hardware counters, but they will still be required to minimize the overhead and retain scalability. In this talk, we present how the development environment of the CRESTA exascale EC project meets all these challenges by advancing the state of the art in the field.

An investigation of compiler support for hybrid GPU programming, the design concepts, and the main characteristics of the alpha prototype implementation of the CRESTA development environment components for exascale computing are presented. A performance study of OpenACC compiler directives has been carried out, showing very promising results and indicating OpenACC as viable approach for programming hybrid exascale supercomputer. A new Domain-Specific Language (DSL) has been defined for the expression of parallel auto-tuning at very large scale. The focus of on the extension of the auto-tuning approach into the parallel domain to enable tuning of communication-related aspects of application. A new adaptive run-time system has been designed to schedule processes depending on the resource availability, on the workload, and on the run-time analysis of the application performance. The Allinea DDT debugger and the Dresden University of Technology MUST MPI correctness checker are being extended to provide a unified interface, to improve scalability, and to include new disruptive technology based on statistical analysis of run-time behavior of the application for anomalies detection. The new exascale prototypes of the Dresden University of Technology Vampir, VampirTrace and Score-P performance monitoring and analysis tools have been released. The new features include the possibility of applying filtering technique before loading performance data to drastically reduce memory needs during the performance analysis. The initial evaluation study of the development environment is targeted on the CRESTA project applications to determine how the development environment could be coupled into a production suite for exascale computing.


Application-Centric Benchmarking and Modeling for Co-Design

Torsten Hoefler

Download a PDF of the talk slides

The development of Exascale hardware architectures, compilers, and runtime systems or middleware requires a detailed understanding of the applications to be run on the system. In this work, we summarize the challenges and common pitfalls of current design strategies. We discuss how to characterize applications by their requirements in order to design future systems.


Fault Tolerance in MPI

Stephen Sachs

Download a PDF of the talk slides

By raising computational performance through increased parallelism, single core failures in modern supercomputers have become a more important and more expensive issue. This is due to the fact that with core count the number of overall components within a system rises. Thus, as the mean time to failure (MTTF) of every single component does not grow as fast as the number of components in a supercomputer, the overall MTTF of the system shrinks. A study [1] including 22 system at LANL over a time span of 9 years has shown, that the average MTTF of some current systems is less than 8 hours.

Failure prevention is used in many parts of the machine (e.g. RAID, exception handling, adaptive routing, etc.), but a standard MPI implementation will abort the entire computation if any of its ranks encounters a failure. The traditional handling of these failures is using checkpoint/restart techniques. However, as the overhead of these implementations grow with core count, their application divert to inefficiency.

Fault tolerant MPI (FT MPI) enables the code to recover from failures and continue execution although some parts of the system have been lost indefinitely. Although not yet part of the MPI standard, there is a vivid working group around FT MPI. This talk will give an overview of the current and past developments in FT MPI. Furthermore, simple test applications utilizing the most promising approach show the principle functionality of this implementation. Finally an assessment of the advantages as well as current obstacles is given.

[1] Bianca Schroeder and Garth A. Gibson. "A large scale study of failures in high-performance-computing systems." IEEE Transactions on Dependable and Secure Computing, 7(4):337 {350, oct.-dec. 2010.


Fast and scalable Elliptic Solvers in Numerical Weather and Climate-Prediction on CPUs and GPUs

Eike Mueller and Robert Scheichl

Download a PDF of the talk slides

Semi-implicit time stepping is very popular and widely used in numerical weather- and climate prediction models for various reasons, particularly since it allows for larger time steps and thus for better efficiency. However, the bottleneck in semi-implicit schemes is the need for a three dimensional elliptic solve in each time step. With increasing model resolution this elliptic PDE can only be solved on operational timescales if highly efficient algorithms are used and their scalability and performance on modern computer architectures can be guaranteed.

We have studied a typical model equation arising from semi-implicit semi-Lagrangian time stepping and carried out extensive parallel scalability tests with different Krylov-subspace and multigrid solvers. We demonstrated the scalability and robustness of all these solvers for systems with more than $10^{10}$ degrees of freedom on up to 65536 CPU cores of the HECToR supercomputer. In particular we wrote our own bespoke, matrix-free geometric multigrid solver which exploits the strong vertical anisotropy of the problem and avoids matrix- and preconditioner setup costs.

To investigate the performance on non-standard chip architectures, we ported the solver to a GPU using the CUDA-C programming model and showed that significant performance gains can be achieved on a single GPU by using the matrix-free approach and minimising memory access through loop fusion. We also investigated the bottlenecks encountered when extending the implementation to multiple GPUs, in particular the limited bandwidth of host-device data transfers, and their impact on the performance of the elliptic solver.


MDMP: Managed Data Message Passing

Adrian Jackson and Par Strand

Download a PDF of the talk slides

MDMP is a new parallel programming approach that aims to provide users with an easy way to add parallelism to programs, optimise the message passing costs of traditional scientific simulation algorithms, and enable existing MPI-based parallel programs to be optimised and extended without requiring the whole code to be re-written from scratch. MDMP utilises a directives based approach to enable users to specify what communications should take place in the code, and then implements those communications for the user in an optimal manner using both the information provided by the user and data collected from instrumenting the code and gathering information on the data to be communicated. This work will present the basic concepts and functionality of MDMP and discuss the performance that can be achieved using our prototype implementation of MDMP on some model scientific simulation applications.


Resolution Limits of Continuous Media Models and their Mathematical Formulations

Boris Chetverushkin

Download a PDF of the talk slides

Massive parallel systems are especially useful for detailed simulations of physical processes with complex spatial and temporal characteristics. They diminish the restrictions on the degree of computational spatiotemporal resolution of the physical problem. But instead the problems of data validation, correctness of numerical algorithms and mathematical models become very topical for the high-performance computing. And the problem of keeping the stability condition is very significant, if we use explicit schemes.
In this work new mathematical formulations for a number of continuous media models is presented. The developed formulations account for physical constraints on resolution level of media description. Compared to classical formulations the presented ones introduce additional terms which allow for construction of numerical schemes well suited for their efficient implementation for high-performance computer systems.


Design, implementation and use of mampicl, the multi-algorithm MPI collective library

Michael Schliephake, Erwin Laure, Katherine Heisey and Paul Fischer


Benchmarking mixed-mode PETSc performance on high-performance architectures

Michael Lange, Gerard Gorman, Michele Weiland, Lawrence Mitchell, Xiaohu Guo and James Southern

Download a PDF of the talk slides

The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of parallelism exposed in modern high-performance platforms. In order to realise the full potential of recent hardware advances, a mixed-mode between shared-memory programming techniques and inter-node message passing can be adopted which provides high-levels of parallelism with minimal overheads. For scientific applications this entails that not only the simulation code itself, but the whole software stack needs to evolve.
In this paper, we evaluate the mixed-mode performance of PETSc, a widely used scientific library for the scalable solution of partial differential equations. We describe the addition of OpenMP threaded functionality to the library, focussing on matrix multiplication in Compressed Sparse Row (CSR) and Block CSR format. We highlight key challenges in achieving good parallel performance of threaded applications on modern multi-core processors, before demonstrating that the overall performance of the mixed-mode implementation is superior to that of the pure-MPI version.
Using a set of matrices extracted from Fluidity, a CFD application code which uses the library as its linear solver engine, we then benchmark the parallel performance of mixed-mode PETSc across multiple nodes on several modern HPC architectures. We evaluate the parallel scalability on uniform memory access (UMA) platforms, such as the Fujitsu FX10 and IBM BlueGene/Q, as well as non-uniform memory access (NUMA) systems including the Cray XE6. We furthermore analyse the threaded performance of the Intel Xeon Phi coprocessor using the devised benchmarks. A detailed comparison is performed which highlights the characteristics of each particular architecture, before discussing the suitability of the provided benchmark suite to draw comparisons between HPC hardware vendors.


POLARIS(MD): an innovative and scalable method for high throughput molecule targeting

Michel Masella (CEA/DSV), Othman Bouizi (Intel, Exascale Lab Paris)

Molecular Dynamics is part of the N-Body problem. With the increasing computation capabilities of the supercomputers, scientists in biochemistry can study larger molecular systems. Such systems are of the size of the smallest viruses and can handle hundreds of thousands of atoms.
The best known molecular dynamics codes are NAMD and GROMACS. These treat explicitly the atoms and their interactions which require communications between the processes during a parallel run. For modelling larger biological systems, the communications eventually limit the scalability.
We present a different way of computing molecular systems with the innovative application POLARIS(MD) - POLARIzation and Simulations (Molecular Dynamics). The first key point of POLARIS(MD) relies in its capacity to handle large biological systems on a very few number of CPU cores compared to its peers. This is made possible thanks to the hierarchical modeling of the solvent surrounding the protein. Farthest solvent molecules effects are taken into account into a pseudoparticle, making the computation faster. The second key point is the accuracy of the microscopic interaction model that POLARIS(MD) considers -this code is, to our knowledge, the only one taking into account microscopic polarization effects-. These combined features make POLARIS(MD) fast and reliable.
The relative low number of CPU cores needed to achieve a simulation in biochemistry allows us to explore the dimension of the molecular complexity. The goal of the biochemists is to target the best molecule that fits his or her needs. Once the "good" molecule targeted, comes the time to synthesize it. Reducing the time - and therefore the cost- of targeting the best candidate on the bench is key. Targeting through computation is efficient only if, for each molecule tested, the code uses a reasonable number of CPU cores and delivers the result within one day. To be efficient, POLARIS(MD) must be aware of the data organization at the level of the parallelism as well as of the vectorization.


Type oriented parallel programming for exascale

Nick Brown

Download a PDF of the talk slides

Whilst there have been great advances in HPC hardware and software in recent years, the languages and models that we use to program these machines have remained much more static. This is not from a lack of effort, but instead by virtue of the fact that the foundation that many programming languages are built on is not sufficient for the level of expressivity required for parallel work. The result is an all too familiar implicit trade-off between programmability and performance which is made worse due to the fact that, whilst many scientific users are experts within their own fields, they are not HPC experts. Moving towards exascale, where the programmer must make efficient use of a vast amount of resource, this problem will be exasperated and the lack of a suitable parallel programming model will limit the amount of scientific benefit that we can gain from this new generation of machine.

Type oriented programming looks to address this by encoding the complexity of a language via the type system. Most of the language functionality is contained within a loosely coupled type library that can be flexibly used to control many aspects such as parallelism. Due to the high level nature of this approach, there is much information available during compilation which can be used for optimisation and, in the absence of type information, the compiler can apply sensible default options thus supporting both the expert programmer and novice alike. This is especially applicable to the HPC domain, as it allows the programmer to get parallel codes working using safe, simple default behaviour and then further tune if required. By abstracting much of the complexity away into a well-documented type library, critical aspects which would usually be hard coded from the start, such as the method of data decomposition, can be modified easily by simply changing the type once the programmer has a feeling for how their code runs. These types, constructed by experts, can be built to address issues such as scalability and then used without the end programmer ever having to know the low level details of why their code works at exascale.

A parallel programming language, Mesham, has been created which is centred around the novel type oriented programming paradigm. This language has been used as a vehicle to investigate the approach of developing parallel codes in this manner. A number of existing codes, including one of NASA’s parallel benchmarks and aspects of the cosmological simulation package Gadget-2, have been ported into the language, which allowed us to evaluate how these might benefit from this novel programming model.

We have demonstrated that, at no performance penalty, codes written in this type oriented manner provide improved programmability. The programmer is able to write simple, implicit parallel, HPC code at a high level and then explicitly tune by adding additional type information if required. In some cases, due to the high level optimisation opportunities available at compile time, the Mesham code outperforms the existing C-MPI code.


Optimising Performance Through Unbalanced Decompositions

Adrian Jackson, Joachim Hein and Colin Roach

Download a PDF of the talk slides

GS2 is an initial value gyrokinetic simulation code developed to study low-frequency turbulence in magnetized plasma. It is parallelised using MPI with the simulation domain decomposed across tasks. The optimal domain decomposition is non-trivial, and complicated by the different requirements of the linear and non-linear parts of the calculations. GS2 users currently choose a data layout, and are guided towards processor counts that are efficient for linear calculations. These choices can, however, lead to data decompositions that are relatively inefficient for the non-linear calculations. We have analysed the performance impact of the data decompositions on the non-linear calculation, and the communications required for those calculations. This has helped us to optimise the decomposition algorithm by using slightly imbalanced data layouts for the non-linear calculations whilst maintaining the existing decompositions for the linear calculations. With the imbalanced layouts we completely eliminate communications for parts of the non-linear simulation.


Reliable Scalable Symbolic Computation: The Design of SymGridPar2

Patrick Maier, Rob Stewart and Phil Trinder

Download a PDF of the talk slides

Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Many symbolic problems are large, and the algorithms often exhibit a high degree of parallelism. However, parallelising symbolic computations poses challenges, as symbolic algorithms tend to employ complex data and control structures. Moreover, the parallelism is often both dynamically generated and highly irregular, e.g. the number and sizes of subtasks may vary by several orders of magnitude.
The SCIEnce project developed SymGridPar as a standard framework for executing symbolic computations on small-scale parallel architectures. SymGridPar uses dynamic load management (work stealing)
for handling dynamic and irregular parallelism.

SymGridPar is not, however, designed for parallel architectures with large numbers of cores. The multicore revolution is driving the number of cores along an exponential curve, but interconnection technology does not scale that fast. Hence many anticipate that processor architectures will have ever deeper memory hierarchies, with memory
access latencies varying by several orders of magnitude. The expectation is similar for exascale computing systems, where an increasing number of cores will lead to deeper interconnection networks, with relatively high communication latency between distant cores. Related to the exponential growth in the number of cores is a predicted exponential growth in core failures, as core reliability will remain constant, at best. These trends exacerbate the challenges of exploiting exascale architectures because they require the programmer to pay attention to locality and to guard against failures.

We present the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to scale onto 10$^6$ cores by providing the programmer with high-level abstractions for locality
control and fault tolerance. SGP2 is being developed as part of the UK EPSRC HPC-GAP project (EP/G05553X), which aims to scale the GAP computer algebra system to large scale clusters and HPC architectures.

We present the SGP2 design goals, principles and architecture. We demonstrate how scalability is achieved by allowing the programmer to control task placement on a distance-based abstraction of the communication topology of large architectures. We outline how SGP2 provides fault tolerance by supervising remote computations, and we sketch higher-level fault tolerance abstractions like supervised workpools and skeletons.

SGP2 is still under development. We will survey the status of the implementation and present preliminary scalability and fault tolerance results. Specifically, we will investigate the scalability and efficiency of a layered task placement strategy on up to 64K cores on HECToR, and we will evaluate the overheads of a fault tolerant work pool on a Beowulf cluster, both in the presence and absence of faults.


Exaflop computations and the fusion energy devices simulation prospects

Boris Chetverushkin, Vladimir Gasilov, Vladimir Novikov, Olga Olkhovskaya, Gennady Bagdasarov, Alexey Boldarev, Elizaveta Dorofeeva, Irina Gasilova and Ilya Vichev

Download a PDF of the talk slides

Controlled Nuclear Fusion is a promising way to nonpolluting production a liberal share of the world's energy consumption. The forthcoming of commercial power plants is expected by 2050, but the creation of a fusion reactor requires solving a number of intricate scientific and engineering problems. Simulations using supercomputers accelerate the progress of fusion.
Today the most important task is a critical analysis of available computational tools (algorithms, codes) as regards to the possibility of scaling to new generations of multi-petaflop and exaflop computers. Keldysh team in the NuFuSE project has developed edge plasma simulations via modification of radiative plasma dynamics models in the direction of including the effects of non-equilibrium plasma and radiation. Such full-scale model will be used to integrate the code for the numerical analysis of edge plasma effects with the models of the plasma in the central part of the Tokamak.
We have carried out a number of computations using supercomputers Helios (IFERC, Rokkasho, Japan), LOMONOSOV (Moscow State University, Russia), MVS-100K (JSCC, Moscow, Russia), and K-100 (KIAM RAS, Moscow, Russia) concerning 3D simulations of plasma flows in ITER diverter by means of our MHD code MARPLE. We hope to continue this work. Now MARPLE is enriched with the turbulence model which is important for comprehensive diverter simulations. MARPLE performance and scalability in these computations was measured; speedup, efficiency and cost of computations were examined in the view of enabling our applications for exascale.


PyOP2: a framework for fast, unstructured mesh computations

Lawrence Mitchell, Florian Rathgeber, Graham Markall, Nicolas Loriant, Carlo Bertolli, Gihan R. Mudalige, Mike B. Giles, David A. Ham and Paul H. J. Kelly

Download a PDF of the talk slides

With the move towards exascale-lite HPC systems, computational scientists are required to exploit ever more layers of parallelism to obtain reasonable performance for their simulations. As well as writing efficient, weak-scaling code, they must ensure that on-node performance is some reasonable fraction of peak (be it memory-bandwidth or flop). The latter problem is often addressed by micro-optimisations specific to a single architecture, or even chip.
Most scientists, who have relatively little say over what system their code will run on, will shy away from this approach even if the optimisations are possible: the maintenance burden is too great.

We propose that the solution to this problem is to separate the abstract numerics for any particular numerical problem from their implementation, and use generative approaches to ensure efficient execution on a wide range of platforms. This allows domain scientists to efficiently express the problems they wish to address, and computer
scientists and HPC experts to lend their efforts to translating these representations into optimised code.

PyOP2 is a high-performance runtime library for the execution of unstructured mesh codes. It supports distributed memory parallelism through MPI and targets both CPU shared memory parallelism and GPU accelerators on a node using either OpenMP, OpenCL or CUDA as appropriate. The user writes Kernels to be executed over every entity in a mesh (for example nodes, facets or elements) and expresses this execution in an explicitly data
parallel fashion. On execution, efficient code is generated to perform the user's intent. The runtime library takes care of data dependencies where appropriate through halo swaps and redundant computation for MPI parallelism and either by colouring for safe concurrent access, or use of atomic instructions in shared memory.

Built on top of this layer, we present a high-performance implemention of the Unified Form Language (UFL) -- an abstraction for expressing the weak forms of PDEs -- that automatically generates high-performance code for solving finite element problems on both CPUs and GPUs. This implementation has been embedded within the computational fluid dynamics package Fluidity. Our generative approach results in significant performance improvements over the existing code for identical functionality. On CPUs, we see a 5x speedup over the Fortran code in Fluidity, and a factor of 2.5x over the FEniCS UFL implementation, DOLFIN. Our GPU implementation outperforms Fluidity and DOLFIN by a further factor of 5x.

In addition to increased performance, by retaining high-level abstractions until the point of execution we can easily implement computational methods that have previously been considered intractable. For finite element problems the ability to automatically generate optimal adjoint models is perhaps the most exciting.

In the past, high level languages have been shied away from in HPC, since there has been a (justified) belief that one loses performance when gaining expressivity. We believe our approach demonstrates that you can both have your cake, and eat it.

Development of PyOP2 is funded by EPSRC (EP/I00677X/1, EP/I006079/1) and the European Union (FP7/277481).


Plasma Generation Modelling – A Challenge for Exascale

Mikhail Markov, Andrey Berezin, Sergey Parotkin and Alexander Vorontsov

Two main approaches are used for plasma modeling: kinetics and hydrodynamics. The majority of critical results are obtained in the second approach. However, hydrodynamic approximation in the problems of practical interest cannot be valid in the whole phase space and for all time moments. For example, plasma generation cannot be considered. Kinetic approach becomes impossible, when electron density reaches some critical level. It is unlikely that even on exascale system kinetics will fully substitute hydrodynamics. But kinetic modeling of plasma generation until the distribution function is maxwellized is a good challenge for exascale.

This paper represents an attempt of numerical kinetics-based modeling of an industrial discharge-produced plasma source, which consists of metallic volumes connected by a dielectric capillary and outlines the limitation of used model on contemporary supercomputers.

Initially, free electrons, which exist due to natural background ionization, are in the heat equilibrium with gas. These electrons are accelerated by external electric field up to threshold of molecule excitation and ionization.

The inhomogeneous kinetic equation simulates electron 3D transport under Lorenz force action. Elastic scattering on neutral molecules, molecule excitation and impact ionization are described by the linear collision integral. Maxwell equations describe the electromagnetic field. The ion motion and interactions are neglected.

The solution is constructed by the successive generation’s method. The sum of zero and first generation distributions defines the solution with the first order of accuracy in terms of generalized functions with random parameters. A combination of Monte-Carlo and particles methods is used for numerical modeling. The explicit difference scheme approximates Maxwell equations. The computer code is developed for classic cluster computer. Paralleling on bar of processors with dynamic load-balancing is used.

The results of modeling depend on the gas pressure in the device. When pressure is 1-2.5 Torr only an electron beam is generated in the capillary. Plasma generation begins in a small compressing region inside the capillary when pressure reaches 3 Torr. The space charge becomes significant.

The simulation was performed at a hybrid cluster computer K-100 in KIAM. Over 2 billion particles were considered on a uniform computational mesh containing 13 million spatial cells with edge of 0.01 sm with time step equal to 10^-13 s. 0.5 Tb of RAM was used. 50 hours of time was expended on 176 CPU. Hydrodynamic conditions are not reached. The electron density reaches value that corresponds to Debye shielding radius of 10^-4 cm. Therefore, to continue the simulation one needs two orders of magnitude denser mesh and about three orders of magnitude smaller time step. It is possible for computer with The SIMD (single instruction multiple data) devices. Since only homogeneous instructions are executed efficiently on SIMD, appropriate tuning of algorithms was carried out. Preliminary tests show, that using Cuda technology gives a 20 times speed up.

Electro-ion collisions should be taken into account when electron density becomes comparable with neutral’s one. It is impossible for existing version of particles method without assuming Maxwell distribution of ions.