Hybrid-parallel algorithms for 2D Green’s functions Alejandro Álvarez-Melcón, Domingo Giménez, Fernando D. Quesada and Tomás Ramírez alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informática Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas International Conference on Computational Science June 5-7, 2013 Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 1 / 19 Outline 1 Introduction and motivation 2 Computation of Green’s functions on hybrid systems 3 Experimental results 4 Autotuning 5 Conclusions and perspectives Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 2 / 19 Outline 1 Introduction and motivation 2 Computation of Green’s functions on hybrid systems 3 Experimental results 4 Autotuning 5 Conclusions and perspectives Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 2 / 19 Outline 1 Introduction and motivation 2 Computation of Green’s functions on hybrid systems 3 Experimental results 4 Autotuning 5 Conclusions and perspectives Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 2 / 19 Outline 1 Introduction and motivation 2 Computation of Green’s functions on hybrid systems 3 Experimental results 4 Autotuning 5 Conclusions and perspectives Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 2 / 19 Outline 1 Introduction and motivation 2 Computation of Green’s functions on hybrid systems 3 Experimental results 4 Autotuning 5 Conclusions and perspectives Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 2 / 19 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 3 / 19 Introduction and motivation Motivation of the work 1 High interest in the development of full-wave techniques for the analysis of microwave components and antennas. 2 Need of efficient software tools that allow optimization of complex devices in real time. Calculation of Green’s functions inside waveguides Increment of the execution time due to: 1 Low convergence rate of series (images, modes). 2 Large number of pairs of points. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 3 / 19 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 4 / 19 Introduction and motivation Objectives of the work 1 Increase efficiency using parallel computing. 2 Application of several hybrid-heterogeneous parallelism strategies is proposed in this context. Strategies explored 1 Parameterized hybrid parallelism (MPI+OpenMP+CUDA) for the computation of Green’s functions in rectangular waveguides. 2 Autotuning strategies based in the parameterized code. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 4 / 19 Computation of Green’s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green’s functions. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 5 / 19 Computation of Green’s functions on hybrid systems Hybrid parallelism 1 MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA routines are developed to accelerate the calculation of 2D waveguide Green’s functions. For each MPI process Pk , 0 ≤ k < p: omp_set_num_threads(h + g) for i = k mp to (k + 1) mp − 1 do node=omp_get_thread_num() if node < h then Compute with OpenMP thread else Call to CUDA kernel end if end for As seen, (p) MPI processes are started. In addition, (h + g) threads run inside each process. Threads (0) to (h − 1) works on the CPU (OpenMP, OMP). Remaining threads from (h) to (h + g − 1) works in GPU calling CUDA kernels. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 5 / 19 Computation of Green’s functions on hybrid systems Routines developed p \h+g 1 p 1+0 SEQ MPI h+0 OMP MPI+OMP 0+g CUDA MPI+CUDA h+g OMP+CUDA MPI+OMP+CUDA Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 6 / 19 Experimental results Computational systems Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores each. Are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 7 / 19 Experimental results Use of GPU Comparison between use of one kernel versus several kernels: Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=1)/ T(#kernels=X). Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 8 / 19 Experimental results Use of GPU Comparison between use of one kernel versus several kernels: Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=1)/ T(#kernels=X). Three kernels give satisfactory results. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 8 / 19 Experimental results Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 9 / 19 Experimental results Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; domingo 9 / 19 Experimental results Improvement with MPI+GPU Test on computational speed, several kernels in a process versus several processes one kernel each. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 10 domingo / 19 Experimental results Improvement with MPI+GPU Test on computational speed, several kernels in a process versus several processes one kernel each. S=T(#proc=2;#kernels=X)/ T(#proc=2*X;#kernels=1). S > 1 means it is preferable to start the kernels inside MPI processes. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 10 domingo / 19 Experimental results Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 11 domingo / 19 Experimental results Comparison between GPU and optimum parameters The selection of the optimum values for p, h and g produces lower execution times that blind GPU use. S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 11 domingo / 19 Experimental results Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 12 domingo / 19 Experimental results Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p, h and g for different nodes. S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 12 domingo / 19 Experimental results Experiments allow satisfactory results with some heuristic Three CUDA kernels per GPU. Kernel calls inside MPI processes. Not to include Luna in the heterogeneous cluster. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 13 domingo / 19 Experimental results Experiments allow satisfactory results with some heuristic Three CUDA kernels per GPU. Kernel calls inside MPI processes. Not to include Luna in the heterogeneous cluster. Work to be done Further improvement. What in a different computational system?. What for a user non expert in parallelism?. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 13 domingo / 19 Autotuning parallel codes Autotuning strategies High complexity of today’s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with autotuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 14 domingo / 19 Autotuning parallel codes Autotuning strategies High complexity of today’s hybrid, heterogeneous and hierarchical parallel systems; difficult to estimate optimum parameters leading to lowest execution times. Solution is to develop codes with autotuning engines. Tries to ensure execution times close to optimum, independently of the particular problem and of the characteristics of the computing systems. Types of Autotuning techniques Empirical autotuning. Modeling of the execution time. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 14 domingo / 19 Autotuning parallel codes Model based autotuning Execution time when computation is distributed between OpenMP threads or MPI processes: Fine grained: 2mimag + 1 nmod S1 + (2nimag + 1)S2 + R(c) + M(c) mn c c Coarse grained: lmm n (nmod S1 + (2mimag + 1) (2nimag + 1) S2 + R(c) + M(c)) c R(c) cost of reduction, and M(c) management cost; depend of the number of threads or processes Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 15 domingo / 19 Autotuning parallel codes Model based autotuning Satisfactory predictions In Marte In Marte+Mercurio from which satisfactory selection can be taken, but how to model for hybrid systems and GPU? Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 16 domingo / 19 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 17 domingo / 19 Autotuning parallel codes Empirical autotuning Run some test executions during the initial installation phase of the routine (installation set; keep installation set small). This information is used at running time when a particular problem is being solved (validation set). images-points AUTO-TUNING LOWEST DEVIATION 1000-25 0.155 0.114 35.96% 100000-25 5.012 5.012 0% 1000-100 1.706 1.656 3.02% 100000-100 87.814 79.453 10.52% Waveguide GF: different problem sizes (images, number of points). Execution times with the autotuning technique and with the optimum parameters (lowest). Autotuning routine performs well for the problem sizes investigated. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 17 domingo / 19 Conclusions Combination of several parallelism paradigms allows the efficient solution of electromagnetic problems in today’s computational systems, which are hybrid, heterogeneous and hierarchical. Calculation of Green’s functions inside waveguides has been adapted for heterogeneous clusters with CPUs and GPUs with different speeds. Parameterized algorithms facilitate to adapt the code to the characteristics of the computational system. Autotuning techniques can be incorporated so that non parallelism-experts can use routines efficiently in complex computational systems. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 18 domingo / 19 Perspectives More optimized versions of the codes can be developed, specially for GPU. Empirical autotuning techniques for large heterogeneous systems must be more in depth studied. Model of the execution time of the hybrid routines need to be developed. Inclusion of the routines in higher-level electromagnetism codes, as for example analysis of finite microstrip structures using the Volume Integral Equation solved by the Method of Moments. Alejandro Álvarez-Melcón, Domingo Giménez, Fernando ICCS D. Quesada 2013 / June and5-7, Tomás 2013 Ramírez alejandro.alvarez@upct.es; 19 domingo / 19