Keywords

1 Introduction

The capabilities of modern high performance computers and the level of development of special problem-oriented software packages of predictive modeling allow the user to increase resolution of numerical grids up to the order of billions nodes and more. Files of simulation results on larger meshes are represented by big data arrays, especially in the case of modeling of unsteady processes. Investigations show that there is a growing trend in the size of the data [1]. The size of data retrieved causes a problem of low speed of scientific visualization and analysis of the results.

While software for predictive modeling and hardware available for the wide spectrum of scientists allow to perform fluid dynamics evaluations on meshes up to 10 billions cells, visualization tools do not provide desired efficiency. One of the problems is low visualization speed. Moreover, it is not only linked with the rendering. Childs et al. [2] showed that time of input/output can be two orders higher than rendering and evaluations. One of the ways to reduce the time of input/output is the development of special algorithms for data processing not involving supercomputers technologies. This approach was performed in [3,4,5]. In this paper authors use another approach which assumes usage of a supercomputer.

There is a lack of good tools which are able to handle extra large grids on the market of visualization systems. The most known scientific visualization brands such as ParaView, TecPlot, COVISE, TechViz do not support information about effective results presentation on the meshes of up to billion nodes large. At the same time, this kind of problem persists for aircrafts designers.

The scientific problem considered in this article is the problem of achieving an effective and rapid interactive visualization and analysis of results of predictive modelling of fluid dynamics problems for modern aircrafts on superlarge meshes. It is assumed that created software can be used by engineers who do not have any special knowledges in IT, that is why it should provide displaying of the results in the most convenient and easy to understand way. The software should give visualization of fields in real or almost real time for comfortable working process.

The key idea underlying the developed software package is the usage of distributed big data analysis tools such as Apache Hadoop in conjunction with Apache Spark [6]. They provide distributed retrieval of the data from cluster nodes that can seriously reduce time of data reading in comparison to traditional sequential approach. Hadoop is mainly used to support Hadoop Distributed File System (HDFS), and the server built on the base of the Spark framework provides distributed processing of queries for retrieving required dataset from a cluster. A plugin to ParaView developed by the authors, plays the role of client of Spark server. It is intended to send queries and is integrated with the server version of ParaView. The user’s computer has the client version of ParaView, which receives final results of rendering from server ParaView.

The problem with the Apache Hadoop application for creation of packages for visualization of results of finite element modelling is considered in [7]. The example of an effective usage of HDFS is presented in [8]. An approach similar to the authors one was investigated in [9] with the difference that in that work Apache Hive was used instead of Apache Spark. Also the article [10] should be mentioned in which a hybrid approach is offered that assumes usage of HDFS and Kitware ParaView as a user interface. In the papers [11] and [12] Hadoop and Spark frameworks were applied for analysis and visualization of atmospheric phenomena modeling in Earth science problems. The attention was paid mainly to the analysis tasks.

2 Architecture

In this section the software architecture is described, as well as how particular program parts interact with each other during the typical visualization process.

The software environment is built by “client-server” scheme and has the structure which is shown on the Fig. 1. Basic elements are:

  • Client ParaView. The client version of the well-known scientific visualization Kitware’s ParaView package is installed on the local computer and is intended to provide direct interaction with the user. It visualizes results of rendering obtained from ParaView server. This is the only component of the software, which works on the local machine.

  • Server ParaView. The server version of ParaView which is installed on the cluster. It provides effective parallel rendering based on the data retrieved with the plugin to ParaView, designed by the authors.

  • Plugin to ParaView. The plugin is made by the authors and is integrated into the server Paraview and is intended to efficient data reading. The reading is processed through the responds to the client SQL-queries to data server that is also run on the same or another cluster.

  • Data server. The server is written using Python and uses the Apache Thrift framework. It receives queries from the ParaView plugin and gives data blocks back. The server forwards queries to the Spark system that retrieves data in a distributed manner from Hadoop Distributed File System.

Fig. 1.
figure 1

The scheme of interaction of the software components

The interaction of client and server parts of ParaView is quite traditional approach of usage of ParaView that provides parallel model rendering [13]. Therefore, the development of the ParaView plugin and the data server is the main direction of the authors efforts. From the ParaView point of view the plugin is the regular plugin for the reading of the model data defined on the structured grid. At the present time the plugin gives VTK-object vtkMultiBlockDataset. But instead of direct data reading, plugin forms SQL-query to the data server and retrieves data in response. The server transfers the query to Apache Spark. Spark executes distributed reading, collecting (collect operation) and sending data as a response to the SQL-query. The advantage of such a scheme over direct reading from file is that the reading via Spark is performed in parallel on several nodes of the cluster, which can give an increase in speed, especially on large files.

The speed of reading essentially depends on the file format used. Initially, data is presented in text format Tecplot and takes 1.27 GB. It is not suitable format for holding large data so it must be converted into another format. The format to be converted in must meet the following requirements. Firstly, the file size must be as small as possible. Secondly, it must provide relatively fast access to individual blocks of data, which is important in cases there is no need to read the entire file. It must support distributed storage and be readable by Spark. Based on the sum of these requirements, Apache Parquet [14] was chosen as the data format. On the one hand, it meets all requirements above. It is binary format so it has a smaller size in comparison to Tecplot. After conversion one frame of data takes about 400 MB. Besides, it provides fast access, distributed storage and can be read by Spark. Also it has a sufficient flexibility to form the necessary block structure of data storage. The format itself is a set of columns organized into a hierarchical structure under the control of a special scheme. The choice of the scheme is left to the person who writes the file. The scheme is the part of the format written to the file’s metadata and can be restored while reading. Thus, a Parquet file can contain a highly complex hierarchical structure. Although Apache does not describe details of format implementation it provides open-source API for reading and writing. The basic ideas of working with Parquet are taken from [15].

As a method of storing data on a structured grid, a block structure was chosen, which is obtained from the initial index of parallelepiped grid elements by dividing it by parallel index planes in three directions. Each block spatially occupies an area in the form of a curvilinear hexahedron and represents a structured grid. The storage of such block inside the Parquet file is organized as a set of separate Parquet columns for each of the node coordinates and field components. Due to the specifics of the Parquet format, column addresses are stored in the file’s metadata and each column can be accessed directly without reading the entire file so it solves the problem of selective reading of the necessary grid blocks.

At the moment, the solution is not adaptive to the manner of distribution of simulation output. That is why data must be redistributed in case of change of computer nodes numbers. Solution adaptivity to this kind of changes is the point for further research.

3 Usage Example

An example of visualization of unsteady gas dynamics simulation results on the structured grid of hexahedrons is considered. The source data is written in the form of time layers. Then each layer is initially stored in a separate file in Techplot format. The model contains about \(5\cdot 10^6\) nodes and each Techplot file has size of 1.27 GB.

The processing of separate frame files during direct visualization of these frames sequence on a standalone computer in ParaView takes about 30 s, which is unacceptably slow for interactive mode. Such low speed can be attributed to the fact that Tecplot is not suitable format to hold an information of big data.

The same data is visualized with developed software. All components except of client ParaView are installed on the cluster. The nodes of supercomputer “RSC Tornado” of Saint-Petersburg Polytechnic University Supercomputer Center are used as a cluster for ParaView and data servers. Each node consists of two CPU Intel Xeon E5-2697 v3 (14 cores, 2.6 GHz) and 64 GB RAM DDR4. Simple reading of Tecplot datafile located on one node with Paraview takes about 60 s.

Data files were converted to Parquet format before visualization in the developed environment and the size of one frame was reduced to about 400 MB. There were 20 frames. In total, the size of all frames was about 7.8 GB. The data of such rather small size is used only to illustrate approach. In further research size of each file is proposed to be bigger. During writing to Parquet format the data was transformed as described above: initial index parallelepiped of structured grid was divided by mutually orthogonal index planes into the parallelepipeds of smaller sizes. Inside each box each coordinate and each component field is separate Parquet column. Such a structure is effective on extraction of selected blocks because it does not require reading of the entire file. Each conversion takes about 40–45 s. In total, the whole dataset is converted on 10 MPI threads for 90 s.

Apache Spark is launched on the cluster under control of Slurm system. The reading of Parquet files in Spark is performed by the pyarrow library, as it provides a higher speed and requires significantly less memory than built-in Spark tools for Parquet reading. Data in Spark is represented in non-hierarchical plain structure.

Table 1. Time of reading and displaying of separate frames on 8 nodes of the cluster “RSC Tornado”

The time of displaying of the first five frames using 8 cluster nodes is shown in the Table 1. The second column contains the time of direct reading of the Parquet file in Spark using pyarrow. The third column is the total time of the frame displaying. The last column is the number of frames per second, i.e. the inverse of the total time. Most of the time is spent on data transfer from the data server to the client ParaView, which indicates the need to use a faster network. The time of reading and transmitting the first frame is longer than the next ones, which is caused first of all by the necessity of reading the grid. The fact is that the grid remains unchanged during the transition from frame to frame, so it is read only once at the first frame. Starting from the second frame the system reads only the values of the displayed field. That is why from the second frame reading time does not change.

The increment of the number of nodes involved does not lead to the expected decrement of reading time in Spark. Finding the cause of this phenomenon is one of the tasks of further research.

4 Conclusions

A software environment for interactive visualization is developed. It provides visualization of simulation results evaluated on large numerical grids. The environment consists of a client ParaView, a server ParaView, a data server that forwards SQL queries to Apache Spark. The latter is used to increase the speed of reading large data by providing distributed access to them.

The experiments have shown the effectiveness of using data in Parquet format. Compared to the Tecplot text format this format provides a smaller file size, is directly readable by Apache Spark, and provides an ability to extract individual data blocks without having to read the entire file.

The experiments also showed the absence of scalability of the data reading speed in Spark with increasing number of nodes and high overhead for data transfer from Spark to the server ParaView. These problems are tasks for further development.

Another challenge is to use more specific SQL queries. These can be requests to get data corresponding to the visible part of the model. In addition, there might be requests to retrieve data distributed across layers. The layer which data should be extracted depends on the camera position. If it corresponds to a higher detalization, the layer must contain more nodes.