1 Introduction

Graph structured data appear in many applications, including scientific discovery, social network analysis, web searching (Inokuchi et al., 2003), recommender systems, and geographical data (Miller & Han, 2001). Many techniques have been successfully developed to exploit both the information captured by the graph structure and the features of nodes and edges. Most notably, neural network approaches (Kipf & Welling, 2016; Hamilton et al., 2017; Veličković et al., 2017) and network embedding approaches (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016; Wang et al., 2016; Cao et al., 2015; Goyal & Ferrara, 2018) have continuously set the state of the art in a wide range of problems such as node classification, graph classification, and link prediction. However, these methods mentioned above often share a common homogeneity assumption. Thus, they are not suitable for graph data with various node types and edge types.

In real-world applications such as recommender systems or search engines, the graph data usually contain multiple types of objects (nodes) and relations (edges). Although it is reasonable to use a homogeneous graph learning model with node types and edge types (relations) encoded into node attributes, doing so would compromise the smoothness assumption inherent in many graph neural networks (NT & Maehara, 2019). In fact, node types (and also edge types) are discrete values and often do not share the same structure as node features. Therefore, heterogeneous graphs or in another name heterogeneous information networks (HINs) (Sun & Han, 2013), can capture data in real-world applications more truthfully than homogeneous graphs.

HINs are designed to capture rich semantics and comprehensive information. It is useful for various data mining tasks, such as similarity search (Sun et al., 2011), recommendation (Liu et al., 2014), clustering (Sun et al., 2012), and classification (Kong et al., 2012). The heterogeneity of HINs has posed a challenge in graph mining, that is how to learn information from multiple types of nodes and edges. Since the state of the art for homogeneous graph representation learning is neural-based graph embedding methods (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016; Wang et al., 2016; Cao et al., 2015; Kipf & Welling, 2016; Veličković et al., 2017), it is natural to extend these methods to HINs. Early works (Tang et al., 2015a; Dong et al., 2017) support multiple node types and relations, but their node embeddings do not consider target relations. To address this problem, some subsequent works (Ty et al., 2017; Shi et al., 2018b) implicitly consider the target relation as edges or metapath vectors. However, they do not consider neighbor information, which is crucial to the high performance as that in homogeneous graphs (Kipf & Welling, 2016; Hamilton et al., 2017).

Fig. 1
figure 1

a A snapshot of the Amazon data where the node of interest is in a circle (the laptop); b Example for our proposed the single-level aggregation scheme; c Example for the bi-level aggregation scheme where attention is required. Our model introduces relation projections and infomax learning, which replace the attention mechanism in state-of-the-art approaches

Most recently, graph neural networks designed for HINs extend ideas from the homogeneous data literature to efficiently solve problems of heterogeneous data — setting state-of-the-art results . In general, these neural networks use an approach called bi-level aggregation (Fig. 1c) in order to learn the heterogeneous node embeddings. The first level aggregates information of neighbors with the same node types, relations, or meta-paths. Then, the second level employs the averaging or attention mechanism to aggregate the outputs of the first level. Notable bi-level aggregations include RGCN (Schlichtkrull et al., 2018), HAN (Wang et al., 2019), GATNE (Cen et al., 2019), and HGT (Hu et al., 2020b).

In this paper, we find that the bi-level approach may overlook individual node information, especially when there is an imbalance in the number of different typed relations. Take Fig. 1c as an example. Suppose there are many more user interactions than also-buy or also-view relations, then the bi-level aggregation scheme tends to down-weight the information from an individual user. This harms its performance for HIN embedding.

To further investigate this problem, we create a toy HIN, which consists of three types of nodes, namely “user”, “item”, and “tag” and two relations, namely “user-item” (U-I) and “item-tag” (I-T). To simulate preference of a user and features of an item, we randomly assign four and three tags to each user and item, respectively. Then, we connect an item and a tag if the item is associated with the tag; we add a link between a user and an item if they have two associated tags in common with more than 25% probability. As a result, the graph contains 1,000 users, 100 items and 10 tags and 8,025 U-I edges and 300 I-T edges, representing majority and minority relations, respectively. We generate 10 attributes of nodes according to their associated tags and another 20 non-related noisy attributes according to binary distribution. Table 1 shows the link prediction results of various methods (the experimental settings are described in the experiment section). Surprisingly, we find that bi-level aggregations (RGCN, HAN) are even inferior to traditional aggregations that do not consider heterogeneity (GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017)), especially for I-T relation. The reason is that bi-level aggregations down-weight U-I relation and get overfitted to the noisy features. More evidence, explanations, and results on the down-weighting issue will be provided in the experiment section and Sect. 4.3 Bi-level vs Single-level Aggregation.

Table 1 Link prediction results (MRR) in the toy HIN

In this paper, we propose a simple yet effective single-level aggregation scheme with infomax encoding, named HIME, for unsupervised HIN embedding (Fig. 1b). The key point is that we perform relation-specific transformation to obtain homogeneous embeddings before aggregating information from multiple typed neighbors. Thus, we emphasize the “equal” contribution of each neighbor and thus will not suffer from the down-weighting issue when there are imbalanced numbers of multiple typed relations. Our final embeddings are learned by a loss that encourages closeness between neighbors and an infomax encoder (Fig. 2) to augment graph smoothing in the homogeneous embeddings. As shown in Table 1, HIME outperforms the bi-level aggregation approaches, especially by a large margin for I-T relation, in the toy HIN. We do extensive experiments using ten benchmark datasets and demonstrate that HIME consistently outperforms the state-of-the-art approaches in link prediction, node classification, and node clustering tasks. We show that HIME is scalable and is able to deal with a large HIN containing 12.8 million edges.

Contribution—We make the following contributions: (1) To the best of our knowledge, we are the first to raise the down-weighting issue of bi-level aggregations hindering their effectiveness for HIN embedding and provide concretely empirical evidence. (2) We introduce the heterogeneous single-level aggregation scheme with infomax embedding, a simple yet effective HIN embedding method. (3) We show that our implementation outperforms the latest HIN embedding models in many practical tasks.

2 Preliminaries

Definition 1

(Heterogeneous Information Network) A Heterogeneous Information Network (HIN) is a tuple \((G, {\mathcal {T}}, \varphi , {\mathcal {R}})\), where \(G = ({\mathcal {V}}, {\mathcal {E}})\) is an undirected graph with node set \({\mathcal {V}}\) and edge set \({\mathcal {E}} \subseteq {\mathcal {V}} \times {\mathcal {R}} \times {\mathcal {V}}\), and \({\mathcal {R}}\) is a set of relations. \({\mathcal {T}}\) is the set of node types and function \(\varphi : {\mathcal {V}} \mapsto {\mathcal {T}}\) maps a node to a single type. Optionally, some dataset include node attributes \(x': {\mathcal {V}} \mapsto {\mathbb {R}}^p\), where p is the number of dimensions of the attributes. Note that HINs require \(\vert {\mathcal {T}} \vert > 1\) or \(\vert {\mathcal {R}} \vert > 1\).

Problem 1

(Relation-aware embedding for HIN) Given a HIN and a target relation \(t \in {\mathcal {R}}\), the problem is to generate low-dimensional representation vectors \({\mathbf {x}}^{t}_{i} \in {\mathbb {R}}^d\) for each node \(v_i \in {\mathcal {V}}\) according to the relation t and \(d \ll \vert {\mathcal {V}} \vert\) such that the representations preserve the structure of the given HIN.

Our defined problem is common for unsupervised HIN embedding. To comprehensively embed HIN, we do not look at a specific relation to generate the embedding, but to solve the relation-aware embedding problem across all relations in a given HIN. The main challenge is maintaining a high-quality representation across relations while dealing with the scalability issue. In the following, we first discuss related works on graph embedding and HIN embedding.

2.1 Graph embedding

Recently, homogeneous graph embedding methods (Perozzi et al., 2014; Tang et al., 2015b; Grover & Leskovec, 2016; Wang et al., 2016; Cao et al., 2015) have emerged as scalable ways to learn low-dimensional vector representations for graph nodes. These node representations encode semantic information transcribed from the graph and can be used directly as the node features for downstream machine learning tasks. Traditionally, node representation can be obtained by performing the eigendecomposition of Laplacian matrix (Ng et al., 2002). However, due to the complexity of such operation, other heuristics were proposed. The most notable is DeepWalk (Perozzi et al., 2014), where we generate a series of random walks to capture the graph semantic and learn node representations using a skip-gram model (Mikolov et al., 2013). The applications of the skip-gram model can also be seen in recommender systems (Grbovic et al., 2015; Bianchi et al., 2020), context prediction (Lazaridou et al., 2015), etc. Subsequent models to DeepWalk include LINE (Tang et al., 2015b) and node2vec (Grover & Leskovec, 2016). Certain works (Wu et al., 2018; Yang et al., 2019) introduce hashing techniques to reduce the training time and improve the scalability. For a more comprehensive view, we refer to the survey article  (Goyal & Ferrara, 2018) on graph embedding.

More recently, researchers proposed Graph Neural Networks (GNNs) as a new class of graph embedding models  (Hamilton et al., 2017; Kipf & Welling, 2016; Veličković et al., 2017). While not learning the node embedding explicitly, these models implicitly learn the node embeddings by combining node attributes and graph structures in neural-based models. We refer to the graph neural network survey article (Wu et al., 2019) for more details in this literature.

Since real-world data such as the Amazon data (Fig. 1a) are intrinsically heterogeneous, it is not trivial to extend graph embedding methods to work with heterogeneous data. Many HIN embedding models have been successful in bridging this gap.

2.2 HIN embedding

In early works of mining HINs, many methods (Sun et al., 2011; Liu et al., 2014) use meta-paths as semantic information to underline the difference between HIN and homogeneous network. As random-walk based models become popular, there are many attempts to combine the concept of meta-paths and skip-gram models. The notable works are metapath2vec and HIN2vec. metapath2vec (Dong et al., 2017) formalizes meta-path-based random walk and then utilizes heterogeneous skip-gram models. HIN2vec considers a neural network model for capturing and differentiating the information of meta-paths. In the meantime, another line of work (Xu et al., 2017; Tang et al., 2015a; Shi et al., 2018a) treat the graph under each relation as a subgraph, then they jointly learn from those subgraphs.

Researchers have recognized the problem of incompatibility of heterogeneity in relations (Shi et al., 2018b; Chen et al., 2018). To alleviate this problem, they propose relation-specific projection or implicitly consider it as relation embedding. With the success of GNNs, many works follow these architectures and adapt them to HINs. HAN (Wang et al., 2019), GATNE (Cen et al., 2019) and GTN (Yun et al., 2019) utilize attention mechanism after aggregating on subgraph-level separated by node types, relations, or meta-paths. We refer to these methods as bi-level aggregations. However, none of these concerns about the contribution of each individual neighbor to the final node representations.

3 Proposed method

We present a simple yet effective relation-aware heterogeneous graph embedding algorithm called Heteregeneous graph InfoMax Encoder (HIME) to keep the awareness of relations in heterogeneous graphs without suffering from the down-weighting issue. We first describe our encoder to learn node representations, then we introduce a mechanism to maximize the mutual information inside the encoder.

3.1 Relation-aware node embedding

Initially, each node \(v_i\) is associated with a feature vector \({\mathbf {x}}^{(0)}_i \in {\mathbb {R}}^{d}\) which is shared across relations. In case node attributes are available, we project them to a d-dimensional space by \({\mathbf {x}}^{(0)}_i = {\mathbf {W}}_{a}{\mathbf {x}}'_i\) where \({\mathbf {W}}_a \in {\mathbb {R}}^{d \times p}\) is a learnable weight matrix for the projection and \({\mathbf {x}}'_i \in {\mathbb {R}}^{p}\) denotes attribute of \(v_i\). To get the final representation of node \(v_i\) for relation t, we combine the information of node \(v_i\), its neighbors \({\mathcal {N}}_i\), and relation t together. The process is fulfilled by a single or a stack of our proposed heterogeneous single-level aggregator(s).

3.1.1 Heterogeneous single-level aggregator

As shown in Fig. 1b, given an object of interest, we transform its neighbor features according to their relations to \(v_i\), \({\mathbf {e}}_{j,r} = h ( {\mathbf {x}}_j, r)\) where \(v_j\) is a node connected to \(v_i\) by relation r; \(h: {\mathbb {R}}^d \times {\mathcal {R}} \rightarrow {\mathbb {R}}^d\) is a differentiable function. h allows a node to pass distinct feature for different relations. The transformation can be a hyperplane translation (Wang et al., 2014) or linear transformation. We call \({\mathbf {e}}_{\cdot ,\cdot }\) as an edge vector. Then we aggregate edge vectors using the mean function: \({\mathbf {n}}_i = \frac{1}{\vert {\mathcal {N}}_i \vert } \sum _{(j, r) \in {\mathcal {N}}^i} {\mathbf {e}}_{j,r}.\) Notice that we perform the neighbor relation-specific transformation before the aggregation to emphasize each neighbor’s contribution rather than diluting the node’s information by fusing it with neighbors of the same relation. Furthermore, we found that the transformation should be carefully selected. For example, the hyperplane translation assumes that all edge vectors after the transformation still belong to the same feature space but in different hyperplanes. This can be beneficial for certain heterogeneous graphs, where the information between relations is highly inclusive so that knowing other relations can help to inform the structure of a target relation.

Next, we combine the information of node \(v_i\) (\({\mathbf {x}}_i\)) and its neighbors (\({\mathbf {n}}_i\)) by using gated recurrent units (GRU) (Cho et al., 2014): \({\mathbf {x}}_i \leftarrow {{\,\mathrm{GRU}\,}}({\mathbf {x}}_i, {\mathbf {n}}_i)\). We treat \({\mathbf {x}}_i\) as the hidden state of a recurrent model. Alternatively, simplified versions (Chairatanakul et al., 2019) of GRU can be used to reduce the model complexity and possibly improve the performance in sparse datasets. The motivation behind using GRUs is based on the oversmoothing issue of GNNs; That is the degradation of performance of GNNs when increasing the number of layers because of the similarity in node representations (Li et al., 2018; Oono & Suzuki, 2019). Note that GRU is commonly used in learning from sequential data that retaining past information is crucial, although it requires heavy computational resources. We hypothesize that GRU can extract new information of the current layer while retaining useful information from the previous layer. Any homogeneous GNN can also be used after obtaining edge vectors by treating them as neighbors in a homogeneous graph. Note that a slight modification of bi-level aggregation can turn it into single-level aggregation. However, doing that will contradict the purpose of the second-level that aims to reassign the weight based on relations.

Fig. 2
figure 2

Architecture of Heterogeneous Graph Infomax Encoding. Following Figure 1, nodes and relations on the left are neighbors with correct relations (positive) to the center node, while nodes and relations on the right are randomly sampled (negative)

In the last layer, we employ a distinct GRU for each target relation t. The formula can be written as follows: \({\mathbf {x}}^t_i = {{\,\mathrm{GRU}\,}}_t({\mathbf {x}}_i, {\mathbf {n}}_i)\). The motivation is to allow the node representation to have both shared information across all relations and unique node’s relational information. This can be realized by looking at updating node vector of \({{\,\mathrm{GRU}\,}}_t\): \({\mathbf {x}}^t_i = (1-{\mathbf {z}}^t_i) \cdot {\mathbf {x}}_i + ({\mathbf {z}}^t_i) \cdot \hat{{\mathbf {x}}}^t_i\), where \(z^t_i\) is the update gate vector and \(\hat{{\mathbf {x}}}^t_i\) is the relation-specific candidate vector.

3.1.2 Objective function

To preserve the structure of HIN, we encourage the closeness of embeddings for nodes connected by an edge, while enforcing the separation of embeddings for nodes unconnected. Therefore, we minimize the following loss:

$$\begin{aligned} {\mathcal {L}}_G = \frac{1}{\vert {\mathcal {E}} \vert } \sum _{v_i \in {\mathcal {V}}} \sum _{(j, t) \in {\mathcal {N}}_i} \sum _{(k, t) \notin {\mathcal {N}}_i} -{{\,\mathrm{log}\,}}\, \sigma ( \langle {\mathbf {x}}_i^t , {\mathbf {x}}_j^t \rangle - \langle {\mathbf {x}}_i^t , {\mathbf {x}}_k^t \rangle ), \end{aligned}$$
(1)

where \(\langle \cdot , \cdot \rangle\) denotes the inner product. The loss is derived from Bayesian Personalized Ranking (Rendle et al., 2009). It is commonly used in recommender systems (Ricci et al., 2010) and is similar to margin-based ranking loss (Bordes et al., 2013). For scalability, we use negative sampling, which will be informed later in Sect. 3.5.

3.2 Heterogeneous graph infomax encoding

While the proposed aggregator in the previous section is flexible and powerful in learning local structures, using multiple transformations can result in high heterogeneity. That is, the edge vectors conflict with each other. This impairs the graph smoothing which is an essential characteristic of GNNs (Chen et al., 2020; Xie et al., 2020). We want to avoid this scenario and, at the same time, encourage the model to capture unique local features. Therefore, we aim to maximize the mutual information between edge vectors and the output of the layer. Similar to DGI (Veličković et al., 2018) and DIM (Hjelm et al., 2018), we use the binary cross-entropy between samples from the joint (positive) and the product of marginals (negative):

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {I}}}^{(l)} = \frac{1}{E_{{\mathcal {I}}}} \sum _{v_i \in {\mathcal {V}}} \Big ( \sum _{(j, r) \in {\mathcal {N}}_i} -{{\,\mathrm{log}\,}}\, {\mathcal {D}}^{(l)} ({\mathbf {e}}^{(l-1)}_{j, r}, {\mathbf {x}}^{(l)}_i) + \sum _{(k, q) \in \tilde{{\mathcal {N}}}_i} -{{\,\mathrm{log}\,}}\, (1- {\mathcal {D}}^{(l)} ({\mathbf {e}}^{(l-1)}_{k,q}, {\mathbf {x}}^{(l)}_i)) \Big ) , \end{aligned}$$
(2)

where \(E_{{\mathcal {I}}}= \sum _{v_i \in V} ( \vert {\mathcal {N}}_i \vert + \vert \tilde{{\mathcal {N}}}_i \vert )\), superscript (l) denotes l-th layer of the aggregator, \({\mathcal {D}}^{(l)}: {\mathbb {R}}^d \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is a discriminator for l-th layer and \(\tilde{{\mathcal {N}}_i}\) is a corrupted neighbors of node \(v_i\). As analogous to DIM, the vicinity centering around \(v_i\) in the graph is treated as a single image. To increase the mutual information, the encoder needs to consider every neighbors and aggregates the information in which most neighbors agree. Notice that for each layer, we apply the loss separately since it is easier to apply mini-batch training.

Following DGI, we adopt its discriminator: \({\mathcal {D}} ({\mathbf {x}}, {\mathbf {y}}) = {\mathbf {x}}^\intercal {\mathbf {U}} {\mathbf {y}}\), where \({\mathbf {U}}\) is a weight matrix of the discriminator. We call the process of optimizing Eq. (2) via a discriminator inside each layer of the single-level aggregator as infomax encoding. Note that our motivation for the infomax encoding is different from the graph infomax in DGI, DMGI (Park et al., 2020), and HDGT (Ren et al., 2019). In their works, the main objective is to preserve the mutual information between local patches and global graph summary. We will call them as global infomax.

The advantages of the proposed infomax encoding over global infomax are two-fold. First, infomax encoding is compatible with mini-batch sampling, whereas global infomax needs to compute the embeddings for the whole graph to obtain the global summary. This makes infomax encoding scalable to a large graph without any further modification. The second is for promoting homogeneity between neighbors or edge vectors in the graph. To make a comparison, we optimize the model in the previous section via either infomax encoding or global infomaxFootnote 1 while measuring the homogeneity between neighbors. To measure the homogeneity, we use the cosine similarity between neighbors of the same node and between any random edges following Chen et al., (2020). Intuitively, we want the high value from the neighbors telling that the neighbors contain similar information, while we want the low value from the random indicating oversmoothing. The results are plotted in Fig. 3. we can clearly see the large gap between the value from the neighbors and the value from the random in Fig. 3b. This indicates that infomax encoding can encourage the homogeneity while keeping the oversmoothing in control. On the other hand, Fig. 3a informs that global infomax does not deliver the same desirable. More benefits of infomax encoding will be presented in Sect. 4.4 Study of Infomax Encoding.

Fig. 3
figure 3

Average of the cosine similarity of edge vectors between neighbors of the same node (green line) and between any random edges (grey line). The green area shows the gap between them. The larger the gap, the better homogeneity and lesser prone to oversmoothing

3.3 Model Optimization

For the final objective, we jointly optimize the loss from the graph context in Eq. (3) and the infomax loss in Eq. (2):

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{G} + \alpha \sum _{l=1}^L {\mathcal {L}}_{{\mathcal {I}}}^{(l)}, \end{aligned}$$
(3)

where \(\alpha\) is a hyper-parameter controlling the importance of the infomax loss. Since it is computationally expensive to minimize the above loss directly, we adopt the negative sampling technique. Specifically, we uniformly sample an edge \((v_i, t, v_j)\) from the graph as a positive sample, then we uniformly sample K negative nodes that have the same types as \(v_j\) and do not have relation t to \(v_i\). For further improvement on the negative sampling, “hard-samples” negative sampling (Zhu et al., 2021) or adversarial negative sampling (Hu et al., 2019; Sun et al., 2019) can be considered. The effect of negative sampling in homogeneous graphs have also been investigated by Qiu et al., (2018); Yang et al., (2020).

Note that the model can perform in semi-supervised manner by changing the loss function \({\mathcal {L}}_G\) to a loss function for multiclass classification such as cross-entropy loss. However, we found that the semi-supervised loss usually converges significantly faster than the infomax loss. The slower convergence of the infomax loss suggests that the model may underutilize the infomax encoder in such a condition.

To make the model applicable to large graphs, we follow the architecture of GraphSAGE (Hamilton et al., 2017) to be able to generate node representation individually and utilize mini-batch training. In particular, for generating the representation of a node, namely \(v_{i}\), we uniformly sample up to n neighbors of \(v_{i}\), where n is a hyper-parameter. Subsequently, for each sampled neighbor, we perform the sampling process of that neighbor. We repeat the process for L times. The aggregation considers only these samples and \(v_{i}\) for generating the representation of \(v_{i}\). In this way, we can limit the memory usage in the aggregation process to \(O(n^{L}T)\), where T is the space complexity of the transformation h, compared with \(O( \vert {\mathcal {E}} \vert T )\) using the whole graph. In practice, L is usually set to a small number due to the oversmoothing effect. We obtain nodes’ corrupted neighbors by shuffling the nodes’ neighbors of the same mini-batch. Specifically, \(\tilde{{\mathcal {N}}}_{B_i} = {\mathcal {N}}_{B_{\pi (i)}}\) such that \(T_{i} = T_{\pi (i)}\) where \(\pi\) is a permutation function, B and T denote the arrays of target relations and node indexes of a mini-batch, respectively. The reason is to reduce computation cost. Since we need to perform \(l-1\) encoding layers to has \({\mathbf {e}}^{(l-1)}_{k, q}\) in Eq. (2), by shuffling the neighbors inside a mini-batch, \({\mathbf {e}}_{k, q}^{(l-1)}\) has been calculated already in the positive side and can be reused in the negative.

4 Experiments

We conduct extensive experiments to compare HIME with the state-of-the-art models on link prediction, node clustering, and node classification tasks. We also analyze the benefits of the proposed single-level aggregation over bi-level aggregation. Moreover, we provide a thorough investigation into the different aspects of HIME, including the effect of infomax encoding, the scalability, hyper-parameter sensitivity, and its adaptability to other frameworks.

4.1 Datasets

Aside from our synthetic toy dataset, we use ten publicly available real-world HIN datasets. We divide the datasets into three groups. The first group is datasets where both node types and relations are more than one: DBLP (Hu et al., 2019), Yelp ( Hu et al., 2019), Douban Movie (Shi et al., 2019), Douban Book (Shi et al., 2019), and Amazon-LargeFootnote 2 (Ni et al., 2019). The second group is multiplex networks (\(\vert {\mathcal {T}} \vert = 1\) and \(\vert {\mathcal {R}} \vert > 1\)): Amazon, YouTube, and Twitter. We obtain the data from Cen et al., (2019). The last group is multiplex networks with node labels: ACM and IMDB. We obtain node labels and attributes from Park et al., (2020) for node clustering and classification evaluation purpose. Basic statistics of the datasets are summarized in Table 2. Additional details can be found in Sect. A in the appendix.

Table 2 Statistics of datasets

4.2 Compare HIME with baselines

We first compare our proposed HIME to various unsupervised models which can be potentially applied to HINs:

  • Traditional homogeneous graph embedding: LINE (Tang et al., 2015b), DeepWalk (Perozzi et al., 2014)Footnote 3

  • HIN-based embedding: BHIN2vec (Lee et al., 2019b), metapath2vec (Dong et al., 2017), HEER (Shi et al., 2018b)

  • Knowledge graph embedding: TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), CompleEx (Trouillon et al., 2016), RotatE (Sun et al., 2019)

  • GNN for homogeneous graph: GraphSAGE (Hamilton et al., 2017), DGI (Veličković et al., 2018)

  • GNN for multiplex network: DMGI (Park et al., 2020)

  • GNN for HIN: RGCN (Schlichtkrull et al., 2018), HANFootnote 4 (Wang et al., 2019), GATNE (Cen et al., 2019), and HGT (Hu et al., 2020b).

Note that all GNNs for HIN in the list are bi-level aggregations. Because the core idea of DMGI (Park et al., 2020) and HDGI (Ren et al., 2019) is similar but DMGI has an improved regularization, we select DMGI to represent global infomax approach for multiplex networks. GATNE in Link Prediction section refers to GATNE-T while GATNE in Node Clustering and Classification section refers to GATNE-I. GATNE* refers to a variant of GATNE by changing its initialization and loss function to Eq. (1). HAN* refers to a variant of HAN by changing the semantic attention to be an independent and learnable parameter. For a fair comparison, we fix the embedding dimensions d to 128. Without specifying, all models are trained in unsupervised manner. Please see Sect. B in the appendix for additional details about the implementation and hyper-parameter settings.

4.2.1 Link prediction

To evaluate the quality of embedding methods on preserving the information of a HIN, we conduct link prediction experiments following Shi et al., (2018b). For each HIN, we split its edges into three sets for training, validation, and testing with numbers 85, 5, and 10% from the total, respectively. An evaluated model is trained on the training set, while the validation set is used for stopping criteria and finding suitable parameters. The task is to predict the edges in the test set using the learned embeddings from the training set. We sample 10 negatives instead of all possible candidates because of computation cost. We rank positive edges among both positives and negatives and report the mean reciprocal rank (MRR) by micro- and macro-average. In particular, the micro-average MRR averages all reciprocal rank without considering relations, whereas macro-average MRR averages over the mean of values of reciprocal rank associated with each relation. Mathematically,

$$\begin{aligned} \text {MRR}_{micro}&= \frac{1}{\vert {\mathcal {E}}_{test} \vert } \sum _{(i,t,j) \in {\mathcal {E}}_{test}} \frac{1}{2} ( \frac{1}{rank_i} + \frac{1}{rank_j} ), \\ \text {MRR}_{macro}&= \frac{1}{\vert {\mathcal {R}} \vert } \sum _{t \in {\mathcal {R}}} \frac{1}{c^t} \sum _{\begin{array}{c} (i,r,j) \in {\mathcal {E}}_{test} \\ r=t \end{array}} \frac{1}{2} ( \frac{1}{rank_i} + \frac{1}{rank_j} ), \end{aligned}$$

where \({\mathcal {E}}_{test}\) denotes the test set and \(c^t\) denotes the number of test edges with a relation t in the test set. We average over relations for which their appearances exceed 5%. Table 3 lists the results. We observe that HIME achieves the best performance for all cases. Particularly, HIME significantly outperforms the baselines, with p-value \(< 0.01\).

Table 3 Micro- and macro-average MRR results of each model in link prediction task in HINs. Bold texts indicate the best result in each case. Underline texts indicate the best among baselines. \(\star\) indicates the significant improvement of HIME over baselines, with p-value \(< 0.01\)

4.2.2 Link prediction on multiplex networks

In multiplex networks, nodes of the same type are connected to each other with multiple relations. We conduct link prediction in multiplex networks to see whether our model can handle effectively this scenario. We obtain data from Cen et al., (2019) and conduct experiments following them. We use their source code for evaluation and report the performance.

Table 4 Performance for link prediction in multiplex networks. The result of [‡] are taken from the corresponding paper

We report a summary of the results in Table 4. We observe that HIME shows superior performance over other models. The highest gap in performance is in the YouTube dataset, which has the highest number of relations among those three datasets. This implies the effectiveness and distinguishability of the model for dealing with multiple relations. Although DMGI and DGI aiming to capture global properties are good for node clustering and classification, they are inferior to models with graph context optimization (first- and second-order in graphs) for the link prediction task.

4.2.3 Node clustering and classification

Node clustering aims to group nodes belonging to the same class, while node classification is to identify their classes which can be considered as either multi-class classification or multi-label classification (Tsoumakas and Katakis, 2007; Do et al., 2019). We conduct experiments for both cases following Park et al., (2020) with the same data split for training, validation, and test sets. For methods that ignore node attributes, we concatenate the raw attributes with the learned node representations following Park et al., (2020). Since our methods generate multiple node representations for different edge types, for a fair comparison, we select the node representations of an edge type that yields the best results in the validation set, then evaluate the performance on the test set. In practice, we can combine such different node representations and use fast and highly scalable feature selection methods including ensemble techniques to enhance the performance. We report normalized mutual information (NMI) for node clusteringFootnote 5, and macro- (MaF1) and micro-average F1 (MiF1) for node classification. For comparison, we include bi-level aggregations: RGCN, HAN, and HGT trained in unsupervised manner with Eq. (1) as the objective function.

Table 5 Node classfication results (MiF1, MaF1) and node clustering results (NMI) on multiplex networks. [\(\dagger\)] refers to semi-supervised training. Bold texts indicate the best result in each case. Underline texts indicate the best among baselines

The results are summarized in Table 5. HIME demonstrates the best performance, while DMGI is the second best. For node classification, we observe that our unsupervised model HIME even outperforms the semi-supervised model HAN [\(\dagger\)]. For node clustering in ACM dataset, HIME achieves performance gain as high as 4.9% over DMGI which is significantly superior to other baselines.

4.3 Bi-level vs single-level aggregation

In this section, we explain why bi-level aggregation performs worse than single-level aggregation for relation-aware HIN embedding. First, we look at where the performance is improved by single-level over bi-level aggregation. We conduct further experiments for the link prediction task on Douban Book dataset, which has the largest number of relations by switching between graph convolutional methods using the same objective function as defined in Sect. 3.1. For bi-level aggregations, we include well-known methods such as RGCN (Schlichtkrull et al., 2018), HAN (Wang et al., 2019), and HGT (Hu et al., 2020b), and for single-level aggregations, we include commonly used methods such as GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) and also our proposed HIME. We also include our proposed single-level aggregator without infomax encoding as “HIME (-IM)”.

Fig. 4
figure 4

Link prediction performance of models for each relation in Douban Book dataset by optimizing relation-aware embedding. The number in round brackets indicates the percentage of the relation

Figure 4 presents the link prediction results for each relation. We can see that the single-level aggregations significantly outperform bi-level aggregations in minority relations. To explain why bi-level aggregations perform worse, we aim to investigate the importance of each relation to node representations for a minority relation. Intuitively, an appropriate aggregator should transfer abundant information from majority relations to the minority. One can simply look at the attention score of the bi-level. However, it lacks the consideration of the magnitude of a feature and the transformation along the aggregation path.

To provide a better concrete analysis, we derive an idea from a neural network pruning technique SNIP (Lee et al., 2019a) which can also be considered as an application of an attribution method gradient*input (Shrikumar et al., 2017). SNIP uses the derivative of a loss function with respect to an auxiliary variable c representing connectivity of a parameter w. The purpose is to find important connections based on the change of c called saliency score. To measure the importance of a node \(v_i\), the saliency score of the node \(s_i\) can be calculated as:

$$\begin{aligned} s_i = {{\,\mathrm{abs}\,}}\Big ( \frac{ \partial {\mathcal {L}}_{G}({\mathbf {c}}_i \odot {\mathbf {x}}^{(0)}_i; \theta , G) }{ \partial {\mathbf {c}}_i} \Big |_{ {\mathbf {c}}_i = {\mathbf {1}}} \Big )^\intercal {\mathbf {1}}, \end{aligned}$$
(4)

where \(\theta\) is a set of the model’s parameters, \({\mathbf {x}}^{(0)}_i\) denotes an initial feature vector of node \(v_i\) before the aggregation, and \({{\,\mathrm{abs}\,}}\) denotes a function calculating absolute value of each element in an input. The higher saliency score, the higher significance of a node.

Fig. 5
figure 5

Average saliency score of nodes connected to output nodes via each relation (x-axis) for generating the node representations for a target relation (y-axis) on Douban Book dataset

Figure 5 shows the average nodes’ saliency score of each relation to a target relation’s node representations. Note that Fig. 5a—c are from bi-level aggregations, while Fig. 5d–f are from single-level aggregations. RGCN has very high saliency scores on minority relations because it uses the mean across relations in the second-level. HAN* uses attention mechanism in the second-level that can slightly gravitate the significance to majority relations and improve the performance. HGT deploys an attention mechanism considering each message from a neighbor instead. This can alleviate the down-weighting problem and make its results closer to those of single-level aggregations than others. However, we observe that for most target relations, HGT has very high saliency scores on the same relation. This suggests that it tends to underutilize the information of HIN. However, the saliency scores of majority relations,“U-G” and “U-U”, of bi-level aggregations are still much lower than those of minority relations for most cases. This demonstrates that bi-level aggregation scheme down-weights the information of the majority. On the other hand, Fig. 5d–f support that single-level aggregations much less suffer from the down-weighting issue.

However, the disadvantage of both GraphSAGE and GAT is the lack of awareness of relations in the aggregation that reduces their powerfulness when performing on HIN. For example, if we aim to directly optimize link prediction for each relation instead of relation-aware embedding. Figure 6 informs us that bi-level aggregation is better than single-level aggregation (excluding HIME) in all cases, following a similar finding by other researchers. This suggests that there is room for improvement to find a method that considers relations in the aggregation without suffering from the down-weighting issue. By considering relations in the aggregation to improve the powerfulness alone, we can see the increase in performance in HIME (-IM), while still performing well for minority relations in Fig. 4, unlike bi-level aggregations. Finally, by incorporating infomax encoding to encourage graph smoothing, we observe that HIME outperforms HIME (-IM) in all cases in Fig. 6 and most cases in Fig. 4 implying that it can utilize more information from graph structure. We can conclude that HIME is effective and powerful for HIN embedding and does not suffer from the down-weighting issue, satisfying both requirements.

Fig. 6
figure 6

Link prediction performance of models for each relation in Douban Book dataset by optimizing link prediction task for each relation. The number in round brackets indicates the percentage of the relation

4.4 Study of infomax encoding

In this section, we provide study and analyze of the effect of the infomax encoding. For these purposes, we conduct further experiments with three settings: insusceptibility to attack on node features, horizontal improvement, and vertical improvement.

4.4.1 Insusceptibility to attack on node features

As we introduce the infomax encoding, we hypothesize that the infomax encoding augments the graph smoothness. Feng et al., (2019) show that implementing the graph smoothness in the prediction stage can increase the performance and robustness against attacks on node input features in node classification, and Jin & Zhang (2019) obtain a similar achievement by perturbing the latent presentation in GCN. We will demonstrate that the infomax encoding follows the same conjecture and improves the robustness.

Table 6 Node classification performance of HIME on IMDB graph with node feature attacks. Bold values indicate the best result in each case

We choose the node classification task as in Sect. 4.2.3. We perform node feature attacks on either 5% uniformly sampled nodes or top 5% of nodes ranked by node degree. The attacks are 1. Noise: injecting Gaussian noise \({\mathcal {N}}(0, 0.01)\) 2. Shuffle: row shuffling between the attacked nodes 3. Zero: substituting constant for its features. Then we train HIME and evaluate the performance on an attacked graph. The results are collected and presented in Table 6. We can see that HIME with infomax encoding outperforms its variance without infomax encoding (-IM) regardless of attack types and attacked nodes.

4.4.2 Horizontal improvement

The second setting is to see its benefit when going wide (horizontal) in the graph. We set the number of layers L to 1, gradually increase the number of sampled neighbors N from 5 to 100, and run the model with and without the infomax encoding. We plot the performance of the model against the number of sampled neighbors in Fig. 7. We can see that the model achieves higher performance with infomax encoding than without it when N is high enough. This follows our assumption that infomax encoding can improve the usage of each neighbor’s information.

Fig. 7
figure 7

Performance of HIME with/without infomax w.r.t. the number of sampled neighbors

4.4.3 Vertical improvement

In the last setting, we investigate whether the infomax encoding can improve the performance when going deep (vertical) in the graph. We conduct additional experiments by varying the number of layers L from 1 to 5, which can reach up to five-hop neighbors, and run the model with and without using infomax encoding in all layers. Due to the limitation in our memory space, we set the number of neighbors to 50, 20, 8, 5, 3 for L from 1 to 5.

The results are listed in Table 7. As can be seen, the model with infomax encoding outperforms the one without it in all cases. The performance gain becomes lower when the number of layers is higher. The reason is that when the number of layers is higher, (1) the number of sampled neighbors becomes lower, and (2) the smoothing effect of the GNN becomes stronger, leading to more homogeneity when infomax encoding is not used.

Table 7 Performance (Micro-avg. MRR) of the model with and without infomax encoding by varying the number of layers

4.5 Additional analysis

In this section, we provide additional studies about (1) oversmoothing issue (2) training time, scalability, and convergence, (3) ablation study, (4) hyper-parameter sensitivity, and (5) adaptability of HIME to other frameworks.

4.5.1 Oversmoothing issue

First, we investigate whether HIME suffers from the oversmoothing issue, which is a common issue in GNNs. In addition, we aim to provide empirical evidences to support our motivation behind using GRUs. Therefore, we introduce two variants of the single-level aggregator by replacing GRU (\({\mathbf {x}}_i \leftarrow {{\,\mathrm{GRU}\,}}({\mathbf {x}}_i, {\mathbf {n}}_i)\)) with GCN and RGCN styles: \({\mathbf {x}}_i \leftarrow \sigma (\omega _i {\mathbf {W}}{\mathbf {x}}_i + (1-\omega _i) {\mathbf {n}}_i)\), where \(\sigma\) denotes the rectified linear unit (ReLU), \({\mathbf {W}}\) is a learnable transformation matrix, and \(\omega\) is a balancing factor between the information of a node and its neighbors that equals to \(\frac{1}{\vert {\mathcal {N}}_i \vert + 1}\) for GCN style and \(\frac{1}{\vert {\mathcal {R}} \vert + 1}\) for RGCN style. We conduct additional experiments for the GCN and RGCN styles by varying the number of layers L from 2 to 5 with the same setting as in Sect. 4.4.3.

The results are listed in Table 8. We observe that all variants achieve comparable results when \(L=2\). However, as the number of layers increases, the performance gap becomes larger on both datasets, and the GRU variant retains the performance the most among all of them. The results demonstrate the possibility of using GRU to alleviate the oversmoothing issue. To further investigate the difference among them, we use Mean Average Distance (MAD) between node representations (Chen et al., 2020) to measure the smoothness of the representations. The distance is defined as one minus the cosine similarity between the representations between a pair of nodes. Lower MAD value indicates smoother the node representations in a graph, where the zero value means that all the node representations become indistinguishable. The results are shown in Fig. 8. We observe significant drops in the MAD values of the GCN and RGCN variants on Yelp dataset indicating the possibility of oversmoothing. In contrast, the MAD values of the GRU slightly reduce when the number of layers increases on both datasets. This supports that HIME with GRU (default) does not suffer from the oversmoothing issue.

Table 8 Performance (Micro-avg. MRR) of the model with different aggregation styles by varying the number of layers. Bold values indicate the best result in each case
Fig. 8
figure 8

MAD values of HIME with different aggregation styles and layers. Lower MAD value indicates smoother the node representations in the graph. The GRU variant is the least prone to the oversmoothing issue

4.5.2 Training time, scalability, and convergence analysis

We aim to analyze the training time of HIME and compared it with bi-level aggregation models. We conduct experiments by training models on Douban Movie with varying the number of training edges. To see it clearly, we include the time for variants of HIME with varying the transformations h and with/without infomax encoding (IM). We plot the results in Fig. 9. As can be seen, the characteristics of the training time can be vary depending on models. For HIME, we find that its training time linearly grows with the number of edges, but the slope is smaller compared with GATNE, HAN, and HGT because of the less training steps until convergence. The faster convergence is probably due to the simplicity of single-level aggregations compared with those of bi-level aggregations employing attention mechanisms. On the other hand, RGCN demonstrates the similar training time as HIME because RGCN employs the mean pooling instead. .

Fig. 9
figure 9

Training time until convergence of each model with respect to the number of training edges. The number of layers is set to 1 for all models

Furthermore, to investigate the scalability of HIME, we run HIME on Amazon-Large dataset consisting of 12.8 million edges. We compare the link prediction performance of HIME to its variant without the infomax encoding. Table 9 shows the performance and its training time on a single GPU, NVIDIA Tesla V100. HIME with the infomax encoding performs better while costing a tiny amount of time. We plot the convergence curve of HIME in Fig. 10. We observe that the training loss is steadily converged and inversely proportional to the performance on the validation set.

Table 9 Performance (MRR) and time analysis in link prediction task on Amazon-Large dataset. “t/step” and “T” denotes time per step and total training time, respectively
Fig. 10
figure 10

Convergence curve for HIME on Amazon-Large dataset. The plotted performances are on the validation set

4.5.3 Ablation study

We provide ablation study to clarify the contributions of our work in the methodology over RGCN. Thus far, our contributions are as follows.

  1. (i)

    We proposed single-level aggregation (Sect. 3.1), which uses mean pooling over all types of neighbors instead of averaging over each type of neighbors in RGCN.

  2. (ii)

    We introduced the transformation h (Sect. 3.1) that should be carefully selected and fine-tuned, unlike RGCN that uses only linear transformation.

  3. (iii)

    We proposed infomax encoding (Sect. 3.4), which can reduce the heterogeneity and promote the homogeneity between neighbors in the graph.

To demonstrate the effect of these three contributions, we perform ablation study of HIME by considering three variants of it that illustrate the contribution of each point on the top of its previous. Note that for (ii), we use the hyperplane transformation to distinguish from the linear transformation in RGCN.

The results are listed in Table 10. We observe that the effect of each contribution is significantly different and depends on the datasets. Specifically, the first contribution shows the highest performance gain over other contributions on Douban Movie and Douban Book datasets, where the average degree of a graph is high and the edges are heavily dominated by the majority relations (please see Table 12 in Appendix A for the statistics). Such a situation will cause a severe down-weighting issue in bi-level aggregations, and that explains the performance gain from the first contribution. Conversely, it performs significantly worse than RGCN on DBLP, where the average degree is much lower compared with Douban Movie and Douban Book datasets. This empirically supports the application of RGCN that was originally proposed for knowledge graphs, which usually have a small number of average degree. The second contribution positively yields significant improvements in all cases. Please be reminded that we have thoroughly investigated the effect of the third contribution in Sect. 4.4.

Table 10 Ablation study of HIME in comparison to RGCN in the link prediction task. Bold values indicate the best result for each dataset and metric

4.5.4 Hyper-parameter sensitivity

We conduct parameter sensitivity analysis by adjusting important hyper-parameters: d and \(\alpha\) with \(L=1\). We plot the model performance against them as shown in Fig. 11a, b. As d increases, the performance is raised until becoming plateau when \(d > 100\). Then, the performance of the model is slightly affected by d when setting d high enough. On the other hand, the performance improves as \(\alpha\) increases from 0.001 to 0.1, then it declines. HIME achieves the best results at around \(\alpha =0.1\), where the infomax loss positively contributes to the performance.

Fig. 11
figure 11

Parameter sensitivity

4.5.5 Adaptability of HIME to other frameworks

We investigate whether HIME can be used on other frameworks with a similar setting. We select GPT-GNN (Hu et al., 2020a) which is a framework for pretraining graph neural network and applicable for HINs. GPT-GNN aims to generate or reconstruct an input graph which is similar to the objective of HIME. The main difference between GPT-GNN and HIME is that GPT-GNN operates on subgraph sampling while HIME samples an observed edge, then generates relevant node representations in the training stage.

Table 11 Performance of HGT and HIME* on multiple tasks with and without pretrain. Bold values indicate the best result for each downstream task and metric

We conduct experiments based on GPT-GNN framework by comparing augmented HIME (HIME*) to HGT. HIME* uses relative temporal encoding, as in HGT, in relation-specific transformation to generate edge vectors. However, to keep the concept of single-level aggregation, HIME* does not use any attention. We use the same hyper-parameter settingFootnote 6 as being provided without tuning. We use Open Academic Graph (OAG) data on computer science (CS) provided by the GPT-GNN authors. we follow the original setting for time-transfer, then evaluate the models on three downstream tasks: prediction of Paper–Field (PF), Paper–Venue (PV), and Author Disambiguation (AD). Please see Sect. 4 in GPT-GNN literature for more details.

Table 11 shows the performance on the downstream tasks of both models. We observe that HIME* shows its superior over HGT with and without pretrain for most cases. This demonstrates that the concept of HIME, a heterogeneous single-level aggregator with infomax encoding, is applicable to other frameworks.

5 Conclusion

In this work, we proposed the single-level aggregation scheme and the application of infomax to learn node embedding for heterogeneous information networks. The single-level aggregation scheme is not only simpler than the bi-level scheme adopted by the state-of-the-art methods but it also has higher performance across many benchmark tests. The proposed infomax helps in bridging heterogeneous embedding to homogeneous embedding by encouraging graph smoothness and allows scalability. We conducted extensive experiments to verify and compare the performance of our model with the state-of-the-art methods.

Our results with single-level aggregation justify that the bi-level aggregation scheme down-weights some popular node types and edge types (such as users and user interactions) by design. In the light of this, it is beneficial to use our single-level aggregation scheme in future studies as a benchmark method.

For future direction, we aim to investigate a way to combine existing homogeneous GNNs and HIN embedding frameworks efficiently to accommodate a variety of homogeneous GNNs to HIN domain.