Deep Graph Contrastive Learning
Self-supervised learning (SSL) has been studied extensively for alleviating the label scarcity problem of deep models. Recent SSL techniques are converging around the central theme of contrastive learning (CL), which aims to maximize the agreement of representations under multiple views of input data. However, the development of CL for graph-structured data remains nascent. In this blog post, I will discuss recent progress in the field of graph CL. Specifically, I will introduce a general framework for graph CL and describe our recent work on adaptive augmentation for graph CL. I will also share my thoughts on the potential of this approach and outline future directions for research in this area.
Self-Supervised Learning Revisited
Deep graph representation learning, which aims to learn a low-dimensional dense vector that encodes node structures and attributes, enables efficient feature learning for graph-structured data.
Graph neural networks (GNNs) have become a popular approach for learning graph representations. However, most GNN models are trained in a (semi-)supervised manner, which requires a large amount of labeled data. In many real-world scenarios, labeled data may not be available, and collecting and labeling data can be time-consuming and labor-intensive. Furthermore, supervised methods that use labeled data may not be able to learn generic knowledge that can be reused across different tasks.
Here I include two quotes:
“Labels are the opium of the machine learning researcher.”
--- Jitendra Malik“The future is self-supervised!”
--- Yann LeCun
They advocate that label supervision is not always necessary. Considering that the amount of unlabeled data is substantially more than labeled data, it is a natural idea to exploit various labels that come with the data for free, known as self-supervised learning (SSL). Nowadays, self-supervised graph representation learning has attracted a lot of research attention. Here is a curated list of must-read papers, survey, and talks that I maintained; you may refer to it if you are interested in reading more in depth.
Taxonomy of Self-Supervised Learning
In essence, self-supervised methods employ proxy tasks (aka pretext tasks) to guide learning the representations. These are framed by predicting any part of the input from any other observed part in the input data. For example, we might rotate images at random and train a model to predict the rotation angle of each input image. Since the prediction task is made-up, we usually do not care its performance, but rather focus on the learned representations that would carry semantic or structural meanings. You may refer to [Jing and Tian, 2020] for a comprehensive survey on self-supervised visual representation learning techniques.
Self-supervised learning methods usually fall into two lines of development: generative/predictive approaches and contrastive approaches.
- Generative/predictive methods usually train the model in a supervised manner, where the labels are self-generated from the data.
- Contrastive learning, aka instance discrimination, requires data-data pairs, and performs discrimination between positive and negative pairs.
The Contrastive Learning Paradigm
Contrastive learning aims to maximize the agreement of latent representations under stochastic data augmentation. SimCLR [Chen et al., 2020] sets a paradigm for contrastive learning. Specifically, it derives two versions of one sample, and pushes the embeddings of the same sample close to each other and that of different samples apart.
There are three main components:
- Data augmentation pipeline $\mathcal{T}$
- Encoder $f$ and representation extractor $g$
- Contrastive mode and objective $\mathcal{L}$
Contrastive Learning Objectives
Formally, for any data point $\boldsymbol{x}$ (also commonly referred to as anchor sample in metric learning literature), contrastive learning aims to learn an encoder $f$ such that
\[s(f(\boldsymbol{x}), f(\boldsymbol{x}^+)) \gg s(f(\boldsymbol{x}), f(\boldsymbol{x}^-)),\]where $\boldsymbol{x}^+$ is a positive sample congruent to $\boldsymbol{x}$, $\boldsymbol{x}^-$ is a negative sample dissimilar to $\boldsymbol{x}$, and $s(\cdot,\cdot)$ measures similarity between two embeddings. The score function $s$ is encouraged to assign large values to positive examples and small values to negative examples.
We can construct an $N$-way softmax classifier to optimize it
\[\mathcal{L} = -\mathbb{E}_X \left[ \log \frac{\exp(s(\boldsymbol{x}, \boldsymbol{x}^+))}{\exp(s(\boldsymbol{x}, \boldsymbol{x}^+)) + \sum_{j = 1}^{N - 1} \exp(s(\boldsymbol{x}, \boldsymbol{x}_j))} \right],\]which is referred to as the InfoNCE loss [Oord et al., 2018]. It distinguishes a pair of representations from two augmentations of the same sample (positives) apart from $(N – 1)$ pairs of representations from different samples (negatives). The critic function $s$ can be simply implemented as a cosine similarity function.
Graph Contrastive Learning
Unlike visual representation learning, the traditional work of network embedding inherently follows a contrastive paradigm, which is originated in the skip-gram model. To be specific, nodes appearing on the same random walk are considered as positive samples. For example, node2vec [Grover and Leskovec, 2016] first samples short random walks and then enforces neighboring nodes on the same walk to share similar embeddings by contrasting them with other nodes. These traditional node embedding approaches could be seen as factorizing a preset graph proximity matrix, which have difficulty in leveraging node attributes [Qiu et al., 2018].
Modern GNNs employ more powerful encoders for learning representations by aggregating information from neighborhood. However, GNN-based contrastive learning methods are in their infancy. A growing body of graph CL literature has investigated different contrastive architectures and data augmentation techniques.
Graph Contrastive Modes
Contrastive modes define which embeddings to pull together or push apart. Mainstream work involves two modes: global-local and local-local contrastive learning.
- Global-local contrastive learning: DGI [Veličković et al., 2019] and MVGRL [Hassani and Khasahmadi, 2020] maximize the agreement between node- and graph-level representations. The global-contrastive mode can be seen as a proxy of the local-local mode, but the graph readout function should be injective [Xu et al., 2019] to distill enough information from node-level embeddings.
- Local-local contrastive learning: Follow-up work GCC [Qiu et al., 2020], GRACE [Zhu et al., 2020], and GraphCL [You et al., 2020] eschew the need of an injective readout function and directly maximize the agreement of node embeddings (or subgraph-level representations) across two augmented views.
Graph Data Augmentation
Another critical design consideration is data augmentation for graph-structured data, which transforms the original graphs to congruent counterparts. Most existing work adopts a bi-level augmentation scheme, consisting of both attribute- and structure-level augmentation.
- Attribute-level augmentation
- Dropping / masking features [You et al., 2020; Zhu et al., 2020; Zhu et al., 2021]
- Adding Gaussian noise
- Structure-level augmentation
- Adding / dropping edges [You et al., 2020; Zhu et al., 2020; Zhu et al., 2021]
- Sampling subgraphs [Hassani and Khasahmadi, 2020; Qiu et al., 2020; You et al., 2020]
- Generating global view via diffusion kernels [Hassani and Khasahmadi, 2020]
We notice that most work leverages a dual-branch architecture following SimCLR [Chen et al., 2020] that augments the original graph twice to form two views and designates positive samples across two views. For some global-local CL methods like DGI [Veličković et al., 2019] and GMI [Peng et al., 2020], they employ an architecture with only one branch. In this case, negative samples are obtained by corrupting the original graph. Different from the aforementioned augmentation schemes that generate congruent pairs to model the joint distribution of positive pairs, we resort to the term corruption functions, which approximates the product of marginals.
Deep Graph Contrastive Learning: GRACE
Having introduced background of CL, let’s present a general graph CL framework GRACE [Zhu et al., 2020].
The Contrastive Learning Framework
The proposed GRACE framework mainly consists of two stages: data augmentation and contrastive learning. In each iteration, we first sample two augmentation functions $t$ and $t’$ from a set of all possible augmentation $\mathcal{T}$. For data augmentation on graphs, we conduct hybrid augmentation at both topology and attribute levels to construct diverse node contexts.
- Removing edges: randomly remove a portion of edges in the original graph. Specifically, we sample a random masking matrix $\widetilde{\boldsymbol{R}} \in \{ 0, 1 \}^{N \times N}$, whose entry is drawn from a Bernoulli distribution $\widetilde{\boldsymbol{R}}_{ij} \sim \operatorname{Bern}(1 - p_r)$ if $\boldsymbol{A}_{ij} = 1$, and $p_r$ is the probability of each edge being removed. The resulting adjacency matrix can be computed as $\widetilde{\boldsymbol{A}} = \boldsymbol{A} \circ \widetilde{\boldsymbol{R}}$.
- Masking node features: randomly mask a fraction of dimensions with zeros in node features. Similarly, we sample a random vector $\widetilde{\boldsymbol{m}} \in \{ 0, 1 \}^F$, whose entry independently is drawn from a Bernoulli distribution with probability $(1 - p_m)$. Then, the generated node features is computed by $\widetilde{\boldsymbol{X}} = [ \boldsymbol{x}_1 \circ \widetilde{\boldsymbol{m}}; \boldsymbol{x}_2 \circ \widetilde{\boldsymbol{m}}; \cdots; \boldsymbol{x}_N \circ \widetilde{\boldsymbol{m}} ]^\top$.
Then, we generate two correlated graph views by applying the augmentation functions over the structure and features and feed them into a shared GNN, where their node embeddings are denoted as $\boldsymbol{U} = f(\widetilde{\boldsymbol{X}}_1, \widetilde{\boldsymbol A}_1)$ and $\boldsymbol{V} = f(\widetilde{\boldsymbol{X}}_2, \widetilde{\boldsymbol A}_2)$ respectively.
To learn node representation in an unsupervised fashion, we train the model using a contrastive loss to maximize the agreement between node embeddings in the latent space. Particularly, we first define a pairwise contrastive objective, which takes in the form of NT-Xent loss (Normalized Temperature-scaled Cross Entropy) [Chen et al., 2020], given by
\[\ell (\boldsymbol{u}_i, \boldsymbol{v}_i) = \log \frac {e^{\theta\left(\boldsymbol{u}_i, \boldsymbol{v}_{i} \right) / \tau}} {\underbrace{e^{\theta\left(\boldsymbol{u}_i, \boldsymbol{v}_{i} \right) / \tau}}_{\text{positives}} + \underbrace{\displaystyle\sum_{k \neq i} e^{\theta\left(\boldsymbol{u}_i, \boldsymbol{v}_{k} \right) / \tau}}_{\text{inter-view negatives}} + \underbrace{\displaystyle\sum_{k \neq i}e^{\theta\left(\boldsymbol{u}_i, \boldsymbol{u}_k \right) / \tau}}_{\text{intra-view negatives}}},\]where $\tau$ is a temperature parameter. We define the critic function as
\[\theta(\boldsymbol{u}, \boldsymbol{v}) = s(g(\boldsymbol{u}), g(\boldsymbol{v})),\]where $s(\cdot, \cdot)$ is the cosine similarity and $g(\cdot)$ is a non-linear projection to enhance the expression power of the critic function.
Note that embeddings of the same node across two views constitute the positive pairs. We do not explicitly generate negative samples; all other node embeddings are negative samples.
The final objective is then defined as the average over all positive node pairs
\[\mathcal{J} = \frac{1}{2N} \sum_{i = 1}^{N} \left[\ell(\boldsymbol{u}_i, \boldsymbol{v}_i) + \ell(\boldsymbol{v}_i, \boldsymbol{u}_i)\right].\]Theoretical Analysis
The reason behind the success of GRACE may be vague at this time. In this section, I try to establish a connection of the contrastive objective to several well-known learning objectives in machine learning literature. Firstly, our loss can be seen as a lower of mutual information between node features and the embeddings in the two views. To demonstrate this, we give formal definition to mutual information and the InfoMax principle.
Definition 1 (Mutual information). Mutual information (MI) $I(X; Y)$ is a measure of the mutual dependence between the two random variables $X$ and $Y$, determining how different the joint distribution of the pair $P(X, Y)$ is to the marginal $P(X)P(Y)$.
Definition 2 (InfoMax principle) [Linsker, 1998]. A function that maps a set of input values $I$ to a set of output values $O$ should be learned so as to maximize the MI between $I$ and $O$.
In representation learning literature, the InfoMax principle is a guideline for learning good representations by maximizing the mutual information between the input and output of a neural network.
Theorem 1. Let $\boldsymbol{X}_i = \{ \boldsymbol{x}_k \}_{k \in \mathcal{N}(i)}$ be the neighborhood of node $v_i$ that collectively maps to its output embedding, where $\mathcal{N}(i)$ denotes the set of neighbors of node $v_i$ specified by GNN architectures, and $\boldsymbol{X}$ be the corresponding random variable with a uniform distribution $p(\boldsymbol{X}_i) = \frac{1}{N}$. Given two random variables $\boldsymbol{U}, \boldsymbol{V} \in \mathbb{R}^{F’}$ being the embedding in the two views, with their joint distribution denoted as $p(\boldsymbol{U}, \boldsymbol{V})$, our objective $\mathcal{J}$ is a lower bound of MI between encoder input $\boldsymbol{X}$ and node representations in two graph views $\boldsymbol{U}, \boldsymbol{V}$. Formally, \(\mathcal{J} \leq I(\boldsymbol{X}; \boldsymbol{U}, \boldsymbol{V}).\)
Theorem 1 reveals that maximizing $\mathcal{J}$ is equivalent to explicitly maximizing a lower bound of the MI $I(\boldsymbol{X}; \boldsymbol{U}, \boldsymbol{V})$ between input node features and learned node representations. Recent work further provides empirical evidence that optimizing a stricter bound of MI may not lead to better downstream performance on visual representation learning [Tschannen et al., 2020], which further highlights the importance of the design of data augmentation strategies. However, as the objective is not defined specifically on negative samples generated by the augmentation function, it remains challenging to derive the relationship between specific augmentation functions and the lower bound of MI. We shall leave it for future work.
Alternatively, we may view the objective from the metric learning perspective, where the pairwise objective coincides with the traditional triplet loss.
Theorem 2. When the projection function $g$ is the identity function and we measure embedding similarity by simply taking the inner product, i.e. $s(\boldsymbol{u}, \boldsymbol{v}) = \boldsymbol{u}^\top \boldsymbol{v}$, and further assuming that positive pairs are far more aligned than negative pairs, i.e. $\boldsymbol{u}_i^\top \boldsymbol{v}_k \ll \boldsymbol{u}_i^\top \boldsymbol{v}_i$ and $\boldsymbol{u}_i^\top \boldsymbol{u}_k \ll \boldsymbol{u}_i^\top \boldsymbol{v}_i$, minimizing the pairwise objective $\ell(\boldsymbol{u}_i, \boldsymbol{v}_i)$ coincides with maximizing the triplet loss, as given in the sequel
\[- \ell (\boldsymbol{u}_i, \boldsymbol{v}_i) \propto 4 \tau + \sum_{j \neq i}\left( 2\| {\boldsymbol{u}_i} - {\boldsymbol{v}_i} \|^2 - \| {\boldsymbol{u}_i} - {\boldsymbol{v}_j} \|^2 - \| {\boldsymbol{u}_i} - {\boldsymbol{u}_j} \|^2\right).\]In this way, we highlight the importance of appropriate data augmentation schemes, which is often neglected in previous InfoMax-based methods. Specifically, as the objective pulls together representation of each node in the two corrupted views, the model is enforced to encode information in the input graph that is insensitive to perturbation. Since the proposed adaptive augmentation schemes tend to keep important link structures and node attributes intact in the perturbation, the model is guided to encode essential structural and semantic information into the representation, which improves the quality of embeddings.
Graph Contrastive Learning with Adaptive Augmentation: GCA
Augmentation serves as a crux for CL but how to augment graph-structured data in graph CL is still an empirical choice. In essence, CL seeks to learn representations that are insensitive to perturbation induced by augmentation schemes [Wu et al., 2020; Xiao et al., 2020]. The transformations therefore aim to produce a view which is distinct from the input but is also imperceptible, i.e. the transformation should not fundamentally alter its identity [Jovanović et al., 2021]. Considering that there is discrepancy in the impact of nodes and edges, we argue that augmentation should identify important structural and attribute information of graphs and preserve node identities.
Adaptive Augmentation on Graphs
More specifically, we propose to keep important structures and attributes unchanged and perturb possibly unimportant links and features by setting the removal probability inversely proportional to centrality scores of edges or attributes. From an amortized perspective, we emphasize important structures and attributes over randomly corrupted views.
At the topology level, we sample a modified edge subset $\widetilde{\mathcal{E}}$ with probability $p_{uv}^e$. This probability should reflect the importance of that edge in the graph topology. In network science literature, node centrality $\varphi_c(\cdot)$ is a often used measure to quantify the influence of a node. We derive the edge importance $w_{uv}^e$ from the centrality scores of nodes at the two ends. For undirected graphs, we take the average centrality scores at two ends, i.e. $w_{uv}^e = (\varphi_c(u) + \varphi_c(u))/2$, while for directed graphs, we simply have $w_{uv}^e = \varphi_c(v)$.
To alleviate the impact of nodes with heavily dense connections and overly high removal probabilities, we further log transform the edge importance followed by a normalization step.
\[\begin{align} s_{uv}^e & = \log w_{uv}^e, \\ p_{uv}^e & = \min\left(\frac{s_\max^e - s_{uv}^e}{s_\max^e - \mu_s^e} \cdot p_e, \enspace p_\tau \right), \end{align}\]- $p_e$ is a hyperparameter that controls the overall removing probability.
- $s_\max^e$ and $\mu_s^e$ is the maximum and average of $s_uv^e$.
- $p_\tau < 1$ is a cut-off probability.
In GCA, we consider and evaluate on three well-known centrality measures:
Here we visualize the obtained edge removal probability on the Karate club dataset [Zachary, 1977]. It is seen that the three measures all highlight connection around two central nodes (the two coaches). Further experiments also demonstrate negligible performance difference among the choice of centrality measures.
At the attribute level, we randomly mask a fraction of dimensions of node attributes with zeros, where the masking probability is denoted as $p_i^f$. Assuming important feature dimensions appearing in influential nodes, we calculate the frequency of each dimension.
\[w_i^f = \sum_{v \in \mathcal{V}} x_{ui} \cdot \varphi_c(u),\]where $x_{ui} \in \{0, 1\}$ indicate the occurrence of dimension $i$ in node $u$. Then, similar to topology-level augmentation, we transform the importance score $w_i^f$ into mask probabilities $p_i^f$ via log transformation and normalization.
Experiments
Datasets
For comprehensive comparison, we use five widely-used datasets, including Wiki-CS, Amazon-Computers, Amazon-Photo, Coauthor-CS, and Coauthor-Physics, to study the performance of transductive node classification. The datasets are collected from real-world networks from different domains.
- Wiki-CS1 is a reference network constructed based on Wikipedia. The nodes correspond to articles about computer science and edges are hyperlinks between the articles. Nodes are labeled with ten classes each representing a branch of the field.
- Amazon-Computers2 and Amazon-Photo3 are two networks of co-purchase relationships constructed from Amazon, where nodes are goods and two goods are connected when they are frequently bought together.
- Coauthor-CS4 and Coauthor-Physics5 are two academic networks, which contain co-authorship graphs based on the Microsoft Academic Graph. Nodes represent authors and edges indicate co-authorship relationships. The label of an author corresponds to their most active research field.
Baselines
We include a broad range of methods as baselines, including traditional graph embedding methods, GNN models, as well as supervised methods.
- Network embedding methods
- DeepWalk [Perozzi et al., 2014]
- node2vec [Grover and Leskovec, 2016]
- Unsupervised GNNs
- Recontraction-based methods
- GAE, VGAE [Kipf and Welling, 2016]
- GraphSAGE [Hamilton et al., 2017]
- Contrastive learning methods
- DGI [Veličković et al., 2019]
- GMI [Peng et al., 2020]
- MVGRL [Hassani and Khasahmadi, 2020]
- Recontraction-based methods
- Supervised GNNs
- GCN [Kipf and Welling, 2017]
- GAT [Veličković et al., 2018]
Experimental Configurations
We follow the linear evaluation scheme in previous studies, where the model is trained in an unsupervised manner, and the learned node embeddings are fed into a simple $\ell_2$-regularized logistic regression model.
For all baselines, we employ a two-layer GCN model as the encoder and report the performance in terms of classification accuracy.
\[\begin{align} \operatorname{GC}_i (\boldsymbol{X}, \boldsymbol{A}) & = \sigma\left(\hat{\boldsymbol{D}}^{-\frac{1}{2}}\hat{\boldsymbol{A}}\hat{\boldsymbol{D}}^{-\frac{1}{2}}\boldsymbol{XW}_i\right), \\ f(\boldsymbol{X}, \boldsymbol{A}) & = \operatorname{GC}_2(\operatorname{GC}_1(\boldsymbol{X}, \boldsymbol{A}), \boldsymbol{A}). \end{align}\]Overall Performance
The overall performance is summarized in this table. We observe that GCA consistently performs better than unsupervised baselines by considerable margins. Also, we particularly note that GCA is competitive with models trained with supervision on all five datasets.
Ablation Studies
We also conduct ablation studies on the adaptive augmentation module. We replace the topology and attribute augmentation function with a uniform sampling function respectively.
- GCA–T–A (GRACE): uniform augmentation.
- GCA–T and GCA–A: substitute the topology and the attribute augmentation scheme with uniform sampling respectively.
The results are shown in this table. From the table, we see that both topology-level and node-attribute-level adaptive augmentation scheme improve model performance consistently on all datasets.
Concluding Remarks
- Graph self-supervised learning (SSL) is a promising way to learn embeddings without human annotations. Stemming from traditional network embedding approaches, graph CL has established a new paradigm for unsupervised representation learning on graphs.
- We have developed a novel graph CL framework GRACE and its extension GCA with adaptive augmentation. The two approaches employ local-local contrastive mode, which enables better distillation of node-level representations.
- We also find augmentation schemes on both structural and attributive levels are critical for graph CL and important nodes/attributes should be preserved during augmentation to force the model learn intrinsic patterns of graphs. Specifically, in GCA, we set the removal probability inversely proportional to centrality scores of the edges and attributes to reflect their importance.
- Our proposed method achieves state-of-the-art performance and bridges the gap between unsupervised and supervised learning.
- Though promising performance has been achieved, the development of graph CL remains nascent, yet calls for a principled understanding of it.
Bibliographies
- [Chen et al., 2020] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, A Simple Framework for Contrastive Learning of Visual Representations, in ICML, 2020.
- [Grover and Leskovec, 2016] A. Grover and J. Leskovec, node2vec: Scalable Feature Learning for Networks, in KDD, 2016.
- [Hamilton et al., 2017] W. L. Hamilton, Z. Ying, and J. Leskovec, Inductive Representation Learning on Large Graphs, in NIPS, 2017.
- [Hassani and Khasahmadi, 2020] K. Hassani and A. H. Khasahmadi, Contrastive Multi-View Representation Learning on Graphs, in ICML, 2020.
- [Jing and Tian, 2020] L. Jing and Y. Tian, Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, TPAMI, 2020.
- [Kipf and Welling, 2016] T. N. Kipf and M. Welling, Variational Graph Auto-Encoders, in BDL@NIPS, 2016.
- [Kipf and Welling, 2017] T. N. Kipf and M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, in ICLR, 2017.
- [Linsker, 1998] R. Linsker, Self-Organization in a Perceptual Network, IEEE Computer, 1988.
- [Newman, 2018] M. E. J. Newman, Networks: An Introduction (Second Edition), Oxford University Press, 2018.
- [Jovanović et al., 2021] N. Jovanović, Z. Meng, L. Faber, and R. Wattenhofer, Towards Robust Graph Contrastive Learning, arXiv.org, vol. cs.LG. 26-Feb-2021.
- [Oord et al., 2018] A. van den Oord, Y. Li, and O. Vinyals, Representation Learning with Contrastive Predictive Coding, arXiv.org, vol. cs.LG. 2018.
- [Peng et al., 2020] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, Graph Representation Learning via Graphical Mutual Information Maximization, in WWW, 2020.
- [Perozzi et al., 2014] B. Perozzi, R. Al-Rfou, and S. Skiena, DeepWalk: Online Learning of Social Representations, in KDD, 2014.
- [Qiu et al., 2018] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec, in WSDM, 2018.
- [Qiu et al., 2020] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training, in KDD, 2020.
- [Tschannen et al., 2020] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, On Mutual Information Maximization for Representation Learning, in ICLR, 2020.
- [Veličković et al., 2018] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, Graph Attention Networks, in ICLR, 2018.
- [Veličković et al., 2019] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep Graph Infomax, in ICLR, 2019.
- [Wu et al., 2020] M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman, On Mutual Information in Contrastive Learning for Visual Representations, arXiv.org, vol. cs.LG. 27-May-2020.
- [Xiao et al., 2020] T. Xiao, X. Wang, A. A. Efros, and T. Darrell, What Should Not Be Contrastive in Contrastive Learning, arXiv.org, vol. cs.CV. 13-Aug-2020.
- [Xu et al., 2019] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How Powerful are Graph Neural Networks?, in ICLR, 2019.
- [You et al., 2020] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, Graph Contrastive Learning with Augmentations, in NeurIPS, 2020.
- [Zachary, 1977] W. W. Zachary, An Information Flow Model for Conflict and Fission in Small Groups, Journal of Anthropological Research, 1977.
- [Zhu et al., 2020] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, Deep Graph Contrastive Representation Learning, in GRL+@ICML, 2020.
- [Zhu et al., 2021] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, Graph Contrastive Learning with Adaptive Augmentation, in WWW, 2021.
Citation
Please cite our paper should you find our work relevant to yours:
@inproceedings{Zhu:2020vf,
author = {Zhu, Yanqiao and Xu, Yichen and Yu, Feng and Liu, Qiang and Wu, Shu and Wang, Liang},
title = {{Deep Graph Contrastive Representation Learning}},
booktitle = {ICML Workshop on Graph Representation Learning and Beyond},
year = {2020},
url = {https://arxiv.org/abs/2006.04131}
}
@inproceedings{Zhu:2021wh,
author = {Zhu, Yanqiao and Xu, Yichen and Yu, Feng and Liu, Qiang and Wu, Shu and Wang, Liang},
title = {{Graph Contrastive Learning with Adaptive Augmentation}},
year = 2021,
isbn = {9781450370233},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of The Web Conference 2021},
location = {Ljubljana, Slovenia},
month = apr,
series = {WWW '21},
doi = {10.1145/3442381.3449802},
url = {https://doi.org/10.1145/3442381.3449802},
pages = {2069--2080},
numpages = {12}
}
-
https://github.com/pmernyei/wiki-cs-dataset/raw/master/dataset ↩
-
https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_computers.npz ↩
-
https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_photo.npz ↩
-
https://github.com/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_cs.npz ↩
-
https://github.com/shchur/gnn-benchmark/raw/master/data/npz/ms_academic_phy.npz ↩