Unverified Commit c0701fb0 authored by Fize Jacques's avatar Fize Jacques Committed by GitHub

Merge pull request #5 from Jacobe2169/graph_cython

Add new version : Graph Extension, Parallelization, Graph Embedding, Performance enhanced
parents c569c36e c054fc2d
......@@ -6,8 +6,9 @@ notifications:
email: false
install:
- pip install -q cython numpy networkx scipy scikit-learn pandas
- python setup.py build_ext --inplace
- pip install cython numpy networkx scipy scikit-learn pandas gensim joblib gensim psutil --upgrade
- pip install .
script:
- pytest gmatch4py/test/test.py
\ No newline at end of file
- echo "1"
![](logo.png)
[![Build Status](https://travis-ci.com/Jacobe2169/GMatch4py.svg?branch=master)](https://travis-ci.com/Jacobe2169/GMatch4py)
# GMatch4py a graph matching library for Python
GMatch4py is a library dedicated to graph matching. Graph structure are stored in NetworkX graph objects.
GMatch4py algorithms were implemented with Cython to enhance performance.
## Requirements
* Python 3.x
* Cython
* networkx
* numpy
* scikit-learn
* Python 3
* Numpy and Cython installed (if not : `(sudo) pip(3) install numpy cython`)
## Installation
......@@ -19,7 +20,7 @@ To install `GMatch4py`, run the following commands:
```bash
git clone https://github.com/Jacobe2169/GMatch4py.git
cd GMatch4py
(sudo) python3 setup.py install
(sudo) pip(3) install .
```
## Get Started
......@@ -28,7 +29,7 @@ cd GMatch4py
In `GMatch4py`, algorithms manipulate `networkx.Graph`, a complete graph model that
comes with a large spectrum of parser to load your graph from various inputs : `*.graphml,*.gexf,..` (check [here](https://networkx.github.io/documentation/stable/reference/readwrite/index.html) to see all the format accepted)
### Use Gmatch4py
### Use GMatch4py
If you want to use algorithms like *graph edit distances*, here is an example:
```python
......@@ -44,7 +45,7 @@ g1=nx.complete_bipartite_graph(5,4)
g2=nx.complete_bipartite_graph(6,4)
```
All graph matching algorithms in `Gmatch4py work this way:
All graph matching algorithms in `Gmatch4py` work this way:
* Each algorithm is associated with an object, each object having its specific parameters. In this case, the parameters are the edit costs (delete a vertex, add a vertex, ...)
* Each object is associated with a `compare()` function with two parameters. First parameter is **a list of the graphs** you want to **compare**, i.e. measure the distance/similarity (depends on the algorithm). Then, you can specify a sample of graphs to be compared to all the other graphs. To this end, the second parameter should be **a list containing the indices** of these graphs (based on the first parameter list). If you rather compute the distance/similarity **between all graphs**, just use the `None` value.
......@@ -68,15 +69,22 @@ ged.similarity(result)
ged.distance(result)
```
## Exploit nodes and edges attributes
In this latest version, we add the possibility to exploit graph attributes ! To do so, the `base.Base` is extended with the `set_attr_graph_used(node_attr,edge_attr)` method.
```python
import networkx as nx
import gmatch4py as gm
ged = gm.GraphEditDistance(1,1,1,1)
ged.set_attr_graph_used("theme","color") # Edge colors and node themes attributes will be used.
```
## List of algorithms
* DeltaCon and DeltaCon0 (*debug needed*) [1]
* Vertex Ranking [2]
* Vertex Edge Overlap [2]
* Bag of Nodes (a bag of words model using nodes as vocabulary)
* Bag of Cliques (a bag of words model using cliques as vocabulary)
* Graph Embedding
* Graph2Vec [1]
* DeepWalk [7]
* Graph kernels
* Random Walk Kernel (*debug needed*) [3]
* Geometrical
......@@ -84,23 +92,27 @@ ged.distance(result)
* Shortest Path Kernel [3]
* Weisfeiler-Lehman Kernel [4]
* Subtree Kernel
* Edge Kernel
* Graph Edit Distance [5]
* Approximated Graph Edit Distance
* Hausdorff Graph Edit Distance
* Bipartite Graph Edit Distance
* Greedy Edit Distance
* Vertex Ranking [2]
* Vertex Edge Overlap [2]
* Bag of Nodes (a bag of words model using nodes as vocabulary)
* Bag of Cliques (a bag of words model using cliques as vocabulary)
* MCS [6]
## Publications associated
* [1] Koutra, D., Vogelstein, J. T., & Faloutsos, C. (2013, May). Deltacon: A principled massive-graph similarity function. In Proceedings of the 2013 SIAM International Conference on Data Mining (pp. 162-170). Society for Industrial and Applied Mathematics.
* [1] Narayanan, Annamalai and Chandramohan, Mahinthan and Venkatesan, Rajasekar and Chen, Lihui and Liu, Yang. Graph2vec: Learning distributed representations of graphs. MLG 2017, 13th International Workshop on Mining and Learning with Graphs (MLGWorkshop 2017).
* [2] Papadimitriou, P., Dasdan, A., & Garcia-Molina, H. (2010). Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1), 19-30.
* [3] Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M. (2010). Graph kernels. Journal of Machine Learning Research, 11(Apr), 1201-1242.
* [4] Shervashidze, N., Schweitzer, P., Leeuwen, E. J. V., Mehlhorn, K., & Borgwardt, K. M. (2011). Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), 2539-2561.
* [5] Fischer, A., Riesen, K., & Bunke, H. (2017). Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment. Pattern Recognition Letters, 87, 55-62.
* [6] A graph distance metric based on the maximal common subgraph, H. Bunke and K. Shearer, Pattern Recognition Letters, 1998
* [7] Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.
## Author(s)
......@@ -109,6 +121,26 @@ Jacques Fize, *jacques[dot]fize[at]cirad[dot]fr*
Some algorithms from other projects were integrated to Gmatch4py. **Be assured that
each code is associated with a reference to the original.**
## CHANGELOG
### 05.03.2019
* Add Graph Embedding algorithms
* Remove depreciated methods and classes
* Add logo
* Update documentation
### 25.02.2019
* Add New Graph Class. Features : Cython Extensions, precomputed values (degrees, neighbor info), hash representation of edges and nodes for a faster comparison
* Some algorithms are parallelized such as graph edit distances or Jaccard
## TODO List
* Debug algorithms --> Random Walk Kernel, Deltacon
* Optimize algorithms --> Vertex Ranking
=======
## Improvements
GMatch4py is going through some heavy changes to diminish the time execution of each algorithm. You may found an alpha version available in the branch `graph_cython`.
......@@ -118,4 +150,3 @@ As of today, the results are promising (up to ![](https://latex.codecogs.com/gif
## TODO List
* Debug algorithms --> :runner: (almost done !)
* Write the documentation :runner:
name = "gmatch4py"
\ No newline at end of file
......@@ -8,9 +8,14 @@ from .ged.hausdorff_edit_distance import *
# Kernels algorithms import
from .kernels.weisfeiler_lehman import *
from .kernels.shortest_path_kernel import *
# Graph Embedding import
from .embedding.graph2vec import *
from .embedding.deepwalk import *
# Helpers import
from .helpers.reader import *
from .helpers.general import *
# Basic algorithms import
from .bag_of_cliques import *
......
# coding = utf-8
from enum import Enum
class AlgorithmType(Enum):
similarity = 0
distance = 1
\ No newline at end of file
......@@ -9,7 +9,7 @@ cimport numpy as np
from scipy.sparse import csr_matrix,lil_matrix
import sys
from .base cimport Base,intersection
from .base cimport Base
cdef class BagOfCliques(Base):
......
......@@ -4,12 +4,16 @@ cdef class Base:
## Attribute(s)
cdef int type_alg
cdef bint normalized
cdef int cpu_count
cdef str node_attr_key
cdef str edge_attr_key
## Methods
cpdef np.ndarray compare(self,list graph_list, list selected)
cpdef np.ndarray compare_old(self,list listgs, list selected)
cpdef np.ndarray distance(self, np.ndarray matrix)
cpdef np.ndarray similarity(self, np.ndarray matrix)
cpdef bint isAccepted(self,G,index,selected)
cpdef np.ndarray get_selected_array(self,selected,size_corpus)
cpdef set_attr_graph_used(self, str node_attr_key, str edge_attr_key)
cpdef intersection(G,H)
cpdef union_(G,H)
......@@ -3,6 +3,10 @@
import numpy as np
cimport numpy as np
import networkx as nx
cimport cython
import multiprocessing
cpdef np.ndarray minmax_scale(np.ndarray matrix):
"""
......@@ -17,85 +21,6 @@ cpdef np.ndarray minmax_scale(np.ndarray matrix):
return x/(max_)
cpdef intersection(G, H):
"""
Return a new graph that contains only the edges and nodes that exist in
both G and H.
The node sets of H and G must be the same.
Parameters
----------
G,H : graph
A NetworkX graph. G and H must have the same node sets.
Returns
-------
GH : A new graph with the same type as G.
Notes
-----
Attributes from the graph, nodes, and edges are not copied to the new
graph. If you want a new graph of the intersection of G and H
with the attributes (including edge data) from G use remove_nodes_from()
as follows
>>> G=nx.path_graph(3)
>>> H=nx.path_graph(5)
>>> R=G.copy()
>>> R.remove_nodes_from(n for n in G if n not in H)
Modified so it can be used with two graphs with different nodes set
"""
# create new graph
R = nx.create_empty_copy(G)
if not G.is_multigraph() == H.is_multigraph():
raise nx.NetworkXError('G and H must both be graphs or multigraphs.')
if G.number_of_edges() <= H.number_of_edges():
if G.is_multigraph():
edges = G.edges(keys=True)
else:
edges = G.edges()
for e in edges:
if H.has_edge(*e):
R.add_edge(*e)
else:
if H.is_multigraph():
edges = H.edges(keys=True)
else:
edges = H.edges()
for e in edges:
if G.has_edge(*e):
R.add_edge(*e)
nodes_g=set(G.nodes())
nodes_h=set(H.nodes())
R.remove_nodes_from(list(nodes_g - nodes_h))
return R
cpdef union_(G, H):
"""
Return a graph that contains nodes and edges from both graph G and H.
Parameters
----------
G : networkx.Graph
First graph
H : networkx.Graph
Second graph
Returns
-------
networkx.Graph
A new graph with the same type as G.
"""
R = nx.create_empty_copy(G)
R.add_nodes_from(H.nodes(data=True))
R.add_edges_from(G.edges(data=True))
R.add_edges_from(H.edges(data=True))
return R
cdef class Base:
"""
This class define the common methods to all Graph Matching algorithm.
......@@ -115,7 +40,7 @@ cdef class Base:
self.type_alg=0
self.normalized=False
def __init__(self,type_alg,normalized):
def __init__(self,type_alg,normalized,node_attr_key="",edge_attr_key=""):
"""
Constructor of Base
......@@ -136,6 +61,66 @@ cdef class Base:
else:
self.type_alg=type_alg
self.normalized=normalized
self.cpu_count=multiprocessing.cpu_count()
self.node_attr_key=node_attr_key
self.edge_attr_key=edge_attr_key
cpdef set_attr_graph_used(self, str node_attr_key, str edge_attr_key):
"""
Set graph attribute used by the algorithm to compare graphs.
Parameters
----------
node_attr_key : str
key of the node attribute
edge_attr_key: str
key of the edge attribute
"""
self.node_attr_key=node_attr_key
self.edge_attr_key=edge_attr_key
cpdef np.ndarray get_selected_array(self,selected,size_corpus):
"""
Return an array which define which graph will be compared in the algorithms.
Parameters
----------
selected : list
indices of graphs you wish to compare
size_corpus :
size of your dataset
Returns
-------
np.ndarray
selected vector (1 -> selected, 0 -> not selected)
"""
cdef double[:] selected_test = np.zeros(size_corpus)
if not selected == None:
for ix in range(len(selected)):
selected_test[selected[ix]]=1
return np.array(selected_test)
else:
return np.array(selected_test)+1
cpdef np.ndarray compare_old(self,list listgs, list selected):
"""
Soon will be depreciated ! To store the old version of an algorithm.
Parameters
----------
listgs : list
list of graphs
selected
selected graphs
Returns
-------
np.ndarray
distance/similarity matrix
"""
pass
@cython.boundscheck(False)
cpdef np.ndarray compare(self,list graph_list, list selected):
"""
Return the similarity/distance matrix using the current algorithm.
......@@ -153,7 +138,7 @@ cdef class Base:
the None value
Returns
-------
np.array
np.ndarray
distance/similarity matrix
"""
......@@ -164,12 +149,12 @@ cdef class Base:
Return a normalized distance matrix
Parameters
----------
matrix : np.array
Similarity/distance matrix you want to transform
matrix : np.ndarray
Similarity/distance matrix you wish to transform
Returns
-------
np.array
np.ndarray
distance matrix
"""
if self.type_alg == 1:
......@@ -186,8 +171,8 @@ cdef class Base:
Return a normalized similarity matrix
Parameters
----------
matrix : np.array
Similarity/distance matrix you want to transform
matrix : np.ndarray
Similarity/distance matrix you wish to transform
Returns
-------
......@@ -201,30 +186,12 @@ cdef class Base:
matrix=np.ma.getdata(minmax_scale(matrix))
return 1-matrix
def mcs(self, G, H):
"""
Return the Most Common Subgraph of
Parameters
----------
G : networkx.Graph
First Graph
H : networkx.Graph
Second Graph
Returns
-------
networkx.Graph
Most common Subgrah
"""
R=G.copy()
R.remove_nodes_from(n for n in G if n not in H)
return R
cpdef bint isAccepted(self,G,index,selected):
"""
Indicate if the graph will be compared to the other. A graph is "accepted" if :
* G exists(!= None) and not empty (|vertices(G)| >0)
* If selected graph to compare were indicated, check if G exists in selected
* G exists(!= None) and not empty (|vertices(G)| >0)
* If selected graph to compare were indicated, check if G exists in selected
Parameters
----------
......@@ -244,7 +211,7 @@ cdef class Base:
if not G:
f=False
elif len(G)== 0:
f=False
f=False
if selected:
if not index in selected:
f=False
......
......@@ -11,7 +11,7 @@ cdef class BagOfNodes(Base):
We could call this algorithm Bag of nodes
"""
def __init__(self):
Base.__init__(self,0,True)
Base.__init__(self,0,True)
cpdef np.ndarray compare(self,list graph_list, list selected):
nodes = list()
......
# coding = utf-8
import networkx as nx
import numpy as np
import scipy.sparse
class DeltaCon0():
__type__ = "sim"
@staticmethod
def compare(list_gs,selected):
n=len(list_gs)
comparison_matrix = np.zeros((n,n))
for i in range(n):
for j in range(i,n):
g1,g2=list_gs[i],list_gs[j]
f=True
if not list_gs[i] or not list_gs[j]:
f=False
elif len(list_gs[i])== 0 or len(list_gs[j]) == 0:
f=False
if selected:
if not i in selected:
f=False
if f:
# S1
epsilon = 1/(1+DeltaCon0.maxDegree(g1))
D, A = DeltaCon0.degreeAndAdjacencyMatrix(g1)
S1 = np.linalg.inv(np.identity(len(g1))+(epsilon**2)*D -epsilon*A)
# S2
D, A = DeltaCon0.degreeAndAdjacencyMatrix(g2)
epsilon = 1 / (1 + DeltaCon0.maxDegree(g2))
S2 = np.linalg.inv(np.identity(len(g2))+(epsilon**2)*D -epsilon*A)
comparison_matrix[i,j] = 1/(1+DeltaCon0.rootED(S1,S2))
comparison_matrix[j,i] = comparison_matrix[i,j]
else:
comparison_matrix[i, j] = 0.
comparison_matrix[j, i] = comparison_matrix[i, j]
return comparison_matrix
@staticmethod
def rootED(S1,S2):
return np.sqrt(np.sum((S1-S2)**2)) # Long live numpy !
@staticmethod
def degreeAndAdjacencyMatrix(G):
"""
Return the Degree(D) and Adjacency Matrix(A) from a graph G.
Inspired of nx.laplacian_matrix(G,nodelist,weight) code proposed by networkx
:param G:
:return:
"""
A = nx.to_scipy_sparse_matrix(G, nodelist=list(G.nodes), weight="weight",
format='csr')
n, m = A.shape
diags = A.sum(axis=1)
D = scipy.sparse.spdiags(diags.flatten(), [0], m, n, format='csr')
return D, A
@staticmethod
def maxDegree(G):
degree_sequence = sorted(nx.degree(G).values(), reverse=True) # degree sequence
# print "Degree sequence", degree_sequence
dmax = max(degree_sequence)
return dmax
class DeltaCon():
__type__ = "sim"
@staticmethod
def relabel_nodes(graph_list):
label_lookup = {}
label_counter = 0
n= len(graph_list)
# label_lookup is an associative array, which will contain the
# mapping from multiset labels (strings) to short labels
# (integers)
for i in range(n):
nodes = list(graph_list[i].nodes)
for j in range(len(nodes)):
if not (nodes[j] in label_lookup):
label_lookup[nodes[j]] = label_counter
label_counter += 1
graph_list[i] = nx.relabel_nodes(graph_list[i], label_lookup)
return graph_list
@staticmethod
def compare(list_gs, g=3):
n=len(list_gs)
list_gs=DeltaCon.relabel_nodes(list_gs)
comparison_matrix = np.zeros((n,n))
for i in range(n):
for j in range(i,n):
g1,g2=list_gs[i],list_gs[j]
V = list(g1.nodes)
V.extend(list(g2.nodes))
V=np.unique(V)
partitions=V.copy()
np.random.shuffle(partitions)
if len(partitions)< g:
partitions=np.array([partitions])
else:
partitions=np.array_split(partitions,g)
partitions_e_1 = DeltaCon.partitions2e(partitions, list(g1.nodes))
partitions_e_2 = DeltaCon.partitions2e(partitions, list(g2.nodes))
S1,S2=[],[]
for k in range(len(partitions)):
s0k1,s0k2=partitions_e_1[k],partitions_e_2[k]
# S1
epsilon = 1/(1+DeltaCon0.maxDegree(g1))
D, A = DeltaCon0.degreeAndAdjacencyMatrix(g1)
s1k = np.linalg.inv(np.identity(len(g1))+(epsilon**2)*D -epsilon*A)
s1k=np.linalg.solve(s1k,s0k1).tolist()
# S2
D, A = DeltaCon0.degreeAndAdjacencyMatrix(g2)
epsilon = 1 / (1 + DeltaCon0.maxDegree(g2))
s2k= np.linalg.inv(np.identity(len(g2))+(epsilon**2)*D -epsilon*A)
s2k = np.linalg.solve(s2k, s0k2).tolist()
S1.append(s1k)
S2.append(s2k)
comparison_matrix[i,j] = 1/(1+DeltaCon0.rootED(np.array(S1),np.array(S2)))
comparison_matrix[j,i] = comparison_matrix[i,j]
return comparison_matrix
@staticmethod
def partitions2e( partitions, V):
e = [ [] for i in range(len(partitions))]
for p in range(len(partitions)):
e[p] = []
for i in range(len(V)):
if i in partitions[p]:
e[p].append(1.0)
else:
e[p].append(0.0)
return e
\ No newline at end of file
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os
import sys
import random
from io import open
from argparse import ArgumentParser, FileType, ArgumentDefaultsHelpFormatter
from collections import Counter
from concurrent.futures import ProcessPoolExecutor
import logging
from multiprocessing import cpu_count
import networkx as nx
import numpy as np
cimport numpy as np
from six import text_type as unicode
from six import iteritems
from six.moves import range
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
from joblib import Parallel, delayed
import psutil