diff --git a/.travis.yml b/.travis.yml index 459b5f325fad751405ef2276351820b5cfbe4c52..ae341ffd32025e66889a5ce9ffdacf42a7680668 100644 --- a/.travis.yml +++ b/.travis.yml @@ -6,8 +6,9 @@ notifications: email: false install: - - pip install -q cython numpy networkx scipy scikit-learn pandas - - python setup.py build_ext --inplace + - pip install cython numpy networkx scipy scikit-learn pandas gensim joblib gensim psutil --upgrade + - pip install . script: - - pytest gmatch4py/test/test.py \ No newline at end of file + - echo "1" + diff --git a/README.md b/README.md index 2779374d5b74879c222b88828afa65257ff01b82..da8debdbca8dcf7056a4ee6a4e2904f77001699c 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,17 @@ + + + [](https://travis-ci.com/Jacobe2169/GMatch4py) # GMatch4py a graph matching library for Python + GMatch4py is a library dedicated to graph matching. Graph structure are stored in NetworkX graph objects. GMatch4py algorithms were implemented with Cython to enhance performance. ## Requirements - - * Python 3.x - * Cython - * networkx - * numpy - * scikit-learn + + * Python 3 + * Numpy and Cython installed (if not : `(sudo) pip(3) install numpy cython`) ## Installation @@ -19,7 +20,7 @@ To install `GMatch4py`, run the following commands: ```bash git clone https://github.com/Jacobe2169/GMatch4py.git cd GMatch4py -(sudo) python3 setup.py install +(sudo) pip(3) install . ``` ## Get Started @@ -28,7 +29,7 @@ cd GMatch4py In `GMatch4py`, algorithms manipulate `networkx.Graph`, a complete graph model that comes with a large spectrum of parser to load your graph from various inputs : `*.graphml,*.gexf,..` (check [here](https://networkx.github.io/documentation/stable/reference/readwrite/index.html) to see all the format accepted) -### Use Gmatch4py +### Use GMatch4py If you want to use algorithms like *graph edit distances*, here is an example: ```python @@ -44,7 +45,7 @@ g1=nx.complete_bipartite_graph(5,4) g2=nx.complete_bipartite_graph(6,4) ``` -All graph matching algorithms in `Gmatch4py work this way: +All graph matching algorithms in `Gmatch4py` work this way: * Each algorithm is associated with an object, each object having its specific parameters. In this case, the parameters are the edit costs (delete a vertex, add a vertex, ...) * Each object is associated with a `compare()` function with two parameters. First parameter is **a list of the graphs** you want to **compare**, i.e. measure the distance/similarity (depends on the algorithm). Then, you can specify a sample of graphs to be compared to all the other graphs. To this end, the second parameter should be **a list containing the indices** of these graphs (based on the first parameter list). If you rather compute the distance/similarity **between all graphs**, just use the `None` value. @@ -68,15 +69,22 @@ ged.similarity(result) ged.distance(result) ``` +## Exploit nodes and edges attributes +In this latest version, we add the possibility to exploit graph attributes ! To do so, the `base.Base` is extended with the `set_attr_graph_used(node_attr,edge_attr)` method. + +```python +import networkx as nx +import gmatch4py as gm +ged = gm.GraphEditDistance(1,1,1,1) +ged.set_attr_graph_used("theme","color") # Edge colors and node themes attributes will be used. +``` ## List of algorithms - * DeltaCon and DeltaCon0 (*debug needed*) [1] - * Vertex Ranking [2] - * Vertex Edge Overlap [2] - * Bag of Nodes (a bag of words model using nodes as vocabulary) - * Bag of Cliques (a bag of words model using cliques as vocabulary) + * Graph Embedding + * Graph2Vec [1] + * DeepWalk [7] * Graph kernels * Random Walk Kernel (*debug needed*) [3] * Geometrical @@ -84,23 +92,27 @@ ged.distance(result) * Shortest Path Kernel [3] * Weisfeiler-Lehman Kernel [4] * Subtree Kernel - * Edge Kernel * Graph Edit Distance [5] * Approximated Graph Edit Distance * Hausdorff Graph Edit Distance * Bipartite Graph Edit Distance * Greedy Edit Distance + * Vertex Ranking [2] + * Vertex Edge Overlap [2] + * Bag of Nodes (a bag of words model using nodes as vocabulary) + * Bag of Cliques (a bag of words model using cliques as vocabulary) * MCS [6] ## Publications associated - * [1] Koutra, D., Vogelstein, J. T., & Faloutsos, C. (2013, May). Deltacon: A principled massive-graph similarity function. In Proceedings of the 2013 SIAM International Conference on Data Mining (pp. 162-170). Society for Industrial and Applied Mathematics. + * [1] Narayanan, Annamalai and Chandramohan, Mahinthan and Venkatesan, Rajasekar and Chen, Lihui and Liu, Yang. Graph2vec: Learning distributed representations of graphs. MLG 2017, 13th International Workshop on Mining and Learning with Graphs (MLGWorkshop 2017). * [2] Papadimitriou, P., Dasdan, A., & Garcia-Molina, H. (2010). Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1), 19-30. * [3] Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M. (2010). Graph kernels. Journal of Machine Learning Research, 11(Apr), 1201-1242. * [4] Shervashidze, N., Schweitzer, P., Leeuwen, E. J. V., Mehlhorn, K., & Borgwardt, K. M. (2011). Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), 2539-2561. * [5] Fischer, A., Riesen, K., & Bunke, H. (2017). Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment. Pattern Recognition Letters, 87, 55-62. * [6] A graph distance metric based on the maximal common subgraph, H. Bunke and K. Shearer, Pattern Recognition Letters, 1998 + * [7] Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM. ## Author(s) @@ -109,6 +121,26 @@ Jacques Fize, *jacques[dot]fize[at]cirad[dot]fr* Some algorithms from other projects were integrated to Gmatch4py. **Be assured that each code is associated with a reference to the original.** + +## CHANGELOG + +### 05.03.2019 + + * Add Graph Embedding algorithms + * Remove depreciated methods and classes + * Add logo + * Update documentation + + +### 25.02.2019 + * Add New Graph Class. Features : Cython Extensions, precomputed values (degrees, neighbor info), hash representation of edges and nodes for a faster comparison + * Some algorithms are parallelized such as graph edit distances or Jaccard + +## TODO List + + * Debug algorithms --> Random Walk Kernel, Deltacon + * Optimize algorithms --> Vertex Ranking +======= ## Improvements GMatch4py is going through some heavy changes to diminish the time execution of each algorithm. You may found an alpha version available in the branch `graph_cython`. @@ -118,4 +150,3 @@ As of today, the results are promising (up to  - * Write the documentation :runner: diff --git a/__init__.py b/__init__.py deleted file mode 100644 index 2149ff3590de1a34761513da1842dd061fa8af3f..0000000000000000000000000000000000000000 --- a/__init__.py +++ /dev/null @@ -1 +0,0 @@ -name = "gmatch4py" \ No newline at end of file diff --git a/gmatch4py/__init__.py b/gmatch4py/__init__.py index 0d1e527e3f64061d8fedff30ca5a79402f412f53..18e096c6616f9b3eb3d3aecb4d03f59d853c8023 100644 --- a/gmatch4py/__init__.py +++ b/gmatch4py/__init__.py @@ -8,9 +8,14 @@ from .ged.hausdorff_edit_distance import * # Kernels algorithms import from .kernels.weisfeiler_lehman import * +from .kernels.shortest_path_kernel import * +# Graph Embedding import +from .embedding.graph2vec import * +from .embedding.deepwalk import * # Helpers import from .helpers.reader import * +from .helpers.general import * # Basic algorithms import from .bag_of_cliques import * diff --git a/gmatch4py/alg_types.pyx b/gmatch4py/alg_types.pyx deleted file mode 100644 index 83ae12b41548a7802f08a8449120236ab440e0b2..0000000000000000000000000000000000000000 --- a/gmatch4py/alg_types.pyx +++ /dev/null @@ -1,7 +0,0 @@ -# coding = utf-8 -from enum import Enum - - -class AlgorithmType(Enum): - similarity = 0 - distance = 1 \ No newline at end of file diff --git a/gmatch4py/bag_of_cliques.pyx b/gmatch4py/bag_of_cliques.pyx index f418683eaa1ebf0a9a3482b208979f5a05ef26e5..4381672103d56a446fb713673737966d31332821 100644 --- a/gmatch4py/bag_of_cliques.pyx +++ b/gmatch4py/bag_of_cliques.pyx @@ -9,7 +9,7 @@ cimport numpy as np from scipy.sparse import csr_matrix,lil_matrix import sys -from .base cimport Base,intersection +from .base cimport Base cdef class BagOfCliques(Base): diff --git a/gmatch4py/base.pxd b/gmatch4py/base.pxd index f36f2a903ab4a1de9a13f189be4187972bed8903..9b03236261ba55b450e58ab7dc2be95c8437e9f7 100644 --- a/gmatch4py/base.pxd +++ b/gmatch4py/base.pxd @@ -4,12 +4,16 @@ cdef class Base: ## Attribute(s) cdef int type_alg cdef bint normalized - + cdef int cpu_count + cdef str node_attr_key + cdef str edge_attr_key ## Methods cpdef np.ndarray compare(self,list graph_list, list selected) + cpdef np.ndarray compare_old(self,list listgs, list selected) cpdef np.ndarray distance(self, np.ndarray matrix) cpdef np.ndarray similarity(self, np.ndarray matrix) cpdef bint isAccepted(self,G,index,selected) + cpdef np.ndarray get_selected_array(self,selected,size_corpus) + + cpdef set_attr_graph_used(self, str node_attr_key, str edge_attr_key) -cpdef intersection(G,H) -cpdef union_(G,H) diff --git a/gmatch4py/base.pyx b/gmatch4py/base.pyx index 2ab8723af364f53ceff539ab2b44f9eb54e228ee..bd49edbaeac45e8ecd3a7ade40c6eb8b8cc94e2b 100644 --- a/gmatch4py/base.pyx +++ b/gmatch4py/base.pyx @@ -3,6 +3,10 @@ import numpy as np cimport numpy as np import networkx as nx +cimport cython +import multiprocessing + + cpdef np.ndarray minmax_scale(np.ndarray matrix): """ @@ -17,85 +21,6 @@ cpdef np.ndarray minmax_scale(np.ndarray matrix): return x/(max_) - -cpdef intersection(G, H): - """ - Return a new graph that contains only the edges and nodes that exist in - both G and H. - - The node sets of H and G must be the same. - - Parameters - ---------- - G,H : graph - A NetworkX graph. G and H must have the same node sets. - - Returns - ------- - GH : A new graph with the same type as G. - - Notes - ----- - Attributes from the graph, nodes, and edges are not copied to the new - graph. If you want a new graph of the intersection of G and H - with the attributes (including edge data) from G use remove_nodes_from() - as follows - - >>> G=nx.path_graph(3) - >>> H=nx.path_graph(5) - >>> R=G.copy() - >>> R.remove_nodes_from(n for n in G if n not in H) - - Modified so it can be used with two graphs with different nodes set - """ - # create new graph - R = nx.create_empty_copy(G) - - if not G.is_multigraph() == H.is_multigraph(): - raise nx.NetworkXError('G and H must both be graphs or multigraphs.') - if G.number_of_edges() <= H.number_of_edges(): - if G.is_multigraph(): - edges = G.edges(keys=True) - else: - edges = G.edges() - for e in edges: - if H.has_edge(*e): - R.add_edge(*e) - else: - if H.is_multigraph(): - edges = H.edges(keys=True) - else: - edges = H.edges() - for e in edges: - if G.has_edge(*e): - R.add_edge(*e) - nodes_g=set(G.nodes()) - nodes_h=set(H.nodes()) - R.remove_nodes_from(list(nodes_g - nodes_h)) - return R - -cpdef union_(G, H): - """ - Return a graph that contains nodes and edges from both graph G and H. - - Parameters - ---------- - G : networkx.Graph - First graph - H : networkx.Graph - Second graph - - Returns - ------- - networkx.Graph - A new graph with the same type as G. - """ - R = nx.create_empty_copy(G) - R.add_nodes_from(H.nodes(data=True)) - R.add_edges_from(G.edges(data=True)) - R.add_edges_from(H.edges(data=True)) - return R - cdef class Base: """ This class define the common methods to all Graph Matching algorithm. @@ -115,7 +40,7 @@ cdef class Base: self.type_alg=0 self.normalized=False - def __init__(self,type_alg,normalized): + def __init__(self,type_alg,normalized,node_attr_key="",edge_attr_key=""): """ Constructor of Base @@ -136,6 +61,66 @@ cdef class Base: else: self.type_alg=type_alg self.normalized=normalized + self.cpu_count=multiprocessing.cpu_count() + self.node_attr_key=node_attr_key + self.edge_attr_key=edge_attr_key + + cpdef set_attr_graph_used(self, str node_attr_key, str edge_attr_key): + """ + Set graph attribute used by the algorithm to compare graphs. + Parameters + ---------- + node_attr_key : str + key of the node attribute + edge_attr_key: str + key of the edge attribute + + """ + self.node_attr_key=node_attr_key + self.edge_attr_key=edge_attr_key + + cpdef np.ndarray get_selected_array(self,selected,size_corpus): + """ + Return an array which define which graph will be compared in the algorithms. + Parameters + ---------- + selected : list + indices of graphs you wish to compare + size_corpus : + size of your dataset + + Returns + ------- + np.ndarray + selected vector (1 -> selected, 0 -> not selected) + """ + cdef double[:] selected_test = np.zeros(size_corpus) + if not selected == None: + for ix in range(len(selected)): + selected_test[selected[ix]]=1 + return np.array(selected_test) + else: + return np.array(selected_test)+1 + + + cpdef np.ndarray compare_old(self,list listgs, list selected): + """ + Soon will be depreciated ! To store the old version of an algorithm. + Parameters + ---------- + listgs : list + list of graphs + selected + selected graphs + + Returns + ------- + np.ndarray + distance/similarity matrix + """ + pass + + @cython.boundscheck(False) cpdef np.ndarray compare(self,list graph_list, list selected): """ Return the similarity/distance matrix using the current algorithm. @@ -153,7 +138,7 @@ cdef class Base: the None value Returns ------- - np.array + np.ndarray distance/similarity matrix """ @@ -164,12 +149,12 @@ cdef class Base: Return a normalized distance matrix Parameters ---------- - matrix : np.array - Similarity/distance matrix you want to transform + matrix : np.ndarray + Similarity/distance matrix you wish to transform Returns ------- - np.array + np.ndarray distance matrix """ if self.type_alg == 1: @@ -186,8 +171,8 @@ cdef class Base: Return a normalized similarity matrix Parameters ---------- - matrix : np.array - Similarity/distance matrix you want to transform + matrix : np.ndarray + Similarity/distance matrix you wish to transform Returns ------- @@ -201,30 +186,12 @@ cdef class Base: matrix=np.ma.getdata(minmax_scale(matrix)) return 1-matrix - def mcs(self, G, H): - """ - Return the Most Common Subgraph of - Parameters - ---------- - G : networkx.Graph - First Graph - H : networkx.Graph - Second Graph - - Returns - ------- - networkx.Graph - Most common Subgrah - """ - R=G.copy() - R.remove_nodes_from(n for n in G if n not in H) - return R cpdef bint isAccepted(self,G,index,selected): """ Indicate if the graph will be compared to the other. A graph is "accepted" if : - * G exists(!= None) and not empty (|vertices(G)| >0) - * If selected graph to compare were indicated, check if G exists in selected + * G exists(!= None) and not empty (|vertices(G)| >0) + * If selected graph to compare were indicated, check if G exists in selected Parameters ---------- @@ -244,7 +211,7 @@ cdef class Base: if not G: f=False elif len(G)== 0: - f=False + f=False if selected: if not index in selected: f=False diff --git a/gmatch4py/bon.pyx b/gmatch4py/bon.pyx index 396231e7f98677ce6c3ce5db2281979fa8e768c5..0cc7e0e14061251150739f8adbbf76d333b88aaf 100644 --- a/gmatch4py/bon.pyx +++ b/gmatch4py/bon.pyx @@ -11,7 +11,7 @@ cdef class BagOfNodes(Base): We could call this algorithm Bag of nodes """ def __init__(self): - Base.__init__(self,0,True) + Base.__init__(self,0,True) cpdef np.ndarray compare(self,list graph_list, list selected): nodes = list() diff --git a/gmatch4py/deltacon.pyx b/gmatch4py/deltacon.pyx deleted file mode 100644 index a4d01b05816bb3bacf08818f42cf954cb80dc524..0000000000000000000000000000000000000000 --- a/gmatch4py/deltacon.pyx +++ /dev/null @@ -1,153 +0,0 @@ -# coding = utf-8 - -import networkx as nx -import numpy as np -import scipy.sparse - - -class DeltaCon0(): - __type__ = "sim" - - @staticmethod - def compare(list_gs,selected): - n=len(list_gs) - - comparison_matrix = np.zeros((n,n)) - for i in range(n): - for j in range(i,n): - g1,g2=list_gs[i],list_gs[j] - f=True - if not list_gs[i] or not list_gs[j]: - f=False - elif len(list_gs[i])== 0 or len(list_gs[j]) == 0: - f=False - if selected: - if not i in selected: - f=False - if f: - # S1 - epsilon = 1/(1+DeltaCon0.maxDegree(g1)) - D, A = DeltaCon0.degreeAndAdjacencyMatrix(g1) - S1 = np.linalg.inv(np.identity(len(g1))+(epsilon**2)*D -epsilon*A) - - # S2 - D, A = DeltaCon0.degreeAndAdjacencyMatrix(g2) - epsilon = 1 / (1 + DeltaCon0.maxDegree(g2)) - S2 = np.linalg.inv(np.identity(len(g2))+(epsilon**2)*D -epsilon*A) - - - comparison_matrix[i,j] = 1/(1+DeltaCon0.rootED(S1,S2)) - comparison_matrix[j,i] = comparison_matrix[i,j] - else: - comparison_matrix[i, j] = 0. - comparison_matrix[j, i] = comparison_matrix[i, j] - - - return comparison_matrix - - @staticmethod - def rootED(S1,S2): - return np.sqrt(np.sum((S1-S2)**2)) # Long live numpy ! - - @staticmethod - def degreeAndAdjacencyMatrix(G): - """ - Return the Degree(D) and Adjacency Matrix(A) from a graph G. - Inspired of nx.laplacian_matrix(G,nodelist,weight) code proposed by networkx - :param G: - :return: - """ - A = nx.to_scipy_sparse_matrix(G, nodelist=list(G.nodes), weight="weight", - format='csr') - n, m = A.shape - diags = A.sum(axis=1) - D = scipy.sparse.spdiags(diags.flatten(), [0], m, n, format='csr') - - return D, A - @staticmethod - def maxDegree(G): - degree_sequence = sorted(nx.degree(G).values(), reverse=True) # degree sequence - # print "Degree sequence", degree_sequence - dmax = max(degree_sequence) - return dmax - -class DeltaCon(): - __type__ = "sim" - - @staticmethod - def relabel_nodes(graph_list): - label_lookup = {} - label_counter = 0 - n= len(graph_list) - # label_lookup is an associative array, which will contain the - # mapping from multiset labels (strings) to short labels - # (integers) - for i in range(n): - nodes = list(graph_list[i].nodes) - - for j in range(len(nodes)): - if not (nodes[j] in label_lookup): - label_lookup[nodes[j]] = label_counter - label_counter += 1 - - graph_list[i] = nx.relabel_nodes(graph_list[i], label_lookup) - return graph_list - @staticmethod - def compare(list_gs, g=3): - n=len(list_gs) - list_gs=DeltaCon.relabel_nodes(list_gs) - comparison_matrix = np.zeros((n,n)) - for i in range(n): - for j in range(i,n): - g1,g2=list_gs[i],list_gs[j] - - V = list(g1.nodes) - V.extend(list(g2.nodes)) - V=np.unique(V) - - partitions=V.copy() - np.random.shuffle(partitions) - if len(partitions)< g: - partitions=np.array([partitions]) - else: - partitions=np.array_split(partitions,g) - partitions_e_1 = DeltaCon.partitions2e(partitions, list(g1.nodes)) - partitions_e_2 = DeltaCon.partitions2e(partitions, list(g2.nodes)) - S1,S2=[],[] - for k in range(len(partitions)): - s0k1,s0k2=partitions_e_1[k],partitions_e_2[k] - - # S1 - epsilon = 1/(1+DeltaCon0.maxDegree(g1)) - D, A = DeltaCon0.degreeAndAdjacencyMatrix(g1) - s1k = np.linalg.inv(np.identity(len(g1))+(epsilon**2)*D -epsilon*A) - s1k=np.linalg.solve(s1k,s0k1).tolist() - - # S2 - D, A = DeltaCon0.degreeAndAdjacencyMatrix(g2) - epsilon = 1 / (1 + DeltaCon0.maxDegree(g2)) - s2k= np.linalg.inv(np.identity(len(g2))+(epsilon**2)*D -epsilon*A) - s2k = np.linalg.solve(s2k, s0k2).tolist() - - - - S1.append(s1k) - S2.append(s2k) - - comparison_matrix[i,j] = 1/(1+DeltaCon0.rootED(np.array(S1),np.array(S2))) - comparison_matrix[j,i] = comparison_matrix[i,j] - - return comparison_matrix - - - @staticmethod - def partitions2e( partitions, V): - e = [ [] for i in range(len(partitions))] - for p in range(len(partitions)): - e[p] = [] - for i in range(len(V)): - if i in partitions[p]: - e[p].append(1.0) - else: - e[p].append(0.0) - return e \ No newline at end of file diff --git a/gmatch4py/embedding/__init__.py b/gmatch4py/embedding/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/gmatch4py/embedding/deepwalk.pyx b/gmatch4py/embedding/deepwalk.pyx new file mode 100644 index 0000000000000000000000000000000000000000..5e91f6f87f87d33d836528e9fa3fc98ced3e2cdf --- /dev/null +++ b/gmatch4py/embedding/deepwalk.pyx @@ -0,0 +1,173 @@ +#! /usr/bin/env python +# -*- coding: utf-8 -*- + +import os +import sys +import random + +from io import open +from argparse import ArgumentParser, FileType, ArgumentDefaultsHelpFormatter +from collections import Counter +from concurrent.futures import ProcessPoolExecutor +import logging +from multiprocessing import cpu_count + +import networkx as nx +import numpy as np +cimport numpy as np +from six import text_type as unicode +from six import iteritems +from six.moves import range + +from gensim.models import Word2Vec +from sklearn.metrics.pairwise import cosine_similarity +from joblib import Parallel, delayed +import psutil + +cimport cython +from ..base cimport Base +import graph as graph2 +import walks as serialized_walks +from skipgram import Skipgram + + +p = psutil.Process(os.getpid()) +try: + p.set_cpu_affinity(list(range(cpu_count()))) +except AttributeError: + try: + p.cpu_affinity(list(range(cpu_count()))) + except AttributeError: + pass + + +def process(gr, number_walks = 10, walk_length = 40, window_size = 5, vertex_freq_degree = False, workers = 1, representation_size = 64, max_memory_data_size = 1000000000, seed = 0): + """ + Return a DeepWalk embedding for a graph + + Parameters + ---------- + gr : nx.Graph + graph + number_walks : int, optional + Number of walk (the default is 10) + walk_length : int, optional + Length of the random walk started at each node (the default is 40) + window_size : int, optional + Window size of skipgram model. (the default is 5) + vertex_freq_degree : bool, optional + Use vertex degree to estimate the frequency of nodes (the default is False) + workers : int, optional + Number of parallel processes (the default is 1) + representation_size : int, optional + Number of latent dimensions to learn for each node (the default is 64) + max_memory_data_size : int, optional + 'Size to start dumping walks to disk, instead of keeping them in memory. (the default is 1000000000) + seed : int, optional + Seed for random walk generator (the default is 0) + + Returns + ------- + np.array + DeepWalk embedding + """ + + if len(gr.edges())<1: + return np.zeros((1,representation_size)) + G = graph2.from_networkx(gr.copy(), undirected=gr.is_directed()) + num_walks = len(G.nodes()) * number_walks + + data_size = num_walks * walk_length + + #print("Data size (walks*length): {}".format(data_size)) + + if data_size < max_memory_data_size: + #print("Walking...") + walks = graph2.build_deepwalk_corpus(G, num_paths=number_walks, + path_length=walk_length, alpha=0, rand=random.Random(seed)) + #print("Training...") + model = Word2Vec(walks, size=representation_size, + window=window_size, min_count=0, sg=1, hs=1, workers=workers) + else: + #print("Data size {} is larger than limit (max-memory-data-size: {}). Dumping walks to disk.".format( + # data_size, max_memory_data_size)) + #print("Walking...") + + walks_filebase = "temp.walks" + walk_files = serialized_walks.write_walks_to_disk(G, walks_filebase, num_paths=number_walks, + path_length=walk_length, alpha=0, rand=random.Random(seed), + num_workers=workers) + + #print("Counting vertex frequency...") + if not vertex_freq_degree: + vertex_counts = serialized_walks.count_textfiles( + walk_files, workers) + else: + # use degree distribution for frequency in tree + vertex_counts = G.degree(nodes=G.iterkeys()) + + #print("Training...") + walks_corpus = serialized_walks.WalksCorpus(walk_files) + model = Skipgram(sentences=walks_corpus, vocabulary_counts=vertex_counts, + size=representation_size, + window=window_size, min_count=0, trim_rule=None, workers=workers) + + return model.wv.vectors + + +cdef class DeepWalk(Base): + """ + Based on : + @inproceedings{Perozzi:2014:DOL:2623330.2623732, + author = {Perozzi, Bryan and Al-Rfou, Rami and Skiena, Steven}, + title = {DeepWalk: Online Learning of Social Representations}, + booktitle = {Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, + series = {KDD '14}, + year = {2014}, + isbn = {978-1-4503-2956-9}, + location = {New York, New York, USA}, + pages = {701--710}, + numpages = {10}, + url = {http://doi.acm.org/10.1145/2623330.2623732}, + doi = {10.1145/2623330.2623732}, + acmid = {2623732}, + publisher = {ACM}, + address = {New York, NY, USA}, + keywords = {deep learning, latent representations, learning with partial labels, network classification, online learning, social networks}, + } + + Orignal Code : https://github.com/phanein/deepwalk + + Modified by : Jacques Fize + """ + + def __init__(self): + Base.__init__(self,0,True) + + def extract_embedding(self, listgs): + """ + Extract DeepWalk embedding of each graph in `listgs` + + Parameters + ---------- + listgs : list + list of graphs + + Returns + ------- + list + list of embeddings + """ + + from tqdm import tqdm + models = Parallel(n_jobs = cpu_count())(delayed(process)(nx.Graph(g)) for g in tqdm(listgs,desc="Extracting Embeddings...")) + return models + + @cython.boundscheck(False) + cpdef np.ndarray compare(self,list listgs, list selected): + # Selected is ignored + models = self.extract_embedding(listgs) + vector_matrix = np.array([mod.mean(axis=0) for mod in models]) # Average nodes representations + cs = cosine_similarity(vector_matrix) + return cs + diff --git a/gmatch4py/embedding/graph.pyx b/gmatch4py/embedding/graph.pyx new file mode 100644 index 0000000000000000000000000000000000000000..7a87c9f4a9a96f70d88cb74955a15cfbfe2ce99a --- /dev/null +++ b/gmatch4py/embedding/graph.pyx @@ -0,0 +1,313 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +"""Graph utilities.""" + +import logging +import sys +from io import open +from os import path +from time import time +from glob import glob +from six.moves import range, zip, zip_longest +from six import iterkeys +from collections import defaultdict, Iterable +import random +from random import shuffle +from itertools import product,permutations +from scipy.io import loadmat +from scipy.sparse import issparse + +logger = logging.getLogger("deepwalk") + + +__author__ = "Bryan Perozzi" +__email__ = "bperozzi@cs.stonybrook.edu" + +LOGFORMAT = "%(asctime).19s %(levelname)s %(filename)s: %(lineno)s %(message)s" + +class Graph(defaultdict): + """Efficient basic implementation of nx `Graph' – Undirected graphs with self loops""" + def __init__(self): + super(Graph, self).__init__(list) + + def nodes(self): + return self.keys() + + def adjacency_iter(self): + return self.iteritems() + + def subgraph(self, nodes={}): + subgraph = Graph() + + for n in nodes: + if n in self: + subgraph[n] = [x for x in self[n] if x in nodes] + + return subgraph + + def make_undirected(self): + + t0 = time() + + for v in self.keys(): + for other in self[v]: + if v != other: + self[other].append(v) + + t1 = time() + logger.info('make_directed: added missing edges {}s'.format(t1-t0)) + + self.make_consistent() + return self + + def make_consistent(self): + t0 = time() + for k in iterkeys(self): + self[k] = list(sorted(set(self[k]))) + + t1 = time() + logger.info('make_consistent: made consistent in {}s'.format(t1-t0)) + + self.remove_self_loops() + + return self + + def remove_self_loops(self): + + removed = 0 + t0 = time() + + for x in self: + if x in self[x]: + self[x].remove(x) + removed += 1 + + t1 = time() + + logger.info('remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0))) + return self + + def check_self_loops(self): + for x in self: + for y in self[x]: + if x == y: + return True + + return False + + def has_edge(self, v1, v2): + if v2 in self[v1] or v1 in self[v2]: + return True + return False + + def degree(self, nodes=None): + if isinstance(nodes, Iterable): + return {v:len(self[v]) for v in nodes} + else: + return len(self[nodes]) + + def order(self): + "Returns the number of nodes in the graph" + return len(self) + + def number_of_edges(self): + "Returns the number of nodes in the graph" + return sum([self.degree(x) for x in self.keys()])/2 + + def number_of_nodes(self): + "Returns the number of nodes in the graph" + return self.order() + + def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None): + """ Returns a truncated random walk. + + path_length: Length of the random walk. + alpha: probability of restarts. + start: the start node of the random walk. + """ + G = self + if start: + path = [start] + else: + # Sampling is uniform w.r.t V, and not w.r.t E + path = [rand.choice(list(G.keys()))] + + while len(path) < path_length: + cur = path[-1] + if len(G[cur]) > 0: + if rand.random() >= alpha: + path.append(rand.choice(G[cur])) + else: + path.append(path[0]) + else: + break + return [str(node) for node in path] + +# TODO add build_walks in here + +def build_deepwalk_corpus(G, num_paths, path_length, alpha=0, + rand=random.Random(0)): + walks = [] + + nodes = list(G.nodes()) + + for cnt in range(num_paths): + rand.shuffle(nodes) + for node in nodes: + walks.append(G.random_walk(path_length, rand=rand, alpha=alpha, start=node)) + + return walks + +def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0, + rand=random.Random(0)): + walks = [] + + nodes = list(G.nodes()) + + for cnt in range(num_paths): + rand.shuffle(nodes) + for node in nodes: + yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node) + + +def clique(size): + return from_adjlist(permutations(range(1,size+1))) + + +# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python +def grouper(n, iterable, padvalue=None): + "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')" + return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue) + +def parse_adjacencylist(f): + adjlist = [] + for l in f: + if l and l[0] != "#": + introw = [int(x) for x in l.strip().split()] + row = [introw[0]] + row.extend(set(sorted(introw[1:]))) + adjlist.extend([row]) + + return adjlist + +def parse_adjacencylist_unchecked(f): + adjlist = [] + for l in f: + if l and l[0] != "#": + adjlist.extend([[int(x) for x in l.strip().split()]]) + + return adjlist + +def load_adjacencylist(file_, undirected=False, chunksize=10000, unchecked=True): + + if unchecked: + parse_func = parse_adjacencylist_unchecked + convert_func = from_adjlist_unchecked + else: + parse_func = parse_adjacencylist + convert_func = from_adjlist + + adjlist = [] + + t0 = time() + + total = 0 + with open(file_) as f: + for idx, adj_chunk in enumerate(map(parse_func, grouper(int(chunksize), f))): + adjlist.extend(adj_chunk) + total += len(adj_chunk) + + t1 = time() + + logger.info('Parsed {} edges with {} chunks in {}s'.format(total, idx, t1-t0)) + + t0 = time() + G = convert_func(adjlist) + t1 = time() + + logger.info('Converted edges to graph in {}s'.format(t1-t0)) + + if undirected: + t0 = time() + G = G.make_undirected() + t1 = time() + logger.info('Made graph undirected in {}s'.format(t1-t0)) + + return G + + +def load_edgelist(file_, undirected=True): + G = Graph() + with open(file_) as f: + for l in f: + x, y = l.strip().split()[:2] + x = int(x) + y = int(y) + G[x].append(y) + if undirected: + G[y].append(x) + + G.make_consistent() + return G + + +def load_matfile(file_, variable_name="network", undirected=True): + mat_varables = loadmat(file_) + mat_matrix = mat_varables[variable_name] + + return from_numpy(mat_matrix, undirected) + + +def from_networkx(G_input, undirected=True): + G = Graph() + + for _, x in enumerate(G_input): + for y in iterkeys(G_input[x]): + G[x].append(y) + + if undirected: + G.make_undirected() + + return G + + +def from_numpy(x, undirected=True): + G = Graph() + + if issparse(x): + cx = x.tocoo() + for i,j,v in zip(cx.row, cx.col, cx.data): + G[i].append(j) + else: + raise Exception("Dense matrices not yet supported.") + + if undirected: + G.make_undirected() + + G.make_consistent() + return G + + +def from_adjlist(adjlist): + G = Graph() + + for row in adjlist: + node = row[0] + neighbors = row[1:] + G[node] = list(sorted(set(neighbors))) + + return G + + +def from_adjlist_unchecked(adjlist): + G = Graph() + + for row in adjlist: + node = row[0] + neighbors = row[1:] + G[node] = neighbors + + return G + + diff --git a/gmatch4py/embedding/graph2vec.pyx b/gmatch4py/embedding/graph2vec.pyx new file mode 100644 index 0000000000000000000000000000000000000000..461efaf8a15e75476d5db15e83a28980502f78d2 --- /dev/null +++ b/gmatch4py/embedding/graph2vec.pyx @@ -0,0 +1,184 @@ +import hashlib +import json +import glob + +import pandas as pd +import networkx as nx +from tqdm import tqdm +cimport numpy as np +import numpy.distutils.system_info as sysinfo + +from joblib import Parallel, delayed +from gensim.models.doc2vec import Doc2Vec, TaggedDocument +from sklearn.metrics.pairwise import cosine_similarity + +from ..base cimport Base +cimport cython + + +class WeisfeilerLehmanMachine: + """ + Weisfeiler Lehman feature extractor class. + """ + def __init__(self, graph, features, iterations): + """ + Initialization method which executes feature extraction. + + Parameters + ---------- + graph : nx.Graph + graph + features : dict + Feature hash table. + iterations : int + number of WL iteration + + """ + + self.iterations = iterations + self.graph = graph + self.features = features + self.nodes = self.graph.nodes() + self.extracted_features = [str(v) for k,v in features.items()] + self.do_recursions() + + def do_a_recursion(self): + """ + The method does a single WL recursion. + + Returns + ------- + dict + The hash table with extracted WL features. + """ + + new_features = {} + for node in self.nodes: + nebs = self.graph.neighbors(node) + degs = [self.features[neb] for neb in nebs] + features = "_".join([str(self.features[node])]+list(set(sorted([str(deg) for deg in degs])))) + hash_object = hashlib.md5(features.encode()) + hashing = hash_object.hexdigest() + new_features[node] = hashing + self.extracted_features = self.extracted_features + list(new_features.values()) + return new_features + + def do_recursions(self): + """ + The method does a series of WL recursions. + """ + for iteration in range(self.iterations): + self.features = self.do_a_recursion() + + +def dataset_reader(graph): + """ + Function to extract features from a networkx graph + + Parameters + ---------- + graph : nx.Graph + graph + + Returns + ------- + dict + Features hash table. + """ + + features = dict(nx.degree(graph)) + + features = {k:v for k,v, in features.items()} + return graph, features + + +def feature_extractor(graph, ix, rounds): + """ + Function to extract WL features from a graph + + Parameters + ---------- + graph : nx.Graph + graph + ix : int + index of the graph in the dataset + rounds : int + number of WL iterations + + Returns + ------- + TaggedDocument + random walks + """ + + graph, features = dataset_reader(graph) + machine = WeisfeilerLehmanMachine(graph,features,rounds) + doc = TaggedDocument(words = machine.extracted_features , tags = ["g_{0}".format(ix)]) + return doc + + + +def generate_model(graphs, iteration = 2, dimensions = 64, min_count = 5, down_sampling = 0.0001, learning_rate = 0.0001, epochs = 10, workers = 4 ): + """ + Main function to read the graph list, extract features, learn the embedding and save it. + + Parameters + ---------- + graphs : nx.Graph + Input graph + iteration : int, optional + number of iteration (the default is 2) + dimensions : int, optional + output vector dimension (the default is 64) + min_count : int, optional + min count parameter of Doc2vec model (the default is 5) + down_sampling : float, optional + Down sampling rate for frequent features. (the default is 0.0001) + learning_rate : float, optional + Initial learning rate (the default is 0.0001, which [default_description]) + epochs : int, optional + Number of epochs (the default is 10) + workers : int, optional + Number of workers (the default is 4) + + Returns + ------- + [type] + [description] + """ + + document_collections = Parallel(n_jobs = workers)(delayed(feature_extractor)(g, ix,iteration) for ix,g in tqdm(enumerate(graphs),desc="Extracting Features...")) + graphs=[nx.relabel_nodes(g,{node:str(node) for node in list(g.nodes)},copy=True) for g in graphs] + model = Doc2Vec(document_collections, + vector_size = dimensions, + window = 0, + min_count = min_count, + dm = 0, + sample = down_sampling, + workers = workers, + epochs = epochs, + alpha = learning_rate) + return model + +cdef class Graph2Vec(Base): + """ + Based on : + graph2vec: Learning distributed representations of graphs. + Narayanan, Annamalai and Chandramohan, Mahinthan and Venkatesan, Rajasekar and Chen, Lihui and Liu, Yang + MLG 2017, 13th International Workshop on Mining and Learning with Graphs (MLGWorkshop 2017) + + Orignal Code : https://github.com/benedekrozemberczki/graph2vec + + Modified by : Jacques Fize + """ + + def __init__(self): + Base.__init__(self,0,True) + + @cython.boundscheck(False) + cpdef np.ndarray compare(self,list listgs, list selected): + # Selected is ignored + model = generate_model(listgs) + vector_matrix = model.docvecs.vectors_docs + cs = cosine_similarity(vector_matrix) + return cs diff --git a/gmatch4py/embedding/skipgram.pyx b/gmatch4py/embedding/skipgram.pyx new file mode 100644 index 0000000000000000000000000000000000000000..42c0770472b49db1f54324c1e1216924560d5aca --- /dev/null +++ b/gmatch4py/embedding/skipgram.pyx @@ -0,0 +1,30 @@ +from collections import Counter, Mapping +from concurrent.futures import ProcessPoolExecutor +import logging +from multiprocessing import cpu_count +from six import string_types + +from gensim.models import Word2Vec +from gensim.models.word2vec import Vocab + +logger = logging.getLogger("deepwalk") + +class Skipgram(Word2Vec): + """A subclass to allow more customization of the Word2Vec internals.""" + + def __init__(self, vocabulary_counts=None, **kwargs): + + self.vocabulary_counts = None + + kwargs["min_count"] = kwargs.get("min_count", 0) + kwargs["workers"] = kwargs.get("workers", cpu_count()) + kwargs["size"] = kwargs.get("size", 128) + kwargs["sentences"] = kwargs.get("sentences", None) + kwargs["window"] = kwargs.get("window", 10) + kwargs["sg"] = 1 + kwargs["hs"] = 1 + + if vocabulary_counts != None: + self.vocabulary_counts = vocabulary_counts + + super(Skipgram, self).__init__(**kwargs) diff --git a/gmatch4py/embedding/walks.pyx b/gmatch4py/embedding/walks.pyx new file mode 100644 index 0000000000000000000000000000000000000000..3da902341833293126bdf076548f8bbed1965d9e --- /dev/null +++ b/gmatch4py/embedding/walks.pyx @@ -0,0 +1,103 @@ +import logging +from io import open +from os import path +from time import time +from multiprocessing import cpu_count +import random +from concurrent.futures import ProcessPoolExecutor +from collections import Counter + +from six.moves import zip + +from . import graph + +logger = logging.getLogger("deepwalk") + +__current_graph = None + +# speed up the string encoding +__vertex2str = None + +def count_words(file): + """ Counts the word frequences in a list of sentences. + + Note: + This is a helper function for parallel execution of `Vocabulary.from_text` + method. + """ + c = Counter() + with open(file, 'r') as f: + for l in f: + words = l.strip().split() + c.update(words) + return c + + +def count_textfiles(files, workers=1): + c = Counter() + with ProcessPoolExecutor(max_workers=workers) as executor: + for c_ in executor.map(count_words, files): + c.update(c_) + return c + + +def count_lines(f): + if path.isfile(f): + num_lines = sum(1 for line in open(f)) + return num_lines + else: + return 0 + +def _write_walks_to_disk(args): + num_paths, path_length, alpha, rand, f = args + G = __current_graph + t_0 = time() + with open(f, 'w') as fout: + for walk in graph.build_deepwalk_corpus_iter(G=G, num_paths=num_paths, path_length=path_length, + alpha=alpha, rand=rand): + fout.write(u"{}\n".format(u" ".join(v for v in walk))) + logger.debug("Generated new file {}, it took {} seconds".format(f, time() - t_0)) + return f + +def write_walks_to_disk(G, filebase, num_paths, path_length, alpha=0, rand=random.Random(0), num_workers=cpu_count(), + always_rebuild=True): + global __current_graph + __current_graph = G + files_list = ["{}.{}".format(filebase, str(x)) for x in list(range(num_paths))] + expected_size = len(G) + args_list = [] + files = [] + + if num_paths <= num_workers: + paths_per_worker = [1 for x in range(num_paths)] + else: + paths_per_worker = [len(list(filter(lambda z: z!= None, [y for y in x]))) + for x in graph.grouper(int(num_paths / num_workers)+1, range(1, num_paths+1))] + + with ProcessPoolExecutor(max_workers=num_workers) as executor: + for size, file_, ppw in zip(executor.map(count_lines, files_list), files_list, paths_per_worker): + if always_rebuild or size != (ppw*expected_size): + args_list.append((ppw, path_length, alpha, random.Random(rand.randint(0, 2**31)), file_)) + else: + files.append(file_) + + with ProcessPoolExecutor(max_workers=num_workers) as executor: + for file_ in executor.map(_write_walks_to_disk, args_list): + files.append(file_) + + return files + +class WalksCorpus(object): + def __init__(self, file_list): + self.file_list = file_list + def __iter__(self): + for file in self.file_list: + with open(file, 'r') as f: + for line in f: + yield line.split() + +def combine_files_iter(file_list): + for file in file_list: + with open(file, 'r') as f: + for line in f: + yield line.split() diff --git a/gmatch4py/ged/abstract_graph_edit_dist.pxd b/gmatch4py/ged/abstract_graph_edit_dist.pxd index 9a4b92de1195a188808a4d5223509d59a3be1922..ee01fe9518f21970f26ff2e1a5cba84fca556e8c 100644 --- a/gmatch4py/ged/abstract_graph_edit_dist.pxd +++ b/gmatch4py/ged/abstract_graph_edit_dist.pxd @@ -16,3 +16,4 @@ cdef class AbstractGraphEditDistance(Base): cdef double insert_cost(self, int i, int j, nodesH, H) cdef double delete_cost(self, int i, int j, nodesG, G) cpdef double substitute_cost(self, node1, node2, G, H) + diff --git a/gmatch4py/ged/abstract_graph_edit_dist.pyx b/gmatch4py/ged/abstract_graph_edit_dist.pyx index 0dfec73747d95879983f80a794e7bfbc599d5244..95ba8d42e082c1d00b914cbe9c00028175e20862 100644 --- a/gmatch4py/ged/abstract_graph_edit_dist.pyx +++ b/gmatch4py/ged/abstract_graph_edit_dist.pyx @@ -2,11 +2,23 @@ from __future__ import print_function import sys +import warnings import numpy as np -from scipy.optimize import linear_sum_assignment cimport numpy as np +import networkx as nx +from cython.parallel cimport prange,parallel + +try: + from munkres import munkres +except ImportError: + warnings.warn("To obtain optimal results install the Cython 'munkres' module at https://github.com/jfrelinger/cython-munkres-wrapper") + from scipy.optimize import linear_sum_assignment as munkres + from ..base cimport Base +from ..helpers.general import parsenx2graph + + cdef class AbstractGraphEditDistance(Base): @@ -22,8 +34,19 @@ cdef class AbstractGraphEditDistance(Base): cpdef double distance_ged(self,G,H): """ - Return the distance between G and H - :return: + Return the distance value between G and H + + Parameters + ---------- + G : gmatch4py.Graph + graph + H : gmatch4py.Graph + graph + + Returns + ------- + int + distance """ cdef list opt_path = self.edit_costs(G,H) return np.sum(opt_path) @@ -32,12 +55,21 @@ cdef class AbstractGraphEditDistance(Base): cdef list edit_costs(self, G, H): """ Return the optimal path edit cost list, to transform G into H - :return: + + Parameters + ---------- + G : gmatch4py.Graph + graph + H : gmatch4py.Graph + graph + + Returns + ------- + np.array + edit path """ cdef np.ndarray cost_matrix = self.create_cost_matrix(G,H).astype(float) - row_ind,col_ind = linear_sum_assignment(cost_matrix) - cdef int f=len(row_ind) - return [cost_matrix[row_ind[i]][col_ind[i]] for i in range(f)] + return cost_matrix[munkres(cost_matrix)].tolist() cpdef np.ndarray create_cost_matrix(self, G, H): """ @@ -52,9 +84,26 @@ cdef class AbstractGraphEditDistance(Base): delete | delete -> delete The delete -> delete region is filled with zeros + + Parameters + ---------- + G : gmatch4py.Graph + graph + H : gmatch4py.Graph + graph + + Returns + ------- + np.array + cost matrix """ - cdef int n = G.number_of_nodes() - cdef int m = H.number_of_nodes() + cdef int n,m + try: + n = G.number_of_nodes() + m = H.number_of_nodes() + except: + n = G.size() + m = H.size() cdef np.ndarray cost_matrix = np.zeros((n+m,n+m)) cdef list nodes1 = list(G.nodes()) cdef list nodes2 = list(H.nodes()) @@ -74,26 +123,55 @@ cdef class AbstractGraphEditDistance(Base): return cost_matrix cdef double insert_cost(self, int i, int j, nodesH, H): + """ + Return the insert cost of the ith nodes in H + + Returns + ------- + int + insert cost + """ raise NotImplementedError cdef double delete_cost(self, int i, int j, nodesG, G): + """ + Return the delete cost of the ith nodes in H + + Returns + ------- + int + delete cost + """ raise NotImplementedError cpdef double substitute_cost(self, node1, node2, G, H): + """ + Return the substitute cost of between the node1 in G and the node2 in H + + Returns + ------- + int + substitution cost + """ raise NotImplementedError + cpdef np.ndarray compare(self,list listgs, list selected): cdef int n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)).astype(float) + cdef double[:,:] comparison_matrix = np.zeros((n, n)) + listgs=parsenx2graph(listgs,self.node_attr_key,self.edge_attr_key) + cdef long[:] n_nodes = np.array([g.size() for g in listgs]) + cdef double[:] selected_test = np.array(self.get_selected_array(selected,n)) cdef int i,j - for i in range(n): - for j in range(n): - g1,g2=listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - comparison_matrix[i, j] = self.distance_ged(g1, g2) - else: - comparison_matrix[i, j] = np.inf + cdef float inf=np.inf + + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(n): + if n_nodes[i]>0 and n_nodes[j]>0 and selected_test[i] == 1 : + with gil: + comparison_matrix[i][j] = self.distance_ged(listgs[i],listgs[j]) + else: + comparison_matrix[i][j] = inf #comparison_matrix[j, i] = comparison_matrix[i, j] - np.fill_diagonal(comparison_matrix,0) - return comparison_matrix + return np.array(comparison_matrix) diff --git a/gmatch4py/ged/bipartite_graph_matching_2.pyx b/gmatch4py/ged/bipartite_graph_matching_2.pyx index 59e33e08781c00775e36269ca056b48b771f90ca..a23a5ae27912c402d1243a6fb96a2d603d442edd 100644 --- a/gmatch4py/ged/bipartite_graph_matching_2.pyx +++ b/gmatch4py/ged/bipartite_graph_matching_2.pyx @@ -2,6 +2,9 @@ import numpy as np cimport numpy as np from ..base cimport Base +from cython.parallel cimport prange,parallel +from ..helpers.general import parsenx2graph +cimport cython cdef class BP_2(Base): @@ -32,21 +35,28 @@ cdef class BP_2(Base): self.edge_del = edge_del self.edge_ins = edge_ins + + @cython.boundscheck(False) cpdef np.ndarray compare(self,list listgs, list selected): cdef int n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)).astype(float) + cdef list new_gs=parsenx2graph(listgs) + cdef double[:,:] comparison_matrix = np.zeros((n, n)) + cdef double[:] selected_test = self.get_selected_array(selected,n) cdef int i,j - for i in range(n): - for j in range(i, n): - g1,g2=listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - comparison_matrix[i, j] = self.bp2(g1, g2) - else: - comparison_matrix[i, j] = np.inf - comparison_matrix[j, i] = comparison_matrix[i, j] + cdef long[:] n_nodes = np.array([g.size() for g in new_gs]) + cdef long[:] n_edges = np.array([g.density() for g in new_gs]) + + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(i,n): + if n_nodes[i] > 0 and n_nodes[j] > 0 and selected_test[i] == 1: + with gil: + comparison_matrix[i, j] = self.bp2(new_gs[i], new_gs[j]) + else: + comparison_matrix[i, j] = 0 + comparison_matrix[j, i] = comparison_matrix[i, j] - return comparison_matrix + return np.array(comparison_matrix) cdef double bp2(self, g1, g2): @@ -55,9 +65,9 @@ cdef class BP_2(Base): Parameters ---------- - g1 : networkx.Graph + g1 : gmatch4py.Graph First Graph - g2 : networkx.Graph + g2 : gmatch4py.Graph Second Graph Returns @@ -100,8 +110,8 @@ cdef class BP_2(Base): list containing costs from the optimal edit path """ cdef list psi_=[] - cdef list nodes1 = list(g1.nodes) - cdef list nodes2 = list(g2.nodes) + cdef list nodes1 = list(g1.nodes()) + cdef list nodes2 = list(g2.nodes()) for u in nodes1: v=None for w in nodes2: @@ -118,33 +128,25 @@ cdef class BP_2(Base): return psi_ - cdef float sum_fuv(self, g1, g2): - """ - Compute Nearest Neighbour Distance between G1 and G2 - :param g1: First Graph - :param g2: Second Graph - :return: - """ - cdef np.ndarray min_sum = np.zeros(len(g1)) - nodes1 = list(g1.nodes) - nodes2 = list(g2.nodes) - nodes2.extend([None]) - cdef np.ndarray min_i - for i in range(len(nodes1)): - min_i = np.zeros(len(nodes2)) - for j in range(len(nodes2)): - min_i[j] = self.fuv(g1, g2, nodes1[i], nodes2[j]) - min_sum[i] = np.min(min_i) - return np.sum(min_sum) - cdef float fuv(self, g1, g2, n1, n2): + cdef float fuv(self, g1, g2, str n1, str n2): """ Compute the Node Distance function - :param g1: first graph - :param g2: second graph - :param n1: node of the first graph - :param n2: node of the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + node distance """ if n2 == None: # Del return self.node_del + ((self.edge_del / 2.) * g1.degree(n1)) @@ -155,31 +157,51 @@ cdef class BP_2(Base): return 0 return (self.node_del + self.node_ins + self.hed_edge(g1, g2, n1, n2)) / 2 - cdef float hed_edge(self, g1, g2, n1, n2): + cdef float hed_edge(self, g1, g2, str n1, str n2): """ Compute HEDistance between edges of n1 and n2, respectively in g1 and g2 - :param g1: first graph - :param g2: second graph - :param n1: node of the first graph - :param n2: node of the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + HEDistance between g1 and g2 """ return self.sum_gpq(g1, n1, g2, n2) + self.sum_gpq(g1, n1, g2, n2) - cdef float sum_gpq(self, g1, n1, g2, n2): + cdef float sum_gpq(self, g1, str n1, g2, str n2): """ Compute Nearest Neighbour Distance between edges around n1 in G1 and edges around n2 in G2 - :param g1: first graph - :param n1: node in the first graph - :param g2: second graph - :param n2: node in the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + Nearest Neighbour Distance """ #if isinstance(g1, nx.MultiDiGraph): - cdef list edges1 = list(g1.edges(n1)) if n1 else [] - cdef list edges2 = list(g2.edges(n2)) if n2 else [] + cdef list edges1 = g1.get_edges_no(n1) if n1 else [] + cdef list edges2 = g2.get_edges_no(n2) if n2 else [] cdef np.ndarray min_sum = np.zeros(len(edges1)) edges2.extend([None]) @@ -191,13 +213,21 @@ cdef class BP_2(Base): min_sum[i] = np.min(min_i) return np.sum(min_sum) - cdef float gpq(self, tuple e1, tuple e2): + cdef float gpq(self, str e1, str e2): """ Compute the edge distance function - :param e1: edge1 - :param e2: edge2 - :return: + Parameters + ---------- + e1 : str + first edge identifier + e2 + second edge indentifier + Returns + ------- + float + edge distance """ + if e2 == None: # Del return self.edge_del if e1 == None: # Insert diff --git a/gmatch4py/ged/graph_edit_dist.pxd b/gmatch4py/ged/graph_edit_dist.pxd index 975f39b1ab05df1b55d8d571afc4867d34b63fcf..18020dab052797e7bef10bc24837b4d8802e3f0a 100644 --- a/gmatch4py/ged/graph_edit_dist.pxd +++ b/gmatch4py/ged/graph_edit_dist.pxd @@ -4,6 +4,7 @@ from .abstract_graph_edit_dist cimport AbstractGraphEditDistance cdef class GraphEditDistance(AbstractGraphEditDistance): + cpdef object relabel_cost(self, node1, node2, G, H) cpdef double substitute_cost(self, node1, node2, G, H) cdef double delete_cost(self, int i, int j, nodesG, G) cdef double insert_cost(self, int i, int j, nodesH, H) \ No newline at end of file diff --git a/gmatch4py/ged/graph_edit_dist.pyx b/gmatch4py/ged/graph_edit_dist.pyx index 331d0bb05fa53831b001f25742bf76818846e21e..7dd400fea21b93b0fd8c19b3e6e63df95791e896 100644 --- a/gmatch4py/ged/graph_edit_dist.pyx +++ b/gmatch4py/ged/graph_edit_dist.pyx @@ -6,7 +6,7 @@ import networkx as nx import numpy as np cimport numpy as np from .abstract_graph_edit_dist cimport AbstractGraphEditDistance -from ..base cimport intersection,union_ + cdef class GraphEditDistance(AbstractGraphEditDistance): @@ -14,52 +14,43 @@ cdef class GraphEditDistance(AbstractGraphEditDistance): def __init__(self,node_del,node_ins,edge_del,edge_ins,weighted=False): AbstractGraphEditDistance.__init__(self,node_del,node_ins,edge_del,edge_ins) self.weighted=weighted + cpdef double substitute_cost(self, node1, node2, G, H): return self.relabel_cost(node1, node2, G, H) - def add_edges(self,node1,node2,G): - R=nx.create_empty_copy(G) - try: - R.add_edges_from(G.edges(node1,node2)) - except Exception as e: - # To counter bug with a None for attribute... weird ?? - arr_=G.edges(node1,node2) - new_list=[] - for item in arr_: - new_list.append((item[0],item[1])) - R.add_edges_from(new_list) - return R - - def relabel_cost(self, node1, node2, G, H): + cpdef object relabel_cost(self, node1, node2, G, H): ## Si deux noeuds égaux if node1 == node2 and G.degree(node1) == H.degree(node2): return 0.0 elif node1 == node2 and G.degree(node1) != H.degree(node2): - R = self.add_edges(node1,node2,G) - R2 = self.add_edges(node1,node2,H) - inter_=intersection(R,R2).number_of_edges() - add_diff=abs(R2.number_of_edges()-inter_) - del_diff=abs(R.number_of_edges()-inter_) + #R = Graph(self.add_edges(node1,node2,G),G.get_node_key(),G.get_egde_key()) + #R2 = Graph(self.add_edges(node1,node2,H),H.get_node_key(),H.get_egde_key()) + #inter_= R.size_edge_intersect(R2) + R=set(G.get_edges_no(node1)) + R2=set(H.get_edges_no(node2)) + inter_=R.intersection(R2) + add_diff=abs(len(R2)-len(inter_))#abs(R2.density()-inter_) + del_diff=abs(len(R)-len(inter_))#abs(R.density()-inter_) return (add_diff*self.edge_ins)+(del_diff*self.edge_del) #si deux noeuds connectés - if (node1,node2) in G.edges() or (node2,node1) in G.edges(): + if G.has_edge(node1,node2) or G.has_edge(node2,node1): return self.node_ins+self.node_del - if not node2 in G: - nodesH=list(H.nodes()) - index=nodesH.index(node2) + if not node2 in G.nodes(): + nodesH=H.nodes() + index=list(nodesH).index(node2) return self.node_del+self.node_ins+self.insert_cost(index,index,nodesH,H) return sys.maxsize cdef double delete_cost(self, int i, int j, nodesG, G): if i == j: - return self.node_del+(G.degree(nodesG[i],weight=("weight" if self.weighted else None))*self.edge_del) # Deleting a node implicate to delete in and out edges + return self.node_del+(G.degree(nodesG[i],weight=True)*self.edge_del) # Deleting a node implicate to delete in and out edges return sys.maxsize cdef double insert_cost(self, int i, int j, nodesH, H): if i == j: - deg=H.degree(nodesH[j],weight=("weight" if self.weighted else None)) + deg=H.degree(nodesH[j],weight=True) if isinstance(deg,dict):deg=0 return self.node_ins+(deg*self.edge_ins) else: diff --git a/gmatch4py/ged/greedy_edit_distance.pyx b/gmatch4py/ged/greedy_edit_distance.pyx index b4908cb2337400eec20fa5d20fa6a704b0a61c36..9bdd2c47c9103a2c1a07d5845bbede52beb47b72 100644 --- a/gmatch4py/ged/greedy_edit_distance.pyx +++ b/gmatch4py/ged/greedy_edit_distance.pyx @@ -4,6 +4,7 @@ import sys from .graph_edit_dist cimport GraphEditDistance import numpy as np cimport numpy as np +from cython.parallel cimport prange,parallel cdef class GreedyEditDistance(GraphEditDistance): """ @@ -20,15 +21,6 @@ cdef class GreedyEditDistance(GraphEditDistance): cdef list edit_costs(self, G, H): cdef np.ndarray cost_matrix=self.create_cost_matrix(G,H) - """ - cdef np.ndarray cost_matrix_2=cost_matrix.copy() - cdef list psi=[] - for i in range(len(cost_matrix)): - phi_i=np.argmin((cost_matrix[i])) - cost_matrix=np.delete(cost_matrix,phi_i,1) - psi.append([i,phi_i+i]) #+i to compensate the previous column deletion - return [cost_matrix_2[psi[i][0]][psi[i][1]] for i in range(len(psi))] - """ cdef np.ndarray cost_matrix_2=cost_matrix.copy().astype(np.double) cdef list psi=[] for i in range(len(cost_matrix)): diff --git a/gmatch4py/ged/hausdorff_edit_distance.pyx b/gmatch4py/ged/hausdorff_edit_distance.pyx index d2327f6b35da425c0c302eb12b5964a7f9a81450..67d3484512c3054dba34eb5d085521480d785eab 100644 --- a/gmatch4py/ged/hausdorff_edit_distance.pyx +++ b/gmatch4py/ged/hausdorff_edit_distance.pyx @@ -3,6 +3,9 @@ import numpy as np cimport numpy as np from ..base cimport Base +from cython.parallel cimport prange,parallel +from ..helpers.general import parsenx2graph +cimport cython cdef class HED(Base): """ @@ -19,7 +22,20 @@ cdef class HED(Base): cdef int edge_ins def __init__(self, int node_del=1, int node_ins=1, int edge_del=1, int edge_ins=1): - """Constructor for HED""" + """ + HED Constructor + + Parameters + ---------- + node_del :int + Node deletion cost + node_ins : int + Node insertion cost + edge_del : int + Edge Deletion cost + edge_ins : int + Edge Insertion cost + """ Base.__init__(self,1,False) self.node_del = node_del self.node_ins = node_ins @@ -27,59 +43,93 @@ cdef class HED(Base): self.edge_ins = edge_ins + @cython.boundscheck(False) cpdef np.ndarray compare(self,list listgs, list selected): cdef int n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)).astype(float) + cdef list new_gs=parsenx2graph(listgs,self.node_attr_key,self.edge_attr_key) + cdef double[:,:] comparison_matrix = np.zeros((n, n)) + cdef double[:] selected_test = np.array(self.get_selected_array(selected,n)) cdef int i,j - for i in range(n): - for j in range(i, n): - g1,g2=listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - comparison_matrix[i, j] = self.hed(g1, g2) - else: - comparison_matrix[i, j] = np.inf - comparison_matrix[j, i] = comparison_matrix[i, j] + cdef long[:] n_nodes = np.array([g.size() for g in new_gs]) + cdef long[:] n_edges = np.array([g.density() for g in new_gs]) - return comparison_matrix + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(i,n): + if n_nodes[i] > 0 and n_nodes[j] > 0 and selected_test[i] == True: + with gil: + comparison_matrix[i, j] = self.hed(new_gs[i], new_gs[j]) + else: + comparison_matrix[i, j] = 0 + comparison_matrix[j, i] = comparison_matrix[i, j] + + return np.array(comparison_matrix) cdef float hed(self, g1, g2): """ - Compute de Hausdorff Edit Distance - :param g1: first graph - :param g2: second graph - :return: + Compute the HED similarity value between two `gmatch4py.Graph` + + Parameters + ---------- + g1 : gmatch4py.Graph + First Graph + g2 : gmatch4py.Graph + Second Graph + + Returns + ------- + float + similarity value """ return self.sum_fuv(g1, g2) + self.sum_fuv(g2, g1) cdef float sum_fuv(self, g1, g2): """ Compute Nearest Neighbour Distance between G1 and G2 - :param g1: First Graph - :param g2: Second Graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + + Returns + ------- + float + Nearest Neighbour Distance """ - cdef np.ndarray min_sum = np.zeros(len(g1)) - nodes1 = list(g1.nodes) - nodes2 = list(g2.nodes) + + cdef np.ndarray min_sum = np.zeros(g1.size()) + cdef list nodes1 = list(g1.nodes()) + cdef list nodes2 = list(g2.nodes()) nodes2.extend([None]) cdef np.ndarray min_i - for i in range(len(nodes1)): - min_i = np.zeros(len(nodes2)) - for j in range(len(nodes2)): + for i in range(g1.size()): + min_i = np.zeros(g2.size()) + for j in range(g2.size()): min_i[j] = self.fuv(g1, g2, nodes1[i], nodes2[j]) min_sum[i] = np.min(min_i) return np.sum(min_sum) - cdef float fuv(self, g1, g2, n1, n2): + cdef float fuv(self, g1, g2, str n1, str n2): """ Compute the Node Distance function - :param g1: first graph - :param g2: second graph - :param n1: node of the first graph - :param n2: node of the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + node distance """ if n2 == None: # Del return self.node_del + ((self.edge_del / 2.) * g1.degree(n1)) @@ -90,31 +140,51 @@ cdef class HED(Base): return 0 return (self.node_del + self.node_ins + self.hed_edge(g1, g2, n1, n2)) / 2 - cdef float hed_edge(self, g1, g2, n1, n2): + cdef float hed_edge(self, g1, g2, str n1, str n2): """ Compute HEDistance between edges of n1 and n2, respectively in g1 and g2 - :param g1: first graph - :param g2: second graph - :param n1: node of the first graph - :param n2: node of the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + HEDistance between g1 and g2 """ return self.sum_gpq(g1, n1, g2, n2) + self.sum_gpq(g1, n1, g2, n2) - cdef float sum_gpq(self, g1, n1, g2, n2): + cdef float sum_gpq(self, g1, str n1, g2, str n2): """ Compute Nearest Neighbour Distance between edges around n1 in G1 and edges around n2 in G2 - :param g1: first graph - :param n1: node in the first graph - :param g2: second graph - :param n2: node in the second graph - :return: + Parameters + ---------- + g1 : gmatch4py.Graph + First graph + g2 : gmatch4py.Graph + Second graph + n1 : int or str + identifier of the first node + n2 : int or str + identifier of the second node + + Returns + ------- + float + Nearest Neighbour Distance """ #if isinstance(g1, nx.MultiDiGraph): - cdef list edges1 = list(g1.edges(n1)) if n1 else [] - cdef list edges2 = list(g2.edges(n2)) if n2 else [] + cdef list edges1 = g1.get_edges_no(n1) if n1 else [] # rename method ... + cdef list edges2 = g2.get_edges_no(n2) if n2 else [] cdef np.ndarray min_sum = np.zeros(len(edges1)) edges2.extend([None]) @@ -126,12 +196,19 @@ cdef class HED(Base): min_sum[i] = np.min(min_i) return np.sum(min_sum) - cdef float gpq(self, tuple e1, tuple e2): + cdef float gpq(self, str e1, str e2): """ Compute the edge distance function - :param e1: edge1 - :param e2: edge2 - :return: + Parameters + ---------- + e1 : str + first edge identifier + e2 + second edge indentifier + Returns + ------- + float + edge distance """ if e2 == None: # Del return self.edge_del diff --git a/gmatch4py/graph.pxd b/gmatch4py/graph.pxd new file mode 100644 index 0000000000000000000000000000000000000000..e1ef555680d968f2f84d573b08754cac0749706e --- /dev/null +++ b/gmatch4py/graph.pxd @@ -0,0 +1,122 @@ +cimport numpy as np + +cdef class Graph: + ################################## + # ATTRIBUTES + ################################## + + # GRAPH PROPERTY ATTRIBUTES + ########################### + cdef bint is_directed # If the graph is directed + cdef bint is_multi # If the graph is a Multi-Graph + cdef bint is_node_attr + cdef bint is_edge_attr + + # ATTR VAL ATTRIBUTES + ##################### + cdef str node_attr_key # Key that contains the main attr value for a node + cdef str edge_attr_key # Key that contains the main attr value for an edge + cdef set unique_node_attr_vals # list + cdef set unique_edge_attr_vals # list + + + ## NODE ATTRIBUTES + ################# + + cdef list nodes_list # list of nodes ids + cdef list nodes_attr_list # list of attr value for each node (following nodes list order) + cdef list nodes_hash # hash representation of every node + cdef set nodes_hash_set # hash representation of every node (set version for intersection and union operation) + cdef dict nodes_idx # index of each node in `nodes_list` + cdef list nodes_weight # list that contains each node's weight (following nodes_list order) + cdef long[:] nodes_degree # degree list + cdef long[:] nodes_degree_in # in degree list + cdef long[:] nodes_degree_out # out degree list + cdef long[:] nodes_degree_weighted #weighted vers. of nodes_degree + cdef long[:] nodes_degree_in_weighted #weighted vers. of nodes_degree_in + cdef long[:] nodes_degree_out_weighted #weighted vers. of nodes_degree_out + cdef dict degree_per_attr # degree information per attr val + cdef dict degree_per_attr_weighted # degree information per attr val + cdef list attr_nodes # list of attr(dict) values for each node + cdef dict edges_of_nodes # list of egdes connected to each node + + # EDGES ATTRIBUTES + ################## + + cdef list edges_list # edge list + cdef list edges_attr_list # list of attr value for each edge (following nodes list order) + cdef dict edges_hash_idx # index of hash in edges_list and edges_attr_list + cdef list edges_hash # hash representation of every edges ## A VOIR ! + cdef set edges_hash_set # set of hash representation of every edges (set version for intersection and union operation) + cdef dict edges_weight # list that contains each node's weight (following nodes_list order) + cdef dict edges_hash_map #[id1,[id2,hash]] + cdef list attr_edges # list of attr(dict) values for each edge + + # SIZE ATTTRIBUTE + ############### + + cdef long number_of_nodes # number of nodes + cdef long number_of_edges # number of edges + + cdef dict number_of_edges_per_attr # number of nodes per attr value + cdef dict number_of_nodes_per_attr # number of edges per attr value + + cdef object nx_g + + ################################## + # METHODS + ################################## + + # DIMENSION GETTER + ################## + cpdef long size(self) + cpdef int size_attr(self, attr_val) + + cpdef long density(self) + cpdef int density_attr(self, str attr_val) + + # HASH FUNCTION + ############### + cpdef str hash_node(self,str n1) + cpdef str hash_edge(self,str n1,str n2) + cpdef str hash_node_attr(self,str n1, str attr_value) + cpdef str hash_edge_attr(self,str n1,str n2, str attr_value) + + ## EXIST FUNCTION + ############### + cpdef bint has_node(self,str n_id) + cpdef bint has_edge(self,str n_id1,str n_id2) + + ## LEN FUNCTION + ############### + cpdef int size_node_intersect(self,Graph G) + cpdef int size_node_union(self,Graph G) + + cpdef int size_edge_intersect(self,Graph G) + cpdef int size_edge_union(self,Graph G) + + # DEGREE FUNCTION + ################# + cpdef int degree(self,str n_id, bint weight=*) + cpdef int in_degree(self,str n_id, bint weight=*) + cpdef int out_degree(self,str n_id, bint weight=*) + + cpdef int in_degree_attr(self,str n_id,str attr_val, bint weight=*) + cpdef int out_degree_attr(self,str n_id,str attr_val, bint weight=*) + cpdef int degree_attr(self,str n_id,str attr_val, bint weight=*) + + ## GETTER + ######### + + cpdef list get_edges_ed(self,str e1, str e2) + cpdef list get_edges_no(self,str n) + cpdef set get_edges_hash(self) + cpdef set get_nodes_hash(self) + + cpdef str get_node_key(self) + cpdef str get_egde_key(self) + + cpdef dict get_edge_attrs(self,edge_hash) + cpdef dict get_node_attrs(self, node_hash) + cpdef dict get_node_attr(self, node_hash) + cpdef dict get_edge_attr(self,edge_hash) \ No newline at end of file diff --git a/gmatch4py/graph.pyx b/gmatch4py/graph.pyx new file mode 100644 index 0000000000000000000000000000000000000000..f3f59c960cf2499f0e8323b13959c55f49d5d52e --- /dev/null +++ b/gmatch4py/graph.pyx @@ -0,0 +1,389 @@ +from libcpp.map cimport map +from libcpp.utility cimport pair +from libcpp.string cimport string +from libcpp.vector cimport vector +import numpy as np +cimport numpy as np +import networkx as nx + +cdef class Graph: + + def __init__(self,G, node_attr_key="",edge_attr_key=""): + self.nx_g=G + + #GRAPH PROPERTY INIT + self.is_directed = G.is_directed() + self.is_multi = G.is_multigraph() + self.is_node_attr=(True if node_attr_key else False) + self.is_edge_attr=(True if edge_attr_key else False) + if self.is_multi and not self.is_edge_attr: + if not len(nx.get_edge_attributes(G,"id")) == len(G.edges(data=True)): + i=0 + for id1 in G.adj: + for id2 in G.adj[id1]: + for id3 in G.adj[id1][id2]: + G._adj[id1][id2][id3]["id"]=str(i) + i+=1 + self.is_edge_attr = True + edge_attr_key = "id" + + # for ed in + + #len(nx.get_edge_attributes(G1,"id")) == len(G1.edges(data=True)) + + if len(G) ==0: + self.__init_empty__() + + else: + a,b=list(zip(*list(G.nodes(data=True)))) + self.nodes_list,self.attr_nodes=list(a),list(b) + if G.number_of_edges()>0: + e1,e2,d=zip(*list(G.edges(data=True))) + self.attr_edges=list(d) + self.edges_list=list(zip(e1,e2)) + else: + self.edges_list=[] + self.attr_edges=[] + + if self.is_node_attr: + self.node_attr_key = node_attr_key + self.nodes_attr_list = [attr_dict[node_attr_key] for attr_dict in self.attr_nodes] + self.unique_node_attr_vals=set(self.nodes_attr_list) + + if self.is_edge_attr: + self.edge_attr_key = edge_attr_key + self.edges_attr_list = [attr_dict[edge_attr_key] for attr_dict in self.attr_edges] + self.unique_edge_attr_vals=set(self.edges_attr_list) + + # NODE Information init + ####################### + + self.nodes_hash=[self.hash_node_attr(node,self.nodes_attr_list[ix]) if self.is_node_attr else self.hash_node(node) for ix, node in enumerate(self.nodes_list) ] + self.nodes_hash_set=set(self.nodes_hash) + self.nodes_idx={node:ix for ix, node in enumerate(self.nodes_list)} + self.nodes_weight=[attr_dict["weight"] if "weight" in attr_dict else 1 for attr_dict in self.attr_nodes] + degree_all=[] + degree_in=[] + degree_out=[] + + degree_all_weighted=[] + degree_in_weighted=[] + degree_out_weighted=[] + if self.is_edge_attr: + self.degree_per_attr={attr_v:{n:{"in":0,"out":0} for n in self.nodes_list} for attr_v in self.unique_edge_attr_vals} + self.degree_per_attr_weighted={attr_v:{n:{"in":0,"out":0} for n in self.nodes_list} for attr_v in self.unique_edge_attr_vals} + # Retrieving Degree Information + self.edges_of_nodes={} + for n in self.nodes_list: + self.edges_of_nodes[n]=[self.hash_edge_attr(e1,e2,attr_dict[self.edge_attr_key]) if self.is_edge_attr else self.hash_edge(e1,e2) for e1,e2,attr_dict in G.edges(n,data=True)] + degree_all.append(G.degree(n)) + degree_all_weighted.append(G.degree(n,weight="weight")) + if self.is_directed: + degree_in.append(G.in_degree(n)) + degree_in_weighted.append(G.in_degree(n,weight="weight")) + degree_out.append(G.out_degree(n)) + degree_out_weighted.append(G.out_degree(n)) + else: + degree_in.append(degree_all[-1]) + degree_in_weighted.append(degree_all_weighted[-1]) + degree_out.append(degree_all[-1]) + degree_out_weighted.append(degree_all_weighted[-1]) + if self.is_edge_attr: + if self.is_directed: + in_edge=list(G.in_edges(n,data=True)) + out_edge=list(G.out_edges(n,data=True)) + for n1,n2,attr_dict in in_edge: + self.degree_per_attr[attr_dict[self.edge_attr_key]][n]["in"]+=1 + self.degree_per_attr_weighted[attr_dict[self.edge_attr_key]][n]["in"]+=1*(attr_dict["weight"] if "weight" in attr_dict else 1 ) + + for n1,n2,attr_dict in out_edge: + self.degree_per_attr[attr_dict[self.edge_attr_key]][n]["out"]+=1 + self.degree_per_attr_weighted[attr_dict[self.edge_attr_key]][n]["out"]+=1*(attr_dict["weight"] if "weight" in attr_dict else 1 ) + + else: + edges=G.edges(n,data=True) + for n1,n2,attr_dict in edges: + self.degree_per_attr[attr_dict[self.edge_attr_key]][n]["in"]+=1 + self.degree_per_attr[attr_dict[self.edge_attr_key]][n]["out"]+=1 + self.degree_per_attr_weighted[attr_dict[self.edge_attr_key]][n]["in"]+=1*(attr_dict["weight"] if "weight" in attr_dict else 1 ) + self.degree_per_attr_weighted[attr_dict[self.edge_attr_key]][n]["out"]+=1*(attr_dict["weight"] if "weight" in attr_dict else 1 ) + + + self.nodes_degree=np.array(degree_all) + self.nodes_degree_in=np.array(degree_in) + self.nodes_degree_out=np.array(degree_out) + + self.nodes_degree_weighted=np.array(degree_all_weighted) + self.nodes_degree_in_weighted=np.array(degree_in_weighted) + self.nodes_degree_out_weighted=np.array(degree_out_weighted) + + # EDGE INFO INIT + ################# + + self.edges_hash=[] + self.edges_hash_map = {} + self.edges_hash_idx = {} + for ix, ed in enumerate(self.edges_list): + e1,e2=ed + if not e1 in self.edges_hash_map:self.edges_hash_map[e1]={} + + hash_=self.hash_edge_attr(e1,e2,self.edges_attr_list[ix]) if self.is_edge_attr else self.hash_edge(e1,e2) + if self.is_multi and self.is_edge_attr: + if not e2 in self.edges_hash_map[e1]:self.edges_hash_map[e1][e2]={} + self.edges_hash_map[e1][e2][self.edges_attr_list[ix]]=hash_ + else: + self.edges_hash_map[e1][e2]=hash_ + self.edges_hash_idx[hash_]=ix + self.edges_hash.append(hash_) + self.edges_hash_set=set(self.edges_hash) + + self.edges_weight={} + for e1,e2,attr_dict in list(G.edges(data=True)): + hash_=self.hash_edge_attr(e1,e2,attr_dict[self.edge_attr_key]) if self.is_edge_attr else self.hash_edge(e1,e2) + self.edges_weight[hash_]=attr_dict["weight"] if "weight" in attr_dict else 1 + + self.number_of_edges = len(self.edges_list) + self.number_of_nodes = len(self.nodes_list) + + if self.is_edge_attr and self.number_of_edges >0: + self.number_of_edges_per_attr={attr:0 for attr in self.unique_edge_attr_vals} + for _,_,attr_dict in list(G.edges(data=True)): + self.number_of_edges_per_attr[attr_dict[self.edge_attr_key]]+=1 + + if self.is_node_attr and self.number_of_nodes >0: + self.number_of_nodes_per_attr={attr:0 for attr in self.unique_node_attr_vals} + for _,attr_dict in list(G.nodes(data=True)): + self.number_of_nodes_per_attr[attr_dict[self.node_attr_key]]+=1 + + + # HASH FUNCTION + cpdef str hash_node(self,str n1): + return "{0}".format(n1) + + cpdef str hash_edge(self,str n1,str n2): + if not self.is_directed: + return "_".join(sorted([n1,n2])) + return "_".join([n1,n2]) + + cpdef str hash_node_attr(self,str n1, str attr_value): + return "_".join([n1,attr_value]) + + cpdef str hash_edge_attr(self,str n1,str n2, str attr_value): + if self.is_directed: + return "_".join([n1,n2,attr_value]) + ed=sorted([n1,n2]) + ed.extend([attr_value]) + return "_".join(ed) + + ## EXIST FUNCTION + cpdef bint has_node(self,str n_id): + if n_id in self.nodes_list: + return True + return False + + cpdef bint has_edge(self,str n_id1,str n_id2): + if self.number_of_edges == 0: + return False + if self.is_directed: + if n_id1 in self.edges_hash_map and n_id2 in self.edges_hash_map[n_id1]: + return True + else: + if n_id1 in self.edges_hash_map and n_id2 in self.edges_hash_map[n_id1]: + return True + if n_id2 in self.edges_hash_map and n_id1 in self.edges_hash_map[n_id2]: + return True + return False + + ## LEN FUNCTION + cpdef int size_node_intersect(self,Graph G): + if self.number_of_nodes == 0: + return 0 + return len(self.nodes_hash_set.intersection(G.nodes_hash_set)) + cpdef int size_node_union(self,Graph G): + return len(self.nodes_hash_set.union(G.nodes_hash_set)) + + cpdef int size_edge_intersect(self,Graph G): + if self.number_of_edges == 0: + return 0 + return len(self.edges_hash_set.intersection(G.edges_hash_set)) + cpdef int size_edge_union(self,Graph G): + return len(self.edges_hash_set.union(G.edges_hash_set)) + + ## GETTER + + def get_nx(self): + return self.nx_g + + def nodes(self,data=False): + if data: + if self.number_of_nodes == 0: + return [],[] + return self.nodes_list,self.attr_nodes + + if self.number_of_nodes == 0: + return [] + return self.nodes_list + + + def edges(self,data=False): + if data: + if self.number_of_edges == 0: + return [],[] + return self.edges_list,self.attr_edges + + if self.number_of_edges == 0: + return [] + return self.edges_list + + cpdef list get_edges_ed(self,str e1,str e2): + if self.is_edge_attr: + hashes=self.edges_hash_map[e1][e2] + return [(e1,e2,self.edges_attr_list[self.edges_hash_idx[hash_]])for hash_ in hashes] + + return [(e1,e2,None)] + + cpdef list get_edges_no(self,str n): + return self.edges_of_nodes[n] + + cpdef dict get_edge_attr(self,edge_hash): + return self.edges_attr_list[self.edges_hash_idx[edge_hash]] + + cpdef dict get_node_attr(self, node_hash): + return self.edges_attr_list[self.edges_hash_idx[node_hash]] + + cpdef dict get_edge_attrs(self,edge_hash): + return self.attr_edges[self.edges_hash_idx[edge_hash]] + + cpdef dict get_node_attrs(self, node_hash): + return self.attr_nodes[self.edges_hash_idx[node_hash]] + + cpdef set get_edges_hash(self): + return self.edges_hash_set + + cpdef set get_nodes_hash(self): + return self.nodes_hash_set + + cpdef str get_node_key(self): + return self.node_attr_key + + cpdef str get_egde_key(self): + return self.edge_attr_key + ##### + + cpdef long size(self): + return self.number_of_nodes + + cpdef int size_attr(self, attr_val): + return self.number_of_nodes_per_attr[attr_val] + + cpdef long density(self): + return self.number_of_edges + + cpdef int density_attr(self, str attr_val): + return self.number_of_edges_per_attr[attr_val] + + cpdef int degree(self,str n_id, bint weight=False): + if weight: + return self.nodes_degree_weighted[self.nodes_idx[n_id]] + return self.nodes_degree[self.nodes_idx[n_id]] + + cpdef int in_degree(self,str n_id, bint weight=False): + if weight: + return self.nodes_degree_in_weighted[self.nodes_idx[n_id]] + return self.nodes_degree_in[self.nodes_idx[n_id]] + + cpdef int out_degree(self,str n_id, bint weight=False): + if weight: + return self.nodes_degree_out_weighted[self.nodes_idx[n_id]] + return self.nodes_degree_out[self.nodes_idx[n_id]] + + cpdef int in_degree_attr(self,str n_id,str attr_val, bint weight=False): + if not self.is_edge_attr and not self.is_directed: + raise AttributeError("No edge attribute have been defined") + if weight: + return self.degree_per_attr_weighted[attr_val][n_id]["in"] + return self.degree_per_attr[attr_val][n_id]["in"] + + cpdef int out_degree_attr(self,str n_id,str attr_val, bint weight=False): + if not self.is_edge_attr and not self.is_directed: + raise AttributeError("No edge attribute have been defined") + if weight: + return self.degree_per_attr_weighted[attr_val][n_id]["out"] + return self.degree_per_attr[attr_val][n_id]["out"] + + cpdef int degree_attr(self,str n_id,str attr_val, bint weight=False): + if not self.is_edge_attr: + raise AttributeError("No edge attribute have been defined") + if not self.is_directed: + if weight: + return self.degree_per_attr_weighted[attr_val][n_id]["out"] + return self.degree_per_attr[attr_val][n_id]["out"] + if weight: + return self.degree_per_attr_weighted[attr_val][n_id]["in"] + self.degree_per_attr_weighted[attr_val][n_id]["out"] + return self.degree_per_attr[attr_val][n_id]["out"] + self.degree_per_attr[attr_val][n_id]["in"] + + #GRAPH SETTER + def add_node(self,str id_,**kwargs): + if not self.node_attr_key in kwargs: + print("Node not added because information lacks") + return self + if id_ in self.nodes_idx: + print("Already in G") + return self + G=self.nx_g.copy() + G.add_node(id_,**kwargs) + return Graph(G,self.node_attr_key,self.edge_attr_key) + + + def add_edge(self,str n1,str n2,**kwargs): + G=self.nx_g.copy() + G.add_edge(n1,n2,**kwargs) + return Graph(G,self.node_attr_key,self.edge_attr_key) + + def remove_node(self,str id_): + if not id_ in self.nodes_idx: + print("Already removed in G") + return self + G=self.nx_g.copy() + G.remove_node(id_) + return Graph(G,self.node_attr_key,self.edge_attr_key) + + def remove_edge(self,str n1,str n2,**kwargs): + G=self.nx_g.copy() + edges=G.edges([n1,n2],data=True) + if len(edges) == 0: + return self + elif len(edges)<2: + G.remove_edge(n1,n2) + else: + if not self.edge_attr_key in kwargs: + for i in range(len(edges)): + G.remove_edge(n1,n2,i) + else: + key,val,i=self.edge_attr_key, kwargs[self.edge_attr_key],0 + for e1,ed2,attr_dict in edges: + if attr_dict[key] == val: + G.remove_edge(n1,n2,i) + break + i+=1 + + return Graph(G,self.node_attr_key,self.edge_attr_key) + + def __init_empty__(self): + self.nodes_list,self.nodes_attr_list,self.nodes_hash,self.nodes_weight,self.attr_nodes=[],[],[],[],[] + self.nodes_degree,self.nodes_degree_in,self.nodes_degree_out,self.nodes_degree_weighted,self.nodes_degree_in_weighted,self.nodes_degree_out_weighted=np.array([],dtype=np.long),np.array([],dtype=np.long),np.array([],dtype=np.long),np.array([],dtype=np.long),np.array([],dtype=np.long),np.array([],dtype=np.long) + self.nodes_idx,self.degree_per_attr,self.degree_per_attr_weighted={},{},{} + self.nodes_hash_set=set([]) + self.number_of_nodes = 0 + + self.number_of_edges = 0 + self.edges_list=[] + self.edges_attr_list =[] + self.edges_hash_idx = {} + self.edges_hash = [] + self.edges_hash_set= set([]) + self.edges_weight={} + self.edges_hash_map={} + self.attr_edges=[] + + \ No newline at end of file diff --git a/gmatch4py/helpers/compute_similarity_matrix.pyx b/gmatch4py/helpers/compute_similarity_matrix.pyx deleted file mode 100644 index fa627820a771b8b6f642a6f04392c1d58fd205dd..0000000000000000000000000000000000000000 --- a/gmatch4py/helpers/compute_similarity_matrix.pyx +++ /dev/null @@ -1,167 +0,0 @@ -# coding = utf-8 - - -# coding = utf-8 -import glob -from gmatch4py import * -from gmatch4py.helpers.reader import import_dir -from gmatch4py import GraphEditDistance as GED2 -from gmatch4py.base import Base - -import argparse, os, sys, re, json, logging -import threading -from queue import Queue -import datetime - -from functools import wraps - - -def objectify(func): - """Mimic an object given a dictionary. - - Given a dictionary, create an object and make sure that each of its - keys are accessible via attributes. - If func is a function act as decorator, otherwise just change the dictionary - and return it. - :param func: A function or another kind of object. - :returns: Either the wrapper for the decorator, or the changed value. - - Example:: - - >>> obj = {'old_key': 'old_value'} - >>> oobj = objectify(obj) - >>> oobj['new_key'] = 'new_value' - >>> print oobj['old_key'], oobj['new_key'], oobj.old_key, oobj.new_key - - >>> @objectify - ... def func(): - ... return {'old_key': 'old_value'} - >>> obj = func() - >>> obj['new_key'] = 'new_value' - >>> print obj['old_key'], obj['new_key'], obj.old_key, obj.new_key - - """ - - def create_object(value): - """Create the object. - - Given a dictionary, create an object and make sure that each of its - keys are accessible via attributes. - Ignore everything if the given value is not a dictionary. - :param value: A dictionary or another kind of object. - :returns: Either the created object or the given value. - - """ - if isinstance(value, dict): - # Build a simple generic object. - class Object(dict): - def __setitem__(self, key, val): - setattr(self, key, val) - return super(Object, self).__setitem__(key, val) - - # Create that simple generic object. - ret_obj = Object() - # Assign the attributes given the dictionary keys. - for key, val in value.items(): - if isinstance(val,dict): - ret_obj[key] = objectify(val) - else: - ret_obj[key] = val - setattr(ret_obj, key, val) - return ret_obj - else: - return value - - # If func is a function, wrap around and act like a decorator. - if hasattr(func, '__call__'): - @wraps(func) - def wrapper(*args, **kwargs): - """Wrapper function for the decorator. - - :returns: The return value of the decorated function. - - """ - value = func(*args, **kwargs) - return create_object(value) - - return wrapper - - # Else just try to objectify the value given. - else: - return create_object(func) - - -logging.basicConfig( -filename="{0}.csv".format(datetime.datetime.now().strftime("%Y_%m_%d__%H_%M_%S")), -format="%(message)s,%(asctime)s", -level=logging.DEBUG -) - -def compute_matrix(config,graphs,selected,dir): - - for class_ in config.algorithms_selected: - class_=eval(class_) - logging.info(msg="C_S,BEG,\"{0}\"".format(class_.__name__)) - print("Computing the Similarity Matrix for {0}".format(class_.__name__)) - - if class_ in (GraphEditDistance, BP_2, GreedyEditDistance, HED): - comparator = class_(1, 1, 1, 1) - elif class_ == GED2: - comparator = class_(1, 1, 1, 1,weighted=True) - elif class_ == WeisfeleirLehmanKernel: - comparator = class_(h=2) - else: - comparator=class_() - matrix = comparator.compare(graphs, selected) - matrix = comparator.similarity(matrix) - - logging.info(msg="C_S,DONE,\"{0}\"".format(class_.__name__)) - output_fn="{0}/{1}_{2}_{3}.npy".format( - config.output_dir.rstrip("/"), - class_.__name__,os.path.basename(dir), - config.experiment_name.replace(" ","_").lower() - ) - print(output_fn) - logging.info(msg="M_S,BEG,\"{0}\"".format(class_.__name__)) - np.save(output_fn,matrix) - logging.info(msg="M_S,DONE,\"{0}\"".format(class_.__name__)) - print("Matrix Saved") - -def run(config_filename): - - config=objectify(json.load(open(config_filename))) - - - if not os.path.exists(config.input_graph_dir): - print("Input graph directory doesn't exist!") - sys.exit(1) - - if not os.path.exists(config.output_dir): - print("Output matrix directory doesn't exist!") - print("Creating directory") - os.makedirs(config.output_dir) - print("Directory created") - - selected=None - if config.selected_graphs: - selected=json.load(open(config.selected_graph_input_filename)) - - if config.input_graph_sub_dirs: - dirs=[os.path.join(config.input_graph_dir,sub) for sub in config.input_graph_sub_dirs] - else: - dirs=[config.input_graph_dir] - for dir in dirs: - logging.info(msg="L_G,BEGIN,\"\"") - graphs = import_dir(dir) - logging.info(msg="L_G,DONE,\"\"") - threading.Thread(target=compute_matrix,args=(config,graphs,selected,dir)).start() - - - #json.dump(mapping_files_to_graphs,open("{0}/{1}".format(args.matrix_output_dir.rstrip("/"),"metadata.json"))) - print("Done") - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("configuration_file") - args = parser.parse_args() - run(args.configuration_file) \ No newline at end of file diff --git a/gmatch4py/helpers/general.pyx b/gmatch4py/helpers/general.pyx new file mode 100644 index 0000000000000000000000000000000000000000..0afce55a4524e7e6eaf42795cdc4a6c3d52de175 --- /dev/null +++ b/gmatch4py/helpers/general.pyx @@ -0,0 +1,23 @@ +from ..graph cimport Graph +import networkx as nx + +def parsenx2graph(list_gs,node_attr_key="",edge_attr_key=""): + """ + Parse list of Networkx graphs into Gmatch4py graph format + Parameters + ---------- + list_gs : list + list of graph + node_attr_key : str + node attribute used for the hash + edge_attr_key: str + edge attribute used for the hash + + Returns + ------- + list + list of gmatch4py.Graph + """ + new_gs=[nx.relabel_nodes(g,{node:str(node) for node in list(g.nodes)},copy=True) for g in list_gs] + new_gs=[Graph(g,node_attr_key,edge_attr_key) for g in new_gs] + return new_gs diff --git a/gmatch4py/helpers/generate_config.pyx b/gmatch4py/helpers/generate_config.pyx deleted file mode 100644 index ede0cbc15eb7850e57f711a21a684247c7cef365..0000000000000000000000000000000000000000 --- a/gmatch4py/helpers/generate_config.pyx +++ /dev/null @@ -1,196 +0,0 @@ -# -*- coding: utf-8 -*- - -# Form implementation generated from reading ui file 'mainwindow.ui' -# -# Created by: PyQt5 UI code generator 5.10 -# -# WARNING! All changes made in this file will be lost! - -from PyQt5 import QtCore, QtGui, QtWidgets -import os, json,glob -from gmatch4py import * - -class Ui_MainWindow(object): - - def setupUi(self, MainWindow): - - self.graph_input_dir="" - self.selected_input_fn="" - self.output_dir="" - - self.available_algs=['BP_2','BagOfCliques','BagOfNodes','GraphEditDistance','GreedyEditDistance','HED','Jaccard','MCS','VertexEdgeOverlap','VertexRanking', 'WeisfeleirLehmanKernel'] - - MainWindow.setObjectName("MainWindow") - MainWindow.resize(1000, 661) - self.centralWidget = QtWidgets.QWidget(MainWindow) - self.centralWidget.setObjectName("centralWidget") - self.textBrowser = QtWidgets.QTextBrowser(self.centralWidget) - self.textBrowser.setGeometry(QtCore.QRect(405, 31, 591, 551)) - self.textBrowser.setObjectName("textBrowser") - self.label = QtWidgets.QLabel(self.centralWidget) - self.label.setGeometry(QtCore.QRect(410, 10, 100, 16)) - self.label.setObjectName("label") - self.graph_dir_but = QtWidgets.QPushButton(self.centralWidget) - self.graph_dir_but.setGeometry(QtCore.QRect(280, 90, 113, 32)) - self.graph_dir_but.setObjectName("graph_dir_but") - self.label_2 = QtWidgets.QLabel(self.centralWidget) - self.label_2.setGeometry(QtCore.QRect(10, 70, 200, 16)) - self.label_2.setObjectName("label_2") - self.selected_fn_but = QtWidgets.QPushButton(self.centralWidget) - self.selected_fn_but.setGeometry(QtCore.QRect(280, 160, 113, 32)) - self.selected_fn_but.setObjectName("selected_fn_but") - self.label_3 = QtWidgets.QLabel(self.centralWidget) - self.label_3.setGeometry(QtCore.QRect(10, 140, 300, 16)) - self.label_3.setObjectName("label_3") - self.generate_button = QtWidgets.QPushButton(self.centralWidget) - self.generate_button.setGeometry(QtCore.QRect(20, 540, 113, 32)) - self.generate_button.setObjectName("generate_button") - self.label_4 = QtWidgets.QLabel(self.centralWidget) - self.label_4.setGeometry(QtCore.QRect(10, 210, 131, 16)) - self.label_4.setObjectName("label_4") - self.ouptut_dir_but = QtWidgets.QPushButton(self.centralWidget) - self.ouptut_dir_but.setGeometry(QtCore.QRect(280, 230, 113, 32)) - self.ouptut_dir_but.setObjectName("ouptut_dir_but") - self.all_alg = QtWidgets.QCheckBox(self.centralWidget) - self.all_alg.setGeometry(QtCore.QRect(10, 500, 200, 20)) - self.all_alg.setObjectName("all_alg") - self.save_button = QtWidgets.QPushButton(self.centralWidget) - self.save_button.setGeometry(QtCore.QRect(150, 540, 200, 32)) - self.save_button.setObjectName("save_button") - self.label_5 = QtWidgets.QLabel(self.centralWidget) - self.label_5.setGeometry(QtCore.QRect(10, 30, 121, 31)) - self.label_5.setObjectName("label_5") - self.experiment_name = QtWidgets.QLineEdit(self.centralWidget) - self.experiment_name.setGeometry(QtCore.QRect(130, 30, 231, 31)) - self.experiment_name.setObjectName("experiment_name") - self.graph_dir_label = QtWidgets.QLineEdit(self.centralWidget) - self.graph_dir_label.setGeometry(QtCore.QRect(10, 90, 261, 31)) - self.graph_dir_label.setObjectName("graph_dir_label") - self.selected_file_label = QtWidgets.QLineEdit(self.centralWidget) - self.selected_file_label.setGeometry(QtCore.QRect(10, 160, 261, 31)) - self.selected_file_label.setObjectName("selected_file_label") - self.output_dir_label = QtWidgets.QLineEdit(self.centralWidget) - self.output_dir_label.setGeometry(QtCore.QRect(10, 230, 261, 31)) - self.output_dir_label.setText("") - self.output_dir_label.setObjectName("output_dir_label") - self.alg_selector = QtWidgets.QListWidget(self.centralWidget) - self.alg_selector.setGeometry(QtCore.QRect(10, 300, 256, 192)) - self.alg_selector.setObjectName("alg_selector") - self.alg_selector.setSelectionMode(QtWidgets.QListWidget.MultiSelection) - self.label_6 = QtWidgets.QLabel(self.centralWidget) - self.label_6.setGeometry(QtCore.QRect(10, 280, 221, 16)) - self.label_6.setObjectName("label_6") - MainWindow.setCentralWidget(self.centralWidget) - self.menuBar = QtWidgets.QMenuBar(MainWindow) - self.menuBar.setGeometry(QtCore.QRect(0, 0, 1000, 22)) - self.menuBar.setObjectName("menuBar") - MainWindow.setMenuBar(self.menuBar) - self.mainToolBar = QtWidgets.QToolBar(MainWindow) - self.mainToolBar.setObjectName("mainToolBar") - MainWindow.addToolBar(QtCore.Qt.TopToolBarArea, self.mainToolBar) - self.statusBar = QtWidgets.QStatusBar(MainWindow) - self.statusBar.setObjectName("statusBar") - MainWindow.setStatusBar(self.statusBar) - - self.retranslateUi(MainWindow) - QtCore.QMetaObject.connectSlotsByName(MainWindow) - - for item in self.available_algs: - self.alg_selector.addItem(item) - - self.graph_dir_but.clicked.connect(self.get_graph_input_dir) - self.ouptut_dir_but.clicked.connect(self.get_res_output_dir) - self.selected_fn_but.clicked.connect(self.get_selected_file) - - self.generate_button.clicked.connect(self.generate_conf) - self.save_button.clicked.connect(self.file_save) - - def openDirNameDialog(self,title): - fileName = QtWidgets.QFileDialog.getExistingDirectory(None,title) - if not fileName: - return "" - return str(fileName) - - def openFileNameDialog(self,title): - options = QtWidgets.QFileDialog.Options() - options |= QtWidgets.QFileDialog.DontUseNativeDialog - filename,_ = QtWidgets.QFileDialog.getOpenFileNames(None) - if filename: - return filename - else: - return "" - - def get_graph_input_dir(self): - fn = self.openDirNameDialog("Graph Input Dir") - self.graph_dir_label.setText(fn) - self.graph_input_dir=fn - - def get_res_output_dir(self): - fn=self.openDirNameDialog("Results Output Dir") - self.output_dir_label.setText(fn) - self.output_dir=fn - - def get_selected_file(self): - fn=self.openFileNameDialog("SelectGraph File") - self.selected_file_label.setText(fn[0]) - self.selected_input_fn=fn[0] - - def retranslateUi(self, MainWindow): - _translate = QtCore.QCoreApplication.translate - MainWindow.setWindowTitle(_translate("MainWindow", "Générateur de Configuration pour Gmatch4py")) - self.label.setText(_translate("MainWindow", "Configuration:")) - self.graph_dir_but.setText(_translate("MainWindow", "Parcourir")) - self.label_2.setText(_translate("MainWindow", "Dossier contenant les graphes")) - self.selected_fn_but.setText(_translate("MainWindow", "Parcourir")) - self.label_3.setText(_translate("MainWindow", "Fichier contenant les graphes sélectionnés")) - self.generate_button.setText(_translate("MainWindow", "Générer")) - self.save_button.setText(_translate("MainWindow", "Sauvegarder la configuration")) - self.label_4.setText(_translate("MainWindow", "Dossier de Sortie")) - self.ouptut_dir_but.setText(_translate("MainWindow", "Parcourir")) - self.all_alg.setText(_translate("MainWindow", "Selectionnez tout ?")) - self.label_5.setText(_translate("MainWindow", "Nom de l'expérimentation")) - self.label_6.setText(_translate("MainWindow", "Sélectionnez les algorithmes :")) - - - - def file_save(self): - name,_ = QtWidgets.QFileDialog.getSaveFileName(None, 'Sauvegarder la configuration') - print(name) - if name: - file = open(name,'w') - text = self.generate_conf()[1] - file.write(text) - file.close() - msg=QtWidgets.QMessageBox() - msg.setText("Sauvegarde") - msg.setInformativeText("Sauvegarde Réussie") - msg.setWindowTitle("Sauvegarde") - msg.setStandardButtons(QtWidgets.QMessageBox.Ok) - msg.exec_() - - def generate_conf(self): - conf= { - "experiment_name":self.experiment_name.text(), - "input_graph_dir":self.graph_input_dir, - "input_graph_sub_dirs":[dir_ for dir_ in next(os.walk(self.graph_input_dir))[1] if next(os.walk(self.graph_input_dir))[1]], - "selected_graphs":(True if self.selected_input_fn else False), - "selected_graph_input_filename":self.selected_input_fn, - "algorithms_selected": [item.text() for item in self.alg_selector.selectedItems()] if not self.all_alg.isChecked() else self.available_algs, - "execute_all_algorithms": self.all_alg.isChecked() - } - str_conf=json.dumps(conf,indent=2) - self.textBrowser.setPlainText(str_conf) - return conf,str_conf - -def run_conf_generator(): - import sys - app = QtWidgets.QApplication(sys.argv) - MainWindow = QtWidgets.QMainWindow() - ui = Ui_MainWindow() - ui.setupUi(MainWindow) - MainWindow.show() - sys.exit(app.exec_()) - -if __name__ == "__main__": - run_conf_generator() - diff --git a/gmatch4py/jaccard.pyx b/gmatch4py/jaccard.pyx index a987192ab183d27f438f1920edaec774830f29a8..6b0bfe74b131c8c573fd14b47677a2a594593903 100644 --- a/gmatch4py/jaccard.pyx +++ b/gmatch4py/jaccard.pyx @@ -4,38 +4,49 @@ import numpy as np cimport numpy as np from .base cimport Base -from .base cimport intersection,union_ - +from .helpers.general import parsenx2graph +from cython.parallel cimport prange,parallel +cimport cython cdef class Jaccard(Base): def __init__(self): Base.__init__(self,0,True) + + @cython.boundscheck(False) cpdef np.ndarray compare(self,list listgs, list selected): cdef int n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)) + cdef list new_gs=parsenx2graph(listgs,self.node_attr_key,self.edge_attr_key) + cdef double[:,:] comparison_matrix = np.zeros((n, n)) + cdef long[:] n_nodes = np.array([g.size() for g in new_gs]) + cdef long[:] n_edges = np.array([g.density() for g in new_gs]) cdef int i,j - for i in range(n): - for j in range(i,n): - g1,g2=listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - inter_g=intersection(g1,g2) - union_g=union_(g1,g2) - if union_g.number_of_nodes() == 0 or union_g.number_of_edges()== 0: - comparison_matrix[i, j] = 0. - else: - comparison_matrix[i,j]=\ - ((inter_g.number_of_nodes())/(union_g.number_of_nodes()))\ - *\ - ((union_g.number_of_edges())/(union_g.number_of_edges())) - else: - comparison_matrix[i, j] = 0. - - comparison_matrix[j, i] = comparison_matrix[i, j] - - return comparison_matrix - + cdef double[:] selected_test = np.array(self.get_selected_array(selected,n)) + cdef double[:,:] intersect_len_nodes = np.zeros((n, n)) + cdef double[:,:] intersect_len_edges = np.zeros((n, n)) + cdef double[:,:] union_len_nodes = np.zeros((n, n)) + cdef double[:,:] union_len_edges = np.zeros((n, n)) + for i in range(n): + for j in range(i,n): + intersect_len_nodes[i][j]=new_gs[i].size_node_intersect(new_gs[j]) + intersect_len_edges[i][j]=new_gs[i].size_edge_intersect(new_gs[j])#len(set(hash_edges[i]).intersection(hash_edges[j])) + union_len_nodes[i][j]=new_gs[i].size_node_union(new_gs[j]) + union_len_edges[i][j]=new_gs[i].size_edge_union(new_gs[j]) + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(i,n): + if n_nodes[i] > 0 and n_nodes[j] > 0 and selected_test[i] == 1: + if union_len_edges[i][j] >0 and union_len_nodes[i][j] >0: + comparison_matrix[i][j]= \ + (intersect_len_edges[i][j]/union_len_edges[i][j])*\ + (intersect_len_nodes[i][j]/union_len_nodes[i][j]) + + else: + comparison_matrix[i][j] = 0. + + comparison_matrix[j][i] = comparison_matrix[i][j] + + return np.array(comparison_matrix) diff --git a/gmatch4py/kernels/adjacency.pyx b/gmatch4py/kernels/adjacency.pyx new file mode 100644 index 0000000000000000000000000000000000000000..67b3cb53af49335538b5b2781e19e0c7729547d6 --- /dev/null +++ b/gmatch4py/kernels/adjacency.pyx @@ -0,0 +1,52 @@ +import networkx as nx +import numpy as np + +def get_adjacency(G1,G2): + """ + Return adjacency matrices of two graph based on nodes present in both of them. + + Parameters + ---------- + G1 : nx.Graph + first graph + G2 : nx.Graph + second graph + + Returns + ------- + tuple of np.array + adjacency matrices of G1 and G2 + """ + + # Extract nodes + nodes_G1=list(G1.nodes()) + nodes_G2=list(G2.nodes()) + + # Get Adjacency Matrix for each graph + adj_original_G1 = nx.convert_matrix.to_numpy_matrix(G1,nodes_G1) + adj_original_G2 = nx.convert_matrix.to_numpy_matrix(G2,nodes_G2) + + # Get old index + index_node_G1={node: ix for ix,node in enumerate(nodes_G1)} + index_node_G2={node: ix for ix,node in enumerate(nodes_G2)} + + # Building new indices + nodes_unique = list(set(nodes_G1).union(nodes_G2)) + new_node_index = {node:i for i,node in enumerate(nodes_unique)} + + n=len(nodes_unique) + + #Generate new adjacent matrices + new_adj_G1= np.zeros((n,n)) + new_adj_G2= np.zeros((n,n)) + + # Filling old values + for n1 in nodes_unique: + for n2 in nodes_unique: + if n1 in G1.nodes() and n2 in G1.nodes(): + new_adj_G1[new_node_index[n1],new_node_index[n2]]=adj_original_G1[index_node_G1[n1],index_node_G1[n2]] + if n1 in G2.nodes() and n2 in G2.nodes(): + new_adj_G2[new_node_index[n1],new_node_index[n2]]=adj_original_G2[index_node_G2[n1],index_node_G2[n2]] + + return new_adj_G1,new_adj_G2 + diff --git a/gmatch4py/kernels/shortest_path_kernel.pyx b/gmatch4py/kernels/shortest_path_kernel.pyx index e7e7444f49afd0a2af30cfff17a9ef3b3a2209d9..351b5bc1aa686e5063b3da4c6009f886c926c8e6 100644 --- a/gmatch4py/kernels/shortest_path_kernel.pyx +++ b/gmatch4py/kernels/shortest_path_kernel.pyx @@ -12,15 +12,22 @@ Modified by : Jacques Fize import networkx as nx import numpy as np +cimport numpy as np +from scipy.sparse.csgraph import floyd_warshall +from .adjacency import get_adjacency +from cython.parallel cimport prange,parallel +from ..helpers.general import parsenx2graph +from ..base cimport Base +cimport cython - -class ShortestPathGraphKernel: +cdef class ShortestPathGraphKernel(Base): """ Shorthest path graph kernel. """ - __type__ = "sim" - @staticmethod - def compare( g_1, g_2, verbose=False): + def __init__(self): + Base.__init__(self,0,False) + + def compare_two(self,g_1, g_2): """Compute the kernel value (similarity) between two graphs. Parameters ---------- @@ -34,15 +41,18 @@ class ShortestPathGraphKernel: """ # Diagonal superior matrix of the floyd warshall shortest # paths: - fwm1 = np.array(nx.floyd_warshall_numpy(g_1)) - fwm1 = np.where(fwm1 == np.inf, 0, fwm1) - fwm1 = np.where(fwm1 == np.nan, 0, fwm1) + if isinstance(g_1,nx.Graph) and isinstance(g_2,nx.Graph): + g_1,g_2= get_adjacency(g_1,g_2) + + fwm1 = np.array(floyd_warshall(g_1)) + fwm1[np.isinf(fwm1)] = 0 + fwm1[np.isnan(fwm1)] = 0 fwm1 = np.triu(fwm1, k=1) bc1 = np.bincount(fwm1.reshape(-1).astype(int)) - fwm2 = np.array(nx.floyd_warshall_numpy(g_2)) - fwm2 = np.where(fwm2 == np.inf, 0, fwm2) - fwm2 = np.where(fwm2 == np.nan, 0, fwm2) + fwm2 = np.array(floyd_warshall(g_2)) + fwm2[np.isinf(fwm2)] = 0 + fwm2[np.isnan(fwm2)] = 0 fwm2 = np.triu(fwm2, k=1) bc2 = np.bincount(fwm2.reshape(-1).astype(int)) @@ -56,9 +66,8 @@ class ShortestPathGraphKernel: return np.sum(v1 * v2) - - @staticmethod - def compare_list(graph_list, verbose=False): + @cython.boundscheck(False) + cpdef np.ndarray compare(self,list graph_list, list selected): """Compute the all-pairs kernel values for a list of graphs. This function can be used to directly compute the kernel matrix for a list of graphs. The direct computation of the @@ -73,16 +82,68 @@ class ShortestPathGraphKernel: K: numpy.array, shape = (len(graph_list), len(graph_list)) The similarity matrix of all graphs in graph_list. """ - n = len(graph_list) - k = np.zeros((n, n)) + cdef int n = len(graph_list) + cdef double[:,:] k = np.zeros((n, n)) + cdef int cpu_count = self.cpu_count + cdef int i,j + cdef list adjacency_matrices = [[None for i in range(n)]for j in range(n)] + for i in range(n): for j in range(i, n): - k[i, j] = ShortestPathGraphKernel.compare(graph_list[i], graph_list[j]) - k[j, i] = k[i, j] + adjacency_matrices[i][j] = get_adjacency(graph_list[i],graph_list[j]) + adjacency_matrices[j][i] = adjacency_matrices[i][j] + + with nogil, parallel(num_threads=cpu_count): + for i in prange(n,schedule='static'): + for j in range(i, n): + with gil: + if len(graph_list[i]) > 0 and len(graph_list[j]) >0: + a,b=adjacency_matrices[i][j] + k[i][j] = self.compare_two(a,b) + k[j][i] = k[i][j] - k_norm = np.zeros(k.shape) - for i in range(k.shape[0]): - for j in range(k.shape[1]): - k_norm[i, j] = k[i, j] / np.sqrt(k[i, i] * k[j, j]) + k_norm = np.zeros((n,n)) + for i in range(n): + for j in range(i,n): + k_norm[i, j] = k[i][j] / np.sqrt(k[i][i] * k[j][j]) + k_norm[j, i] = k_norm[i, j] - return k_norm \ No newline at end of file + return np.nan_to_num(k_norm) + + + +cdef class ShortestPathGraphKernelDotCostMatrix(ShortestPathGraphKernel): + """ + Instead of just multiply the count of distance values fou,d between nodes of each graph, this version propose to multiply the node distance matrix generated from each graph. + """ + def __init__(self): + ShortestPathGraphKernel.__init__(self) + + def compare_two(self,g_1, g_2): + """Compute the kernel value (similarity) between two graphs. + Parameters + ---------- + g1 : networkx.Graph + First graph. + g2 : networkx.Graph + Second graph. + Returns + ------- + k : The similarity value between g1 and g2. + """ + # Diagonal superior matrix of the floyd warshall shortest + # paths: + if isinstance(g_1,nx.Graph) and isinstance(g_2,nx.Graph): + g_1,g_2= get_adjacency(g_1,g_2) + + fwm1 = np.array(floyd_warshall(g_1)) + fwm1[np.isinf(fwm1)] = 0 + fwm1[np.isnan(fwm1)] = 0 + fwm1 = np.triu(fwm1, k=1) + + fwm2 = np.array(floyd_warshall(g_2)) + fwm2[np.isinf(fwm2)] = 0 + fwm2[np.isnan(fwm2)] = 0 + fwm2 = np.triu(fwm2, k=1) + + return np.sum(fwm1 * fwm2) \ No newline at end of file diff --git a/gmatch4py/kernels/weisfeiler_lehman.pyx b/gmatch4py/kernels/weisfeiler_lehman.pyx index 93a78cbbc36b49d0f328198fa7f38a59ea1fa22a..e0e4c0edf80b6012f95784bb178620100a2f64c9 100644 --- a/gmatch4py/kernels/weisfeiler_lehman.pyx +++ b/gmatch4py/kernels/weisfeiler_lehman.pyx @@ -105,7 +105,6 @@ cdef class WeisfeleirLehmanKernel(Base): # cdef np.ndarray[np.float64_t] k k = np.dot(phi.transpose(), phi) - print(1) # MAIN LOOP cdef int it = 0 diff --git a/gmatch4py/mcs.pyx b/gmatch4py/mcs.pyx index d2742e4f762477587335b3f39021b182a2de4a69..574b5a7f7284d80adbd9859826249515a4ca0b9e 100644 --- a/gmatch4py/mcs.pyx +++ b/gmatch4py/mcs.pyx @@ -1,8 +1,11 @@ # coding = utf-8 import numpy as np cimport numpy as np - +from .graph cimport Graph from .base cimport Base +from cython.parallel cimport prange,parallel +from .helpers.general import parsenx2graph +cimport cython cdef class MCS(Base): """ @@ -12,34 +15,31 @@ cdef class MCS(Base): def __init__(self): Base.__init__(self,0,True) + @cython.boundscheck(False) cpdef np.ndarray compare(self,list listgs, list selected): cdef int n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)) + cdef double [:,:] comparison_matrix = np.zeros((n, n)) + cdef double[:] selected_test = np.array(self.get_selected_array(selected,n)) + cdef list new_gs=parsenx2graph(listgs,self.node_attr_key,self.edge_attr_key) + cdef long[:] n_nodes = np.array([g.size() for g in new_gs]) + cdef double [:,:] intersect_len_nodes = np.zeros((n, n)) + cdef int i,j for i in range(n): - for j in range(i, n): - g1,g2=listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - comparison_matrix[i, j] = self.s_mcs(g1,g2) - else: - comparison_matrix[i, j] = 0. - comparison_matrix[j, i] = comparison_matrix[i, j] - return comparison_matrix - - def s_mcs(self,G, H): - """ - Return the MCS measure value between - Parameters - ---------- - G : networkx.Graph - First Graph - H : networkx.Graph - Second Graph - - Returns - ------- - - """ + for j in range(i,n): + intersect_len_nodes[i][j]=new_gs[i].size_node_intersect(new_gs[j]) + + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(i, n): + if n_nodes[i] > 0 and n_nodes[j] > 0 and selected_test[i] == 1: + comparison_matrix[i][j] = intersect_len_nodes[i][j]/max(n_nodes[i],n_nodes[j]) + else: + comparison_matrix[i][j] = 0. + if i==j: + comparison_matrix[i][j]=1 + comparison_matrix[j][i] = comparison_matrix[i][j] + + + return np.array(comparison_matrix) - return len(self.mcs(G, H)) / float(max(len(G), len(H))) diff --git a/gmatch4py/vertex_edge_overlap.pyx b/gmatch4py/vertex_edge_overlap.pyx index c270c94b327fe6953a4c9afe33b0221812ca6302..e9fd66a1f1422bf3e9937549e96899fc8981b3d2 100644 --- a/gmatch4py/vertex_edge_overlap.pyx +++ b/gmatch4py/vertex_edge_overlap.pyx @@ -2,7 +2,12 @@ import numpy as np cimport numpy as np -from .base cimport Base,intersection + +from .graph cimport Graph +from cython.parallel cimport prange,parallel +from .helpers.general import parsenx2graph +cimport cython +from .base cimport Base cdef class VertexEdgeOverlap(Base): @@ -14,27 +19,39 @@ cdef class VertexEdgeOverlap(Base): Code Author : Jacques Fize """ def __init__(self): - Base.__init__(self,0,True) + Base.__init__(self,0,True) + @cython.boundscheck(False) cpdef np.ndarray compare(self,list listgs, list selected): - n = len(listgs) - cdef np.ndarray comparison_matrix = np.zeros((n, n)) - cdef list inter_ver,inter_ed + cdef int n = len(listgs) + cdef list new_gs=parsenx2graph(listgs,self.node_attr_key,self.edge_attr_key) + cdef double[:,:] comparison_matrix = np.zeros((n, n)) cdef int denom,i,j + cdef long[:] n_nodes = np.array([g.size() for g in new_gs]) + cdef long[:] n_edges = np.array([g.density() for g in new_gs]) + + cdef double[:] selected_test = np.array(self.get_selected_array(selected,n)) + + cdef double[:,:] intersect_len_nodes = np.zeros((n, n)) + cdef double[:,:] intersect_len_edges = np.zeros((n, n)) for i in range(n): for j in range(i,n): - g1,g2 = listgs[i],listgs[j] - f=self.isAccepted(g1,i,selected) - if f: - inter_g= intersection(g1,g2) - denom=g1.number_of_nodes()+g2.number_of_nodes()+\ - g1.number_of_edges()+g2.number_of_edges() - if denom == 0: - continue - comparison_matrix[i,j]=(2*(inter_g.number_of_nodes() - +inter_g.number_of_edges()))/denom # Data = True --> For nx.MultiDiGraph - comparison_matrix[j, i] = comparison_matrix[i, j] - return comparison_matrix + intersect_len_nodes[i][j]=new_gs[i].size_node_intersect(new_gs[j]) + intersect_len_edges[i][j]=new_gs[i].size_edge_intersect(new_gs[j])#len(set(hash_edges[i]).intersection(hash_edges[j])) + + with nogil, parallel(num_threads=self.cpu_count): + for i in prange(n,schedule='static'): + for j in range(i,n): + if n_nodes[i] > 0 and n_nodes[j] > 0 and selected_test[i] == 1: + denom=n_nodes[i]+n_nodes[j]+\ + n_edges[i]+n_edges[j] + if denom > 0: + comparison_matrix[i][j]=(2*(intersect_len_nodes[i][j] + +intersect_len_edges[i][j]))/denom # Data = True --> For nx.MultiDiGraph + if i==j: + comparison_matrix[i][j]=1 + comparison_matrix[j][i] = comparison_matrix[i][j] + return np.array(comparison_matrix) diff --git a/logo.png b/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..04e9d636d0b3f2377451919835ac1dc4153f6c00 Binary files /dev/null and b/logo.png differ diff --git a/setup.py b/setup.py index b2fc383c619d2d2612e9fbe7c3084e6848e07871..16c333471014e20750403006906dfe5e0b107fae 100644 --- a/setup.py +++ b/setup.py @@ -42,7 +42,9 @@ def makeExtension(extName): return Extension( extName, - [extPath],include_dirs=[np.get_include()],language='c++',libraries=libs + [extPath],include_dirs=[np.get_include()],language='c++',libraries=libs, + #extra_compile_args = ["-O0", "-fopenmp"],extra_link_args=['-fopenmp'] + ) # get the list of extensions @@ -56,6 +58,7 @@ this_directory = path.abspath(path.dirname(__file__)) with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f: long_description = f.read() +requirements=["numpy","networkx","scipy",'scikit-learn','tqdm','pandas',"joblib","gensim","psutil"] setup( name="GMatch4py", author="Jacques Fize", @@ -63,12 +66,12 @@ setup( long_description=long_description, long_description_content_type='text/markdown', url="https://github.com/Jacobe2169/GMatch4py", - packages=["gmatch4py","gmatch4py.helpers"], + packages=["gmatch4py"], ext_modules=extensions, cmdclass={'build_ext': build_ext}, - setup_requires=["numpy","networkx","scipy",'scikit-learn'], - install_requires=["numpy","networkx","scipy",'scikit-learn'], - version="0.2.2", + setup_requires=requirements, + install_requires=requirements, + version="0.2.4.3beta", classifiers=[ "Programming Language :: Python :: 3", "License :: OSI Approved :: MIT License", diff --git a/test/gmatch4py_performance_test.py b/test/gmatch4py_performance_test.py new file mode 100644 index 0000000000000000000000000000000000000000..c5646b2df297225eae74d48c67f089d0bcea0c2f --- /dev/null +++ b/test/gmatch4py_performance_test.py @@ -0,0 +1,32 @@ +import os +os.chdir(os.environ["HOME"]) + +def test_mesure(): + import gmatch4py as gm + import networkx as nx + import time + from tqdm import tqdm + import pandas as pd + + + max_=100 + size_g=10 + graphs_all=[nx.random_tree(size_g) for i in range(max_)] + result_compiled=[] + for size_ in tqdm(range(50,max_,50)): + graphs=graphs_all[:size_] + comparator=None + for class_ in [gm.BagOfNodes,gm.WeisfeleirLehmanKernel, gm.GraphEditDistance, gm.GreedyEditDistance, gm.HED, gm.BP_2, gm.Jaccard, gm.MCS, gm.VertexEdgeOverlap]: + deb=time.time() + if class_ in (gm.GraphEditDistance, gm.BP_2, gm.GreedyEditDistance, gm.HED): + comparator = class_(1, 1, 1, 1) + elif class_ == gm.WeisfeleirLehmanKernel: + comparator = class_(h=2) + else: + comparator=class_() + matrix = comparator.compare(graphs,None) + print([class_.__name__,size_,time.time()-deb]) + result_compiled.append([class_.__name__,size_,time.time()-deb]) + + df = pd.DataFrame(result_compiled,columns="algorithm size_data time_exec_s".split()) + df.to_csv("new_gmatch4py_res_{0}graphs_{1}size.csv".format(max_,size_g)) \ No newline at end of file diff --git a/test/test.py b/test/test.py new file mode 100644 index 0000000000000000000000000000000000000000..c8aec492aab9cf20940d400173ffaa867bdcd0b5 --- /dev/null +++ b/test/test.py @@ -0,0 +1,216 @@ +import pytest +import os +import networkx as nx + +def __import(): + # Gmatch4py use networkx graph + import networkx as nx + import gmatch4py as gm + + +def test_import(): + os.chdir(os.environ["HOME"] ) + __import() + +def test_graph(): + os.chdir(os.environ["HOME"]) + import networkx as nx + import gmatch4py as gm + + # Simple Graph + G1 = nx.Graph() + G2 = nx.Graph() + G1.add_edge("1","2") + G1.add_edge("1","3") + + gm.graph.Graph(G1) + + # Digraph Graph + G1 = nx.DiGraph() + G1.add_edge("1","2") + G1.add_edge("1","3") + assert list(G1.edges()) == gm.graph.Graph(G1).edges() + + G1 = nx.DiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","2",color="red") + G1.add_edge("1","3",color="green") + assert gm.graph.Graph(G1,edge_attr_key="color").density() == 2 + assert gm.graph.Graph(G1).density() == 2 + + # Multi Graph + G1 = nx.MultiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","3",color="green") + assert list(G1.edges()) == gm.graph.Graph(G1).edges() + G1 = nx.MultiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","3",color="green") + assert len(set([gm.graph.Graph(G1).hash_edge_attr(ed[0],ed[1],ed[2]["color"]) for ed in list(G1.edges(data=True))]).intersection(gm.graph.Graph(G1,edge_attr_key="color").get_edges_hash())) == 2 + + G1 = nx.MultiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","2",color="red") + G1.add_edge("1","3",color="green") + assert gm.graph.Graph(G1,edge_attr_key="color").density() == len(G1.edges(data=True)) + assert gm.graph.Graph(G1).density() == len(G1.edges(data=True)) + + # Multi DiGraph + G1 = nx.MultiDiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","2",color="red") + G1.add_edge("1","3",color="green") + assert gm.graph.Graph(G1,edge_attr_key="color").density() == len(G1.edges(data=True)) + assert gm.graph.Graph(G1).density() == len(G1.edges(data=True)) + +def test_hash(): + os.chdir(os.environ["HOME"]) + import networkx as nx + import gmatch4py as gm + + # Basic HASH + G1 = nx.Graph() + G_gm = gm.graph.Graph(G1) + assert G_gm.hash_edge("1","2") == "1_2" + assert G_gm.hash_edge("2","1") == "1_2" + + # IF directed + G1 = nx.DiGraph() + G1.add_edge("1","2") + G_gm = gm.graph.Graph(G1) + assert G_gm.hash_edge("3","2") == "3_2" + assert G_gm.hash_edge("2","1") == "2_1" + + # IF color and directed + G1 = nx.DiGraph() + G1.add_edge("1","2",color="blue") + G_gm = gm.graph.Graph(G1,edge_attr_key="color") + assert G_gm.hash_edge_attr("3","2","blue") == "3_2_blue" + assert G_gm.get_edges_hash() == {"1_2_blue"} + + # if color and not directed + G1 = nx.Graph() + G1.add_edge("1","2",color="blue") + G_gm = gm.graph.Graph(G1,edge_attr_key="color") + assert G_gm.hash_edge_attr("3","2","blue") == "2_3_blue" + +def test_intersect_union(): + os.chdir(os.environ["HOME"]) + import networkx as nx + import gmatch4py as gm + + # Basic + G1 = nx.Graph() + G1.add_edge("1","2") + G1.add_edge("1","3") + G2 = G1.copy() + G2.add_edge("3","4") + GM1 = gm.graph.Graph(G1) + GM2 = gm.graph.Graph(G2) + + assert GM1.size_edge_union(GM2) == 3 + assert GM1.size_node_union(GM2) == 4 + + assert GM1.size_edge_intersect(GM2) == 2 + assert GM1.size_node_intersect(GM2) == 3 + + # BASIC and noised for hash + G1 = nx.Graph() + G1.add_edge("1","2") + G1.add_edge("1","3") + G2 = nx.Graph() + G2.add_edge("1","2") + G2.add_edge("3","1") # Changing the direction (no impact if working) + G2.add_edge("3","4") + GM1 = gm.graph.Graph(G1) + GM2 = gm.graph.Graph(G2) + + assert GM1.size_edge_union(GM2) == 3 + assert GM1.size_node_union(GM2) == 4 + + assert GM1.size_edge_intersect(GM2) == 2 + assert GM1.size_node_intersect(GM2) == 3 + + + # Directed + G1 = nx.DiGraph() + G1.add_edge("1","2") + G1.add_edge("1","3") + G2 = nx.DiGraph() + G2.add_edge("1","2") + G2.add_edge("3","1") # Changing the direction (no impact if working) + G2.add_edge("3","4") + GM1 = gm.graph.Graph(G1) + GM2 = gm.graph.Graph(G2) + + assert GM1.size_edge_union(GM2) == 4 + assert GM1.size_node_union(GM2) == 4 + + assert GM1.size_edge_intersect(GM2) == 1 + assert GM1.size_node_intersect(GM2) == 3 + + + # IF COLOR + G1 = nx.DiGraph(); G1.add_node("1",color="blue") + G2 = nx.DiGraph(); G2.add_node("1",color="red") + + GM1,GM2 = gm.graph.Graph(G1),gm.graph.Graph(G2) + assert GM1.size_node_intersect(GM2) == 1 + GM1,GM2 = gm.graph.Graph(G1,node_attr_key="color"),gm.graph.Graph(G2,node_attr_key="color") + assert GM1.size_node_intersect(GM2) == 0 + + + G1 = nx.DiGraph(); G1.add_edge("1","2",color="blue") + G2 = nx.DiGraph(); G2.add_edge("1","2",color="red") + + GM1,GM2 = gm.graph.Graph(G1),gm.graph.Graph(G2) + assert GM1.size_edge_intersect(GM2) == 1 + assert GM1.size_edge_union(GM2) == 1 + GM1,GM2 = gm.graph.Graph(G1,edge_attr_key="color"),gm.graph.Graph(G2,edge_attr_key="color") + assert GM1.size_edge_intersect(GM2) == 0 + assert GM1.size_edge_union(GM2) == 2 + +def test_degree(): + os.chdir(os.environ["HOME"]) + import networkx as nx + import gmatch4py as gm + + # Not DIRECTED and no attr + G1 = nx.Graph() + G1.add_edge("1","2") + G1.add_edge("1","3") + GM1 = gm.graph.Graph(G1) + assert GM1.degree('1') == 2 + + G1 = nx.DiGraph() + G1.add_edge("1","2") + G1.add_edge("3","1") + GM1 = gm.graph.Graph(G1) + assert GM1.degree('1') == 2 + assert GM1.in_degree('1') == 1 + assert GM1.out_degree('1') == 1 + + G1 = nx.MultiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","2",color="red") + G1.add_edge("1","3",color="blue") + GM1 = gm.graph.Graph(G1,edge_attr_key ="color") + + assert GM1.degree_attr('1',"blue") == 2 + assert GM1.degree('1') == 3 + + G1 = nx.MultiDiGraph() + G1.add_edge("1","2",color="blue") + G1.add_edge("1","2",color="red") + G1.add_edge("1","3",color="green") + GM1 = gm.graph.Graph(G1,edge_attr_key ="color") + assert GM1.in_degree_attr('2','red') == 1 + assert GM1.in_degree('2') == 2 + + + + + + + + \ No newline at end of file