Commit 828e76f5 authored by Fize Jacques's avatar Fize Jacques
Browse files

Cleaning + Readme Modif.

parent 5a936e7f
No related merge requests found
Showing with 677 additions and 385 deletions
+677 -385
#STR #STR
This repository contains all the work on STR or Spatial Textual Representation. The file This repository contains all the work on STR or Spatial Textual Representation. The file
hierarchy is divided in mutliple modules such as : hierarchy is divided in multiple modules such as :
* **config** which contains the configuration file and a dedicated class for loading and * **config** which contains the configuration file and a dedicated class for loading and
interact with it interact with it
* **gmatch4py** is a module which contains implementation of various graph matching * **gmatch4py** is a module which contains implementation of various graph matching
algorithms algorithms
* **gui_grap_viewer** contains a webapp used to visualize graph and their top-k similar graph
using specific graph matching algorithms.
* **helpers** is a module which contains various helpers methods for requesting the geo database * **helpers** is a module which contains various helpers methods for requesting the geo database
(geodict) or collision between polygons, etc.. (geodict) or collision between polygons, etc..
* **models** contains the STR structure and its variations. * **models** contains the STR structure and its variations.
* **nlp** contains all the implementation or interface of nlp methods such as NER, POS, * **nlp** contains all the implementation or interface of nlp methods such as NER, POS,
Toponym disambiguation, ... Toponym disambiguation, ...
* **tt4py** is a module dedicated to find and annotate tokens in a tokenized text.
## Generate STR ## Generate STR
To generate STR, use the `generate_data.py`. To generate STR, use the `generate_str.py`.
``` ```
usage: generate_data.py [-h] usage: generate_str.py [-h] [-n {spacy,polyglot,stanford}]
texts_input_dir graphs_output_dir metadata_output_fn [-d {occwiki,most_common,shareprop}] [-t {gen,ext}]
{normal,generalisation,extension} ... [-o OUTPUT]
input_pkl
positional arguments: positional arguments:
texts_input_dir input_pkl Filename of your input. Must be in Pickle format with the following columns :
graphs_output_dir - filename : original filename that contains the text in `content`
metadata_output_fn - id_doc : id of your document
{normal,generalisation,extension} - content : text data associated to the document
commands - lang : language of your document
normal Basic STR generation. No argument are necessary !
generalisation Apply a generalisation transformation on the generated
STRs
extension Apply a extension transformation on the generated STRs
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-n {spacy,polyglot,stanford}, --ner {spacy,polyglot,stanford}
``` The Named Entity Recognizer you wish to use
-d {occwiki,most_common,shareprop}, --disambiguator {occwiki,most_common,shareprop}
There are three ways of generate STR: The Named Entity disambiguator you wish to use
-t {gen,ext}, --transform {gen,ext}
* **Normal** Used to generate a STR without modifications Transformation to apply
* **Generalisation** You generate a STR with a generalisation transformation applied to it -o OUTPUT, --output OUTPUT
* **Extension** You generate a STR with a extension transformation applied to it Output Filename
### Generalisation
There is the possibility to generate **generalised** STR. A **generalised** STR, is a STR where
all entities are generalised (Paris --> France) using one of two hypothesis :
* **All**, all spatial entities are generalised *h* times. If *h* = 2, Paris becomes Europe
( Paris --> France --> Europe ).
* **Bounded**, all spatial entities are generalised until they are on a defined spatial
scale. For example, if we set the spatial scale to "country", all spatial entities who are
town, region, village, etc.. are generalised until the resulting spatial entities are countries.
A concrete example, with : Normandy and Montpellier, we would have :
1. Normandy --> France and Montpelier --> Hérault
2. France stays France and Hérault --> Occitanie
3. France stays France and Occitanie --> France
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn generalisation
[-h] [-t TYPE_GEN] [-n N] [-b BOUND]
optional arguments:
-h, --help show this help message and exit
-t TYPE_GEN, --type_gen TYPE_GEN
Type of generalisation
-n N Language
-b BOUND, --bound BOUND
If Generalisation is bounded, this arg. correspondto
the maximal
```
### Extension
An other ways of transforming STR is to extend a part of its spatial entities. The extension
of STR works this way:
* We select entities which are town with a low probability of appearance in the corpus
* Then, we search for neighbors of it in a radius (defined in d) around it.
* Finally, we add to the STR, those who fit these conditions :
* Belong to the same country
* Has a probility superior to the score median over the whole spatial entities in the STR
* Is a Capital or Town
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn extension
[-h] [-d DISTANCE] [-u UNIT] [-a ADJACENT_COUNT]
optional arguments:
-h, --help show this help message and exit
-d DISTANCE, --distance DISTANCE
radius distance
-u UNIT, --unit UNIT unit used for the radius distance
-a ADJACENT_COUNT, --adjacent_count ADJACENT_COUNT
number of adjacent SE add to the STR
```
# coding: utf-8
\ No newline at end of file
import sys, os, re ,argparse, warnings import sys, os, re ,argparse, warnings, json
import logging import logging
logger = logging.getLogger("elasticsearch") logger = logging.getLogger("elasticsearch")
...@@ -24,8 +24,10 @@ from strpython.nlp.disambiguator.wikipedia_cooc import WikipediaDisambiguator as ...@@ -24,8 +24,10 @@ from strpython.nlp.disambiguator.wikipedia_cooc import WikipediaDisambiguator as
from strpython.nlp.disambiguator.geodict_gaurav import GauravGeodict as shared_geo_d from strpython.nlp.disambiguator.geodict_gaurav import GauravGeodict as shared_geo_d
from strpython.nlp.disambiguator.most_common import MostCommonDisambiguator as most_common_d from strpython.nlp.disambiguator.most_common import MostCommonDisambiguator as most_common_d
from mytoolbox.text.clean import clean_text from mytoolbox.text.clean import *
from mytoolbox.exception.inline import safe_execute
from stop_words import get_stop_words
import logging import logging
logger = logging.getLogger("elasticsearch") logger = logging.getLogger("elasticsearch")
...@@ -33,6 +35,7 @@ logger.setLevel(logging.ERROR) ...@@ -33,6 +35,7 @@ logger.setLevel(logging.ERROR)
logger = logging.getLogger("Fiona") logger = logging.getLogger("Fiona")
logger.setLevel(logging.ERROR) logger.setLevel(logging.ERROR)
disambiguator_dict = { disambiguator_dict = {
"occwiki" : wiki_d, "occwiki" : wiki_d,
"most_common" : most_common_d, "most_common" : most_common_d,
...@@ -94,13 +97,50 @@ pipelines={ ...@@ -94,13 +97,50 @@ pipelines={
lang : Pipeline(lang=lang,ner=ner_dict[args.ner](lang=lang),tagger=Tagger(),disambiguator= disambiguator_dict[args.disambiguator]()) lang : Pipeline(lang=lang,ner=ner_dict[args.ner](lang=lang),tagger=Tagger(),disambiguator= disambiguator_dict[args.disambiguator]())
for lang in tqdm(languages,desc="Load Pipelines model") for lang in tqdm(languages,desc="Load Pipelines model")
} }
def matcher_agrovoc( lang):
"""
Return a terminolgy matcher using the Agrovoc vocabulary.
Parameters
----------
nlp : spacy.lang.Language
model
lang : str
language of the terms
Returns
-------
TerminologyMatcher
matcher
"""
agrovoc_vocab = pd.read_csv("../thematic_str/data/terminology/agrovoc/agrovoc_cleaned.csv")
agrovoc_vocab["preferred_label_new"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: safe_execute({}, Exception, json.loads, x.replace("\'", "\"")))
agrovoc_vocab["label_lang"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: str(resolv_a(x[lang]) if lang in x else np.nan).strip().lower())
agrovoc_vocab=agrovoc_vocab[~pd.isna(agrovoc_vocab["label_lang"])]
return agrovoc_vocab["label_lang"].values.tolist()
stopwords = {
lang:matcher_agrovoc(lang)
for lang in tqdm(languages,desc="Load stopwords")
}
for lang in stopwords:
stopwords[lang].extend(get_stop_words(lang))
print("Clean input content ...") print("Clean input content ...")
df["content"]= df.content.progress_apply(lambda x :clean_text(x)) if not "entities" in df:
df["content"]= df.content.progress_apply(lambda x :clean_text(x))
count_error=0 count_error=0
def build(pipelines,x): def build(pipelines,x):
global count_error global count_error
try:
if "entities" in x:
return pipelines[x.lang].build(x.content,toponyms=x.entities,stop_words=stopwords[x.lang])
except Exception as e:
print(e)
try: try:
return pipelines[x.lang].build(x.content) return pipelines[x.lang].build(x.content)
except Exception as e: except Exception as e:
......
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
This source diff could not be displayed because it is too large. You can view the blob instead.
Flask_Session==0.3.1
Shapely==1.5.17.post1
matplotlib==2.0.2
termcolor==1.1.0
networkx==2.1
requests==2.18.4
numpy==1.14.0
gensim==1.0.1
elasticsearch==5.2.0
geopandas==0.2.1
SQLAlchemy==1.1.14
pycorenlp==0.3.0
Flask_Login==0.4.0
pandas==0.19.2
scipy==0.19.1
Flask==0.12
ipython==6.2.1
python_bcrypt==0.3.2
extractor==0.5
progressbar2==3.35.0
scikit_bio==0.5.1
scikit_learn==0.19.1
typing==3.6.4
plotly
folium
\ No newline at end of file
# coding: utf-8
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import argparse
parser=argparse.ArgumentParser()
parser.add_argument("result_csv")
args=parser.parse_args()
data=pd.read_csv(args.result_csv,index_col=0)
data=data[data.mesure != "BP"]
def pareto_frontier_multi(myArray):
# Sort on first dimension
myArray = myArray[myArray[:,0].argsort()]
# Add first row to pareto_frontier
pareto_frontier = myArray[0:1,:]
indices,i=[],1
# Test next row against the last row in pareto_frontier
for row in myArray[1:,:]:
if sum([row[x] >= pareto_frontier[-1][x]
for x in range(len(row))]) == len(row):
# If it is better on all features add the row to pareto_frontier
pareto_frontier = np.concatenate((pareto_frontier, [row]))
indices.append(i)
i+=1
return indices,pareto_frontier
def highlight_max(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]
def highlight_min(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.min()
return ['background-color: #d64541;color:white;' if v else '' for v in is_max]
def colorize(df,fields):
return df.style.apply(highlight_max,subset=fields).apply(highlight_min,subset=fields)
to_colorize="c1 c2 c3 c4".split()
print("Table for {0}".format(args.result_csv))
print("Average Measure Precision")
print(data.groupby("mesure").mean().to_csv())
print("")
index,data_pa=pareto_frontier_multi(data["c1 c2 c3 c4".split()].values)
print("PARETO c1 c2 c3 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c1 c4".split()].values)
print("PARETO c1 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c2 c3".split()].values)
print("PARETO c2 c3")
print(data.iloc[index].to_csv(index=False))
...@@ -15,6 +15,7 @@ def jsonKeys2int(x): ...@@ -15,6 +15,7 @@ def jsonKeys2int(x):
return x return x
__cache__crit={} __cache__crit={}
if os.path.exists("cache.json"): if os.path.exists("cache.json"):
try: try:
__cache__crit=json.load(open("cache.json")) __cache__crit=json.load(open("cache.json"))
......
import json import json
import os, re import os
import warnings import warnings
import psycopg2
from shapely.geometry import Point from shapely.geometry import Point
from ..config.configuration import config from ..config.configuration import config
...@@ -10,7 +10,8 @@ import geopandas as gpd ...@@ -10,7 +10,8 @@ import geopandas as gpd
__cache = {} __cache = {}
__cache_adjacency = {} __cache_adjacency = {}
__limit_cache = 2000 __limit_cache = 10000
__cache_frequency = {}
def add_cache(id_, hull): def add_cache(id_, hull):
...@@ -80,12 +81,10 @@ def getGEO(id_se): ...@@ -80,12 +81,10 @@ def getGEO(id_se):
data=data[0] data=data[0]
if "path" in data: if "path" in data:
return int(re.findall("\d+",data.other["path"])[-1]) return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
#return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
elif "coord" in data: elif "coord" in data:
return data.coord.lon,data.coord.lat return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
#return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename( columns={0: 'geometry'})
# columns={0: 'geometry'})
return None return None
...@@ -107,29 +106,6 @@ def getGEO2(id_se): ...@@ -107,29 +106,6 @@ def getGEO2(id_se):
return "C",gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename( return "C",gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
columns={0: 'geometry'}) columns={0: 'geometry'})
return None return None
def is_collision_psql_poly(id_1,id_2):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""select a.id,b.id, st_intersects(st_convexhull(a.geom),st_convexhull(b.geom))
from boundary as a, boundary as b
where a.id = {id1} and b.id = {id2}; """.format(id1=id_1,id2=id_2))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def is_collision_psql_poly_and_point(poly_id,data_point):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""SELECT b.id,
st_within(st_buffer(ST_GeomFromText('POINT({lon} {lat})',4326),1),st_setsrid(b.geom,4326)) FROM boundary as b
WHERE id = {poly_id};""".format(lon=data_point[0],lat=data_point[0],poly_id=poly_id))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def collide(se1, se2): def collide(se1, se2):
""" """
...@@ -138,38 +114,30 @@ def collide(se1, se2): ...@@ -138,38 +114,30 @@ def collide(se1, se2):
:param se2: id of the second spatial entity :param se2: id of the second spatial entity
:return: :return:
""" """
global __cache_frequency
try: try:
if se1 in __cache: if se1 in __cache:
data_se1 = __cache[se1] data_se1 = __cache[se1]
__cache_frequency[se1] += 1 __cache_frequency[se1] += 1
# else: else:
# data_se1 = getGEO(se1) data_se1 = getGEO(se1)
# add_cache(se1, data_se1) add_cache(se1, data_se1)
if se2 in __cache: if se2 in __cache:
data_se2 = __cache[se2] data_se2 = __cache[se2]
__cache_frequency[se2] += 1 __cache_frequency[se2] += 1
# else: else:
# data_se2 = getGEO(se2) data_se2 = getGEO(se2)
# add_cache(se2, data_se2) add_cache(se2, data_se2)
except: except Exception as e:
warnings.warn(e)
return False
if not type(data_se1) == gpd.GeoDataFrame or not type(data_se2) == gpd.GeoDataFrame:
return False return False
data_se1 = getGEO(se1) try:
data_se2 = getGEO(se2) if data_se1.intersects(data_se2):
if type(data_se1)==int and type(data_se2)==int: return True
return is_collision_psql_poly(data_se1,data_se2) except:
if type(data_se1)==tuple and type(data_se2)==tuple: if data_se1.intersects(data_se2).any():
return Point(*data_se1).buffer(1).intersects(Point(*data_se2).buffer(1)) return True
if type(data_se1)==tuple and type(data_se2)==int:
return is_collision_psql_poly_and_point(data_se2,data_se1)
if type(data_se1)==int and type(data_se2)==tuple:
return is_collision_psql_poly_and_point(data_se1,data_se2)
# try:
# if data_se1.intersects(data_se2):
# return True
# except:
# if data_se1.intersects(data_se2).any():
# return True
return False return False
...@@ -195,4 +163,4 @@ def collisionTwoSEBoundaries(id_se1, id_se2): ...@@ -195,4 +163,4 @@ def collisionTwoSEBoundaries(id_se1, id_se2):
__cache_adjacency[id_se1][id_se2] = True __cache_adjacency[id_se1][id_se2] = True
return True return True
__cache_adjacency[id_se1][id_se2] = False __cache_adjacency[id_se1][id_se2] = False
return False return False
\ No newline at end of file
...@@ -23,6 +23,7 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -23,6 +23,7 @@ def get_most_common_id_v3(label, lang='fr'):
:param lang: :param lang:
:return: :return:
""" """
label = label.strip()
id_,score=None,-1 id_,score=None,-1
data = gazetteer.get_by_label(label, lang) data = gazetteer.get_by_label(label, lang)
if data: if data:
...@@ -31,11 +32,11 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -31,11 +32,11 @@ def get_most_common_id_v3(label, lang='fr'):
if data2 and data2[0].score > data[0].score: if data2 and data2[0].score > data[0].score:
data2=data2[0] data2=data2[0]
id_, score = data2.id, data2.score id_, score = data2.id, data2.score
simi = gazetteer.get_n_label_similar(label, lang, n=5) # simi = gazetteer.get_n_label_similar(label, lang, n=5)
if simi: # if simi:
id_3, score3 = simi[0].id, simi[0].score # id_3, score3 = simi[0].id, simi[0].score
if id_3 and score3 > score: # if id_3 and score3 > score:
id_, score = id_3, score3 # id_, score = id_3, score3
return gazetteer.get_by_id(id_)[0] return gazetteer.get_by_id(id_)[0]
...@@ -44,13 +45,13 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -44,13 +45,13 @@ def get_most_common_id_v3(label, lang='fr'):
if data: if data:
return data[0] #data[0].id, data[0].score return data[0] #data[0].id, data[0].score
similar_label = gazetteer.get_n_label_similar(label, lang, n=5) # similar_label = gazetteer.get_n_label_similar(label, lang, n=5)
if similar_label: # if similar_label:
return similar_label[0]#similar_label[0].id, similar_label[0].score # return similar_label[0]#similar_label[0].id, similar_label[0].score
similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5) # similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5)
if similar_alias: # if similar_alias:
return similar_alias[0]#similar_alias[0].id, similar_alias[0].score # return similar_alias[0]#similar_alias[0].id, similar_alias[0].score
return None return None
......
...@@ -5,6 +5,7 @@ import os ...@@ -5,6 +5,7 @@ import os
import time import time
import warnings import warnings
from tqdm import tqdm
import folium import folium
import geopandas as gpd import geopandas as gpd
import networkx as nx import networkx as nx
...@@ -21,6 +22,7 @@ import numpy as np ...@@ -21,6 +22,7 @@ import numpy as np
# logging.basicConfig(filename=config.log_file,level=logging.INFO) # logging.basicConfig(filename=config.log_file,level=logging.INFO)
def get_inclusion_chain(id_, prop): def get_inclusion_chain(id_, prop):
""" """
For an entity return it geographical inclusion tree using a property. For an entity return it geographical inclusion tree using a property.
...@@ -40,10 +42,28 @@ class STR(object): ...@@ -40,10 +42,28 @@ class STR(object):
""" """
Str basic structure Str basic structure
""" """
__cache_inclusion = {} __cache_inclusion = {} # Store inclusion relations found between spaital entities
__cache_adjacency = {} # Store adjacency relations found between spaital entities
__cache_entity_data = {} # Store data about entity requested
def __init__(self, tagged_text, spatial_entities): def __init__(self, tagged_text, spatial_entities):
"""
Constructir
Parameters
----------
tagged_text : list
Text in forms of token associated with tag (2D array 2*t where t == |tokens| )
spatial_entities : dict
spatial entities associated with a text. Follow this structure {"<id>: <label>"}
"""
self.tagged_text = tagged_text self.tagged_text = tagged_text
self.spatial_entities = spatial_entities self.spatial_entities = spatial_entities
for k in list(spatial_entities.keys()):
if not k[:2] == "GD":
del spatial_entities[k]
self.adjacency_relationships = {} self.adjacency_relationships = {}
self.inclusion_relationships = {} self.inclusion_relationships = {}
...@@ -51,11 +71,21 @@ class STR(object): ...@@ -51,11 +71,21 @@ class STR(object):
@staticmethod @staticmethod
def from_networkx_graph(g: nx.Graph, tagged_: list = []): def from_networkx_graph(g: nx.Graph, tagged_: list = []):
""" """
Return a STR built from a Networkx imported graph Build a STR based on networkx graph
:param g:
:param tagged_: Parameters
:return: ----------
g : nx.Graph
input graph
tagged_ : list, optional
tagged text (the default is []). A 2D array 2*t where t == |tokens|.
Returns
-------
STR
resulting STR
""" """
sp_en = {} sp_en = {}
for nod in g: for nod in g:
try: try:
...@@ -72,10 +102,19 @@ class STR(object): ...@@ -72,10 +102,19 @@ class STR(object):
@staticmethod @staticmethod
def from_dict(spat_ent: dict, tagged_: list = []): def from_dict(spat_ent: dict, tagged_: list = []):
""" """
Return a STR built from a Networkx imported graph Build a STR based on networkx graph
:param g:
:param tagged_: Parameters
:return: ----------
spat_ent : dict
Dict of patial entities associated with a text. Follow this structure {"<id>: <label>"}
tagged_ : list, optional
tagged text (the default is []). A 2D array 2*t where t == |tokens|.
Returns
-------
STR
resulting STR
""" """
sp_en = {} sp_en = {}
for id_, label in spat_ent.items(): for id_, label in spat_ent.items():
...@@ -87,16 +126,59 @@ class STR(object): ...@@ -87,16 +126,59 @@ class STR(object):
@staticmethod @staticmethod
def from_pandas(dataf: pd.DataFrame, tagged: list = []): def from_pandas(dataf: pd.DataFrame, tagged: list = []):
"""
Build a STR from a Pandas Dataframe with two column : id and label.
Parameters
----------
dataf : pd.DataFrame
dataframe containing the spatial entities
tagged : list, optional
tagged text (the default is []). A 2D array 2*t where t == |tokens|.
Returns
-------
STR
resulting STR
"""
return STR.from_dict(pd.Series(dataf.label.values, index=dataf.id).to_dict(), tagged) return STR.from_dict(pd.Series(dataf.label.values, index=dataf.id).to_dict(), tagged)
def set_graph(self, g):
"""
Apply changes to the current STR based on Networkx Graph.
Parameters
----------
g : networkx.Graph
input graph
"""
self.graph = g
rel_ = self.graph.edges(data=True)
for edge in rel_:
id1, id2 = edge[0], edge[1]
if edge[2]["color"] == "green":
self.add_adjacency_rel(edge[0],edge[1])
self.add_cache__adjacency(id1, id2,True)
elif edge[2]["color"] == "red":
self.add_inclusion_rel(edge[0], edge[1])
self.add_cache_inclusion(id1,id2,True)
def add_spatial_entity(self, id, label=None, v=True): def add_spatial_entity(self, id, label=None, v=True):
""" """
Adding a spatial entity to the current STR Add a spatial entity to the current STR
:param id:
:param label: Parameters
:return: ----------
id : str
identifier of the spatial entity in Geodict
label : str, optional
if not available in Geodict (the default is None)
""" """
data_ = gazetteer.get_by_id(id) data_ = self.get_data(id)
if not data_: if not data_:
warnings.warn("{0} wasn't found in Geo-Database".format(id)) warnings.warn("{0} wasn't found in Geo-Database".format(id))
return False return False
...@@ -110,9 +192,14 @@ class STR(object): ...@@ -110,9 +192,14 @@ class STR(object):
def add_spatial_entities(self, ids: list, labels: list = []): def add_spatial_entities(self, ids: list, labels: list = []):
""" """
Add spatial entities to the current STR Add spatial entities to the current STR
:param ids:
:param label: Parameters
:return: ----------
ids : list
list of identifiers of each spatial entity
labels : list, optional
list of labels of each spatial entity
""" """
if not labels: if not labels:
warnings.warn("Labels list is empty. @en labels from Geo-Database will be used by default") warnings.warn("Labels list is empty. @en labels from Geo-Database will be used by default")
...@@ -125,28 +212,120 @@ class STR(object): ...@@ -125,28 +212,120 @@ class STR(object):
self.add_spatial_entity(id, label, False) self.add_spatial_entity(id, label, False)
# print(self.graph.nodes(data=True)) # print(self.graph.nodes(data=True))
def add_adjacency_rel(self, se1, se2,v=True): def add_adjacency_rel(self, se1, se2):
if not se1 in self.adjacency_relationships: """
self.adjacency_relationships[se1] = {} Add a adjacency relationship to the current STR.
self.adjacency_relationships[se1][se2]=v
Parameters
----------
se1 : str
Identifier of the first spatial entity
se2 : str
Identifier of the second spatial entity
"""
if not se1 in self.adjacency_relationships: self.adjacency_relationships[se1] = {}
if not se2 in self.adjacency_relationships: self.adjacency_relationships[se2] = {}
self.adjacency_relationships[se1][se2],self.adjacency_relationships[se2][se1] = True, True
self.add_cache__adjacency(se1,se2,True)
def add_inclusion_rel(self, se1, se2,v=True): def add_inclusion_rel(self, se1, se2):
"""
Add a inclusion relationship to the current STR.
Parameters
----------
se1 : str
Identifier of the first spatial entity
se2 : str
Identifier of the second spatial entity
"""
if not se1 in self.inclusion_relationships: if not se1 in self.inclusion_relationships:
self.inclusion_relationships[se1] = {} self.inclusion_relationships[se1] = {}
self.inclusion_relationships[se1][se2]=v self.inclusion_relationships[se1][se2]=True
self.add_cache_inclusion(se1,se2,True)
def add_cache_inclusion(self,id1, id2, v=True):
"""
Add a relation of inclusion in a cache variable
Parameters
----------
id1 : str
id of the first spatial entity
id2 : str
id of the second spatial entity
v : bool, optional
if the relation exists between the two spatial entities. Default is True
"""
if not id1 in STR.__cache_inclusion:
STR.__cache_inclusion[id1] = {}
STR.__cache_inclusion[id1][id2] = v
def add_cache__adjacency(self,se1,se2,v=True):
"""
Add a relation of adjacency in a cache variable
Parameters
----------
id1 : str
id of the first spatial entity
id2 : str
id of the second spatial entity
v : bool, optional
if the relation exists between the two spatial entities. Default is True
"""
if not se1 in STR.__cache_adjacency:
STR.__cache_adjacency[se1] = {}
if not se2 in STR.__cache_adjacency:
STR.__cache_adjacency[se2] = {}
STR.__cache_adjacency[se1][se2]=v
STR.__cache_adjacency[se2][se1]=v
def get_data(self,id_se):
"""
Return an gazpy.Element object containing information about a spatial entity.
Parameters
----------
id_se : str
Identifier of the spatial entity
Returns
-------
gazpy.Element
data
"""
if id_se in STR.__cache_entity_data:
return STR.__cache_entity_data[id_se]
data=gazetteer.get_by_id(id_se)
if len(data) > 0:
STR.__cache_entity_data[id_se]= data[0]
def transform_spatial_entities(self, transform_map):
def transform_spatial_entities(self, transform_map : dict):
""" """
Apply transformation to a STR Replace or delete certain spatial entities based on a transformation map
:param transform_map:
:return: Parameters
----------
transform_map : dict
New mapping for the spatial entities in the current STR. Format required : {"<id of the old spatial entity>":"<id of the new spatial entity>"}
""" """
final_transform_map = {} final_transform_map = {}
# Erase old spatial entities # Erase old spatial entities
new_label = {} new_label = {}
to_del=set([]) to_del=set([])
for old_se, new_se in transform_map.items(): for old_se, new_se in transform_map.items():
data = gazetteer.get_by_id(new_se) data = self.get_data(new_se)
to_del.add(old_se) to_del.add(old_se)
if data: if data:
data = data[0] data = data[0]
...@@ -159,7 +338,9 @@ class STR(object): ...@@ -159,7 +338,9 @@ class STR(object):
new_label[new_se] = data.label.en new_label[new_se] = data.label.en
else: else:
warnings.warn("{0} doesn't exists in the geo database!".format(new_se)) warnings.warn("{0} doesn't exists in the geo database!".format(new_se))
self.graph = nx.relabel_nodes(self.graph, final_transform_map) self.graph = nx.relabel_nodes(self.graph, final_transform_map)
for es in to_del: for es in to_del:
if es in self.graph._node: if es in self.graph._node:
self.graph.remove_node(es) self.graph.remove_node(es)
...@@ -169,9 +350,9 @@ class STR(object): ...@@ -169,9 +350,9 @@ class STR(object):
def update(self): def update(self):
""" """
Method for updating links between spatial entities Update the relationship between spatial entities in the STR. Used when transforming the STR.
:return:
""" """
nodes = copy.deepcopy(self.graph.nodes(data=True)) nodes = copy.deepcopy(self.graph.nodes(data=True))
self.graph.clear() self.graph.clear()
self.graph.add_nodes_from(nodes) self.graph.add_nodes_from(nodes)
...@@ -194,25 +375,29 @@ class STR(object): ...@@ -194,25 +375,29 @@ class STR(object):
self.graph.add_edge(se1, se2, key=0, color="green") self.graph.add_edge(se1, se2, key=0, color="green")
def add_cache_inclusion(self,id1, id2):
if not id1 in STR.__cache_inclusion:
STR.__cache_inclusion[id1] = set([])
STR.__cache_inclusion[id1].add(id2)
def is_included_in(self, se1_id, se2_id): def is_included_in(self, se1_id, se2_id):
global __cache_inclusion
""" """
Return true if the two spatial entities identified by @se1_id and @se2_id share an inclusion relationship Return True if a spatial entity is included within another one.
:param se1_id:
:param se2_id: Parameters
:return: ----------
se1_id : str
id of the contained entity
se2_id : str
id of the entity container
Returns
-------
bool
if se1 included in se2
""" """
if se1_id in self.inclusion_relationships: if se1_id in self.inclusion_relationships:
if se2_id in self.inclusion_relationships[se1_id]: if se2_id in self.inclusion_relationships[se1_id]:
return self.inclusion_relationships[se1_id][se2_id] return self.inclusion_relationships[se1_id][se2_id]
if se1_id in STR.__cache_inclusion:
if se2_id in STR.__cache_inclusion[se1_id]:
return True
inc_chain_P131 = get_inclusion_chain(se1_id, "P131") inc_chain_P131 = get_inclusion_chain(se1_id, "P131")
inc_chain_P706 = get_inclusion_chain(se1_id, "P706") inc_chain_P706 = get_inclusion_chain(se1_id, "P706")
...@@ -220,18 +405,120 @@ class STR(object): ...@@ -220,18 +405,120 @@ class STR(object):
inc_chain.extend(inc_chain_P706) inc_chain.extend(inc_chain_P706)
inc_chain = set(inc_chain) inc_chain = set(inc_chain)
if se2_id in inc_chain: if se2_id in inc_chain:
self.add_cache_inclusion(se1_id,se2_id) self.add_cache_inclusion(se1_id,se2_id,True)
return True
return False
def is_adjacent_cache(self,se1,se2):
"""
Return true if two spatial entities were found adjacent previously.
Parameters
----------
se1 : str
id of the first spatial entity
se2 : str
id of the second spatial entity
Returns
-------
bool
if se1 adjacent to se2
"""
if se1 in STR.__cache_adjacency:
if se2 in STR.__cache_adjacency[se1]:
return STR.__cache_adjacency[se1][se2]
if se2 in STR.__cache_adjacency:
if se1 in STR.__cache_adjacency[se2]:
return STR.__cache_adjacency[se2][se1]
return False
def is_included_cache(self,se1,se2):
"""
Return true if a spatial entity were found included previously in an other one.
Parameters
----------
se1 : str
id of the first spatial entity
se2 : str
id of the second spatial entity
Returns
-------
bool
if se1 included to se2
"""
if se1 in STR.__cache_inclusion:
if se2 in STR.__cache_inclusion[se1]:
return STR.__cache_inclusion[se1][se2]
return False
def is_adjacent(self,se1,se2,datase1=None,datase2=None):
"""
Return true if se1 is adjacent to se2.
Parameters
----------
se1 : str
id of the first spatial entity
se2 : str
id of the second spatial entity
datase1 : gazpy.Element, optional
if given cached data concerning the spatial entity with id = se1 (the default is None)
datase2 : gazpy.Element, optional
if given cached data concerning the spatial entity with id = se2 (the default is None)
Returns
-------
bool
true if adjacent
"""
stop_class = set(["A-PCLI", "A-ADM1"])
def get_p47_adjacency_data(self, data):
p47se1 = []
for el in data.other.P47:
d = gazetteer.get_by_other_id(el,"wikidata")
if not d:continue
p47se1.append(d[0].id)
return p47se1
if self.is_adjacent_cache(se1,se2):
return False
if self.is_included_in(se1, se2) or self.is_included_in(se2, se1):
return False
data_se1, data_se2 = self.get_data(se1), self.get_data(se2)
if "P47" in data_se2 and se1 in self.get_p47_adjacency_data(data_se2):
return True
# print("P47")
elif "P47" in data_se1 and se2 in self.get_p47_adjacency_data(data_se1):
return True
# print("P47")
if collisionTwoSEBoundaries(se1, se2):
return True return True
if "coord" in data_se1 and "coord" in data_se2:
if Point(data_se1.coord.lon, data_se1.coord.lat).distance(
Point(data_se2.coord.lon, data_se2.coord.lat)) < 1 and len(
set(data_se1.class_) & stop_class) < 1 and len(set(data_se2.class_) & stop_class) < 1:
return True
return False return False
def get_inclusion_relationships(self): def get_inclusion_relationships(self):
""" """
Return all the inclusion relationships between all the spatial entities in the STR. Find all the inclusion relationships between the spatial entities declared in the current STR.
:return:
""" """
inclusions_ = []
for se_ in self.spatial_entities: for se_ in tqdm(self.spatial_entities,desc="Extract Inclusion"):
inc_chain_P131 = get_inclusion_chain(se_, "P131") inc_chain_P131 = get_inclusion_chain(se_, "P131")
inc_chain_P706 = get_inclusion_chain(se_, "P706") inc_chain_P706 = get_inclusion_chain(se_, "P706")
...@@ -242,61 +529,19 @@ class STR(object): ...@@ -242,61 +529,19 @@ class STR(object):
for se2_ in self.spatial_entities: for se2_ in self.spatial_entities:
if se2_ in inc_chain: if se2_ in inc_chain:
self.add_inclusion_rel(se_,se2_) self.add_inclusion_rel(se_,se2_)
return inclusions_
def getP47AdjacencyData(self, data):
p47se1 = []
for el in data.other.P47:
d = gazetteer.get_by_other_id(el,"wikidata")
if not d:continue
p47se1.append(d[0].id)
return p47se1
def is_adjacent(self,se1,se2,datase1=None,datase2=None):
f = False
stop_class = set(["A-PCLI", "A-ADM1"])
if self.is_included_in(se1, se2):
return f
elif self.is_included_in(se2, se1):
return f
data_se1 = gazetteer.get_by_id(se1)[0] if not datase1 else datase1 # Évite de recharger à chaque fois -_-
data_se2 = gazetteer.get_by_id(se2)[0] if not datase2 else datase2
# print("testP47")
if "P47" in data_se2:
if se1 in self.getP47AdjacencyData(data_se2):
return True
# print("P47")
if not f:
if "P47" in data_se1:
if se2 in self.getP47AdjacencyData(data_se1):
return True
# print("P47")
if not f:
# print("test collision")
if collisionTwoSEBoundaries(se1, se2):
return True
if not f:
if "coord" in data_se1 and "coord" in data_se2:
if Point(data_se1.coord.lon, data_se1.coord.lat).distance(
Point(data_se2.coord.lon, data_se2.coord.lat)) < 1 and len(
set(data_se1.class_) & stop_class) < 1 and len(set(data_se2.class_) & stop_class) < 1:
return True
return f
def get_adjacency_relationships(self): def get_adjacency_relationships(self):
""" """
Return all the adjacency relationships between all the spatial entities in the STR. Find all the adjacency relationships between the spatial entities declared in the current STR.
:return:
""" """
data={se:gazetteer.get_by_id(se)[0] for se in self.spatial_entities}
for se1 in self.spatial_entities: data={se:self.get_data(se) for se in self.spatial_entities}
for se1 in tqdm(self.spatial_entities,desc="Extract Adjacency Relationship"):
data_se1 = data[se1] data_se1 = data[se1]
for se2 in self.spatial_entities: for se2 in self.spatial_entities:
if se1 == se2: continue if se1 == se2: continue
# print("test adjacency")
if se1 in self.adjacency_relationships: if se1 in self.adjacency_relationships:
if se2 in self.adjacency_relationships[se1]: if se2 in self.adjacency_relationships[se1]:
continue continue
...@@ -311,11 +556,22 @@ class STR(object): ...@@ -311,11 +556,22 @@ class STR(object):
def build(self, inc=True, adj=True, verbose=False): def build(self, inc=True, adj=True, verbose=False):
""" """
Build the STR Build the STR
:param inc:
:param adj: Parameters
:param verbose: ----------
:return: inc : bool, optional
if inclusion relationship have to be included in the STR (the default is True)
adj : bool, optional
if adjacency relationship have to be included in the STR (the default is True)
verbose : bool, optional
Verbose mode activated (the default is False)
Returns
-------
networkx.Graph
graph representing the STR
""" """
nodes = [] nodes = []
for k, v in self.spatial_entities.items(): for k, v in self.spatial_entities.items():
nodes.append((k, {"label": v})) nodes.append((k, {"label": v}))
...@@ -332,7 +588,7 @@ class STR(object): ...@@ -332,7 +588,7 @@ class STR(object):
graph.add_edge(se1,se2, key=0, color="green") graph.add_edge(se1,se2, key=0, color="green")
graph.add_edge(se2, se1, key=0, color="green") graph.add_edge(se2, se1, key=0, color="green")
logging.info("Extract Adjacency Rel\t{0}".format(time.time()-debut))
if inc: if inc:
debut=time.time() debut=time.time()
self.get_inclusion_relationships() self.get_inclusion_relationships()
...@@ -340,18 +596,20 @@ class STR(object): ...@@ -340,18 +596,20 @@ class STR(object):
for se2 in self.inclusion_relationships[se1]: for se2 in self.inclusion_relationships[se1]:
if self.inclusion_relationships[se1][se2]: if self.inclusion_relationships[se1][se2]:
graph.add_edge(se1,se2, key=0, color="red") graph.add_edge(se1,se2, key=0, color="red")
logging.info("Extract Inclusion Rel\t{0}".format(time.time() - debut))
self.graph = graph self.graph = graph
return graph return graph
def save_graph_fig(self, output_fn, format="svg"): def save_graph_fig(self, output_fn, format="svg"):
""" """
Save the graph graphiz reprensentation Save the graphiz reprensentation of the STR graph.
Parameters Parameters
---------- ----------
output_fn : string output_fn : string
Output filename Output filename
format : str
Output format (svg or pdf)
""" """
try: try:
...@@ -364,22 +622,27 @@ class STR(object): ...@@ -364,22 +622,27 @@ class STR(object):
print("Error while saving STR to {0}".format(format)) print("Error while saving STR to {0}".format(format))
def getUndirected(self): def getUndirected(self):
return nx.Graph(self.graph) """
Return the Undirected form of a STR graph.
def set_graph(self, g):
self.graph = g Returns
rel_ = self.graph.edges(data=True) -------
for edge in rel_: networkx.Graph
id1, id2 = edge[0], edge[1] unidirected graph
if edge[2]["color"] == "green": """
self.add_adjacency_rel(edge[0],edge[1])
add_cache_adjacency(id1, id2)
elif edge[2]["color"] == "red":
self.add_inclusion_rel(edge[0], edge[1])
self.add_cache_inclusion(id1,id2)
return nx.Graph(self.graph)
def get_geo_data_of_se(self): def get_geo_data_of_se(self):
"""
Return Geographical information for each spatial entities in the STR
Returns
-------
geopandas.GeoDataFrame
dataframe containing geographical information of each entity in the STR
"""
points,label,class_ = [], [], [] points,label,class_ = [], [], []
for se in self.spatial_entities: for se in self.spatial_entities:
data = gazetteer.get_by_id(se)[0] data = gazetteer.get_by_id(se)[0]
...@@ -396,6 +659,20 @@ class STR(object): ...@@ -396,6 +659,20 @@ class STR(object):
return df return df
def get_cluster(self,id_=None): def get_cluster(self,id_=None):
"""
Return the cluster detected using spatial entities position.
Parameters
----------
id_ : temp_file_id, optional
if cached version of geoinfo (the default is None)
Returns
-------
gpd.GeoDataFrame
cluster geometry
"""
if os.path.exists("./temp_cluster/{0}.geojson".format(id_)): if os.path.exists("./temp_cluster/{0}.geojson".format(id_)):
return gpd.read_file("./temp_cluster/{0}.geojson".format(id_)) return gpd.read_file("./temp_cluster/{0}.geojson".format(id_))
...@@ -412,22 +689,6 @@ class STR(object): ...@@ -412,22 +689,6 @@ class STR(object):
samples,labels=dbscan(X) samples,labels=dbscan(X)
data["cluster"] = labels data["cluster"] = labels
"""
# deuxième découpe en cluster
c=data['cluster'].value_counts().idxmax()
X=data[data["cluster"] == c]
X=X[["x","y"]]
bandwidth = estimate_bandwidth(X.values)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X.values)
X["cluster"]=ms.labels_+(data['cluster'].max()+1)
lab=ms.labels_
lab+=data['cluster'].max()+1
data["cluster"][data["cluster"] == c]=X["cluster"]
"""
geo = data.groupby("cluster").apply(to_Polygon) geo = data.groupby("cluster").apply(to_Polygon)
cluster_polybuff = gpd.GeoDataFrame(geometry=geo) cluster_polybuff = gpd.GeoDataFrame(geometry=geo)
if id_: if id_:
...@@ -436,6 +697,15 @@ class STR(object): ...@@ -436,6 +697,15 @@ class STR(object):
def to_folium(self): def to_folium(self):
"""
Use the folium package to project the STR on a map
Returns
-------
folium.Map
folium map instance
"""
points = [] points = []
for se in self.spatial_entities: for se in self.spatial_entities:
data = gazetteer.get_by_id(se)[0] data = gazetteer.get_by_id(se)[0]
...@@ -485,6 +755,20 @@ class STR(object): ...@@ -485,6 +755,20 @@ class STR(object):
def map_projection(self,plt=False): def map_projection(self,plt=False):
"""
Return a matplotlib figure of the STR
Parameters
----------
plt : bool, optional
if the user wish to use the plt.show() (the default is False)
Returns
-------
plt.Figure
Matplotlib figure instance
"""
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black', figsize=(16, 9)) base = world.plot(color='white', edgecolor='black', figsize=(16, 9))
...@@ -527,11 +811,39 @@ class STR(object): ...@@ -527,11 +811,39 @@ class STR(object):
plt.show() plt.show()
def to_Multipoints(x): # def to_Multipoints(x):
#print(x[["x","y"]].values) # """
return Polygon([Point(z) for z in x[["x","y"]].values]).buffer(1) # Return a polygon buffered representation for a set of point
# Parameters
# ----------
# x : pandas.Series
# coordinates columns
# Returns
# -------
# shapely.geometry.Polygon
# polygon
# """
# #print(x[["x","y"]].values)
# return Polygon([Point(z) for z in x[["x","y"]].values]).buffer(1)
def to_Polygon(x): def to_Polygon(x):
"""
Return a polygon buffered representation for a set of points.
Parameters
----------
x : pandas.Series
coordinates columns
Returns
-------
shapely.geometry.Polygon
polygon
"""
points = [Point(z) for z in x[["x","y"]].values] points = [Point(z) for z in x[["x","y"]].values]
if len(points) > 2: if len(points) > 2:
coords = [p.coords[:][0] for p in points] coords = [p.coords[:][0] for p in points]
......
...@@ -57,3 +57,6 @@ class Disambiguator(object): ...@@ -57,3 +57,6 @@ class Disambiguator(object):
def disambiguate(self, ner_result): def disambiguate(self, ner_result):
pass pass
def disambiguate_list(self,toponyms,lang):
pass
\ No newline at end of file
...@@ -42,6 +42,14 @@ class MostCommonDisambiguator(Disambiguator): ...@@ -42,6 +42,14 @@ class MostCommonDisambiguator(Disambiguator):
return new_count, selected_en return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result={}
for toponym in toponyms:
id_,_=self.disambiguate_(toponym,lang)
if id_:
result[id_]=toponym
return result
def disambiguate_(self, label, lang='fr'): def disambiguate_(self, label, lang='fr'):
if re.match("^\d+$", label): if re.match("^\d+$", label):
return 'O', -1 return 'O', -1
......
...@@ -31,7 +31,9 @@ class WikipediaDisambiguator(Disambiguator): ...@@ -31,7 +31,9 @@ class WikipediaDisambiguator(Disambiguator):
return new_count, selected_en return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result=self.disambiguate_wiki(toponyms,lang)
return {k:v for k,v in result.items() if v}
def disambiguate_wiki(self, entities, lang): def disambiguate_wiki(self, entities, lang):
......
...@@ -11,7 +11,8 @@ from .nlp.ner.ner import NER ...@@ -11,7 +11,8 @@ from .nlp.ner.ner import NER
from .nlp.ner.stanford_ner import StanfordNER from .nlp.ner.stanford_ner import StanfordNER
from .nlp.pos_tagger.tagger import Tagger from .nlp.pos_tagger.tagger import Tagger
from .nlp.pos_tagger.treetagger import TreeTagger from .nlp.pos_tagger.treetagger import TreeTagger
import json import json,re
class Pipeline(object): class Pipeline(object):
...@@ -96,10 +97,16 @@ class Pipeline(object): ...@@ -96,10 +97,16 @@ class Pipeline(object):
cooc= kwargs.get("cooc",False) cooc= kwargs.get("cooc",False)
adj = kwargs.get("adj", True) adj = kwargs.get("adj", True)
inc = kwargs.get("inc", True) inc = kwargs.get("inc", True)
if not se_identified: toponyms= kwargs.get("toponyms", None)
stop_words=kwargs.get("stop_words",[])
if isinstance(toponyms,list):
se_identified = self.disambiguator.disambiguate_list([top for top in toponyms if not top.lower() in stop_words and not len(re.findall("\d+",top)) != 0 and len(top)>3],self.lang)
count,output ={},text
#print(se_identified)
elif not se_identified:
count,output, se_identified = self.parse(text) count,output, se_identified = self.parse(text)
else: else:
count, output, tt = self.parse(text) count, output, _ = self.parse(text)
str_=STR(output,se_identified) str_=STR(output,se_identified)
str_.build(adj=adj,inc=inc) str_.build(adj=adj,inc=inc)
str_=self.transform(str_,**kwargs) #TODO : Add count str_=self.transform(str_,**kwargs) #TODO : Add count
......
import spacy
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment