Commit 828e76f5 authored by Fize Jacques's avatar Fize Jacques
Browse files

Cleaning + Readme Modif.

parent 5a936e7f
No related merge requests found
Showing with 677 additions and 385 deletions
+677 -385
#STR #STR
This repository contains all the work on STR or Spatial Textual Representation. The file This repository contains all the work on STR or Spatial Textual Representation. The file
hierarchy is divided in mutliple modules such as : hierarchy is divided in multiple modules such as :
* **config** which contains the configuration file and a dedicated class for loading and * **config** which contains the configuration file and a dedicated class for loading and
interact with it interact with it
* **gmatch4py** is a module which contains implementation of various graph matching * **gmatch4py** is a module which contains implementation of various graph matching
algorithms algorithms
* **gui_grap_viewer** contains a webapp used to visualize graph and their top-k similar graph
using specific graph matching algorithms.
* **helpers** is a module which contains various helpers methods for requesting the geo database * **helpers** is a module which contains various helpers methods for requesting the geo database
(geodict) or collision between polygons, etc.. (geodict) or collision between polygons, etc..
* **models** contains the STR structure and its variations. * **models** contains the STR structure and its variations.
* **nlp** contains all the implementation or interface of nlp methods such as NER, POS, * **nlp** contains all the implementation or interface of nlp methods such as NER, POS,
Toponym disambiguation, ... Toponym disambiguation, ...
* **tt4py** is a module dedicated to find and annotate tokens in a tokenized text.
## Generate STR ## Generate STR
To generate STR, use the `generate_data.py`. To generate STR, use the `generate_str.py`.
``` ```
usage: generate_data.py [-h] usage: generate_str.py [-h] [-n {spacy,polyglot,stanford}]
texts_input_dir graphs_output_dir metadata_output_fn [-d {occwiki,most_common,shareprop}] [-t {gen,ext}]
{normal,generalisation,extension} ... [-o OUTPUT]
input_pkl
positional arguments: positional arguments:
texts_input_dir input_pkl Filename of your input. Must be in Pickle format with the following columns :
graphs_output_dir - filename : original filename that contains the text in `content`
metadata_output_fn - id_doc : id of your document
{normal,generalisation,extension} - content : text data associated to the document
commands - lang : language of your document
normal Basic STR generation. No argument are necessary !
generalisation Apply a generalisation transformation on the generated
STRs
extension Apply a extension transformation on the generated STRs
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-n {spacy,polyglot,stanford}, --ner {spacy,polyglot,stanford}
``` The Named Entity Recognizer you wish to use
-d {occwiki,most_common,shareprop}, --disambiguator {occwiki,most_common,shareprop}
There are three ways of generate STR: The Named Entity disambiguator you wish to use
-t {gen,ext}, --transform {gen,ext}
* **Normal** Used to generate a STR without modifications Transformation to apply
* **Generalisation** You generate a STR with a generalisation transformation applied to it -o OUTPUT, --output OUTPUT
* **Extension** You generate a STR with a extension transformation applied to it Output Filename
### Generalisation
There is the possibility to generate **generalised** STR. A **generalised** STR, is a STR where
all entities are generalised (Paris --> France) using one of two hypothesis :
* **All**, all spatial entities are generalised *h* times. If *h* = 2, Paris becomes Europe
( Paris --> France --> Europe ).
* **Bounded**, all spatial entities are generalised until they are on a defined spatial
scale. For example, if we set the spatial scale to "country", all spatial entities who are
town, region, village, etc.. are generalised until the resulting spatial entities are countries.
A concrete example, with : Normandy and Montpellier, we would have :
1. Normandy --> France and Montpelier --> Hérault
2. France stays France and Hérault --> Occitanie
3. France stays France and Occitanie --> France
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn generalisation
[-h] [-t TYPE_GEN] [-n N] [-b BOUND]
optional arguments:
-h, --help show this help message and exit
-t TYPE_GEN, --type_gen TYPE_GEN
Type of generalisation
-n N Language
-b BOUND, --bound BOUND
If Generalisation is bounded, this arg. correspondto
the maximal
```
### Extension
An other ways of transforming STR is to extend a part of its spatial entities. The extension
of STR works this way:
* We select entities which are town with a low probability of appearance in the corpus
* Then, we search for neighbors of it in a radius (defined in d) around it.
* Finally, we add to the STR, those who fit these conditions :
* Belong to the same country
* Has a probility superior to the score median over the whole spatial entities in the STR
* Is a Capital or Town
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn extension
[-h] [-d DISTANCE] [-u UNIT] [-a ADJACENT_COUNT]
optional arguments:
-h, --help show this help message and exit
-d DISTANCE, --distance DISTANCE
radius distance
-u UNIT, --unit UNIT unit used for the radius distance
-a ADJACENT_COUNT, --adjacent_count ADJACENT_COUNT
number of adjacent SE add to the STR
```
# coding: utf-8
\ No newline at end of file
import sys, os, re ,argparse, warnings import sys, os, re ,argparse, warnings, json
import logging import logging
logger = logging.getLogger("elasticsearch") logger = logging.getLogger("elasticsearch")
...@@ -24,8 +24,10 @@ from strpython.nlp.disambiguator.wikipedia_cooc import WikipediaDisambiguator as ...@@ -24,8 +24,10 @@ from strpython.nlp.disambiguator.wikipedia_cooc import WikipediaDisambiguator as
from strpython.nlp.disambiguator.geodict_gaurav import GauravGeodict as shared_geo_d from strpython.nlp.disambiguator.geodict_gaurav import GauravGeodict as shared_geo_d
from strpython.nlp.disambiguator.most_common import MostCommonDisambiguator as most_common_d from strpython.nlp.disambiguator.most_common import MostCommonDisambiguator as most_common_d
from mytoolbox.text.clean import clean_text from mytoolbox.text.clean import *
from mytoolbox.exception.inline import safe_execute
from stop_words import get_stop_words
import logging import logging
logger = logging.getLogger("elasticsearch") logger = logging.getLogger("elasticsearch")
...@@ -33,6 +35,7 @@ logger.setLevel(logging.ERROR) ...@@ -33,6 +35,7 @@ logger.setLevel(logging.ERROR)
logger = logging.getLogger("Fiona") logger = logging.getLogger("Fiona")
logger.setLevel(logging.ERROR) logger.setLevel(logging.ERROR)
disambiguator_dict = { disambiguator_dict = {
"occwiki" : wiki_d, "occwiki" : wiki_d,
"most_common" : most_common_d, "most_common" : most_common_d,
...@@ -94,13 +97,50 @@ pipelines={ ...@@ -94,13 +97,50 @@ pipelines={
lang : Pipeline(lang=lang,ner=ner_dict[args.ner](lang=lang),tagger=Tagger(),disambiguator= disambiguator_dict[args.disambiguator]()) lang : Pipeline(lang=lang,ner=ner_dict[args.ner](lang=lang),tagger=Tagger(),disambiguator= disambiguator_dict[args.disambiguator]())
for lang in tqdm(languages,desc="Load Pipelines model") for lang in tqdm(languages,desc="Load Pipelines model")
} }
def matcher_agrovoc( lang):
"""
Return a terminolgy matcher using the Agrovoc vocabulary.
Parameters
----------
nlp : spacy.lang.Language
model
lang : str
language of the terms
Returns
-------
TerminologyMatcher
matcher
"""
agrovoc_vocab = pd.read_csv("../thematic_str/data/terminology/agrovoc/agrovoc_cleaned.csv")
agrovoc_vocab["preferred_label_new"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: safe_execute({}, Exception, json.loads, x.replace("\'", "\"")))
agrovoc_vocab["label_lang"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: str(resolv_a(x[lang]) if lang in x else np.nan).strip().lower())
agrovoc_vocab=agrovoc_vocab[~pd.isna(agrovoc_vocab["label_lang"])]
return agrovoc_vocab["label_lang"].values.tolist()
stopwords = {
lang:matcher_agrovoc(lang)
for lang in tqdm(languages,desc="Load stopwords")
}
for lang in stopwords:
stopwords[lang].extend(get_stop_words(lang))
print("Clean input content ...") print("Clean input content ...")
df["content"]= df.content.progress_apply(lambda x :clean_text(x)) if not "entities" in df:
df["content"]= df.content.progress_apply(lambda x :clean_text(x))
count_error=0 count_error=0
def build(pipelines,x): def build(pipelines,x):
global count_error global count_error
try:
if "entities" in x:
return pipelines[x.lang].build(x.content,toponyms=x.entities,stop_words=stopwords[x.lang])
except Exception as e:
print(e)
try: try:
return pipelines[x.lang].build(x.content) return pipelines[x.lang].build(x.content)
except Exception as e: except Exception as e:
......
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
This diff is collapsed.
Flask_Session==0.3.1
Shapely==1.5.17.post1
matplotlib==2.0.2
termcolor==1.1.0
networkx==2.1
requests==2.18.4
numpy==1.14.0
gensim==1.0.1
elasticsearch==5.2.0
geopandas==0.2.1
SQLAlchemy==1.1.14
pycorenlp==0.3.0
Flask_Login==0.4.0
pandas==0.19.2
scipy==0.19.1
Flask==0.12
ipython==6.2.1
python_bcrypt==0.3.2
extractor==0.5
progressbar2==3.35.0
scikit_bio==0.5.1
scikit_learn==0.19.1
typing==3.6.4
plotly
folium
\ No newline at end of file
# coding: utf-8
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import argparse
parser=argparse.ArgumentParser()
parser.add_argument("result_csv")
args=parser.parse_args()
data=pd.read_csv(args.result_csv,index_col=0)
data=data[data.mesure != "BP"]
def pareto_frontier_multi(myArray):
# Sort on first dimension
myArray = myArray[myArray[:,0].argsort()]
# Add first row to pareto_frontier
pareto_frontier = myArray[0:1,:]
indices,i=[],1
# Test next row against the last row in pareto_frontier
for row in myArray[1:,:]:
if sum([row[x] >= pareto_frontier[-1][x]
for x in range(len(row))]) == len(row):
# If it is better on all features add the row to pareto_frontier
pareto_frontier = np.concatenate((pareto_frontier, [row]))
indices.append(i)
i+=1
return indices,pareto_frontier
def highlight_max(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]
def highlight_min(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.min()
return ['background-color: #d64541;color:white;' if v else '' for v in is_max]
def colorize(df,fields):
return df.style.apply(highlight_max,subset=fields).apply(highlight_min,subset=fields)
to_colorize="c1 c2 c3 c4".split()
print("Table for {0}".format(args.result_csv))
print("Average Measure Precision")
print(data.groupby("mesure").mean().to_csv())
print("")
index,data_pa=pareto_frontier_multi(data["c1 c2 c3 c4".split()].values)
print("PARETO c1 c2 c3 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c1 c4".split()].values)
print("PARETO c1 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c2 c3".split()].values)
print("PARETO c2 c3")
print(data.iloc[index].to_csv(index=False))
...@@ -15,6 +15,7 @@ def jsonKeys2int(x): ...@@ -15,6 +15,7 @@ def jsonKeys2int(x):
return x return x
__cache__crit={} __cache__crit={}
if os.path.exists("cache.json"): if os.path.exists("cache.json"):
try: try:
__cache__crit=json.load(open("cache.json")) __cache__crit=json.load(open("cache.json"))
......
import json import json
import os, re import os
import warnings import warnings
import psycopg2
from shapely.geometry import Point from shapely.geometry import Point
from ..config.configuration import config from ..config.configuration import config
...@@ -10,7 +10,8 @@ import geopandas as gpd ...@@ -10,7 +10,8 @@ import geopandas as gpd
__cache = {} __cache = {}
__cache_adjacency = {} __cache_adjacency = {}
__limit_cache = 2000 __limit_cache = 10000
__cache_frequency = {}
def add_cache(id_, hull): def add_cache(id_, hull):
...@@ -80,12 +81,10 @@ def getGEO(id_se): ...@@ -80,12 +81,10 @@ def getGEO(id_se):
data=data[0] data=data[0]
if "path" in data: if "path" in data:
return int(re.findall("\d+",data.other["path"])[-1]) return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
#return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
elif "coord" in data: elif "coord" in data:
return data.coord.lon,data.coord.lat return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
#return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename( columns={0: 'geometry'})
# columns={0: 'geometry'})
return None return None
...@@ -107,29 +106,6 @@ def getGEO2(id_se): ...@@ -107,29 +106,6 @@ def getGEO2(id_se):
return "C",gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename( return "C",gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
columns={0: 'geometry'}) columns={0: 'geometry'})
return None return None
def is_collision_psql_poly(id_1,id_2):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""select a.id,b.id, st_intersects(st_convexhull(a.geom),st_convexhull(b.geom))
from boundary as a, boundary as b
where a.id = {id1} and b.id = {id2}; """.format(id1=id_1,id2=id_2))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def is_collision_psql_poly_and_point(poly_id,data_point):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""SELECT b.id,
st_within(st_buffer(ST_GeomFromText('POINT({lon} {lat})',4326),1),st_setsrid(b.geom,4326)) FROM boundary as b
WHERE id = {poly_id};""".format(lon=data_point[0],lat=data_point[0],poly_id=poly_id))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def collide(se1, se2): def collide(se1, se2):
""" """
...@@ -138,38 +114,30 @@ def collide(se1, se2): ...@@ -138,38 +114,30 @@ def collide(se1, se2):
:param se2: id of the second spatial entity :param se2: id of the second spatial entity
:return: :return:
""" """
global __cache_frequency
try: try:
if se1 in __cache: if se1 in __cache:
data_se1 = __cache[se1] data_se1 = __cache[se1]
__cache_frequency[se1] += 1 __cache_frequency[se1] += 1
# else: else:
# data_se1 = getGEO(se1) data_se1 = getGEO(se1)
# add_cache(se1, data_se1) add_cache(se1, data_se1)
if se2 in __cache: if se2 in __cache:
data_se2 = __cache[se2] data_se2 = __cache[se2]
__cache_frequency[se2] += 1 __cache_frequency[se2] += 1
# else: else:
# data_se2 = getGEO(se2) data_se2 = getGEO(se2)
# add_cache(se2, data_se2) add_cache(se2, data_se2)
except: except Exception as e:
warnings.warn(e)
return False
if not type(data_se1) == gpd.GeoDataFrame or not type(data_se2) == gpd.GeoDataFrame:
return False return False
data_se1 = getGEO(se1) try:
data_se2 = getGEO(se2) if data_se1.intersects(data_se2):
if type(data_se1)==int and type(data_se2)==int: return True
return is_collision_psql_poly(data_se1,data_se2) except:
if type(data_se1)==tuple and type(data_se2)==tuple: if data_se1.intersects(data_se2).any():
return Point(*data_se1).buffer(1).intersects(Point(*data_se2).buffer(1)) return True
if type(data_se1)==tuple and type(data_se2)==int:
return is_collision_psql_poly_and_point(data_se2,data_se1)
if type(data_se1)==int and type(data_se2)==tuple:
return is_collision_psql_poly_and_point(data_se1,data_se2)
# try:
# if data_se1.intersects(data_se2):
# return True
# except:
# if data_se1.intersects(data_se2).any():
# return True
return False return False
...@@ -195,4 +163,4 @@ def collisionTwoSEBoundaries(id_se1, id_se2): ...@@ -195,4 +163,4 @@ def collisionTwoSEBoundaries(id_se1, id_se2):
__cache_adjacency[id_se1][id_se2] = True __cache_adjacency[id_se1][id_se2] = True
return True return True
__cache_adjacency[id_se1][id_se2] = False __cache_adjacency[id_se1][id_se2] = False
return False return False
\ No newline at end of file
...@@ -23,6 +23,7 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -23,6 +23,7 @@ def get_most_common_id_v3(label, lang='fr'):
:param lang: :param lang:
:return: :return:
""" """
label = label.strip()
id_,score=None,-1 id_,score=None,-1
data = gazetteer.get_by_label(label, lang) data = gazetteer.get_by_label(label, lang)
if data: if data:
...@@ -31,11 +32,11 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -31,11 +32,11 @@ def get_most_common_id_v3(label, lang='fr'):
if data2 and data2[0].score > data[0].score: if data2 and data2[0].score > data[0].score:
data2=data2[0] data2=data2[0]
id_, score = data2.id, data2.score id_, score = data2.id, data2.score
simi = gazetteer.get_n_label_similar(label, lang, n=5) # simi = gazetteer.get_n_label_similar(label, lang, n=5)
if simi: # if simi:
id_3, score3 = simi[0].id, simi[0].score # id_3, score3 = simi[0].id, simi[0].score
if id_3 and score3 > score: # if id_3 and score3 > score:
id_, score = id_3, score3 # id_, score = id_3, score3
return gazetteer.get_by_id(id_)[0] return gazetteer.get_by_id(id_)[0]
...@@ -44,13 +45,13 @@ def get_most_common_id_v3(label, lang='fr'): ...@@ -44,13 +45,13 @@ def get_most_common_id_v3(label, lang='fr'):
if data: if data:
return data[0] #data[0].id, data[0].score return data[0] #data[0].id, data[0].score
similar_label = gazetteer.get_n_label_similar(label, lang, n=5) # similar_label = gazetteer.get_n_label_similar(label, lang, n=5)
if similar_label: # if similar_label:
return similar_label[0]#similar_label[0].id, similar_label[0].score # return similar_label[0]#similar_label[0].id, similar_label[0].score
similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5) # similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5)
if similar_alias: # if similar_alias:
return similar_alias[0]#similar_alias[0].id, similar_alias[0].score # return similar_alias[0]#similar_alias[0].id, similar_alias[0].score
return None return None
......
This diff is collapsed.
...@@ -57,3 +57,6 @@ class Disambiguator(object): ...@@ -57,3 +57,6 @@ class Disambiguator(object):
def disambiguate(self, ner_result): def disambiguate(self, ner_result):
pass pass
def disambiguate_list(self,toponyms,lang):
pass
\ No newline at end of file
...@@ -42,6 +42,14 @@ class MostCommonDisambiguator(Disambiguator): ...@@ -42,6 +42,14 @@ class MostCommonDisambiguator(Disambiguator):
return new_count, selected_en return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result={}
for toponym in toponyms:
id_,_=self.disambiguate_(toponym,lang)
if id_:
result[id_]=toponym
return result
def disambiguate_(self, label, lang='fr'): def disambiguate_(self, label, lang='fr'):
if re.match("^\d+$", label): if re.match("^\d+$", label):
return 'O', -1 return 'O', -1
......
...@@ -31,7 +31,9 @@ class WikipediaDisambiguator(Disambiguator): ...@@ -31,7 +31,9 @@ class WikipediaDisambiguator(Disambiguator):
return new_count, selected_en return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result=self.disambiguate_wiki(toponyms,lang)
return {k:v for k,v in result.items() if v}
def disambiguate_wiki(self, entities, lang): def disambiguate_wiki(self, entities, lang):
......
...@@ -11,7 +11,8 @@ from .nlp.ner.ner import NER ...@@ -11,7 +11,8 @@ from .nlp.ner.ner import NER
from .nlp.ner.stanford_ner import StanfordNER from .nlp.ner.stanford_ner import StanfordNER
from .nlp.pos_tagger.tagger import Tagger from .nlp.pos_tagger.tagger import Tagger
from .nlp.pos_tagger.treetagger import TreeTagger from .nlp.pos_tagger.treetagger import TreeTagger
import json import json,re
class Pipeline(object): class Pipeline(object):
...@@ -96,10 +97,16 @@ class Pipeline(object): ...@@ -96,10 +97,16 @@ class Pipeline(object):
cooc= kwargs.get("cooc",False) cooc= kwargs.get("cooc",False)
adj = kwargs.get("adj", True) adj = kwargs.get("adj", True)
inc = kwargs.get("inc", True) inc = kwargs.get("inc", True)
if not se_identified: toponyms= kwargs.get("toponyms", None)
stop_words=kwargs.get("stop_words",[])
if isinstance(toponyms,list):
se_identified = self.disambiguator.disambiguate_list([top for top in toponyms if not top.lower() in stop_words and not len(re.findall("\d+",top)) != 0 and len(top)>3],self.lang)
count,output ={},text
#print(se_identified)
elif not se_identified:
count,output, se_identified = self.parse(text) count,output, se_identified = self.parse(text)
else: else:
count, output, tt = self.parse(text) count, output, _ = self.parse(text)
str_=STR(output,se_identified) str_=STR(output,se_identified)
str_.build(adj=adj,inc=inc) str_.build(adj=adj,inc=inc)
str_=self.transform(str_,**kwargs) #TODO : Add count str_=self.transform(str_,**kwargs) #TODO : Add count
......
import spacy
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment