Commit 828e76f5 authored by Fize Jacques's avatar Fize Jacques
Browse files

Cleaning + Readme Modif.

parent 5a936e7f
No related merge requests found
Showing with 677 additions and 385 deletions
+677 -385
#STR
This repository contains all the work on STR or Spatial Textual Representation. The file
hierarchy is divided in mutliple modules such as :
hierarchy is divided in multiple modules such as :
* **config** which contains the configuration file and a dedicated class for loading and
interact with it
* **gmatch4py** is a module which contains implementation of various graph matching
algorithms
* **gui_grap_viewer** contains a webapp used to visualize graph and their top-k similar graph
using specific graph matching algorithms.
* **helpers** is a module which contains various helpers methods for requesting the geo database
(geodict) or collision between polygons, etc..
* **models** contains the STR structure and its variations.
* **nlp** contains all the implementation or interface of nlp methods such as NER, POS,
Toponym disambiguation, ...
* **tt4py** is a module dedicated to find and annotate tokens in a tokenized text.
## Generate STR
To generate STR, use the `generate_data.py`.
To generate STR, use the `generate_str.py`.
```
usage: generate_data.py [-h]
texts_input_dir graphs_output_dir metadata_output_fn
{normal,generalisation,extension} ...
usage: generate_str.py [-h] [-n {spacy,polyglot,stanford}]
[-d {occwiki,most_common,shareprop}] [-t {gen,ext}]
[-o OUTPUT]
input_pkl
positional arguments:
texts_input_dir
graphs_output_dir
metadata_output_fn
{normal,generalisation,extension}
commands
normal Basic STR generation. No argument are necessary !
generalisation Apply a generalisation transformation on the generated
STRs
extension Apply a extension transformation on the generated STRs
input_pkl Filename of your input. Must be in Pickle format with the following columns :
- filename : original filename that contains the text in `content`
- id_doc : id of your document
- content : text data associated to the document
- lang : language of your document
optional arguments:
-h, --help show this help message and exit
```
There are three ways of generate STR:
* **Normal** Used to generate a STR without modifications
* **Generalisation** You generate a STR with a generalisation transformation applied to it
* **Extension** You generate a STR with a extension transformation applied to it
### Generalisation
There is the possibility to generate **generalised** STR. A **generalised** STR, is a STR where
all entities are generalised (Paris --> France) using one of two hypothesis :
* **All**, all spatial entities are generalised *h* times. If *h* = 2, Paris becomes Europe
( Paris --> France --> Europe ).
* **Bounded**, all spatial entities are generalised until they are on a defined spatial
scale. For example, if we set the spatial scale to "country", all spatial entities who are
town, region, village, etc.. are generalised until the resulting spatial entities are countries.
A concrete example, with : Normandy and Montpellier, we would have :
1. Normandy --> France and Montpelier --> Hérault
2. France stays France and Hérault --> Occitanie
3. France stays France and Occitanie --> France
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn generalisation
[-h] [-t TYPE_GEN] [-n N] [-b BOUND]
optional arguments:
-h, --help show this help message and exit
-t TYPE_GEN, --type_gen TYPE_GEN
Type of generalisation
-n N Language
-b BOUND, --bound BOUND
If Generalisation is bounded, this arg. correspondto
the maximal
```
### Extension
An other ways of transforming STR is to extend a part of its spatial entities. The extension
of STR works this way:
* We select entities which are town with a low probability of appearance in the corpus
* Then, we search for neighbors of it in a radius (defined in d) around it.
* Finally, we add to the STR, those who fit these conditions :
* Belong to the same country
* Has a probility superior to the score median over the whole spatial entities in the STR
* Is a Capital or Town
```
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn extension
[-h] [-d DISTANCE] [-u UNIT] [-a ADJACENT_COUNT]
optional arguments:
-h, --help show this help message and exit
-d DISTANCE, --distance DISTANCE
radius distance
-u UNIT, --unit UNIT unit used for the radius distance
-a ADJACENT_COUNT, --adjacent_count ADJACENT_COUNT
number of adjacent SE add to the STR
```
-n {spacy,polyglot,stanford}, --ner {spacy,polyglot,stanford}
The Named Entity Recognizer you wish to use
-d {occwiki,most_common,shareprop}, --disambiguator {occwiki,most_common,shareprop}
The Named Entity disambiguator you wish to use
-t {gen,ext}, --transform {gen,ext}
Transformation to apply
-o OUTPUT, --output OUTPUT
Output Filename
# coding: utf-8
\ No newline at end of file
import sys, os, re ,argparse, warnings
import sys, os, re ,argparse, warnings, json
import logging
logger = logging.getLogger("elasticsearch")
......@@ -24,8 +24,10 @@ from strpython.nlp.disambiguator.wikipedia_cooc import WikipediaDisambiguator as
from strpython.nlp.disambiguator.geodict_gaurav import GauravGeodict as shared_geo_d
from strpython.nlp.disambiguator.most_common import MostCommonDisambiguator as most_common_d
from mytoolbox.text.clean import clean_text
from mytoolbox.text.clean import *
from mytoolbox.exception.inline import safe_execute
from stop_words import get_stop_words
import logging
logger = logging.getLogger("elasticsearch")
......@@ -33,6 +35,7 @@ logger.setLevel(logging.ERROR)
logger = logging.getLogger("Fiona")
logger.setLevel(logging.ERROR)
disambiguator_dict = {
"occwiki" : wiki_d,
"most_common" : most_common_d,
......@@ -94,13 +97,50 @@ pipelines={
lang : Pipeline(lang=lang,ner=ner_dict[args.ner](lang=lang),tagger=Tagger(),disambiguator= disambiguator_dict[args.disambiguator]())
for lang in tqdm(languages,desc="Load Pipelines model")
}
def matcher_agrovoc( lang):
"""
Return a terminolgy matcher using the Agrovoc vocabulary.
Parameters
----------
nlp : spacy.lang.Language
model
lang : str
language of the terms
Returns
-------
TerminologyMatcher
matcher
"""
agrovoc_vocab = pd.read_csv("../thematic_str/data/terminology/agrovoc/agrovoc_cleaned.csv")
agrovoc_vocab["preferred_label_new"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: safe_execute({}, Exception, json.loads, x.replace("\'", "\"")))
agrovoc_vocab["label_lang"] = agrovoc_vocab["preferred_label_new"].apply(
lambda x: str(resolv_a(x[lang]) if lang in x else np.nan).strip().lower())
agrovoc_vocab=agrovoc_vocab[~pd.isna(agrovoc_vocab["label_lang"])]
return agrovoc_vocab["label_lang"].values.tolist()
stopwords = {
lang:matcher_agrovoc(lang)
for lang in tqdm(languages,desc="Load stopwords")
}
for lang in stopwords:
stopwords[lang].extend(get_stop_words(lang))
print("Clean input content ...")
df["content"]= df.content.progress_apply(lambda x :clean_text(x))
if not "entities" in df:
df["content"]= df.content.progress_apply(lambda x :clean_text(x))
count_error=0
def build(pipelines,x):
global count_error
try:
if "entities" in x:
return pipelines[x.lang].build(x.content,toponyms=x.entities,stop_words=stopwords[x.lang])
except Exception as e:
print(e)
try:
return pipelines[x.lang].build(x.content)
except Exception as e:
......
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
This diff is collapsed.
Flask_Session==0.3.1
Shapely==1.5.17.post1
matplotlib==2.0.2
termcolor==1.1.0
networkx==2.1
requests==2.18.4
numpy==1.14.0
gensim==1.0.1
elasticsearch==5.2.0
geopandas==0.2.1
SQLAlchemy==1.1.14
pycorenlp==0.3.0
Flask_Login==0.4.0
pandas==0.19.2
scipy==0.19.1
Flask==0.12
ipython==6.2.1
python_bcrypt==0.3.2
extractor==0.5
progressbar2==3.35.0
scikit_bio==0.5.1
scikit_learn==0.19.1
typing==3.6.4
plotly
folium
\ No newline at end of file
# coding: utf-8
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import argparse
parser=argparse.ArgumentParser()
parser.add_argument("result_csv")
args=parser.parse_args()
data=pd.read_csv(args.result_csv,index_col=0)
data=data[data.mesure != "BP"]
def pareto_frontier_multi(myArray):
# Sort on first dimension
myArray = myArray[myArray[:,0].argsort()]
# Add first row to pareto_frontier
pareto_frontier = myArray[0:1,:]
indices,i=[],1
# Test next row against the last row in pareto_frontier
for row in myArray[1:,:]:
if sum([row[x] >= pareto_frontier[-1][x]
for x in range(len(row))]) == len(row):
# If it is better on all features add the row to pareto_frontier
pareto_frontier = np.concatenate((pareto_frontier, [row]))
indices.append(i)
i+=1
return indices,pareto_frontier
def highlight_max(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]
def highlight_min(s):
'''
highlight the maximum in a Series yellow.
'''
is_max = s == s.min()
return ['background-color: #d64541;color:white;' if v else '' for v in is_max]
def colorize(df,fields):
return df.style.apply(highlight_max,subset=fields).apply(highlight_min,subset=fields)
to_colorize="c1 c2 c3 c4".split()
print("Table for {0}".format(args.result_csv))
print("Average Measure Precision")
print(data.groupby("mesure").mean().to_csv())
print("")
index,data_pa=pareto_frontier_multi(data["c1 c2 c3 c4".split()].values)
print("PARETO c1 c2 c3 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c1 c4".split()].values)
print("PARETO c1 c4")
print(data.iloc[index].to_csv(index=False))
print("")
index,data_pa=pareto_frontier_multi(data["c2 c3".split()].values)
print("PARETO c2 c3")
print(data.iloc[index].to_csv(index=False))
......@@ -15,6 +15,7 @@ def jsonKeys2int(x):
return x
__cache__crit={}
if os.path.exists("cache.json"):
try:
__cache__crit=json.load(open("cache.json"))
......
import json
import os, re
import os
import warnings
import psycopg2
from shapely.geometry import Point
from ..config.configuration import config
......@@ -10,7 +10,8 @@ import geopandas as gpd
__cache = {}
__cache_adjacency = {}
__limit_cache = 2000
__limit_cache = 10000
__cache_frequency = {}
def add_cache(id_, hull):
......@@ -80,12 +81,10 @@ def getGEO(id_se):
data=data[0]
if "path" in data:
return int(re.findall("\d+",data.other["path"])[-1])
#return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
return explode(gpd.read_file(os.path.join(config.osm_boundaries_directory, data.other["path"]))).convex_hull
elif "coord" in data:
return data.coord.lon,data.coord.lat
#return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
# columns={0: 'geometry'})
return gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
columns={0: 'geometry'})
return None
......@@ -107,29 +106,6 @@ def getGEO2(id_se):
return "C",gpd.GeoDataFrame(gpd.GeoSeries([Point(data.coord.lon, data.coord.lat).buffer(1.0)])).rename(
columns={0: 'geometry'})
return None
def is_collision_psql_poly(id_1,id_2):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""select a.id,b.id, st_intersects(st_convexhull(a.geom),st_convexhull(b.geom))
from boundary as a, boundary as b
where a.id = {id1} and b.id = {id2}; """.format(id1=id_1,id2=id_2))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def is_collision_psql_poly_and_point(poly_id,data_point):
conn = psycopg2.connect("dbname='postgis_geodict'host='localhost'")
cur = conn.cursor()
cur.execute("""SELECT b.id,
st_within(st_buffer(ST_GeomFromText('POINT({lon} {lat})',4326),1),st_setsrid(b.geom,4326)) FROM boundary as b
WHERE id = {poly_id};""".format(lon=data_point[0],lat=data_point[0],poly_id=poly_id))
listpoly = cur.fetchall()
if not listpoly:
warnings.warn("No results found in DATABASE")
return listpoly[0][-1]
def collide(se1, se2):
"""
......@@ -138,38 +114,30 @@ def collide(se1, se2):
:param se2: id of the second spatial entity
:return:
"""
global __cache_frequency
try:
if se1 in __cache:
data_se1 = __cache[se1]
__cache_frequency[se1] += 1
# else:
# data_se1 = getGEO(se1)
# add_cache(se1, data_se1)
else:
data_se1 = getGEO(se1)
add_cache(se1, data_se1)
if se2 in __cache:
data_se2 = __cache[se2]
__cache_frequency[se2] += 1
# else:
# data_se2 = getGEO(se2)
# add_cache(se2, data_se2)
except:
else:
data_se2 = getGEO(se2)
add_cache(se2, data_se2)
except Exception as e:
warnings.warn(e)
return False
if not type(data_se1) == gpd.GeoDataFrame or not type(data_se2) == gpd.GeoDataFrame:
return False
data_se1 = getGEO(se1)
data_se2 = getGEO(se2)
if type(data_se1)==int and type(data_se2)==int:
return is_collision_psql_poly(data_se1,data_se2)
if type(data_se1)==tuple and type(data_se2)==tuple:
return Point(*data_se1).buffer(1).intersects(Point(*data_se2).buffer(1))
if type(data_se1)==tuple and type(data_se2)==int:
return is_collision_psql_poly_and_point(data_se2,data_se1)
if type(data_se1)==int and type(data_se2)==tuple:
return is_collision_psql_poly_and_point(data_se1,data_se2)
# try:
# if data_se1.intersects(data_se2):
# return True
# except:
# if data_se1.intersects(data_se2).any():
# return True
try:
if data_se1.intersects(data_se2):
return True
except:
if data_se1.intersects(data_se2).any():
return True
return False
......@@ -195,4 +163,4 @@ def collisionTwoSEBoundaries(id_se1, id_se2):
__cache_adjacency[id_se1][id_se2] = True
return True
__cache_adjacency[id_se1][id_se2] = False
return False
return False
\ No newline at end of file
......@@ -23,6 +23,7 @@ def get_most_common_id_v3(label, lang='fr'):
:param lang:
:return:
"""
label = label.strip()
id_,score=None,-1
data = gazetteer.get_by_label(label, lang)
if data:
......@@ -31,11 +32,11 @@ def get_most_common_id_v3(label, lang='fr'):
if data2 and data2[0].score > data[0].score:
data2=data2[0]
id_, score = data2.id, data2.score
simi = gazetteer.get_n_label_similar(label, lang, n=5)
if simi:
id_3, score3 = simi[0].id, simi[0].score
if id_3 and score3 > score:
id_, score = id_3, score3
# simi = gazetteer.get_n_label_similar(label, lang, n=5)
# if simi:
# id_3, score3 = simi[0].id, simi[0].score
# if id_3 and score3 > score:
# id_, score = id_3, score3
return gazetteer.get_by_id(id_)[0]
......@@ -44,13 +45,13 @@ def get_most_common_id_v3(label, lang='fr'):
if data:
return data[0] #data[0].id, data[0].score
similar_label = gazetteer.get_n_label_similar(label, lang, n=5)
if similar_label:
return similar_label[0]#similar_label[0].id, similar_label[0].score
# similar_label = gazetteer.get_n_label_similar(label, lang, n=5)
# if similar_label:
# return similar_label[0]#similar_label[0].id, similar_label[0].score
similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5)
if similar_alias:
return similar_alias[0]#similar_alias[0].id, similar_alias[0].score
# similar_alias = gazetteer.get_n_alias_similar(label, lang, n=5)
# if similar_alias:
# return similar_alias[0]#similar_alias[0].id, similar_alias[0].score
return None
......
This diff is collapsed.
......@@ -57,3 +57,6 @@ class Disambiguator(object):
def disambiguate(self, ner_result):
pass
def disambiguate_list(self,toponyms,lang):
pass
\ No newline at end of file
......@@ -42,6 +42,14 @@ class MostCommonDisambiguator(Disambiguator):
return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result={}
for toponym in toponyms:
id_,_=self.disambiguate_(toponym,lang)
if id_:
result[id_]=toponym
return result
def disambiguate_(self, label, lang='fr'):
if re.match("^\d+$", label):
return 'O', -1
......
......@@ -31,7 +31,9 @@ class WikipediaDisambiguator(Disambiguator):
return new_count, selected_en
def disambiguate_list(self,toponyms,lang):
result=self.disambiguate_wiki(toponyms,lang)
return {k:v for k,v in result.items() if v}
def disambiguate_wiki(self, entities, lang):
......
......@@ -11,7 +11,8 @@ from .nlp.ner.ner import NER
from .nlp.ner.stanford_ner import StanfordNER
from .nlp.pos_tagger.tagger import Tagger
from .nlp.pos_tagger.treetagger import TreeTagger
import json
import json,re
class Pipeline(object):
......@@ -96,10 +97,16 @@ class Pipeline(object):
cooc= kwargs.get("cooc",False)
adj = kwargs.get("adj", True)
inc = kwargs.get("inc", True)
if not se_identified:
toponyms= kwargs.get("toponyms", None)
stop_words=kwargs.get("stop_words",[])
if isinstance(toponyms,list):
se_identified = self.disambiguator.disambiguate_list([top for top in toponyms if not top.lower() in stop_words and not len(re.findall("\d+",top)) != 0 and len(top)>3],self.lang)
count,output ={},text
#print(se_identified)
elif not se_identified:
count,output, se_identified = self.parse(text)
else:
count, output, tt = self.parse(text)
count, output, _ = self.parse(text)
str_=STR(output,se_identified)
str_.build(adj=adj,inc=inc)
str_=self.transform(str_,**kwargs) #TODO : Add count
......
import spacy
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment