- Cythonize Gmatch4py
Fize Jacques authored
- Debug disambiguisation

- Debug Spacy NER API and StanfordNER Api

- Add Notebooks for Evaluations
0cd549d9

#STR

This repository contains all the work on STR or Spatial Textual Representation. The file hierarchy is divided in mutliple modules such as :

  • config which contains the configuration file and a dedicated class for loading and interact with it
  • gmatch4py is a module which contains implementation of various graph matching algorithms
  • gui_grap_viewer contains a webapp used to visualize graph and their top-k similar graph using specific graph matching algorithms.
  • helpers is a module which contains various helpers methods for requesting the geo database (geodict) or collision between polygons, etc..
  • models contains the STR structure and its variations.
  • nlp contains all the implementation or interface of nlp methods such as NER, POS, Toponym disambiguation, ...
  • tt4py is a module dedicated to find and annotate tokens in a tokenized text.

Generate STR

To generate STR, use the generate_data.py.

usage: generate_data.py [-h]
                        texts_input_dir graphs_output_dir metadata_output_fn
                        {normal,generalisation,extension} ...

positional arguments:
  texts_input_dir
  graphs_output_dir
  metadata_output_fn
  {normal,generalisation,extension}
                        commands
    normal              Basic STR generation. No argument are necessary !
    generalisation      Apply a generalisation transformation on the generated
                        STRs
    extension           Apply a extension transformation on the generated STRs

optional arguments:
  -h, --help            show this help message and exit

There are three ways of generate STR:

  • Normal Used to generate a STR without modifications
  • Generalisation You generate a STR with a generalisation transformation applied to it
  • Extension You generate a STR with a extension transformation applied to it

Generalisation

There is the possibility to generate generalised STR. A generalised STR, is a STR where all entities are generalised (Paris --> France) using one of two hypothesis :

  • All, all spatial entities are generalised h times. If h = 2, Paris becomes Europe ( Paris --> France --> Europe ).
  • Bounded, all spatial entities are generalised until they are on a defined spatial scale. For example, if we set the spatial scale to "country", all spatial entities who are town, region, village, etc.. are generalised until the resulting spatial entities are countries. A concrete example, with : Normandy and Montpellier, we would have :
    1. Normandy --> France and Montpelier --> Hérault
    2. France stays France and Hérault --> Occitanie
    3. France stays France and Occitanie --> France
    usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn generalisation
       [-h] [-t TYPE_GEN] [-n N] [-b BOUND]

optional arguments:
  -h, --help            show this help message and exit
  -t TYPE_GEN, --type_gen TYPE_GEN
                        Type of generalisation
  -n N                  Language
  -b BOUND, --bound BOUND
                        If Generalisation is bounded, this arg. correspondto
                        the maximal

Extension

An other ways of transforming STR is to extend a part of its spatial entities. The extension of STR works this way:

  • We select entities which are town with a low probability of appearance in the corpus
  • Then, we search for neighbors of it in a radius (defined in d) around it.
  • Finally, we add to the STR, those who fit these conditions :
    • Belong to the same country
    • Has a probility superior to the score median over the whole spatial entities in the STR
    • Is a Capital or Town
usage: generate_data.py texts_input_dir graphs_output_dir metadata_output_fn extension
       [-h] [-d DISTANCE] [-u UNIT] [-a ADJACENT_COUNT]

optional arguments:
  -h, --help            show this help message and exit
  -d DISTANCE, --distance DISTANCE
                        radius distance
  -u UNIT, --unit UNIT  unit used for the radius distance
  -a ADJACENT_COUNT, --adjacent_count ADJACENT_COUNT
                        number of adjacent SE add to the STR