building readme.md

e99f23e6 · Decoupes Remy · cd62e3a9 · e99f23e6 · e99f23e6 · e99f23e6
Commit e99f23e6 authored 5 years ago by Decoupes Remy
Hide whitespace changes
Inline Side-by-side

Showing

with 74 additions and 6 deletions
+74 -6
--- a/README.md
+++ b/README.md
+# Populate Data Lake
+The populating is done by using two scripts writing in python and R. These two languages have to be used because the functionalities offered by their libraries were of an unequal level of quality. Indeed, python offers an excellent library to interact with hdfs, while R has interesting modules to manage iso19115 metadata. In order to reduce the complexity generated by the concomitant use of these two languages, the R script has been encapsulated inside the python script. Thus, the administrator only needs to run the python script.
+## Python : collect data and insert to Data Lake on data zone : HDFS Cluster
+### Prerequisites
+Has to be run on python 3.6 with requirements found in python-requirements.txt
+If you are using pip, you can install all requirement with this command :
+```shell
+pip3 install -r requirements.txt
+```
+### Files
+* Main script : **src/main.py**
+* This script read information from a csv file : input/datasources.csv
+* It gives 2 kind of output :
+	+ **/output/data/** : Files downloaded 
+	+ **/output/meta/meta.json** : a json file containing metadata collected on Internet
+## R script : Insert into Data Lake metadatat management system : GeoNetwork
+### Prerequisites
+This script needs to use the Geonetwork's API. You have to enable it it your geonetwork settings. 
+cf : https://github.com/geonetwork/core-geonetwork/blob/9310d0ba85e6a35f48dbfa5d6168ba7088609724/web/src/main/webapp/WEB-INF/config/config-service-xml-api.xml#L83
+*Pay attention that if Geonerwork is updated, you may lose this setting specially if you are using a tomcat war (that's the case for Geosur).*
+### Files :
+* Main script : **src/addServicesToGN.R**
+* This script read informations from 2 files :
+ 	+ **input/tetis-services.csv**    
+ 	+ **output/meta/meta.json** : Json file generated by the main python script containing informations for building HDFS path link
+### Dependencies
+It only uses 2 librairie from FAO :
+* geonapi : R librairy to insert/update/delete metadata directly in our geonetwork : https://github.com/eblondel/geonapi
+* geometa : R librairy to create iso 19115 xml (inspire compliant) : https://github.com/eblondel/geometa
+### Install
+* Dependencies System libs:
+```shell
+sudo apt-get install libssl-dev libxml2-dev
+```
+* Dependencies System lib related to R software:
+```shell
+sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
+sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
+sudo apt-get install r-base
+```
+* R addiotional libs:
+	(see https://github.com/eblondel/geonapi/wiki#install_guide, https://github.com/eblondel/geometa/wiki#install_guide)
+```shell
+install.packages("devtools")
+install.packages("XML")
+install.packages("uuid")
+install.packages("osmdata")
+install.packages("rjson")
+require("devtools")
+install_github("eblondel/geometa")
+install_github("eblondel/geonapi")
+```
--- a/python-requirements.txt
+++ b/python-requirements.txt
+certifi==2019.11.28
+chardet==3.0.4
+docopt==0.6.2
+hdfs==2.5.8
+idna==2.8
+numpy==1.17.4
+pandas==0.25.3
+python-dateutil==2.8.1
+pytz==2019.3
+pywebhdfs==0.4.1
+requests==2.22.0
+six==1.13.0
+urllib3==1.25.7
--- a/src/addServicesToGN.R
+++ b/src/addServicesToGN.R
@@ -7,8 +7,6 @@ library(osmdata)
 library(rjson)
 working_dir = getwd()
-print("working on :")
-print(working_dir)
 # Connection to geonetwork
 gn <- GNManager$new(
@@ -17,7 +15,6 @@ gn <- GNManager$new(
  user = "admin",
  pwd = "admin"
 )
-print("ici")
 ## Read input
 services <- read.csv(file=paste(working_dir, "../input/datasources.csv", sep = "/"), sep =";")
@@ -62,7 +59,7 @@ for (service in services$id) {
    for (node in json_data){
      if(node$idCSV == iterator){
        ressources = basename(node$data)
-        print(str(ressources))
+        # print(str(ressources))
        for (ressource in ressources){
          newURL <- ISOOnlineResource$new()
          newURL$setName(paste0(ressource))

--- a/src/main.py
+++ b/src/main.py
@@ -102,7 +102,7 @@ if __name__ == '__main__':
    jsonfile.write(json.dumps(opendata3mDataMetada))
    jsonfile.close()
    """Download File"""
-    # nboffiledl = downloadOpendata3MFiles(opendata3mDataMetada, pathToSaveDownloadedData)
+    nboffiledl = downloadOpendata3MFiles(opendata3mDataMetada, pathToSaveDownloadedData)
    """Insert files inside HDFS and store file"""
    # connect to HDFS
@@ -121,7 +121,7 @@ if __name__ == '__main__':
        subprocess.call("/usr/bin/Rscript  addServicesToGN.R")
    except :
        print("R error due to OSM ? Try re-launched")
-        subprocess.call("/usr/bin/Rscript  addServicesToGN.R", shell=True)
+        subprocess.call("R -f addServicesToGN.R", shell=True)
    print(str(nboffiledl)+" files downloaded in : " + pathToSaveDownloadedData)
    print("AIDMOIt ingestion module ends")