Commit e99f23e6 authored by Decoupes Remy's avatar Decoupes Remy
Browse files

building readme.md

No related merge requests found
Showing with 74 additions and 6 deletions
+74 -6
# Populate Data Lake
The populating is done by using two scripts writing in python and R. These two languages have to be used because the functionalities offered by their libraries were of an unequal level of quality. Indeed, python offers an excellent library to interact with hdfs, while R has interesting modules to manage iso19115 metadata. In order to reduce the complexity generated by the concomitant use of these two languages, the R script has been encapsulated inside the python script. Thus, the administrator only needs to run the python script.
## Python : collect data and insert to Data Lake on data zone : HDFS Cluster
### Prerequisites
Has to be run on python 3.6 with requirements found in python-requirements.txt
If you are using pip, you can install all requirement with this command :
```shell
pip3 install -r requirements.txt
```
### Files
* Main script : **src/main.py**
* This script read information from a csv file : input/datasources.csv
* It gives 2 kind of output :
+ **/output/data/** : Files downloaded
+ **/output/meta/meta.json** : a json file containing metadata collected on Internet
## R script : Insert into Data Lake metadatat management system : GeoNetwork
### Prerequisites
This script needs to use the Geonetwork's API. You have to enable it it your geonetwork settings.
cf : https://github.com/geonetwork/core-geonetwork/blob/9310d0ba85e6a35f48dbfa5d6168ba7088609724/web/src/main/webapp/WEB-INF/config/config-service-xml-api.xml#L83
*Pay attention that if Geonerwork is updated, you may lose this setting specially if you are using a tomcat war (that's the case for Geosur).*
### Files :
* Main script : **src/addServicesToGN.R**
* This script read informations from 2 files :
+ **input/tetis-services.csv**
+ **output/meta/meta.json** : Json file generated by the main python script containing informations for building HDFS path link
### Dependencies
It only uses 2 librairie from FAO :
* geonapi : R librairy to insert/update/delete metadata directly in our geonetwork : https://github.com/eblondel/geonapi
* geometa : R librairy to create iso 19115 xml (inspire compliant) : https://github.com/eblondel/geometa
### Install
* Dependencies System libs:
```shell
sudo apt-get install libssl-dev libxml2-dev
```
* Dependencies System lib related to R software:
```shell
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
sudo apt-get install r-base
```
* R addiotional libs:
(see https://github.com/eblondel/geonapi/wiki#install_guide, https://github.com/eblondel/geometa/wiki#install_guide)
```shell
install.packages("devtools")
install.packages("XML")
install.packages("uuid")
install.packages("osmdata")
install.packages("rjson")
require("devtools")
install_github("eblondel/geometa")
install_github("eblondel/geonapi")
```
certifi==2019.11.28
chardet==3.0.4
docopt==0.6.2
hdfs==2.5.8
idna==2.8
numpy==1.17.4
pandas==0.25.3
python-dateutil==2.8.1
pytz==2019.3
pywebhdfs==0.4.1
requests==2.22.0
six==1.13.0
urllib3==1.25.7
...@@ -7,8 +7,6 @@ library(osmdata) ...@@ -7,8 +7,6 @@ library(osmdata)
library(rjson) library(rjson)
working_dir = getwd() working_dir = getwd()
print("working on :")
print(working_dir)
# Connection to geonetwork # Connection to geonetwork
gn <- GNManager$new( gn <- GNManager$new(
...@@ -17,7 +15,6 @@ gn <- GNManager$new( ...@@ -17,7 +15,6 @@ gn <- GNManager$new(
user = "admin", user = "admin",
pwd = "admin" pwd = "admin"
) )
print("ici")
## Read input ## Read input
services <- read.csv(file=paste(working_dir, "../input/datasources.csv", sep = "/"), sep =";") services <- read.csv(file=paste(working_dir, "../input/datasources.csv", sep = "/"), sep =";")
...@@ -62,7 +59,7 @@ for (service in services$id) { ...@@ -62,7 +59,7 @@ for (service in services$id) {
for (node in json_data){ for (node in json_data){
if(node$idCSV == iterator){ if(node$idCSV == iterator){
ressources = basename(node$data) ressources = basename(node$data)
print(str(ressources)) # print(str(ressources))
for (ressource in ressources){ for (ressource in ressources){
newURL <- ISOOnlineResource$new() newURL <- ISOOnlineResource$new()
newURL$setName(paste0(ressource)) newURL$setName(paste0(ressource))
......
...@@ -102,7 +102,7 @@ if __name__ == '__main__': ...@@ -102,7 +102,7 @@ if __name__ == '__main__':
jsonfile.write(json.dumps(opendata3mDataMetada)) jsonfile.write(json.dumps(opendata3mDataMetada))
jsonfile.close() jsonfile.close()
"""Download File""" """Download File"""
# nboffiledl = downloadOpendata3MFiles(opendata3mDataMetada, pathToSaveDownloadedData) nboffiledl = downloadOpendata3MFiles(opendata3mDataMetada, pathToSaveDownloadedData)
"""Insert files inside HDFS and store file""" """Insert files inside HDFS and store file"""
# connect to HDFS # connect to HDFS
...@@ -121,7 +121,7 @@ if __name__ == '__main__': ...@@ -121,7 +121,7 @@ if __name__ == '__main__':
subprocess.call("/usr/bin/Rscript addServicesToGN.R") subprocess.call("/usr/bin/Rscript addServicesToGN.R")
except : except :
print("R error due to OSM ? Try re-launched") print("R error due to OSM ? Try re-launched")
subprocess.call("/usr/bin/Rscript addServicesToGN.R", shell=True) subprocess.call("R -f addServicesToGN.R", shell=True)
print(str(nboffiledl)+" files downloaded in : " + pathToSaveDownloadedData) print(str(nboffiledl)+" files downloaded in : " + pathToSaveDownloadedData)
print("AIDMOIt ingestion module ends") print("AIDMOIt ingestion module ends")
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment