Documentation Sentinel-2 files for LULC using Positive Unlabeled Learning

Introduction

An automated workflow for Sentinel-2 data extraction et preprocessing is available on https://gitlab.irstea.fr/johann.desloires/one_class_stis/-/tree/master/Sentinel2Theia. It allows to construct a dataset set using a input vector file with the UTM projection.

The objective of this document is to describe the input and output files that were used in this pipeline, stored remotely on the machine 172.16.10.242 using the source path /media/DATA/johann/PUL/TileHG/ (root path).

Dataset preparation

Input files

Vectors data description

Input final vector file is available on : ./FinalDBPreprocessed/DATABASE_SAMPLED.

The following steps have been done to create this file from the initial input ./FinalDBPreprocessed/DATABASE_READY (dense annotation):

  1. Split large geometry using the intersection regular grid of 1*1 km and keep only objects with a surface higher than 250m2, the output is ./FinalDBPreprocessed/DATABASE_GEOM_SPLIT
  2. Random sampling over a regular grid of 5*5 km (K cells) (VectorsDPP/Sampling.py), given a fixed sampling rate r (e.g. 10%), we randomly select n objects from N available per cell equal to (N*r/K) . Sampling rate values are the following :
    • 2.5% for Cereals/Oilseeds, Meadows/Uncultivated and Forest
    • 5% for Built class
    • 10% for Water, Market Gardenning and Fodder
    • 20% for Orchards

Sentinel-2 Download configuration

Code to download and preprocess a tile for Land Use and Land Cover analysis is summarized into the script Sentinel2Theia/main.py.

You must provide the following paths to execute a code from the following files :

  • Folder theia_download obtained from the git repository pulled (cf prerequisites)

  • A vector folder where you have :

    • Ground truth shapefile : Objects ID and labels to use for land cover training. The projection should be in your UTM (e.g 'espg:32631' for the tile TC31J (France)).
    • Polygon of the Area of Interest : shapefile that we be used to cropping the images with respect to your study zone.
    • Folder where Orfeo Toolbox is installed

By default, the output files will be saved in the folder GEOTIFFS given the root path /media/DATA/johann/PUL/TileHG/Sentinel2.

Output files

Scripts for Sentinel-2 data download and preprocessing

The steps are included together from the script Sentinel2Theia/main.py, which follows the following steps :

  1. Download the folders (Flat Reflectance) containing the raw data images by specifying the tile name, orbit relative and dates range using the scrip Using the script Sentinel2Theia/unzip_data.py. Output are saved into the git repository path theia_download (/media/DATA/johann/PUL/TileHG/Sentinel2/theia_download).
  2. Stack the timely distributed data into GEOTIFFS files per bands and filter out cloudy images given a threshold, using the script Sentinel2Theia/stack_data.py
  3. Apply gap filling to interpolate cloudy pixels using linear interpolation with (Sentinel2Theia/GapFilling.py).
  4. Spatial interpolation of 20 meters bands (B5,B6,B7,B8A,B9,B11,B12) into 10 meters (Sentinel2Theia/GFSuperImpose.py)
  5. Compute NDVI and NDWI vegetation indices (Sentinel2Theia/VegetationIndices.py)
  6. Build a training dataset and compute descriptive statistics ("meta info"), given a list of feature names, and a vector file (Sentinel2Theia/training_set.py)
  7. Scaling the data between 0 and 1 and split into Positive and Unlabeled (50%), and Testing (50%) given a class of interest. Repeat it 10 times independently, given in input different sampling object size (e.g. [20, 40, 60, 80 ,100]), and a window (e.g. 20) (PUL/Experiments.py)

Description of the final output files

Given a year (here, 2019) and a root path (/media/DATA/johann/PUL/TileHG/), the final output raster files are cropped given the extent computed from the vector input files. GEOTIFFS files can be opened into np.array using the library gdal gdal.Open(path)).ReadAsArray()

  • Ground truth data. A binary erosion has been performed, which may involve that small objects might be deleted. The names of the files match with the name of the columns.
    • ./Sentinel2/GEOTIFFS/Object_ID_crop_2019.tif (int16) : unique id per objects, rasterized from the column "Object_ID" in the input vector files. The labels are not necessarily continuous as some objects may have been delete during erosion.
    • ./Sentinel2/GEOTIFFS/Class_ID_crop_2019.tif (int16) : unique id per land cover class, rasterized from the columns "Class_ID" in the input vector file.
  • Sentinel-2 data, filtered during the step 2 (Sentinel2Theia/stack_data.py) using a given cloud percentage (50% by default) :
    • ./Sentinel2/GEOTIFFS/GFStack_B2_crop_2019.tif (float32) : Band B2 gap filled, crop over the extent and subset over the year 2019 (same for B3, B4, B8)
    • ./Sentinel2/GEOTIFFS/GFStack_SI_B5_crop_2019.tif (float32) : Band B5 gap filled, crop over the extent, subset over the year 2019 and interpolated over 10 meters (superimposed, same for 20 meters bands B5, B6, B7, B8A, B11 et B12).
    • ./Sentinel2/GEOTIFFS/stack_10m_crop_2019.tif (int16) : Binary cloud mask, crop over the extent and subset over the year 2019
    • ./Sentinel2/dates.csv : File with the acquisition dates, corresponding to the csv file.
  • Meta informations (descriptive statistics per bands, dates) : (./Sentinel2/dictionary_meta_info.pickle)
  • Training set (./FinalDBPreprocessed/training_set.csv) : Pixels matched with the objects. The data has not been scaled, but it can be using the meta information file.
  • Experiments (./FinalDBPreprocessed/Experiments) : Different scenarios splitting repeated 10 times, given a class of interest (e.g. ./CerealsOilseeds/60_P/ folders for the class Cereals/Oilseeds, given 60 objects)