diff --git a/README.md b/README.md index 5a64ea0ab40da6a98a5ab664aefaa406e5886a9a..eeab1a8b102dd632bac5429e4a6a21a6341e5432 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ be retrieved individually, without downloading the whole archive contents: ```python from theia_picker import TheiaCatalog -# Download band 6 from a Sentinel-2 Level 2A product +# Download bands 4 and 8 from a Sentinel-2 Level 2A product cat = TheiaCatalog("credentials.json") feats = cat.search(tile_name="T31TEJ", start_date="14/01/2021", level="LEVEL2A") for f in feats: diff --git a/doc/index.md b/doc/index.md index c4ba6a9d1274623e233157e142ccef2720ef333b..7df5a1c915c6cecf3fda3d489cd8b3a39540776e 100644 --- a/doc/index.md +++ b/doc/index.md @@ -10,18 +10,111 @@ pip install theia-picker ``` -## Quickstart +## Credentials -``` py +The credentials must be stored in a JSON file. It should look like this: + +*credentials.json* +```json +{ + "ident": "username1234", + "pass": "thisisnotmypassword" +} +``` + +## Searching products + +The `TheiaCatalog` class is the top level API to access the products. It uses +the credentials stored in the *credentials.json* file. + +```python from theia_picker import TheiaCatalog -theia = TheiaCatalog("/path/to.credentials.json") -features = theia.search(...) +theia = TheiaCatalog("credentials.json") +features = theia.search( + start_date="01/01/2022", + end_date="01/01/2023", + bbox=[4.01, 42.99, 4.9, 44.05], + level="LEVEL2A" +) +``` + +The `end_date` is optional. If not provided, the products are searched only for +`start_date`. The `tile_name` parameter enables to search for products in a +specific tile. + +Here is another example of search without `end_date` and `bbox`, using +`tile_name`: + +```python +features = theia.search( + start_date="01/01/2022", + tile_name="T31TEJ", + level="LEVEL2A" +) +``` + +The `search()` returns a `list` of `Feature` instances. For each `Feature`, one +can retrieve its information. + +```python for f in features: - # download the entire archive - f.download_archive(output_dir="...") - # or... download only the file you want - files = f.list_files_in_archive() - some_file = files[0] # pick one file name - f.download_single_file(some_file, output_dir="...") + print(f.properties.product_identifier) +``` + +And the most interesting thing is how we can download files from one `Feature`. + +## Downloading products + +Theia-picker enable to download **archives** or **individual files** from the +remote archive. When individual files are downloaded, only the bytes relative +to the compressed file in the remote archive are downloaded. Then they are +decompressed and written as the file. This is particularly interesting when a +few files are needed. No need to download the entire archive! + +### Archives + +The following will download the entire archive. + +```python +f.download_archive(download_dir="/tmp") +``` + +When the archive already exist in the download directory, the md5sum is +computed and compared with the one in the catalog, in order to determine if it +has to be downloaded again. If the file is already downloaded and is complete +according to the md5sum, its download is skipped. + +### Individual files + +The **list of files** in the remote archive can be retrieved: + +```python +files = f.list_files_in_archive() +``` + +Filenames are returned from `list_files_in_archive()` as a `list` of `str`. + +Then one can **download a specific file** using `download_single_file()`: + +```python +f.download_single_file(filename="S.../MASKS/...EDG_R1.tif", download_dir="/tmp") +``` + +Here the theia token is renewed automatically when a request fails. If you +prefer, you can force the token renewal prior to download the file with +`renew_token=True`. + +Finally, you can **download multiple files** matching a set of patterns. +The following example show how to download only files containing *FRE_B4.tif* +or *FRE_B8.tif* expressions. + +```python +f.download_files(matching=["FRE_B4.tif", "FRE_B8.tif"], download_dir="/tmp") ``` +Theia-picker downloads only the part of the remote archive containing the +compressed bytes of files, and decompresses the data. The CRC32 checksum is +computed to check that the files are correctly downloaded. If not, the download +is retried. When the destination file already exists, the CRC32 is computed and +compared with the CRC32 of the file in the remote archive. If both checksums +match, the download is skipped. diff --git a/theia_picker/download.py b/theia_picker/download.py index e4482fdea28465bdd21edd86698fe94c711224a1..16384c02d058d578d88e0f7da81d03264a2d9959 100644 --- a/theia_picker/download.py +++ b/theia_picker/download.py @@ -2,73 +2,6 @@ """ This module handles the download of Theia products. -The `TheiaCatalog` uses Theia credentials, stored in a JSON file: - -```json -{ - "ident": "remi.cresson@inrae.fr", - "pass": "thisisnotmyrealpassword" -} -``` - -To instantiate the `TheiaCatalog`: -``` py -theia = TheiaCatalog("/path/to.credentials.json") -``` - -# Search - -The following example shows how to use the `TheiaCatalog` to search -all Sentinel-2 images in a bounding box within a temporal range. - -``` py -features = theia.search( - bbox=[4.317, 43.706, 4.420, 43.708], - start_date="01/01/2020", - end_date="O1/O1/2022", - level="LEVEL2A" -) -``` - -Here `features` is a `list` of `Feature` instances. - -# Entire archive download - -Each feature can be downloaded entirely (meaning, the whole archive): - -``` py -for feature in features: - feature.download_archive("/path/to/download_dir/") -``` -When files already exist, the md5sum is computed and compared with the one -in the catalog, in order to determine if it has to be downloaded again. -If the file is already downloaded and is complete according to the md5sum, -its download is skipped. -To force the download, call `download_archive(..., overwrite=True)`. - -# Partial archive download - -## List files in the archive - -The remote archive **is not downloaded** to perform this action. - -``` py -for f in features: - f.list_files_in_archive() -``` - -## Download and unzip a specific file - -Only **a subset** of the remote archive will be downloaded. - -``` py -for f in features: - f.download_single_file( - "SENTINEL2A_..._V3-0/SENTINEL2A_..._QKL_ALL.jpg", - output_dir="/path/to/downloads/" - ) -``` - """ import datetime import hashlib @@ -484,24 +417,21 @@ class RemoteZip: self, filename: str, output_dir: str, - overwrite: bool = False, renew_token: bool = False ): """ Download a single file from the remote archive. - If the destination file already exists in the download directory, and - overwrite is set to False, the CRC32 checksum is computed and compared - with the CRC32 of the compressed file in the remote archive. If they - match, the download is skipped. - After the download, the CRC32 checksum is computed and compared with - the CRC32 of the compressed file in the remote archive. If they don't - match, the download is retried. + If the destination file already exists in the download directory, the + CRC32 checksum is computed and compared with the CRC32 of the + compressed file in the remote archive. If they match, the download is + skipped. After the download, the CRC32 checksum is computed and + compared with the CRC32 of the compressed file in the remote archive. + If they don't match, the download is retried. Args: filename: file path in the remote archive output_dir: output directory - overwrite: overwrite existing downloaded file renew_token: can be used to force the token renewal """ @@ -529,15 +459,12 @@ class RemoteZip: output_file = os.path.join(output_dir, os.path.basename(filename)) # Check if file already exist - if not overwrite: - if os.path.isfile(output_file): - crc_out = compute_crc32(output_file) - log.debug("CRC32 (existing file): %s", crc_out) - if crc_out == crc: - log.info( - "File %s already downloaded. Skipping.", output_file - ) - return + if os.path.isfile(output_file): + crc_out = compute_crc32(output_file) + log.debug("CRC32 (existing file): %s", crc_out) + if crc_out == crc: + log.info("File %s already downloaded. Skipping.", output_file) + return start = info["header_offset"] + sizeof_localhdr + fnlen + extralen self._get_range(start=start, length=size, output_file=output_file) @@ -653,12 +580,12 @@ class Feature(BaseModel, extra=Extra.allow): """ Download the entire archive. - If the destination file already exists in the download directory, and - overwrite is set to False, the MD5 checksum is computed and compared - with the MD5 of the remote archive. If they match, the download is - skipped. After the download, the MD5 checksum is computed and compared - with the MD5 of the compressed file in the remote archive. If they - don't match, the download is retried. + If the destination file already exists in the download directory, the + MD5 checksum is computed and compared with the MD5 of the remote + archive. If they match, the download is skipped. After the download, + the MD5 checksum is computed and compared with the MD5 of the + compressed file in the remote archive. If they don't match, the + download is retried. Args: download_dir: download directory @@ -727,7 +654,6 @@ class Feature(BaseModel, extra=Extra.allow): self, filename: str, download_dir: str, - overwrite: bool = False, renew_token: bool = False ): """ @@ -737,7 +663,6 @@ class Feature(BaseModel, extra=Extra.allow): filename: file path of the file to download/extract from the remote archive download_dir: download directory - overwrite: overwrite renew_token: can be used to force the token renewal """ @@ -760,7 +685,6 @@ class Feature(BaseModel, extra=Extra.allow): remote_zip.download_single_file( filename=filename, output_dir=output_dir, - overwrite=overwrite, renew_token=renew_token ) @@ -768,7 +692,6 @@ class Feature(BaseModel, extra=Extra.allow): self, download_dir: str, matching: List[str], - overwrite: bool = False, renew_token: bool = False ): """ @@ -777,7 +700,6 @@ class Feature(BaseModel, extra=Extra.allow): Args: download_dir: download directory matching: list of string to match filenames - overwrite: overwrite renew_token: force the token renewal prior to download each file """ @@ -786,7 +708,6 @@ class Feature(BaseModel, extra=Extra.allow): self.download_single_file( filename=filename, download_dir=download_dir, - overwrite=overwrite, renew_token=renew_token ) diff --git a/theia_picker/utils.py b/theia_picker/utils.py index d8715d8dc35a9b1d9bc72e8fa481b83d2c20838b..0ce816827835b14d9c7ea9fbcb24f267026c3c5f 100644 --- a/theia_picker/utils.py +++ b/theia_picker/utils.py @@ -1,5 +1,5 @@ """ -Module containing helpers. +This module contains some helpers. """ import os import logging