Commit 86f9d84a authored by Remi Cresson's avatar Remi Cresson
Browse files

DOC: update documentation

No related merge requests found
Showing with 123 additions and 109 deletions
+123 -109
......@@ -33,7 +33,7 @@ be retrieved individually, without downloading the whole archive contents:
```python
from theia_picker import TheiaCatalog
# Download band 6 from a Sentinel-2 Level 2A product
# Download bands 4 and 8 from a Sentinel-2 Level 2A product
cat = TheiaCatalog("credentials.json")
feats = cat.search(tile_name="T31TEJ", start_date="14/01/2021", level="LEVEL2A")
for f in feats:
......
......@@ -10,18 +10,111 @@
pip install theia-picker
```
## Quickstart
## Credentials
``` py
The credentials must be stored in a JSON file. It should look like this:
*credentials.json*
```json
{
"ident": "username1234",
"pass": "thisisnotmypassword"
}
```
## Searching products
The `TheiaCatalog` class is the top level API to access the products. It uses
the credentials stored in the *credentials.json* file.
```python
from theia_picker import TheiaCatalog
theia = TheiaCatalog("/path/to.credentials.json")
features = theia.search(...)
theia = TheiaCatalog("credentials.json")
features = theia.search(
start_date="01/01/2022",
end_date="01/01/2023",
bbox=[4.01, 42.99, 4.9, 44.05],
level="LEVEL2A"
)
```
The `end_date` is optional. If not provided, the products are searched only for
`start_date`. The `tile_name` parameter enables to search for products in a
specific tile.
Here is another example of search without `end_date` and `bbox`, using
`tile_name`:
```python
features = theia.search(
start_date="01/01/2022",
tile_name="T31TEJ",
level="LEVEL2A"
)
```
The `search()` returns a `list` of `Feature` instances. For each `Feature`, one
can retrieve its information.
```python
for f in features:
# download the entire archive
f.download_archive(output_dir="...")
# or... download only the file you want
files = f.list_files_in_archive()
some_file = files[0] # pick one file name
f.download_single_file(some_file, output_dir="...")
print(f.properties.product_identifier)
```
And the most interesting thing is how we can download files from one `Feature`.
## Downloading products
Theia-picker enable to download **archives** or **individual files** from the
remote archive. When individual files are downloaded, only the bytes relative
to the compressed file in the remote archive are downloaded. Then they are
decompressed and written as the file. This is particularly interesting when a
few files are needed. No need to download the entire archive!
### Archives
The following will download the entire archive.
```python
f.download_archive(download_dir="/tmp")
```
When the archive already exist in the download directory, the md5sum is
computed and compared with the one in the catalog, in order to determine if it
has to be downloaded again. If the file is already downloaded and is complete
according to the md5sum, its download is skipped.
### Individual files
The **list of files** in the remote archive can be retrieved:
```python
files = f.list_files_in_archive()
```
Filenames are returned from `list_files_in_archive()` as a `list` of `str`.
Then one can **download a specific file** using `download_single_file()`:
```python
f.download_single_file(filename="S.../MASKS/...EDG_R1.tif", download_dir="/tmp")
```
Here the theia token is renewed automatically when a request fails. If you
prefer, you can force the token renewal prior to download the file with
`renew_token=True`.
Finally, you can **download multiple files** matching a set of patterns.
The following example show how to download only files containing *FRE_B4.tif*
or *FRE_B8.tif* expressions.
```python
f.download_files(matching=["FRE_B4.tif", "FRE_B8.tif"], download_dir="/tmp")
```
Theia-picker downloads only the part of the remote archive containing the
compressed bytes of files, and decompresses the data. The CRC32 checksum is
computed to check that the files are correctly downloaded. If not, the download
is retried. When the destination file already exists, the CRC32 is computed and
compared with the CRC32 of the file in the remote archive. If both checksums
match, the download is skipped.
......@@ -2,73 +2,6 @@
"""
This module handles the download of Theia products.
The `TheiaCatalog` uses Theia credentials, stored in a JSON file:
```json
{
"ident": "remi.cresson@inrae.fr",
"pass": "thisisnotmyrealpassword"
}
```
To instantiate the `TheiaCatalog`:
``` py
theia = TheiaCatalog("/path/to.credentials.json")
```
# Search
The following example shows how to use the `TheiaCatalog` to search
all Sentinel-2 images in a bounding box within a temporal range.
``` py
features = theia.search(
bbox=[4.317, 43.706, 4.420, 43.708],
start_date="01/01/2020",
end_date="O1/O1/2022",
level="LEVEL2A"
)
```
Here `features` is a `list` of `Feature` instances.
# Entire archive download
Each feature can be downloaded entirely (meaning, the whole archive):
``` py
for feature in features:
feature.download_archive("/path/to/download_dir/")
```
When files already exist, the md5sum is computed and compared with the one
in the catalog, in order to determine if it has to be downloaded again.
If the file is already downloaded and is complete according to the md5sum,
its download is skipped.
To force the download, call `download_archive(..., overwrite=True)`.
# Partial archive download
## List files in the archive
The remote archive **is not downloaded** to perform this action.
``` py
for f in features:
f.list_files_in_archive()
```
## Download and unzip a specific file
Only **a subset** of the remote archive will be downloaded.
``` py
for f in features:
f.download_single_file(
"SENTINEL2A_..._V3-0/SENTINEL2A_..._QKL_ALL.jpg",
output_dir="/path/to/downloads/"
)
```
"""
import datetime
import hashlib
......@@ -484,24 +417,21 @@ class RemoteZip:
self,
filename: str,
output_dir: str,
overwrite: bool = False,
renew_token: bool = False
):
"""
Download a single file from the remote archive.
If the destination file already exists in the download directory, and
overwrite is set to False, the CRC32 checksum is computed and compared
with the CRC32 of the compressed file in the remote archive. If they
match, the download is skipped.
After the download, the CRC32 checksum is computed and compared with
the CRC32 of the compressed file in the remote archive. If they don't
match, the download is retried.
If the destination file already exists in the download directory, the
CRC32 checksum is computed and compared with the CRC32 of the
compressed file in the remote archive. If they match, the download is
skipped. After the download, the CRC32 checksum is computed and
compared with the CRC32 of the compressed file in the remote archive.
If they don't match, the download is retried.
Args:
filename: file path in the remote archive
output_dir: output directory
overwrite: overwrite existing downloaded file
renew_token: can be used to force the token renewal
"""
......@@ -529,15 +459,12 @@ class RemoteZip:
output_file = os.path.join(output_dir, os.path.basename(filename))
# Check if file already exist
if not overwrite:
if os.path.isfile(output_file):
crc_out = compute_crc32(output_file)
log.debug("CRC32 (existing file): %s", crc_out)
if crc_out == crc:
log.info(
"File %s already downloaded. Skipping.", output_file
)
return
if os.path.isfile(output_file):
crc_out = compute_crc32(output_file)
log.debug("CRC32 (existing file): %s", crc_out)
if crc_out == crc:
log.info("File %s already downloaded. Skipping.", output_file)
return
start = info["header_offset"] + sizeof_localhdr + fnlen + extralen
self._get_range(start=start, length=size, output_file=output_file)
......@@ -653,12 +580,12 @@ class Feature(BaseModel, extra=Extra.allow):
"""
Download the entire archive.
If the destination file already exists in the download directory, and
overwrite is set to False, the MD5 checksum is computed and compared
with the MD5 of the remote archive. If they match, the download is
skipped. After the download, the MD5 checksum is computed and compared
with the MD5 of the compressed file in the remote archive. If they
don't match, the download is retried.
If the destination file already exists in the download directory, the
MD5 checksum is computed and compared with the MD5 of the remote
archive. If they match, the download is skipped. After the download,
the MD5 checksum is computed and compared with the MD5 of the
compressed file in the remote archive. If they don't match, the
download is retried.
Args:
download_dir: download directory
......@@ -727,7 +654,6 @@ class Feature(BaseModel, extra=Extra.allow):
self,
filename: str,
download_dir: str,
overwrite: bool = False,
renew_token: bool = False
):
"""
......@@ -737,7 +663,6 @@ class Feature(BaseModel, extra=Extra.allow):
filename: file path of the file to download/extract from the
remote archive
download_dir: download directory
overwrite: overwrite
renew_token: can be used to force the token renewal
"""
......@@ -760,7 +685,6 @@ class Feature(BaseModel, extra=Extra.allow):
remote_zip.download_single_file(
filename=filename,
output_dir=output_dir,
overwrite=overwrite,
renew_token=renew_token
)
......@@ -768,7 +692,6 @@ class Feature(BaseModel, extra=Extra.allow):
self,
download_dir: str,
matching: List[str],
overwrite: bool = False,
renew_token: bool = False
):
"""
......@@ -777,7 +700,6 @@ class Feature(BaseModel, extra=Extra.allow):
Args:
download_dir: download directory
matching: list of string to match filenames
overwrite: overwrite
renew_token: force the token renewal prior to download each file
"""
......@@ -786,7 +708,6 @@ class Feature(BaseModel, extra=Extra.allow):
self.download_single_file(
filename=filename,
download_dir=download_dir,
overwrite=overwrite,
renew_token=renew_token
)
......
"""
Module containing helpers.
This module contains some helpers.
"""
import os
import logging
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment