Standard Element 1

Download source data

Author

Affiliation

Pietari Pöykkö, MSc (tech)

University of Oulu

Published

2026-05-12

Keywords

drought indices, SGI, drought prediction

Although central in this analysis, the Finnish natural-state groundwater data is not available as one contained, productized, dataset. Instead, the data must be downloaded manually through the Hertta open data portal of Syke, with a manual selection of stations. Link to this portal is provided in README. Additionally, I have trimmed down the data from natural-state stations to only contain maximally long time series, and also quality-controlled this data. The QC procedure is described in Pöykkö et al. 2026. I have not investigated to get a permission to share this customized dataset publicly.

Running this notebook additionally requires an API key for European Centre for Medium-Range Weather Forecasts services. This key can be obtained by registering an ECMWF account on https://www.ecmwf.int/, and viewing the key from your personal profile. The script will ask for this key automatically, or it can be provided by running the command ecmwfr::wf_set_key()

The raw data is stored in the project subfolder inputs. These are not committed to the repository, using .gitignore.

1 Data description

This notebook reads in the GW data from CSV files placed in inputs/manual/ folder. Notes on obtaining this data and relevant links are provided in README. Also, monthly drought index time series (SPI & SPEI) from the ECMWF Climate Data Store are downloaded here. The relevant links are found in README, and the API-call parameters are documented in code below. More details are also provided in the other notebooks.

2 How to run this notebook

This notebook can be run stand-alone, but is intended to be executed as part of the whole project. The project can be executed with run_reproducibility.R. Running this file outputs the drought index dataset NetCDF files into /inputs/auto/ecmwfr_spi_spei_monthly/, and loads the GW dataset into memory.

3 Storage requirements

The total storage required for running this notebook is 1.1 Gb. - 330 Mb will be downloaded - 650 Mb for the unzipped downloads - 50 Mb for manually accessed GW data Largest individual downloaded file is around 5 Mb. Largest of all files is the main GW datafile at 13 Mb.

3.1 Make sure you have sufficient storage available to download all data required

if ((ps::ps_disk_usage()$available / 2^30) < 2) { # Gb
    stop("Insufficient disk space for the automatically downloaded datasets")
}

4 Project set-up

here::i_am("SE1_data_access.qmd")

here() starts at C:/Users/ppoykko19/Koodi/waterdigi-drought-pred

root = here::here()

raw_data_dir = fs::path(root, "inputs") # Define raw data dir var

fs::dir_create(raw_data_dir, c("manual","auto")) # Ensure folder structure

if (length(fs::dir_ls(fs::path(raw_data_dir, "manual"))) < 4L) { # TODO: set precise number of required manual dls
    stop("Datasets listed in README to require manual inputs are not found")
}

5 Read in manually accessed GW data

G <- data.table::fread(fs::path(raw_data_dir,"manual/G_reg.csv"), keepLeadingZeros = T)
stations <- data.table::fread(fs::path(raw_data_dir,"manual/asemat.csv"), keepLeadingZeros = T)
gwas <- data.table::fread(fs::path(raw_data_dir,"manual/pv_alueet.csv"), keepLeadingZeros = T)
pipes <- data.table::fread(fs::path(raw_data_dir,"manual/paikat.csv"), keepLeadingZeros = T)

6 Download SPI & SPEI timeseries data

Set behaviour variables.

# ===== USER SETTINGS FOR SPI & SPEI DATA =====
OUTPUT_DIR <- fs::dir_create(raw_data_dir,"auto","ecmwfr_spi_spei_monthly")
TIMEOUT <- 2000 # Seconds before req timeout
# =============================================

Download approppriately subsetted drought index datasets. (Also checks for ECMWF API key.)

tryCatch(invisible(ecmwfr::wf_get_key()), # Check for ECMWF API key
    error=\(e){ 
        if (nzchar(Sys.getenv("ECMWF_KEY"))) { # Check for env var (supplied to gh rendering)
            ecmwfr::wf_set_key(Sys.getenv("ECMWF_KEY"))
        } else if (interactive()) {            # Ask user if interactive
            warning("Add ECMWF API key (ecmwfr::wf_set_key())")
            ecmwfr::wf_set_key()
        } else { stop("ECMWF API key not found") }
    }
)
# For more on ecmwfr datasets: ecmwfr::wf_dataset_info(), ecmwfr::wf_datasets()

ecmwf_spi_spei_req_year <- function(y) {
    list(dataset_short_name = "derived-drought-historical-monthly",
        variable = c("standardised_precipitation_index",
                    "standardised_precipitation_evapotranspiration_index"),
        accumulation_period = c("1","3","6","12","24","36","48"),
        version = "1_0", product_type = "reanalysis",
        dataset_type = "consolidated_dataset",
        year = as.character(y),
        month = c("01","02","03","04","05","06","07","08","09","10","11","12"),
        area = c(70, 18, 59, 32), # Nmax, Emin, Nmin, Emax, WGS84
        target = paste0("ecmwfr_spi_spei_",y,".nc")  # Output filename & format
        # ("target" potentially converts to zip if the req would return many files)
    )
}
# test_dl <- ecmwfr::wf_request(ecmwf_spi_spei_req_year(1966), path=fs::dir_create(raw_data_dir,"auto","test"), time_out = 1000)

out_dir_dls <- fs::dir_create(OUTPUT_DIR, "downloads") # Dir for dls from API
out_dir_final <- fs::dir_create(OUTPUT_DIR, "unzipped") # Dir for final usable files

all_ecmwf_reqs <- lapply(1965:2024, ecmwf_spi_spei_req_year)

# Drop the API calls which files already exists
wanted_req_files <- fs::path_ext_remove(sapply(all_ecmwf_reqs, \(req) req$target))
existing_req_files <- fs::path_ext_remove(fs::path_file(fs::dir_ls(out_dir_dls, type="file")))
needed_ecmwf_reqs <- all_ecmwf_reqs[which(!(wanted_req_files %in% existing_req_files))]

if (!rlang::is_empty(needed_ecmwf_reqs)) {
    message("Downloading monthly SPI and SPEI...")
    message("This can take around half an hour")
    sent_reqs <- ecmwfr::wf_request_batch(
        request_list = needed_ecmwf_reqs,
        workers = min(20, length(needed_ecmwf_reqs)), # Max 20?
        time_out = TIMEOUT, #s
        retry = 20, # sec to wait between api calls
        path = out_dir_dls # Into subfolder of downloads
    )
    # Unzip all downloads into subfolder
    all_dl_zips <- fs::dir_ls(out_dir_dls, type="file", glob="*.zip")
    stopifnot(all(fs::is_file(all_dl_zips)))
    for (f in all_dl_zips) { unzip(f, exdir = out_dir_final) }
    rm(all_dl_zips, sent_reqs)
    # Possible to download also quality criteria for SPI & SPEI
    # Similar req to above, but variables "probability_of_zero_precipitation_spi", "test_for_normality_spi", "test_for_normality_spei"
    # All three cover all years in the dataset (no year needed in req)
    message("SPI and SPEI datasets downloaded.")
} else { message("All requested SPI & SPEI original files found") }

All requested SPI & SPEI original files found

rm(wanted_req_files, existing_req_files, needed_ecmwf_reqs, all_ecmwf_reqs, out_dir_dls)

message("Finished")

Finished