{
"cells": [
{
"cell_type": "markdown",
"id": "de6fe43b",
"metadata": {
"tags": []
},
"source": [
"# Convert \"raw\" Hugonnet et al (2021) geodetic per glacier dataset to a corrected dataset without outliers and missing glaciers\n",
"**Workflow:**\n",
"- reindex glaciers with GLIMS-ids to RGI-ids\n",
"- missing glaciers from the RGI are added to the dataset with their area and filled up with NaN-values\n",
"- Glaciers with connectivity level 2 are removed from the dataset (these are those glaciers that are strongly connected to the Greenland Ice Sheet). We don't want to do projections with them because they are already included in the GIS projections!\n",
"- the specific MB and std of all glaciers with MB standard deviations larger than a threshold (i.e. mean error plus three standard deviations of the error distribution) are also replaced by NaN-values (=outliers, in total 1532 glaciers). \n",
" - First an overall global threshold over all regions is computed, then this threshold is computed as well for every RGI region.\n",
" - If the regional threshold is smaller than the global threshold, we use the global threshold (to not penalize regions that are well-measured), otherwise we use the regional threshold\n",
"- For each region a mean specific MB and a mean specific MB standard deviation is estimated. All RGI glaciers (connectivity level < 2) that have NaN-values (no measurements or outlier) are filled up with the regional means according to their RGI region. \n",
"\n",
"**In total, 8597 of 215547 glaciers are filled up with regional mean data, hence 4.0% of all RGI glaciers. This corresponds to only 0.25% of the total glacier area as mostly small glaciers had no geodetic data or are outliers. 1532 from them are \"outlier\" glaciers!** ([click here to get to a figure visualizing this](#id-dmdtda-stats-plot))\n",
"\n",
"This method only fills up the dmdtda and err_dmdtda values. We are working at the moment to also fill up dhdt, err_dhdt, dvoldt and err_dvoldt ([see discussions at the end of this notebook](#id-analysis-problems-dhdt-dvoldt)), but this is not yet incorporated!\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "74981034",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import oggm\n",
"from oggm import utils\n",
"import pickle\n",
"pickle.HIGHEST_PROTOCOL = 4"
]
},
{
"cell_type": "markdown",
"id": "d4a736fe",
"metadata": {},
"source": [
"## Convert csv to hdf file"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fa61bd0d",
"metadata": {},
"outputs": [],
"source": [
"fpath = 'https://cluster.klima.uni-bremen.de/~oggm/geodetic_ref_mb/hugonnet_2021_ds_rgi60_pergla_rates_10_20_worldwide.csv'\n",
"fpath = 'hugonnet_2021_ds_rgi60_pergla_rates_10_20_worldwide.csv'\n",
"df = pd.read_csv(fpath, index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "58b89297",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" period area dhdt err_dhdt dvoldt \\\n",
"G039564E39446N 2010-01-01_2020-01-01 8000.0 NaN NaN NaN \n",
"G039564E39446N 2000-01-01_2020-01-01 8000.0 NaN NaN NaN \n",
"G039564E39446N 2000-01-01_2010-01-01 8000.0 NaN NaN NaN \n",
"G039579E39451N 2000-01-01_2020-01-01 8000.0 NaN NaN NaN \n",
"G039579E39451N 2000-01-01_2010-01-01 8000.0 NaN NaN NaN \n",
"... ... ... ... ... ... \n",
"G052226E35769N 2010-01-01_2020-01-01 0.0 NaN NaN NaN \n",
"G052226E35769N 2000-01-01_2020-01-01 0.0 NaN NaN NaN \n",
"G052227E35770N 2000-01-01_2010-01-01 0.0 NaN NaN NaN \n",
"G052227E35770N 2010-01-01_2020-01-01 0.0 NaN NaN NaN \n",
"G052227E35770N 2000-01-01_2020-01-01 0.0 NaN NaN NaN \n",
"\n",
" err_dvoldt dmdt err_dmdt dmdtda err_dmdtda \\\n",
"G039564E39446N NaN NaN NaN NaN NaN \n",
"G039564E39446N NaN NaN NaN NaN NaN \n",
"G039564E39446N NaN NaN NaN NaN NaN \n",
"G039579E39451N NaN NaN NaN NaN NaN \n",
"G039579E39451N NaN NaN NaN NaN NaN \n",
"... ... ... ... ... ... \n",
"G052226E35769N NaN NaN NaN NaN NaN \n",
"G052226E35769N NaN NaN NaN NaN NaN \n",
"G052227E35770N NaN NaN NaN NaN NaN \n",
"G052227E35770N NaN NaN NaN NaN NaN \n",
"G052227E35770N NaN NaN NaN NaN NaN \n",
"\n",
" perc_area_meas perc_area_res valid_obs valid_obs_py reg \n",
"G039564E39446N NaN NaN 0.0 0.0 12 \n",
"G039564E39446N NaN NaN 0.0 0.0 12 \n",
"G039564E39446N NaN NaN 0.0 0.0 12 \n",
"G039579E39451N NaN NaN 0.0 0.0 12 \n",
"G039579E39451N NaN NaN 0.0 0.0 12 \n",
"... ... ... ... ... ... \n",
"G052226E35769N NaN NaN 0.0 0.0 12 \n",
"G052226E35769N NaN NaN 0.0 0.0 12 \n",
"G052227E35770N NaN NaN 0.0 0.0 12 \n",
"G052227E35770N NaN NaN 0.0 0.0 12 \n",
"G052227E35770N NaN NaN 0.0 0.0 12 \n",
"\n",
"[6330 rows x 15 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"not_rgi = np.array(['RGI60' not in s for s in df_new.index])\n",
"df_new.loc[not_rgi]"
]
},
{
"cell_type": "markdown",
"id": "a3249b3e",
"metadata": {},
"source": [
"those glaciers only come from RGI region 12"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "61072484",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([12])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_new.loc[not_rgi].reg.unique()"
]
},
{
"cell_type": "markdown",
"id": "0043e983",
"metadata": {},
"source": [
"Let's only use glaciers with RGI-index:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "47804fd4",
"metadata": {},
"outputs": [],
"source": [
"df_new = df_new.loc[~ not_rgi]"
]
},
{
"cell_type": "markdown",
"id": "760f86c5",
"metadata": {},
"source": [
"We create a template for the missing glaciers"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "9ac38d9b",
"metadata": {},
"outputs": [],
"source": [
"template = df_new.copy(deep=True).iloc[0:3]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a2e7dc73",
"metadata": {},
"outputs": [],
"source": [
"template[template.columns[1:]] = np.NaN\n",
"template['reg'] = 12"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "11721685",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
period
\n",
"
area
\n",
"
dhdt
\n",
"
err_dhdt
\n",
"
dvoldt
\n",
"
err_dvoldt
\n",
"
dmdt
\n",
"
err_dmdt
\n",
"
dmdtda
\n",
"
err_dmdtda
\n",
"
perc_area_meas
\n",
"
perc_area_res
\n",
"
valid_obs
\n",
"
valid_obs_py
\n",
"
reg
\n",
"
\n",
" \n",
" \n",
"
\n",
"
RGI60-01.00001
\n",
"
2000-01-01_2020-01-01
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
12
\n",
"
\n",
"
\n",
"
RGI60-01.00001
\n",
"
2010-01-01_2020-01-01
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
12
\n",
"
\n",
"
\n",
"
RGI60-01.00001
\n",
"
2000-01-01_2010-01-01
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
12
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" period area dhdt err_dhdt dvoldt \\\n",
"RGI60-01.00001 2000-01-01_2020-01-01 NaN NaN NaN NaN \n",
"RGI60-01.00001 2010-01-01_2020-01-01 NaN NaN NaN NaN \n",
"RGI60-01.00001 2000-01-01_2010-01-01 NaN NaN NaN NaN \n",
"\n",
" err_dvoldt dmdt err_dmdt dmdtda err_dmdtda \\\n",
"RGI60-01.00001 NaN NaN NaN NaN NaN \n",
"RGI60-01.00001 NaN NaN NaN NaN NaN \n",
"RGI60-01.00001 NaN NaN NaN NaN NaN \n",
"\n",
" perc_area_meas perc_area_res valid_obs valid_obs_py reg \n",
"RGI60-01.00001 NaN NaN NaN NaN 12 \n",
"RGI60-01.00001 NaN NaN NaN NaN 12 \n",
"RGI60-01.00001 NaN NaN NaN NaN 12 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"template"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "46eaccfb",
"metadata": {},
"outputs": [],
"source": [
"path_rgi = utils.file_downloader('https://cluster.klima.uni-bremen.de/~oggm/rgi/rgi62_stats.h5')\n",
"rgi = pd.read_hdf(path_rgi)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "de12d1d1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"glaciers outside RGI region 12 inside geodetic dataset : 214614\n",
"glaciers outside RGI region 12 in total from rgi : 214614\n",
"glaciers in total in rgi : 216502\n"
]
}
],
"source": [
"n = len(df_new.loc[rgi['O1Region']!='12'][df_new.loc[rgi['O1Region']!='12'].period == '2000-01-01_2020-01-01'])\n",
"print('glaciers outside RGI region 12 inside geodetic dataset : {}'.format(n))\n",
"print('glaciers outside RGI region 12 in total from rgi : {}'.format(len(rgi.loc[rgi['O1Region']!='12'])))\n",
"print('glaciers in total in rgi : {}'.format(len(rgi)))"
]
},
{
"cell_type": "markdown",
"id": "256c1567",
"metadata": {},
"source": [
"**OK so all glaciers are already in the data except the ones without lookup in region 12.**\n",
"\n",
"First, add missing glaciers and the area of them to the geodetic dataset. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a0c879ee",
"metadata": {},
"outputs": [],
"source": [
"to_add = []\n",
"for rid, s in rgi.loc[rgi['O1Region']=='12'].iterrows():\n",
" if rid not in df_new.index:\n",
" new_id = template.copy(deep=True)\n",
" new_id.index = [rid]*3\n",
" new_id['area'] = s['Area'] * 1e6\n",
" to_add.append(new_id)\n",
"to_add = pd.concat(to_add)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "4e1ded8e",
"metadata": {},
"outputs": [],
"source": [
"df_new = pd.concat([df_new, to_add]).sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "dbef2a0a",
"metadata": {},
"outputs": [],
"source": [
"assert len(df_new) / 3 == len(rgi)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "044b730c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
period
\n",
"
area
\n",
"
dhdt
\n",
"
err_dhdt
\n",
"
dvoldt
\n",
"
err_dvoldt
\n",
"
dmdt
\n",
"
err_dmdt
\n",
"
dmdtda
\n",
"
err_dmdtda
\n",
"
perc_area_meas
\n",
"
perc_area_res
\n",
"
valid_obs
\n",
"
valid_obs_py
\n",
"
reg
\n",
"
\n",
"
\n",
"
rgiid
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
RGI60-01.00001
\n",
"
2000-01-01_2010-01-01
\n",
"
360000.0
\n",
"
0.0255
\n",
"
0.5059
\n",
"
9175.0
\n",
"
182132.0
\n",
"
0.000008
\n",
"
0.000155
\n",
"
0.0217
\n",
"
0.4300
\n",
"
1.000
\n",
"
1.000
\n",
"
8.11
\n",
"
4.78
\n",
"
1
\n",
"
\n",
"
\n",
"
RGI60-01.00001
\n",
"
2000-01-01_2020-01-01
\n",
"
360000.0
\n",
"
-0.0150
\n",
"
0.2559
\n",
"
-5414.0
\n",
"
92139.0
\n",
"
-0.000005
\n",
"
0.000078
\n",
"
-0.0128
\n",
"
0.2176
\n",
"
1.000
\n",
"
1.000
\n",
"
26.41
\n",
"
11.11
\n",
"
1
\n",
"
\n",
"
\n",
"
RGI60-01.00001
\n",
"
2010-01-01_2020-01-01
\n",
"
360000.0
\n",
"
-0.0556
\n",
"
0.4645
\n",
"
-20003.0
\n",
"
167248.0
\n",
"
-0.000017
\n",
"
0.000142
\n",
"
-0.0472
\n",
"
0.3949
\n",
"
1.000
\n",
"
1.000
\n",
"
18.30
\n",
"
6.32
\n",
"
1
\n",
"
\n",
"
\n",
"
RGI60-01.00002
\n",
"
2000-01-01_2010-01-01
\n",
"
558000.0
\n",
"
-0.1980
\n",
"
0.3267
\n",
"
-110465.0
\n",
"
182728.0
\n",
"
-0.000094
\n",
"
0.000155
\n",
"
-0.1683
\n",
"
0.2792
\n",
"
1.000
\n",
"
1.000
\n",
"
11.04
\n",
"
8.28
\n",
"
1
\n",
"
\n",
"
\n",
"
RGI60-01.00002
\n",
"
2000-01-01_2020-01-01
\n",
"
558000.0
\n",
"
-0.2695
\n",
"
0.1653
\n",
"
-150361.0
\n",
"
93741.0
\n",
"
-0.000128
\n",
"
0.000080
\n",
"
-0.2290
\n",
"
0.1460
\n",
"
1.000
\n",
"
1.000
\n",
"
26.21
\n",
"
16.23
\n",
"
1
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
RGI60-19.02751
\n",
"
2000-01-01_2020-01-01
\n",
"
11000.0
\n",
"
-2.1611
\n",
"
0.8691
\n",
"
-23772.0
\n",
"
21979.0
\n",
"
-0.000020
\n",
"
0.000019
\n",
"
-1.8369
\n",
"
2.2891
\n",
"
1.000
\n",
"
1.000
\n",
"
4.00
\n",
"
4.00
\n",
"
19
\n",
"
\n",
"
\n",
"
RGI60-19.02751
\n",
"
2010-01-01_2020-01-01
\n",
"
11000.0
\n",
"
-2.1904
\n",
"
1.5831
\n",
"
-24095.0
\n",
"
26564.0
\n",
"
-0.000020
\n",
"
0.000023
\n",
"
-1.8619
\n",
"
2.5755
\n",
"
1.000
\n",
"
1.000
\n",
"
4.00
\n",
"
4.00
\n",
"
19
\n",
"
\n",
"
\n",
"
RGI60-19.02752
\n",
"
2000-01-01_2010-01-01
\n",
"
528000.0
\n",
"
0.1427
\n",
"
0.6371
\n",
"
75344.0
\n",
"
336536.0
\n",
"
0.000064
\n",
"
0.000286
\n",
"
0.1213
\n",
"
0.5421
\n",
"
0.981
\n",
"
0.981
\n",
"
2.74
\n",
"
1.70
\n",
"
19
\n",
"
\n",
"
\n",
"
RGI60-19.02752
\n",
"
2000-01-01_2020-01-01
\n",
"
528000.0
\n",
"
-0.0454
\n",
"
0.3407
\n",
"
-23981.0
\n",
"
179916.0
\n",
"
-0.000020
\n",
"
0.000153
\n",
"
-0.0386
\n",
"
0.2897
\n",
"
0.981
\n",
"
0.981
\n",
"
6.91
\n",
"
5.15
\n",
"
19
\n",
"
\n",
"
\n",
"
RGI60-19.02752
\n",
"
2010-01-01_2020-01-01
\n",
"
528000.0
\n",
"
-0.2335
\n",
"
0.5643
\n",
"
-123306.0
\n",
"
298432.0
\n",
"
-0.000105
\n",
"
0.000254
\n",
"
-0.1985
\n",
"
0.4814
\n",
"
0.981
\n",
"
0.981
\n",
"
4.17
\n",
"
3.44
\n",
"
19
\n",
"
\n",
" \n",
"
\n",
"
649506 rows × 15 columns
\n",
"
"
],
"text/plain": [
" period area dhdt err_dhdt dvoldt \\\n",
"rgiid \n",
"RGI60-01.00001 2000-01-01_2010-01-01 360000.0 0.0255 0.5059 9175.0 \n",
"RGI60-01.00001 2000-01-01_2020-01-01 360000.0 -0.0150 0.2559 -5414.0 \n",
"RGI60-01.00001 2010-01-01_2020-01-01 360000.0 -0.0556 0.4645 -20003.0 \n",
"RGI60-01.00002 2000-01-01_2010-01-01 558000.0 -0.1980 0.3267 -110465.0 \n",
"RGI60-01.00002 2000-01-01_2020-01-01 558000.0 -0.2695 0.1653 -150361.0 \n",
"... ... ... ... ... ... \n",
"RGI60-19.02751 2000-01-01_2020-01-01 11000.0 -2.1611 0.8691 -23772.0 \n",
"RGI60-19.02751 2010-01-01_2020-01-01 11000.0 -2.1904 1.5831 -24095.0 \n",
"RGI60-19.02752 2000-01-01_2010-01-01 528000.0 0.1427 0.6371 75344.0 \n",
"RGI60-19.02752 2000-01-01_2020-01-01 528000.0 -0.0454 0.3407 -23981.0 \n",
"RGI60-19.02752 2010-01-01_2020-01-01 528000.0 -0.2335 0.5643 -123306.0 \n",
"\n",
" err_dvoldt dmdt err_dmdt dmdtda err_dmdtda \\\n",
"rgiid \n",
"RGI60-01.00001 182132.0 0.000008 0.000155 0.0217 0.4300 \n",
"RGI60-01.00001 92139.0 -0.000005 0.000078 -0.0128 0.2176 \n",
"RGI60-01.00001 167248.0 -0.000017 0.000142 -0.0472 0.3949 \n",
"RGI60-01.00002 182728.0 -0.000094 0.000155 -0.1683 0.2792 \n",
"RGI60-01.00002 93741.0 -0.000128 0.000080 -0.2290 0.1460 \n",
"... ... ... ... ... ... \n",
"RGI60-19.02751 21979.0 -0.000020 0.000019 -1.8369 2.2891 \n",
"RGI60-19.02751 26564.0 -0.000020 0.000023 -1.8619 2.5755 \n",
"RGI60-19.02752 336536.0 0.000064 0.000286 0.1213 0.5421 \n",
"RGI60-19.02752 179916.0 -0.000020 0.000153 -0.0386 0.2897 \n",
"RGI60-19.02752 298432.0 -0.000105 0.000254 -0.1985 0.4814 \n",
"\n",
" perc_area_meas perc_area_res valid_obs valid_obs_py reg \n",
"rgiid \n",
"RGI60-01.00001 1.000 1.000 8.11 4.78 1 \n",
"RGI60-01.00001 1.000 1.000 26.41 11.11 1 \n",
"RGI60-01.00001 1.000 1.000 18.30 6.32 1 \n",
"RGI60-01.00002 1.000 1.000 11.04 8.28 1 \n",
"RGI60-01.00002 1.000 1.000 26.21 16.23 1 \n",
"... ... ... ... ... ... \n",
"RGI60-19.02751 1.000 1.000 4.00 4.00 19 \n",
"RGI60-19.02751 1.000 1.000 4.00 4.00 19 \n",
"RGI60-19.02752 0.981 0.981 2.74 1.70 19 \n",
"RGI60-19.02752 0.981 0.981 6.91 5.15 19 \n",
"RGI60-19.02752 0.981 0.981 4.17 3.44 19 \n",
"\n",
"[649506 rows x 15 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_new = df_new.rename_axis('rgiid').sort_values(by = ['rgiid', 'period'])\n",
"df_new"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "32d4ffd1",
"metadata": {},
"outputs": [],
"source": [
"df_new.to_hdf('hugonnet_2021_ds_rgi60_pergla_rates_10_20_worldwide_reg12cor.hdf', key='df')"
]
},
{
"cell_type": "markdown",
"id": "efab9a60",
"metadata": {},
"source": [
"## Remove glaciers with connectivity level 2\n",
"\n",
"Glaciers with connectivity level 2 are those glaciers that are strongly connected to the Greenland Ice Sheet. We don't want to do projections with them because they are already included in the GIS projections!"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "30827edf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"955"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# there are some glaciers on connectivity level 2\n",
"len(rgi.loc[rgi['Connect'] == 2])"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "7bd89766",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['05'], dtype=object)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check: they are only in RGI region 05 (Greenland)\n",
"rgi.loc[rgi['Connect'] == 2]['O1Region'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "81e21879",
"metadata": {},
"outputs": [],
"source": [
"valid_ids = rgi.loc[rgi['Connect'] != 2].index\n",
"c2_ids = rgi.loc[rgi['Connect'] == 2].index\n",
"df_new = df_new.loc[valid_ids].copy(deep=True)\n",
"rgi = rgi.loc[valid_ids].copy(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "17507f49",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"215547"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(rgi)"
]
},
{
"cell_type": "markdown",
"id": "f6a15d7f",
"metadata": {},
"source": [
"## Remove outliers and fill missing "
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "88ea4ac0",
"metadata": {},
"outputs": [],
"source": [
"dfs = df_new.loc[df_new.period == '2000-01-01_2020-01-01'].copy(deep=True)[['area', 'dmdtda', 'err_dmdtda', 'reg']]"
]
},
{
"cell_type": "markdown",
"id": "977e3aa3",
"metadata": {},
"source": [
"statistics on the error of the specific mass balance from the geodetic data:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "7d92450c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.001 0.075294\n",
"0.010 0.103500\n",
"0.500 0.222300\n",
"0.990 0.879962\n",
"0.999 2.051393\n",
"Name: err_dmdtda, dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs.err_dmdtda.quantile([0.001, 0.01, 0.5, 0.99, 0.999])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "5f7937d5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13.7913"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs.err_dmdtda.max()"
]
},
{
"cell_type": "markdown",
"id": "d12f0172",
"metadata": {},
"source": [
"**we remove all glaciers where the error is larger than the mean error plus three standard deviations of the error distribution (i.e. all sigma threshold)** "
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "19759d52",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"all sigma threshold (global estimate): 0.83\n"
]
}
],
"source": [
"# SHOULD IT BE AREA WEIGHTED???? Maybe not\n",
"all_sigma_mean = dfs['err_dmdtda'].mean()\n",
"all_sigma_std = dfs['err_dmdtda'].std()\n",
"all_sigma_threshold = all_sigma_mean + 3 * all_sigma_std\n",
"print('all sigma threshold (global estimate):', np.round(all_sigma_threshold, 2))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "22844156",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 27108 182 0.83\n",
"2 18855 190 0.83\n",
"3 4556 14 0.83\n",
"4 7415 8 0.83\n",
"5 19306 27 0.83\n",
"6 568 1 0.83\n",
"7 1615 1 0.83\n",
"8 3417 30 0.83\n",
"9 1069 0 0.83\n",
"10 5151 77 1.11\n",
"11 3927 57 1.19\n",
"12 1888 19 0.83\n",
"13 54429 200 0.83\n",
"14 27988 238 0.83\n",
"15 13119 128 0.83\n",
"16 2939 46 0.92\n",
"17 15908 228 1.53\n",
"18 3537 29 1.08\n",
"19 2752 47 1.45\n"
]
}
],
"source": [
"periods = ['2000-01-01_2010-01-01', '2000-01-01_2020-01-01', '2010-01-01_2020-01-01']\n",
"# We filter and fill all periods based on the 20 year threshold\n",
"df_filled = df_new.copy()[['period', 'area', 'dmdtda', 'err_dmdtda', 'reg']]\n",
"df_filled['is_cor'] = False\n",
"for reg in range(1, 20):\n",
" \n",
" dfs_subset = dfs.loc[dfs['reg'] == reg]\n",
" \n",
" # Too high of sigma causes large issues for model\n",
" # compute regional threshold for every region\n",
" reg_sigma_mean = dfs_subset['err_dmdtda'].mean()\n",
" reg_sigma_std = dfs_subset['err_dmdtda'].std()\n",
" reg_sigma_threshold = reg_sigma_mean + 3 * reg_sigma_std\n",
" \n",
" # Don’t penalize regions that are well-measured, so use all threshold as minimum:\n",
" # if the regional threshold is smaller than the global threshold,\n",
" # we use the global threshold, otherwise we use the regional threshold\n",
" if reg_sigma_threshold < all_sigma_threshold:\n",
" reg_sigma_threshold = all_sigma_threshold\n",
" \n",
" to_replace = dfs_subset.loc[dfs_subset['err_dmdtda'] > reg_sigma_threshold]\n",
" to_keep = dfs_subset.loc[dfs_subset['err_dmdtda'] <= reg_sigma_threshold]\n",
" \n",
" print(reg, len(dfs_subset), len(to_replace), np.round(reg_sigma_threshold, 2))\n",
" \n",
" df_filled.loc[to_replace.index, 'dmdtda'] = np.NaN\n",
" df_filled.loc[to_replace.index, 'err_dmdtda'] = np.NaN\n",
" \n",
" # Replace nan values - SHOULD BE AREA WEIGHTED????\n",
" for period in periods:\n",
" \n",
" # indices of glaciers without dmdtda data (because missing or outlier)\n",
" loc_no = (df_filled['reg'] == reg) & (df_filled['dmdtda'].isnull()) & (df_filled['period'] == period)\n",
" # indices of glaciers with dmdtda data\n",
" loc_yes = (df_filled['reg'] == reg) & (~ df_filled['dmdtda'].isnull()) & (df_filled['period'] == period)\n",
" \n",
" # replace the nan-values from the missing and outlier glaciers with the regional mean and standard deviation mean \n",
" # Note that every glacier without \"usable\" geodetic data will get the same dmdtda and err_dmdtda \n",
" # (independent of the glacier characteristics such as size, elevation) !!!\n",
" df_filled.loc[loc_no, 'dmdtda'] = df_filled.loc[loc_yes, 'dmdtda'].mean()\n",
" df_filled.loc[loc_no, 'err_dmdtda'] = df_filled.loc[loc_yes, 'err_dmdtda'].mean()\n",
" df_filled.loc[loc_no, 'is_cor'] = True"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "a1e9cfdd",
"metadata": {},
"outputs": [],
"source": [
"dfs_filled = df_filled.loc[df_new.period == '2000-01-01_2020-01-01']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "f6cbd55b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.001 0.075754\n",
"0.010 0.104000\n",
"0.500 0.226100\n",
"0.990 0.739012\n",
"0.999 1.154537\n",
"Name: err_dmdtda, dtype: float64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs_filled.err_dmdtda.quantile([0.001, 0.01, 0.5, 0.99, 0.999])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "43ba2497",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.5258"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs_filled.err_dmdtda.max()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "6bacb3cd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8.034"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs_filled.dmdtda.max()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "fcee7665",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"dfs_filled.dmdtda.plot(kind='hist', bins=100, bottom=0.1);\n",
"ax.set_yscale('log')\n",
"plt.title(f'Distribution of the specific mass balance after correction for all glaciers (n={len(dfs_filled)})')\n",
"plt.xlabel(r'specific mass balance (kg m$^{-2}$)');"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "e4516368",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.131646168795644"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# relative amount of glacier area with a positive specific mass balance\n",
"dfs_filled.loc[dfs_filled.dmdtda > 0].area.sum() / dfs_filled.area.sum()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "127a0919",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(ncols=3, figsize=(18,6))\n",
"plt.suptitle(f'Distribution of the specific mass balance after correction for all glaciers (n={len(dfs_filled)})')\n",
"for k,period in enumerate([periods[1], periods[0], periods[2]]):\n",
" for _,period2 in enumerate([periods[1], periods[0], periods[2]]):\n",
" dfs_filled = df_filled.loc[df_new.period == period2]\n",
" dfs_filled.dmdtda.plot(kind='hist', bins=100, bottom=0.1, alpha=0.3, ax = ax[k], color = 'grey', label='other periods');\n",
" dfs_filled = df_filled.loc[df_new.period == period]\n",
" dfs_filled.dmdtda.plot(kind='hist', bins=100, bottom=0.1, alpha=1, ax = ax[k], label = 'actual period');\n",
" ax[k].set_yscale('log')\n",
" ax[k].set_title('period: {},\\n max error: {:0.2f}, max dmdtda: {:0.2f},\\n glacier area with dmdtda above zero : {:0.2f}%'.format(period,\n",
" dfs_filled.err_dmdtda.max(),\n",
" dfs_filled.dmdtda.max(),\n",
" dfs_filled.loc[dfs_filled.dmdtda > 0].area.sum()*100 / dfs_filled.area.sum()))\n",
" ax[k].legend()\n",
" ax[k].set_xlabel(r'specific mass balance (kg m$^{-2}$)');\n",
"plt.tight_layout()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "c7305ecb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
period
\n",
"
area
\n",
"
dmdtda
\n",
"
err_dmdtda
\n",
"
reg
\n",
"
is_cor
\n",
"
\n",
"
\n",
"
rgiid
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [period, area, dmdtda, err_dmdtda, reg, is_cor]\n",
"Index: []"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check: are there any glaciers left with nan-values?\n",
"df_filled.loc[df_filled['dmdtda'].isnull()]\n",
"# No!"
]
},
{
"cell_type": "markdown",
"id": "15b97412",
"metadata": {},
"source": [
"save the corrected dataset to use it then to calibrate the mass-balance model"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "4eb00b51",
"metadata": {},
"outputs": [],
"source": [
"df_filled.to_hdf('hugonnet_2021_ds_rgi60_pergla_rates_10_20_worldwide_filled.hdf', key='df')"
]
},
{
"cell_type": "markdown",
"id": "42a1b44f",
"metadata": {},
"source": [
"## Some more statistics"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "42eb6c66",
"metadata": {},
"outputs": [],
"source": [
"# first repeat the filling by using the medians instead of means (just for a check)!\n",
"df_filled_med = df_new.copy()[['period', 'area', 'dmdtda', 'err_dmdtda', 'reg']]\n",
"df_filled_med['is_cor'] = False\n",
"for reg in range(1, 20):\n",
" \n",
" dfs_subset = dfs.loc[dfs['reg'] == reg]\n",
" \n",
" # Too high of sigma causes large issues for model\n",
" # compute regional threshold for every region\n",
" reg_sigma_mean = dfs_subset['err_dmdtda'].mean()\n",
" reg_sigma_std = dfs_subset['err_dmdtda'].std()\n",
" reg_sigma_threshold = reg_sigma_mean + 3 * reg_sigma_std\n",
" \n",
" # Don’t penalize regions that are well-measured, so use all threshold as minimum:\n",
" # if the regional threshold is smaller than the global threshold,\n",
" # we use the global threshold, otherwise we use the regional threshold\n",
" if reg_sigma_threshold < all_sigma_threshold:\n",
" reg_sigma_threshold = all_sigma_threshold\n",
" \n",
" to_replace = dfs_subset.loc[dfs_subset['err_dmdtda'] > reg_sigma_threshold]\n",
" to_keep = dfs_subset.loc[dfs_subset['err_dmdtda'] <= reg_sigma_threshold]\n",
" \n",
" #print(reg, len(dfs_subset), len(to_replace), np.round(reg_sigma_threshold, 2))\n",
" \n",
" df_filled_med.loc[to_replace.index, 'dmdtda'] = np.NaN\n",
" df_filled_med.loc[to_replace.index, 'err_dmdtda'] = np.NaN\n",
" \n",
" # Replace nan values - SHOULD BE AREA WEIGHTED????\n",
" for period in periods:\n",
" \n",
" # indices of glaciers without dmdtda data (because missing or outlier)\n",
" loc_no = (df_filled_med['reg'] == reg) & (df_filled_med['dmdtda'].isnull()) & (df_filled_med['period'] == period)\n",
" # indices of glaciers with dmdtda data\n",
" loc_yes = (df_filled_med['reg'] == reg) & (~ df_filled_med['dmdtda'].isnull()) & (df_filled_med['period'] == period)\n",
" \n",
" # replace the nan-values from the missing and outlier glaciers with the regional median and standard deviation median \n",
" # Note that every glacier without \"usable\" geodetic data will get the same dmdtda and err_dmdtda \n",
" # (independent of the glacier characteristics such as size, elevation) !!!\n",
" df_filled_med.loc[loc_no, 'dmdtda'] = df_filled_med.loc[loc_yes, 'dmdtda'].median() \n",
" df_filled_med.loc[loc_no, 'err_dmdtda'] = df_filled_med.loc[loc_yes, 'err_dmdtda'].median()\n",
" df_filled_med.loc[loc_no, 'is_cor'] = True"
]
},
{
"cell_type": "markdown",
"id": "909b0774-2821-4df1-a91d-bbee0d504291",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "d3775010",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"8597 out of 215547 glaciers were filled up with regional mean data, hence 3.99%.\n",
"0.25% of the glacier area was refilled with regional mean data\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df_filled_per = df_filled[df_filled['period']=='2000-01-01_2020-01-01']\n",
"n_cor = len(df_filled_per.loc[df_filled_per['is_cor']])\n",
"n_total = len(df_filled_per)\n",
"area_cor = df_filled_per.loc[df_filled_per['is_cor']].area\n",
"area_cor_sum = area_cor.sum()\n",
"area_cor_med = area_cor.median()\n",
"print(f'{n_cor} out of {n_total} glaciers were filled up with regional mean data, hence {np.round(n_cor*100/n_total,2)}%.')\n",
"print(f'{np.round(area_cor_sum*100/df_filled_per.area.sum(),2)}% of the glacier area was refilled with regional mean data')\n",
"\n",
"# same but using instead regional medians for corrections\n",
"df_filled_per_med = df_filled_med[df_filled_med['period']=='2000-01-01_2020-01-01']\n",
"\n",
"plt.figure(figsize=(20,10))\n",
"plt.subplot(121)\n",
"plt.plot(df_filled_per.loc[~df_filled_per['is_cor']].area,df_filled_per.loc[~df_filled_per['is_cor']].dmdtda, '.', label='no correction necessary')\n",
"plt.plot(df_filled_per.loc[df_filled_per['is_cor']].area,df_filled_per.loc[df_filled_per['is_cor']].dmdtda, '.', \n",
" label='corrected by regional means')\n",
"plt.plot(df_filled_per_med.loc[df_filled_per_med['is_cor']].area,\n",
" df_filled_per_med.loc[df_filled_per['is_cor']].dmdtda, 'x',alpha=0.4, ms=4,\n",
" label='corrected by regional medians')\n",
"plt.legend()\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('dmdtda')\n",
"\n",
"plt.subplot(122)\n",
"plt.plot(df_filled_per.loc[~df_filled_per['is_cor']].area,df_filled_per.loc[~df_filled_per['is_cor']].err_dmdtda, '.', label='no correction necessary')\n",
"plt.plot(df_filled_per.loc[df_filled_per['is_cor']].area,df_filled_per.loc[df_filled_per['is_cor']].err_dmdtda, '.', \n",
" label='corrected by regional means')\n",
"plt.plot(df_filled_per_med.loc[df_filled_per_med['is_cor']].area,df_filled_per_med.loc[df_filled_per_med['is_cor']].err_dmdtda,\n",
" 'x', alpha=0.4, ms=4,label='corrected by regional medians')\n",
"plt.legend()\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('err_dmdtda');\n"
]
},
{
"cell_type": "markdown",
"id": "13bd72f5",
"metadata": {},
"source": [
"- Glaciers that needed to be filled up with regional means are small \n",
"- There is no direct relationship between area and dmdtda. Glaciers that are small but do not need any corrections can have both very negative or very positive dmdta. In addition the standard deviation increases with decreasing area of a glacier. However, the variability of the standard deviation is large for small glaciers\n",
"- Shouldn't we increase the standard deviation of the corrected small glaciers ?! \n",
" - maybe yes, but just using the medians instead even decreases the standard deviation!\n",
" - this is because there are much more small glaciers than larger ones. So, as we do not weight over the area while averaging, it is mostly the small glaciers that decide over the regional mean dmdtda and err_dmdtda, which is what we want! So maybe this simple method is just ok. \n"
]
},
{
"cell_type": "markdown",
"id": "a3fcffd6-8abf-4fbb-8b35-a8cadfcf383e",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "a4e9f65a",
"metadata": {
"tags": []
},
"source": [
"## How can we best fill up dhdt, err_dhdt and dvoldt, err_dvoldt ?!\n",
"\n",
"- We have not yet found the best solution for that\n",
"- in the following there are just some plots and a possible way to interpolate dvoldt and err_dvoldt using an area-dependency. However, this is not yet incorporated anywhere and needs further tuning!"
]
},
{
"cell_type": "markdown",
"id": "f2c80d12-4fe0-40c8-bb22-a1ee5bd2de9d",
"metadata": {},
"source": [
"**Some further thoughts about that:**\n",
"- for dhdt and its error, probably the same method as used with dmdtda is ok.\n",
"- For dvoldt and dvoldt_err, we really should somehow take into account that the missing glaciers are small .\n",
" - E.g. take those glaciers with data from each region that are within 95% range of glacier area distribution from the missing glaciers, and then use that mean / median ?\n",
" - but if we do this, we could also apply it for all measures to simplify it ...\n",
" - Probably this is still bad. So, we rather need to take those glaciers within the area range of the missing glaciers, and use the relation with the area to fit the missing dvoldt & err_dvoldt values (because dVdt depends strongly on the area).\n",
" - globally this works more or less, but it would be better to repeat this regionally \n",
"- but of course there are other methods. E.g. Loris Compagno (2021) uses estimates from the nearest glacier with similar area to fill up the data. If this is better is not clear. \n",
"- For global scale it is probably not important. But when doing smaller-scale studies, it can get important to think about which filling method to apply! ... at the end for all filling stuff, one could do some kind of \"small\"-cross validation to check how well or bad the method is?!"
]
},
{
"cell_type": "markdown",
"id": "d2324ab4-c531-417e-aeb8-4d6d6dd41c94",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "dcd9c277",
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"import sklearn\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "d930e163",
"metadata": {},
"outputs": [],
"source": [
"# Let's only look at the 20-year period\n",
"df_new_per = df_new[df_new['period']=='2000-01-01_2020-01-01']"
]
},
{
"cell_type": "markdown",
"id": "3e0b37e4-1392-4d54-9e09-c9343e1c8122",
"metadata": {},
"source": [
"### Let's look at dhdt and err_dhdt first\n",
"- it looks quite similar as dmdtda,\n",
"- so probably we can apply the same approach as done with dmdtda"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "d2603950-d73c-4021-a5a2-36199a2818b9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'dhdt')"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,3))\n",
"plt.plot(df_new_per.area, df_new_per.dhdt, '.')\n",
"plt.axvline(np.quantile(df_filled[~df_filled.is_cor].area,0.025), ls='--')\n",
"plt.axvline(df_filled[df_filled.is_cor].area.median())\n",
"plt.axvline(np.quantile(df_filled[~df_filled.is_cor].area,0.975), ls='--')\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('dhdt')"
]
},
{
"cell_type": "markdown",
"id": "930b0e69-c478-46b3-a0fc-aa93990d90df",
"metadata": {},
"source": [
"**Let's look at only the glaciers area range with glaciers that need to be corrected:**"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "eb8faf8b-c465-4115-8b7f-19e8d2a81ec6",
"metadata": {},
"outputs": [],
"source": [
"# Let's only look at those glaciers that have are in the same area range as the glaciers that need corrections:\n",
"df_new_per_small = df_new_per[(df_new_per.area >= df_filled_per[df_filled_per.is_cor].area.min()) & (df_new_per.area <= df_filled_per[df_filled_per.is_cor].area.max())]\n",
"df_new_per_small = df_new_per_small.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "1cbbd8dc-27bc-4c25-bf5c-9c3a68bb2c0f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'dhdt')"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,3))\n",
"plt.plot(df_new_per.area, df_new_per.dvoldt, '.')\n",
"plt.axvline(np.quantile(df_filled[~df_filled.is_cor].area, 0.025), ls='--')\n",
"plt.axvline(df_filled[df_filled.is_cor].area.median())\n",
"plt.axvline(np.quantile(df_filled[~df_filled.is_cor].area, 0.975), ls='--')\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('dvoldt (m3)')\n"
]
},
{
"cell_type": "markdown",
"id": "2d15ee37-2604-4a5a-93ed-be67f33384b7",
"metadata": {},
"source": [
"- For large glaciers, there is no relationship between dvoldt and area (> 97.5\\%). However, those glaciers that need to be filled up with values are rather smaller\n",
"\n",
"**Let's look at only the glaciers area range with glaciers that need to be corrected:**"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "222e3047",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(15,10))\n",
"plt.subplot(121)\n",
"reg = LinearRegression().fit(np.atleast_2d(df_new_per_small.area.values).T, df_new_per_small.dvoldt)\n",
"missing_area_glaciers = np.atleast_2d(df_filled_per[df_filled_per.is_cor].area.values).T\n",
"dvoldt_pred_cor = reg.predict(missing_area_glaciers)\n",
"dvoldt_pred_test = reg.predict(np.atleast_2d(df_new_per.loc[df_new_per_small.index].area.values).T)\n",
"plt.plot(df_new_per_small.area, df_new_per_small.dvoldt, '.', label='world-wide RGI glaciers')\n",
"plt.plot(missing_area_glaciers, dvoldt_pred_cor, 'o', label='filled up missing glaciers')\n",
"plt.axvline(np.quantile(df_filled_per[df_filled_per.is_cor].area,0.0025), ls='--', label='00.25%-area of missing glaciers')\n",
"plt.axvline(df_filled_per[df_filled_per.is_cor].area.median(), label='median of missing glaciers')\n",
"plt.axvline(np.quantile(df_filled_per[df_filled_per.is_cor].area,0.9975), ls='--', label='99.75%-area of missing glaciers')\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('dvoldt')\n",
"plt.plot(df_new_per.loc[df_new_per_small.index].area.values, dvoldt_pred_test, '.', color = 'grey', ms=3, alpha = 0.2, label='fitted')\n",
"plt.title('R2-value: {:0.2f}'.format(sklearn.metrics.r2_score(df_new_per_small.dvoldt,\n",
" dvoldt_pred_test)))\n",
"plt.legend()\n",
"\n",
"plt.subplot(122)\n",
"reg2 = LinearRegression().fit(np.atleast_2d(df_new_per_small.area.values).T, df_new_per_small.err_dvoldt)\n",
"missing_area_glaciers = np.atleast_2d(df_filled_per[df_filled_per.is_cor].area.values).T\n",
"err_dvoldt_pred_cor = reg2.predict(missing_area_glaciers)\n",
"err_dvoldt_pred_test = reg2.predict(np.atleast_2d(df_new_per.loc[df_new_per_small.index].area.values).T)\n",
"\n",
"\n",
"plt.plot(df_new_per_small.area, df_new_per_small.err_dvoldt, '.', label='world-wide RGI glaciers')\n",
"plt.plot(missing_area_glaciers, err_dvoldt_pred_cor, 'o', label='filled up missing glaciers')\n",
"plt.axvline(np.quantile(df_filled_per[df_filled_per.is_cor].area,0.0025), ls='--', label='00.25%-area of missing glaciers')\n",
"plt.axvline(df_filled_per[df_filled_per.is_cor].area.median(), label='median of missing glaciers')\n",
"plt.axvline(np.quantile(df_filled_per[df_filled_per.is_cor].area,0.9975), ls='--', label='99.75%-area of missing glaciers')\n",
"plt.xlabel('area (m2)')\n",
"plt.ylabel('err_dvoldt')\n",
"plt.plot(df_new_per.loc[df_new_per_small.index].area.values, err_dvoldt_pred_test, '.', color = 'grey', ms=3, alpha = 0.2, label='fitted')\n",
"plt.title('R2-value: {:0.2f}'.format(sklearn.metrics.r2_score(df_new_per_small.err_dvoldt,\n",
" err_dvoldt_pred_test)))\n",
"plt.legend()\n",
"#plt.savefig('dvoldt_missing_fit_global.png')"
]
},
{
"cell_type": "markdown",
"id": "a3b6200a-1593-4e58-8314-6da99de3e060",
"metadata": {},
"source": [
"- we could use a linear relationship to fill up **dvoldt** and **err_dvoldt** (instead of the regional mean/median)\n",
" - however, for a small area, there are quite some outliers with larger uncertainties that would be neglected in this case\n",
" - maybe we should repeat this for each RGI region?"
]
},
{
"cell_type": "markdown",
"id": "4556323f-0c95-445c-9a34-b59f786b7e33",
"metadata": {},
"source": [
"**Problem: what do we do with rather large glaciers that need to be filled up?**\n",
"- These are the largest glaciers that need to be corrected!\n",
"- Maybe it would be good to treat them separately and to think about another way how to correct them best? "
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "085e1ba4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"