07: Combining data from multiple netcdf files

07: Combining data from multiple netcdf files#

In some cases, people will choose to break down their data into multiple smaller netcdf files that they will publish in a single data collection. There are a number of good reasons to do this.

The data user can access only the data they are interested in.
Each file can be simpler with potentially less dimensions and less missing values. Imagine you have 10 depth profiles that all sample a different set of depths. If these profiles were included in a single netcdf file, the file would most likely a single depth dimension and coordinate variable which would need to account for all 10 depth profiles. Alternatively, 10 depth dimensions and coordinate variables could be included.
Each individual file can be assigned a separate set of global attributes which describe the data more accurately. For example each file could have global attributes for the coordinates and timestamp. If multiple depth profiles are stored in a single file, only the minimum and maximum coordinates and timestamp can be encoded into the global attributes.
Imagine you are looking for data in a data centre. You want to find depth profiles in a certain area of interest on a map. Files that include a single depth profile will be presented as points on the map. Files that include multiple depth profiles will be presented as a bounding box on a map, and without opening up the file it could be unclear whether the file includes data for your area of interest.

A common reaction to learning that data are divided into multiple files is that extracting the data will involve more work. However, if the files are similar (and they should be if they follow the CF and ACDD conventions) this is not neccessarily the case.

In this notebook we will look at how to combine data from multiple netcdf files into a single object (e.g. dataframe, multi-dimensional array) that you can use.

from IPython.display import YouTubeVideo
YouTubeVideo('AbLRV5YUW2g')

Introducing the data#

The link below is an OPeNDAP access point to CTD data collected as part of the Nansen Legacy data. The data are grouped together by cruise and each CF-NetCDF file contains data from a single depth profile. https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/

Each file has it’s own access point like this https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708/CTD_station_ISG_SVR1_-Nansen_Legacy_Cruise-_2021_Joint_Cruise_2-1.nc.html

And anyone can download some or all of the data from the CF-NetCDF file into ASCII files. This makes data access a lot easier for people who don’t yet know how to work with NetCDF files.

We can load into Python by removing the .html suffix and including the rest of the URL as our filepath. You don’t have to download the data!

Let’s look at a quick example of how we can extract the data into numpy arrays

temperature = xrds['TEMP'].values
salinity = xrds['PSAL'].values
temperature, salinity

(array([3.226, 3.211, 3.202, 3.205, 3.213, 3.215, 3.21 , 3.215, 3.191,
        3.162, 3.144, 3.118, 3.081, 2.993, 2.913, 2.861, 2.837, 2.826,
        2.809, 2.776, 2.743, 2.703, 2.687, 2.652, 2.587, 2.544, 2.508,
        2.483, 2.472, 2.434, 2.399, 2.355, 2.298, 2.277, 2.252, 2.225,
        2.201, 2.173, 2.113, 2.073, 2.06 , 2.04 , 2.022, 2.006, 2.004,
        2.008, 2.067, 2.126, 2.145, 2.143, 2.248, 2.288, 2.232, 2.212,
        2.136, 1.997, 1.748, 1.594, 1.657, 1.623, 1.468, 1.385, 1.374,
        1.383, 1.378, 1.38 , 1.367, 1.351, 1.38 , 1.376, 1.391, 1.396,
        1.403, 1.398, 1.413, 1.393, 1.371, 1.248, 1.159, 1.063, 0.997,
        0.981, 0.944, 0.933, 0.928, 0.964, 1.   , 1.013, 1.044, 1.095,
        1.241, 1.216, 1.154, 1.281, 1.362, 1.4  , 1.378, 1.353, 1.237,
        1.069, 0.993, 0.989, 1.048, 1.078, 1.07 , 1.05 , 1.016, 0.946,
        0.928, 0.984, 1.004, 0.942, 0.892, 0.828, 0.691, 0.641, 0.635,
        0.632, 0.625, 0.615, 0.566, 0.542, 0.518, 0.493, 0.52 , 0.457,
        0.426, 0.416, 0.407, 0.4  , 0.388, 0.389, 0.387, 0.387, 0.373,
        0.368, 0.363, 0.362, 0.36 , 0.359, 0.361, 0.373, 0.395, 0.425,
        0.443, 0.446, 0.449, 0.449, 0.45 , 0.45 , 0.448, 0.455, 0.454,
        0.464, 0.485, 0.63 , 0.753, 0.873, 0.84 , 0.793, 0.754, 0.764,
        0.741, 0.544, 0.496, 0.413, 0.337, 0.341, 0.346, 0.378, 0.435,
        0.468, 0.514, 0.554, 0.484, 0.42 , 0.447, 0.554, 0.639, 0.747,
        0.991, 1.174, 1.382, 1.472, 1.527, 1.52 , 1.521, 1.523, 1.626,
        1.694, 1.718, 1.778, 1.815, 1.937, 2.004, 2.018, 2.126, 2.203,
        2.284, 2.384, 2.454, 2.49 , 2.514, 2.506, 2.464, 2.436, 2.413,
        2.368, 2.369, 2.534, 2.579, 2.572, 2.583, 2.598, 2.618, 2.629,
        2.648, 2.669, 2.676, 2.682, 2.657, 2.641, 2.576, 2.503, 2.447,
        2.428, 2.447, 2.501, 2.546, 2.574, 2.743, 2.776, 2.754, 2.794,
        2.799, 2.8  , 2.802, 2.806, 2.805, 2.805, 2.788, 2.754, 2.762,
        2.779, 2.786, 2.792, 2.792, 2.793, 2.728, 2.486, 2.424, 2.386,
        2.353, 2.304, 2.302, 2.268, 2.221, 2.206, 2.189], dtype=float32),
 array([34.233, 34.24 , 34.243, 34.24 , 34.239, 34.241, 34.248, 34.246,
        34.251, 34.254, 34.255, 34.262, 34.265, 34.271, 34.274, 34.276,
        34.277, 34.279, 34.282, 34.284, 34.287, 34.289, 34.29 , 34.293,
        34.297, 34.302, 34.305, 34.311, 34.315, 34.319, 34.321, 34.321,
        34.321, 34.325, 34.329, 34.33 , 34.33 , 34.328, 34.334, 34.341,
        34.343, 34.344, 34.346, 34.35 , 34.353, 34.358, 34.369, 34.378,
        34.379, 34.38 , 34.398, 34.413, 34.415, 34.415, 34.411, 34.401,
        34.385, 34.383, 34.386, 34.387, 34.389, 34.399, 34.403, 34.404,
        34.419, 34.42 , 34.42 , 34.419, 34.422, 34.421, 34.422, 34.422,
        34.423, 34.422, 34.424, 34.426, 34.429, 34.431, 34.43 , 34.427,
        34.425, 34.425, 34.425, 34.425, 34.428, 34.433, 34.437, 34.441,
        34.447, 34.453, 34.474, 34.475, 34.474, 34.488, 34.499, 34.507,
        34.506, 34.504, 34.497, 34.486, 34.48 , 34.481, 34.488, 34.494,
        34.495, 34.5  , 34.503, 34.501, 34.503, 34.512, 34.518, 34.516,
        34.514, 34.511, 34.505, 34.502, 34.501, 34.502, 34.502, 34.502,
        34.502, 34.501, 34.5  , 34.5  , 34.504, 34.502, 34.501, 34.501,
        34.502, 34.502, 34.503, 34.505, 34.506, 34.508, 34.511, 34.512,
        34.513, 34.514, 34.515, 34.518, 34.519, 34.521, 34.526, 34.532,
        34.536, 34.537, 34.537, 34.537, 34.538, 34.538, 34.538, 34.538,
        34.539, 34.54 , 34.543, 34.56 , 34.573, 34.587, 34.587, 34.584,
        34.58 , 34.58 , 34.578, 34.563, 34.558, 34.553, 34.548, 34.548,
        34.55 , 34.558, 34.569, 34.572, 34.578, 34.584, 34.58 , 34.574,
        34.577, 34.588, 34.597, 34.608, 34.633, 34.653, 34.677, 34.688,
        34.695, 34.694, 34.693, 34.692, 34.705, 34.714, 34.718, 34.725,
        34.729, 34.743, 34.752, 34.754, 34.768, 34.778, 34.787, 34.8  ,
        34.809, 34.814, 34.818, 34.818, 34.814, 34.81 , 34.808, 34.803,
        34.804, 34.825, 34.832, 34.832, 34.833, 34.836, 34.838, 34.839,
        34.842, 34.845, 34.846, 34.847, 34.846, 34.844, 34.838, 34.831,
        34.825, 34.823, 34.824, 34.831, 34.838, 34.842, 34.864, 34.869,
        34.867, 34.872, 34.873, 34.873, 34.873, 34.873, 34.873, 34.873,
        34.871, 34.867, 34.868, 34.87 , 34.871, 34.871, 34.871, 34.872,
        34.866, 34.84 , 34.833, 34.834, 34.827, 34.824, 34.821, 34.821,
        34.816, 34.813, 34.812], dtype=float32))

Or into a pandas dataframe that we can export as a CSV or XLSX file for example

df = xrds[['TEMP','PSAL']].to_dataframe()
#df.to_csv('/path/to/file.csv')
df

	TEMP	PSAL
PRES
2.0	3.226	34.233002
3.0	3.211	34.240002
4.0	3.202	34.243000
5.0	3.205	34.240002
6.0	3.213	34.238998
...	...	...
256.0	2.302	34.820999
257.0	2.268	34.820999
258.0	2.221	34.816002
259.0	2.206	34.813000
260.0	2.189	34.812000

259 rows × 2 columns

Looping through multiple files#

Now let’s look at how we can easily loop through all files (depth profiles) from one cruise

from siphon.catalog import TDSCatalog

base_url = 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708'

# Path to the catalog we can loop through
catalog_url = base_url + '/catalog.xml'

# Access the THREDDS catalog
catalog = TDSCatalog(catalog_url)

# Traverse through the catalog and print a list of the NetCDF files
catalog.datasets

['CTD_station_ISG_SVR1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_4_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG02_5_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG03_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG05_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG05_01_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG05_02_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG06_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG06_01_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG06_02_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG07_01_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG08_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG08_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG09_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG09_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG10_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG10_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG11_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG11_2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG12_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG14_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG15_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG16_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG17_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG18_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG19_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG20_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG22_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG23_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_NLEG24_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P1_NLEG01-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P1_NLEG01-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P1_NLEG01-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P2_NLEG04-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P2_NLEG04-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P2_NLEG04-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P2_NLEG04-4_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P3_NLEG07-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P3_NLEG07-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P3_NLEG07-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P3_NLEG07-4_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P3_NLEG07-5_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P4_NLEG11-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P4_NLEG11-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P4_NLEG11-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13-4_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13-5_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P5_NLEG13_1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P6_NLEG21_NPAL15-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P6_NLEG21_NPAL15-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P6_NLEG21_NPAL15-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P6_NLEG21_NPAL15-4_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P7_NLEG25_NPAL16-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P7_NLEG25_NPAL16-2_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc', 'CTD_station_P7_NLEG25_NPAL16-3_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc']

Now let’s loop through this list and open each file in turn and print time_coverage_start attribute.

base_url = 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708'
catalog_url = base_url + '/catalog.xml'
catalog = TDSCatalog(catalog_url)

for dataset in catalog.datasets:
    profile_url = base_url + '/' + dataset
    xrds = xr.open_dataset(profile_url)
    print(xrds.attrs['time_coverage_start'])
    

2021-07-12T19:05:04Z

2021-07-14T23:59:51Z

2021-07-15T02:01:30Z

2021-07-15T03:35:31Z

2021-07-15T04:45:26Z

2021-07-15T05:55:33Z

2021-07-15T07:25:12Z

2021-07-15T08:44:11Z

2021-07-16T17:22:02Z

2021-07-16T20:34:00Z

Combining data from all the files into a CSV or XLSX file#

import pandas as pd
base_url = 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708'
catalog_url = base_url + '/catalog.xml'
catalog = TDSCatalog(catalog_url)

# Initialize an empty list to store individual DataFrames
dataframes_list = []

for dataset in catalog.datasets:
    profile_url = base_url + '/' + dataset
    xrds = xr.open_dataset(profile_url)
    profile_df = xrds[['TEMP','PSAL']].to_dataframe()
    
    # Let's add some more columns from the global attributes that will help
    profile_df['latitude'] = xrds.attrs['geospatial_lat_min']
    profile_df['longitude'] = xrds.attrs['geospatial_lon_min']
    profile_df['timestamp'] = xrds.attrs['time_coverage_start']
    
    # Append the current DataFrame to the list
    dataframes_list.append(profile_df)

# Concatenate all DataFrames in the list into a master DataFrame
master_df = pd.concat(dataframes_list)

# Reset index of the master DataFrame
master_df.reset_index(inplace=True)

master_df

#master_df.to_csv('/path/to/file.csv', index=False)
#master_df.to_excel('/path/to/file.csv', index=False)  # Set index=False to exclude the index from being written to the Excel file

	PRES	TEMP	PSAL	latitude	longitude	timestamp
0	2.0	3.226	34.233002	78.128197	14.0032	2021-07-12T19:05:04Z
1	3.0	3.211	34.240002	78.128197	14.0032	2021-07-12T19:05:04Z
2	4.0	3.202	34.243000	78.128197	14.0032	2021-07-12T19:05:04Z
3	5.0	3.205	34.240002	78.128197	14.0032	2021-07-12T19:05:04Z
4	6.0	3.213	34.238998	78.128197	14.0032	2021-07-12T19:05:04Z
...	...	...	...	...	...	...
32028	3362.0	-0.722	34.945000	82.003799	30.0427	2021-07-25T03:02:53Z
32029	3363.0	-0.722	34.945000	82.003799	30.0427	2021-07-25T03:02:53Z
32030	3364.0	-0.722	34.945000	82.003799	30.0427	2021-07-25T03:02:53Z
32031	3365.0	-0.722	34.945000	82.003799	30.0427	2021-07-25T03:02:53Z
32032	3366.0	-0.722	34.945000	82.003799	30.0427	2021-07-25T03:02:53Z

32033 rows × 6 columns

Or maybe you prefer to have separate columns for each depth profile. Let’s create a dataframe with one pressure column and individual columns for the temperature data from each profile.

# Create an empty dictionary to store profile data
profile_data = {}

for dataset in catalog.datasets:
    profile_url = base_url + '/' + dataset
    xrds = xr.open_dataset(profile_url)
    
    timestamp = xrds.attrs['time_coverage_start']
    
    profile_data[timestamp] = xrds['TEMP'].to_dataframe()['TEMP']

# Creating a dataframe that includes all profiles
master_df = pd.DataFrame(profile_data.values()).transpose()

# Assigning the timestamps as column headers (currently TEMP for all)
master_df.columns = profile_data.keys()

# Display theb resulting DataFrame
master_df

	2021-07-12T19:05:04Z	2021-07-14T23:59:51Z	2021-07-15T02:01:30Z	2021-07-15T03:35:31Z	2021-07-15T04:45:26Z	2021-07-15T05:55:33Z	2021-07-15T07:25:12Z	2021-07-15T08:44:11Z	2021-07-16T17:22:02Z	2021-07-16T20:34:00Z	...	2021-07-19T14:29:47Z	2021-07-20T13:58:18Z	2021-07-20T19:58:26Z	2021-07-21T17:35:04Z	2021-07-22T13:57:51Z	2021-07-23T06:12:20Z	2021-07-23T09:25:55Z	2021-07-24T01:04:37Z	2021-07-24T14:31:40Z	2021-07-25T03:02:53Z
PRES
2.0	3.226	3.782	3.035	2.504	1.960	1.735	1.694	1.966	NaN	1.382	...	NaN	8.814	NaN	-1.551	-1.582	-1.535	-1.528	NaN	-1.620	-1.643
3.0	3.211	3.783	3.036	2.456	1.960	1.736	1.693	1.958	0.693	1.376	...	-1.499	5.513	NaN	-1.560	-1.585	-1.541	-1.524	NaN	-1.630	-1.643
4.0	3.202	3.783	3.036	2.456	1.950	1.735	1.691	1.954	0.662	1.362	...	-1.498	2.829	NaN	-1.554	-1.588	-1.542	-1.547	NaN	-1.632	-1.641
5.0	3.205	3.783	3.037	2.440	1.956	1.732	1.696	1.973	0.677	1.243	...	-1.497	0.875	-0.452	-1.552	-1.588	-1.549	-1.550	1.913	-1.632	-1.639
6.0	3.213	3.792	3.035	2.443	1.961	1.733	1.695	1.976	0.708	1.359	...	-1.499	1.871	-0.176	-1.552	-1.588	-1.550	-1.550	2.181	-1.627	-1.639
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3366.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	-0.722
3367.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3368.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3369.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3370.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN

3370 rows × 62 columns

Maybe you want the coordinates as well as the timestamp in the column headers.

import pandas as pd
import xarray as xr
from siphon.catalog import TDSCatalog

base_url = 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708'
catalog_url = base_url + '/catalog.xml'
catalog = TDSCatalog(catalog_url)

# Create an empty dictionary to store profile data
profile_data = {}

for dataset in catalog.datasets:
    profile_url = base_url + '/' + dataset
    xrds = xr.open_dataset(profile_url)
    
    timestamp = xrds.attrs['time_coverage_start']
    latitude = xrds.attrs['geospatial_lat_min']
    longitude = xrds.attrs['geospatial_lon_min']
    
    key = (timestamp, latitude, longitude)
    profile_data[key] = xrds['TEMP'].to_dataframe()['TEMP']

# Create a MultiIndex DataFrame
index = pd.MultiIndex.from_tuples(profile_data.keys(), names=['timestamp', 'latitude', 'longitude'])
master_df = pd.DataFrame(profile_data.values(), index=index).transpose()

# Display the resulting DataFrame
master_df

timestamp	2021-07-12T19:05:04Z	2021-07-14T23:59:51Z	2021-07-15T02:01:30Z	2021-07-15T03:35:31Z	2021-07-15T04:45:26Z	2021-07-15T05:55:33Z	2021-07-15T07:25:12Z	2021-07-15T08:44:11Z	2021-07-16T17:22:02Z	2021-07-16T20:34:00Z	...	2021-07-19T14:29:47Z	2021-07-20T13:58:18Z	2021-07-20T19:58:26Z	2021-07-21T17:35:04Z	2021-07-22T13:57:51Z	2021-07-23T06:12:20Z	2021-07-23T09:25:55Z	2021-07-24T01:04:37Z	2021-07-24T14:31:40Z	2021-07-25T03:02:53Z
latitude	78.128197	76.499802	76.595802	76.693001	76.764503	76.812202	76.914703	77.000198	78.000000	78.400002	...	80.512199	80.480003	80.732498	81.547798	81.546700	81.542503	81.542702	82.000999	81.981697	82.003799
longitude	14.003200	31.219801	31.756800	32.303699	32.617298	32.897999	33.516300	34.002300	33.999699	33.999802	...	33.814499	33.202499	33.121498	30.858700	30.795000	30.891500	30.846001	29.982201	29.979000	30.042700
PRES
2.0	3.226	3.782	3.035	2.504	1.960	1.735	1.694	1.966	NaN	1.382	...	NaN	8.814	NaN	-1.551	-1.582	-1.535	-1.528	NaN	-1.620	-1.643
3.0	3.211	3.783	3.036	2.456	1.960	1.736	1.693	1.958	0.693	1.376	...	-1.499	5.513	NaN	-1.560	-1.585	-1.541	-1.524	NaN	-1.630	-1.643
4.0	3.202	3.783	3.036	2.456	1.950	1.735	1.691	1.954	0.662	1.362	...	-1.498	2.829	NaN	-1.554	-1.588	-1.542	-1.547	NaN	-1.632	-1.641
5.0	3.205	3.783	3.037	2.440	1.956	1.732	1.696	1.973	0.677	1.243	...	-1.497	0.875	-0.452	-1.552	-1.588	-1.549	-1.550	1.913	-1.632	-1.639
6.0	3.213	3.792	3.035	2.443	1.961	1.733	1.695	1.976	0.708	1.359	...	-1.499	1.871	-0.176	-1.552	-1.588	-1.550	-1.550	2.181	-1.627	-1.639
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3366.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	-0.722
3367.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3368.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3369.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN
3370.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.719	NaN	NaN

3370 rows × 62 columns

Accessing a subset of the files#

We can use an if statement within our for loop to work with only certain files. Let’s say you are looking for data only within a certain longitude range. You can use select based on the relevant global attribute.

filtered_datasets = {}
for dataset in catalog.datasets:
    profile_url = base_url + '/' + dataset
    xrds = xr.open_dataset(profile_url)
    longitude = xrds.attrs['geospatial_lon_min']
    if 31 < longitude < 31.5:
        filtered_datasets[round(float(xrds.attrs['geospatial_lat_min']),4)] = xrds

Plotting data from multiple files together#

How you do this is going to depend very much on the data and your requirements. However, this is a demonstration that it is possible to create a single plot of data from multiple NetCDF files.

import matplotlib.pyplot as plt
import xarray as xr

fig, axs = plt.subplots(2, len(filtered_datasets), figsize=(13, 10), sharey=True)

# Sorting the datasets in order of latitude
sorted_latitudes = sorted(filtered_datasets.keys())

for i, latitude in enumerate(sorted_latitudes):
    xrds = filtered_datasets[latitude]
    pressure = xrds['PRES']
    temperature = xrds['TEMP']
    salinity = xrds['PSAL']

    axs[0,i].plot(temperature, pressure)
    axs[0,i].set_title(f'Lat = {latitude}')
    axs[0,i].set_ylim([150,0])
    axs[0,i].set_xlim([-2,5])
    axs[0,i].set_xlabel('Temperature')
    if i == 0:
        axs[0,i].set_ylabel('Pressure')

    axs[1,i].plot(salinity, pressure)
    axs[1,i].set_title(f'Lat = {latitude}')
    axs[1,i].set_ylim([150,0])
    axs[1,i].set_xlim([33.5,35])
    axs[1,i].set_xlabel('Salinity')
    if i == 0:
        axs[1,i].set_ylabel('Pressure')

plt.tight_layout()
plt.show()

_images/5ff90c88f54bdd5e8308bc3ade4924f776ff4d487a90c3fd2f59faeeef4fc636.png