Accessing remote files with earthaccess¶
When we search for data using earthaccess we get back a list of results from NASA's Common Metadata Repository or CMR for short. These results contain all the information
we need to access the files represented by the metadata. earthaccess offers 2 access methods that operate with these results, the first method is the well known, download()
where we copy the results from their location to our local disk, if we are running the code in AWS say on a Jupyterhub the files will be copied to the local VM disk.
The other method is open(), earthaccess uses fsspec to open remote files as if they were local. open has advantages and some disadvantages that we must know before using it.
The main advantage for open() is that we don't have to download the file, we can stream it into memory however depending on how we do it we may run into network performance issues. Again, if we run the code next to the data this would be fast, if we do it locally in our laptopts it will be slow.
import earthaccess
auth = earthaccess.login()
results = earthaccess.search_data(
short_name="ATL06",
cloud_hosted=False,
temporal=("2019-01", "2019-02"),
polygon=[(-100, 40), (-110, 40), (-105, 38), (-100, 40)],
)
results[0]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[2], line 7 1 results = earthaccess.search_data( 2 short_name="ATL06", 3 cloud_hosted=False, 4 temporal=("2019-01", "2019-02"), 5 polygon=[(-100, 40), (-110, 40), (-105, 38), (-100, 40)], 6 ) ----> 7 results[0] IndexError: list index out of range
nsidc_url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2019.02.21/ATL06_20190221121851_08410203_005_01.h5"
lpcloud_url = "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc"
session = earthaccess.get_requests_https_session()
headers = {"Range": "bytes=0-100"}
r = session.get(lpcloud_url, headers=headers)
r
<Response [502]>
fs = earthaccess.get_fsspec_https_session()
with fs.open(lpcloud_url) as f:
data = f.read(100)
data
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08\x00\x04\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xd7HUn\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00OHDR'
%%time
import xarray as xr
files = earthaccess.open(results[0:2])
ds = xr.open_dataset(files[0], group="/gt1r/land_ice_segments")
ds
CPU times: user 344 ms, sys: 61.9 ms, total: 406 ms Wall time: 304 ms
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[7], line 1 ----> 1 get_ipython().run_cell_magic('time', '', '\nimport xarray as xr\n\nfiles = earthaccess.open(results[0:2])\n\nds = xr.open_dataset(files[0], group="/gt1r/land_ice_segments")\nds\n') File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1047/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2572, in InteractiveShell.run_cell_magic(self, magic_name, line, cell) 2570 with self.builtin_trap: 2571 args = (magic_arg_s, cell) -> 2572 result = fn(*args, **kwargs) 2574 # The code below prevents the output from being displayed 2575 # when using magics with decorator @output_can_be_silenced 2576 # when the last Python token in the expression is a ';'. 2577 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1047/lib/python3.11/site-packages/IPython/core/magics/execution.py:1447, in ExecutionMagics.time(self, line, cell, local_ns) 1445 if interrupt_occured: 1446 if exit_on_interrupt and captured_exception: -> 1447 raise captured_exception 1448 return 1449 return out File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/1047/lib/python3.11/site-packages/IPython/core/magics/execution.py:1411, in ExecutionMagics.time(self, line, cell, local_ns) 1409 st = clock2() 1410 try: -> 1411 exec(code, glob, local_ns) 1412 out = None 1413 # multi-line %%time case File <timed exec>:5 IndexError: list index out of range