Part 1 FINAL
Part 1 FINAL
Part 1 FINAL
March 5, 2024
About ARSET
About ARSET
CAPACITY BUILDING
• Trainings include a variety of CLIMATE & RESILIENCE
applications of satellite data and are
tailored to audiences with a variety of DISASTERS
experience levels. ECOLOGICAL CONSERVATION
WATER RESOURCES
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 3
About ARSET Trainings
• Online or in-person
• Live and instructor-led or asynchronous and self-paced
• Cost-free
• Bilingual and multilingual options
• Only use open-source software and data
• Accommodate differing levels of expertise
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 4
Large Scale Applications of Machine Learning using
Remote Sensing for Building Agriculture Solutions
Overview
Motivation for Training
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 6
Training Learning Objectives
• Use recommended techniques to download and process remote sensing data from Sentinel-2 and
the Cropland Data Layer (CDL) at large scale (> 5GB) with cloud tools (Amazon Web Services [AWS]
Simple Storage Service [S3], Databricks, Spark/Pyspark, Parquet)
• Produce interactive plots of maps, tables, time series, etc. for investigation & verification of data and
models
• Filter data from both the measured (satellite images) and target (CDL) domains to serve modeling
objectives based on quality factors, land classification, area of interest (AOI) overlap, and
geographical location.
• Build training pipelines in TensorFlow to train machine learning algorithms on large scale remote
sensing/geospatial datasets for agricultural monitoring
• Utilize random sampling techniques to build robustness into a predictive algorithm while avoiding
information leakage across training/validation/testing splits
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 7
Prerequisites
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 8
Training Outline
Homework
Opens March 19 – Due April 1 – Posted on Training Webpage
A certificate of completion will be awarded to those who attend all live sessions and
complete the homework assignment(s) before the given due date.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 9
How to Ask Questions
• Please put your questions in the Questions box and we will address them at the
end of the webinar.
• Feel free to enter your questions as we go. We will try to get to all the questions
during the Q&A session after the webinar.
• The remainder of the questions will be answered in the Q&A document, which will
be posted to the training website about a week after the training.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 10
Part 1 – Trainers
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 11
Large Scale Applications of Machine Learning using
Remote Sensing for Building Agriculture Solutions
Part 1: Data Preparation of Imagery & Labels for
Large-Scale ML Modeling
Part 1 Objectives
• Submit lists of boundaries to the NASS API and retrieve CDL rasters back.
• Subsample and visualize retrieved data from CDL with interactive spatial images and
other statistical plots.
• Obtain Sentinel-2 raster files for a given area and timeframe corresponding to the
retrieved CDL data and manipulate the Sentinel-2 rasters into tables in preparation for
analysis and model training.
• Verify correct processing of data via various interactive plots (e.g. time series of pixels of
various land covers).
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 13
Part 1 Section 1:
Irregularly Spaced Time Series Modeling
Irregularly Spaced Time Series Modeling
Irregular
Spacing/timing
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 15
Motivation for this Example
• The CDL algorithm is already using the same (or similar) satellite
data sources and irregular spacing/timing to make predictions
(documented example of success)
• Labels are readily available via API calls (highly
scalable/available).
• Accuracy is well studied & documented
• Resulting code & methods are highly transferable to other
problems/use cases
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 16
Predicting the CDL in Real-Time
According to the CDL FAQs,
• “The CDL Program uses medium spatial
resolution (30 meter) satellite imagery
because it’s too costly to use higher
resolution satellites to perform crop acreage
estimation over large areas.”
• The CDL is considered confidential and
market sensitive during the growing season
and cannot be released until after the
official NASS year end area county estimates
are published in late January/early February
following the end of the typical US growing
season
• The CDL only gives estimates of the types of “By early July we probably are pretty confident in the
crops, but not their sequence or timing (e.g. crop type just from NDVI in this area (probably even
for double crops) more so by using multispectral time series). By late
August we are really confident (6 months prior to when
However… do we really need to wait until the CDL is typically released).”
following year for accurate estimates?
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 17
Generalizable Approach
• Robust models require large-scale data management tools and approaches.
Leveraging multi-core and multi-machine parallel computing is a necessary
step to scale. We demonstrate these tools & approaches with this series.
• Note that a similar approach to crop modeling with the time-series of imagery
could be used for estimating the crop health or other time-dependent factors
as well (simply substitute the label/target).
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 18
Irregularly-Spaced Time-Series Modeling
• There’s a dearth of statistical theory around unevenly-spaced time series, and thus
not much for out-of-the-box methods to apply directly for such situations.
– Most common solution is to manipulate the data into a regularly-spaced time-
series, then apply standard methods. E.g. interpolation or interval-binning (the
latter being what the CDL algorithm does).
– Note that resulting data format from this part-1 demo will support any modeling
approach.
– We follow a similar approach as the CDL (binned intervals) for parts 2 & 3 of
this demo for data loaders & model training due to simplicity.
• Newer ML sequence models such as transformers (“self-attention”) accept
positional encodings of inputs/outputs and learn meaningful absolute and relative
input (and output) information. This could facilitate direct modeling of unevenly-
spaced satellite data, but due to increased complexity we do not incorporate it
here.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 19
Part 1 Section 1:
Cropland Data Layer (CDL)
Cropland Data Layer (Contiguous United States)
Best place to find information about it is the USDA NASS FAQs & metadata.
Some relevant info:
• Model: decision tree classifier (handles missing, non-continuous, non-normal,
nonlinear data, efficient computation). Probabilistic output (argmax to class)
• Input: Landsat 8 and 9 OLI/TIRS, ISRO ResourceSat-2 LISS-3, and ESA SENTINEL-2A
and -2B. Imagery is downloaded daily with the objective of obtaining at least one
cloud-free usable image every two weeks throughout the growing season
• Ground truth: FSA Common Land Unit (CLU) for ag/crops & National Land Cover
Database (NLCD) for non-ag areas
• Accuracy: Generally, 85% to 95% correct for the major crop-specific land cover
categories. 30m resolution
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 21
CDL Accuracy
As noted earlier, the CDL accuracy is well studied and documented. Below is an excerpt from the USDA
quality checks for Arkansas 2022.
• All states from all years can be found USDA NASS Cropland metadata.
• Some crops like sorghum (in this case) may have low accuracy, but they also represent a tiny
proportion of the farmland. Training a model specifically focused on certain classes could boost
accuracy for those classes, at the expense of others.
*Correct Pixels represents the total number of independent validation pixels correctly identified in the error matrix.
**The Overall Accuracy represents only the FSA row crops and annual fruit and vegetables
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 22
Rough calculations of data size for CDL on Contiguous US
Even though only about 20% of the US is specifically used for crop land, any given
model must run across all the US area to classify the land.
Assuming:
• Use Sentinel-2 for land classification at 10m resolution
• 1 cloud-free image every 2 weeks (26 images total per 100m2)
• 12 bands at 2 bytes (16 bits) per pixel (26 images total per 100m2)
• ~7,500,000 km2 of land
≈44 TB of data post-processed to run a predictive model. While not trivial, 10TB drives only
cost $200. using 30m2 resolution drops this down to ~5TB to work with.
If only using Sentinel-2, roughly 28TB of raster data must be downloaded and processed over
that land since Sentinel-2 tiles are 110 x 110 km2 and have 12 bands, amounting to roughly
640MB per scene, that occur every 5 days.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 23
Part 1 Section 2:
Sentinel-2 Optical Data
Factors Affecting Quality & Temporal Spacing of Satellite Data
Example factors affecting data:
• Irregularity
– Orbit path overlap – closer to poles leads to increased coverage
– Orbit path/image capture repeatability – the exact position of the image can vary east/west
and result in areas on the edge of scenes to have more uncertainty in coverage.
– Thick cloud cover – when present and identified (thus ignored) there is a gap in coverage.
– SCL errors – when we ignore data from scenes due to SCL category but it’s wrong, we
introduce unnecessary gaps in coverage.
• Quality
– Thin cloud haze – cloud cover isn’t Boolean, it’s a gradient. Sometimes hard to identify when
thin (it alters the reflectance values and may not be caught)
– Tiling system overlap – at the edge of tiles (which is how the data is stored & queried) there is
overlap and slight differences between the values for the same scene and location in
different tiles.
– Geolocation/georeferencing – location of pixels can be incorrect and vary by more than the
size of a pixel (resulting in wrong information for a point location)
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 25
Sentinel-2 Orbit Path Overlap
Number of Tracks
increases further from
the equator. Thus,
certain parts of US on
the borders of orbits get
up to 2x coverage per
satellite and results in
intervals of 2 or 3 days
instead of 5.
Reference: https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-2-msi/revisit-coverage
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 26
Orbit Path Repeatability
Within a distance of 1km, the same nominal orbit path has several actual
orbit path variations. This can again affect availability of imagery.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 27
Sentinel-2 Tile System Overview
Some considerations to be aware of when processing S2 data for analysis & modeling:
• Scenes captured from Sentinel-2 are processed and made available in a unique tiling system
that is a slightly modified version of the military grid reference system (MGRS).
St. Louis, MO (reference city) The tiles overlap and can result in
values from the same scene in up to
four different tiles. There also exists
“joints” in the grid to tie it together
due to overlaying a grid on a
sphere.
“Joint”
Reference: https://maps.eatlas.org.au/
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 28
Sentinel-2 Orbit Paths VS Tiling Grid
Tile “Joint”
Reference: Sentinelhub EO Browser
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 29
Scene Classification Layer (SCL)
The SCL band is useful for rapid identification of data of interest. While not a proper land-cover
classifier like the CDL, it facilitates rapid classification of per-scene pixels into 12 [mostly potentially
transient] categories. Some highlights:
• Most common use case is identification of cloud cover, and there is a separate cloud mask
available with probabilities (using the Sen2Cor algorithm). We also use for identifying vegetation
in this demo
• 60m resolution, using single pixel from a single scene for prediction
• Can be error prone
Algorithms
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 30
Common Issues/Limitations
Inconsistent geolocation/georeferencing of pixels and default scene classification labels from providers
(e.g., SCL layer from Sentinel-2) aren’t always accurate.
Inconsistent geolocation (10-15m disagreement Poor scene classification, mislabeling clouds as bare
between subsequent images 5 days apart) ground and water as snow due to cloud haze, etc.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 31
Part 1 Section 3:
Databricks Procedural Demo (Run the code)
Databricks Community Edition Overview
Link to instructions to signup for Databricks Community Edition
• Jupyter Notebook style coding
• Databricks Community Edition allows up to 10GB persistent storage on the “FileStore”
– Can store generic files, tables, and code.
– Notebooks are stored in the “workspace” area
• Can spin up small instances with 2 cpus,15GB RAM, 130GB local storage, Spark
enabled out of the box
• Anything stored on the local machine is lost when the instance shuts down
• Running notebook code longer than ~60 minutes will cause the node shutdown.
However, as long as you are interacting with the notebook (writing and running
code manually) it usually will stay up longer
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 33
Demo Information and Notes
Available materials for this demo:
• Three data processing scripts for Part-1 (CDL acquisition, Sentinel-2 acquisition,
final data manipulation)
• Generalization: The CDL table could be any ground truth label + point location +
timeframe, and the rest of the data acquisition & modeling can remain the same.
E.g., crop health, vegetative stage, or other types of land cover classes.
• Any systematic error from the CDL will likely pass to models trained on it.
• To get data as it would exist after running the scripts in this training demonstration,
download the zip files located with the other training materials.
How to download the resulting files from your Databricks account FileStore:
• Is somewhat not intuitive. To download, you should navigate to the path of YOUR
file using the below format.
– https://community.cloud.databricks.com/files/path/to/folder/filename.extension
• ‘path/to-folder’ is the directory path where your file exists on the file store.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 34
Code Steps
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 35
APIs Brief Overview
In place of manually retrieving data, APIs make data acquisition & processing
significantly more scalable by providing a consistent interface to search for and
retrieve large amounts of data via web requests. We primarily rely on two APIs in this
demo for data acquisition.
• CDL API from NASS geo data (link)
• AWS STAC API for sentinel-2 imagery searches.
– Sentinel-2 image raster data can be downloaded via web URL download links
that we can access directly once known from the imagery search. Since these
are large though and slow down processing, we will do our best to minimize
the downloading of any superfluous or low-impact scenes.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 36
Area of Interest (AOI) & Boundary Creation
The only prior step needed typically before this part is to define AOIs. For this work we used the
nassgeodata web gui to draw 7 boxes and export them as ESRI shapefiles. Then convert them to
bounds (left, bottom, right, top) in the EPSG:5070 CRS (as required by the nassgeodata API). We
provide bounds already in the CDL acquisition code.
Example python code to get bounds from [zipped] ESRI shapefiles from NASSGEO:
import ge opa nda s a s gpd
f r om pypr oj i mpor t CRS
i mpor t pa nda s a s pd
r oot _pa t h = ' C: / Us e r s / myna me / Downl oa ds / '
# Li s t of f i l e pa t hs f or e s r i s ha pe f i l e bounda r i e s ( e xpor t e d i nt o z i ps f r om na s s ge o)
pa t hs = [ r oot _pa t h + ' CDL_12345. z i p' , r oot _pa t h + ' CDL_6789. z i p' ]
gdf _l i s t = [ ] # Cr e a t e a l i s t t o hol d t he Ge oPa nda s da t a f r a me s
# Re a d e a c h s ha pe f i l e i nt o a Ge oPa nda s da t a f r a me a nd a ppe nd i t t o t he l i s t
f or pa t h i n pa t hs :
gdf = gpd. r e a d_f i l e ( " z i p: / / " + pa t h)
gdf _l i s t . a ppe nd( gdf )
# Conc a t e na t e a l l t he da t a f r a me s i nt o a s i n gl e Ge oPa nda s da t a f r a me
c ombi ne d_gdf = gpd. Ge oDa t a Fr a me ( pd. c onc a t ( gdf _l i s t , i gnor e _i nde x=Tr ue ) )
t a r ge t _c r s = CRS( " EPSG: 5070" ) # De f i ne t he t a r ge t CRS ( EPSG: 5070)
gdf _5070 = c ombi ne d_gdf . t o_c r s ( t a r ge t _c r s ) # Conve r t t he Ge oDa t a Fr a me t o t he t a r ge t CRS
pr i nt ( gdf _5070. bounds . a ppl y( l a mbda r ow: ' , ' . j oi n( ma p( s t r , ma p( i nt , r ow) ) ) ,
a xi s =1) . t o_s t r i ng( i nde x=Fa l s e ) )
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 37
CDL Acquisition Code Summary
The results of this first part are a spatially down-sampled version of the CDL for the user specified AOIs and
years.
• This code part executes quite rapidly (few minutes) and results in a parquet table with a single 30m2
pixel/year per row (with associated CDL estimate).
• The below table summarizes the top 5 CDL categories across all AOIs, and % of entire dataset per year that
each CDL category represents. E.g., 2021 Soybeans represented ~36.6% of the land cover for that year.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 38
Sentinel-2 Acquisition Code Results
Summary: For each pixel/year from CDL acquisition code, this code acquires associated
Sentinel-2 data for that entire year and saves in a Parquet table. Note that this part takes a
long time to execute. It will time-out the free version of Databricks after an hour.
“Duplicate” data has the same date & tile, but perhaps due to updated processing has slightly
different values. These are left in the data for this work and removed in data loader, but
probably best to only keep the one with largest number at the end of the tile (latest processing?).
When duplicate values are due to tile overlap, choosing one randomly can be fine.
…123 rows total available for this particular pixel/year. Each row includes the
band values for that location from a single scene/date.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 39
Sentinel-2 Acquisition Code (Plotted)
Example data from a single pixel/year of corn. Includes duplicated data.
Example “duplicate” data
(slightly different band values Clearly an error in classification (only happens
for same scene) once and this during middle of growing season).
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 41
Part 1:
Summary
Summary
• APIs allow us to automate and scale very large data processing pipelines in
preparation for analysis and model building.
• Storing data in Parquet format and using Spark/Databricks to query/pivot or
manipulate the data enables rapid investigation and transformation
– The Parquet format has useful abstractions like partitions, which are also
directories
• A convenient form for modeling time-series imagery data involves storing in
parquet table format, with each row representing a pixel for a given time interval
and having columns of:
– Band values, scene dates, scene classification values over that time interval
– Scalars for lat, lon representing the center point of the pixel (could substitute
an Uber H3 hex or Google S2 cell instead)
– A prediction target (ground truth)
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 43
Looking Ahead to Part 2
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 44
Homework and Certificates
• Homework:
– One homework assignment
– Opens on March 19
– Access from the training webpage
– Answers must be submitted via Google Forms
– Due by April 1
• Certificate of Completion:
– Attend all three live webinars (attendance is recorded automatically)
– Complete the homework assignment by the deadline
– You will receive a certificate via email approximately two months after
completion of the course.
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 45
Contact Information
– [email protected]
• Sean McCartney
Visit our Sister Programs:
– [email protected]
• DEVELOP
• SERVIR
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 46
Questions?
https://earthobservatory.nasa.gov/images/6034/pothole-lakes-in-siberia
NASA ARSET – Large Scale Applications of Machine Learning using Remote Sensing for Building Agriculture Solutions 47
Thank You!