OlmoEarth: export custom embeddings for EO | Keryc
OlmoEarth Studio now lets you compute and export custom embeddings for Earth observation. What does that mean in practice? It means compact vectors that capture landscape features from satellite images, ready for similarity search, few-shot segmentation, change detection and unsupervised exploration.
What embeddings are and why they're useful to you
An embedding is simply a numeric vector that represents the local information of a pixel or patch: surface types, texture, spectral signals and more. OlmoEarth produces these vectors with open foundation models and publishes both the code and the weights, so you can inspect exactly how they are generated.
Why use embeddings instead of a model trained for each task? Because they're fast, cheap and very versatile. You generate the representation once and then apply simple methods —dot product, linear regression, PCA— to different tasks without retraining the whole network.
How to compute embeddings in OlmoEarth Studio
The Studio workflow is the same as any inference: configure a model, run it and download the result. These are the key parameters you can control:
Area of interest: draw or upload any polygon; Studio handles acquisition and tiling of images.
Time span: from 1 to 12 monthly periods, so you can capture seasonal variation.
Spatial resolution: 10 m, 20 m, 40 m or 80 m per pixel.
Imagery sources: Sentinel-2 L2A, Sentinel-1 RTC or both.
Studio delivers a cloud-optimized GeoTIFF (COG) with one band per embedding dimension. The vectors are stored as signed 8-bit integers (int8) in the range -127 to +127, with -128 reserved for nodata. To recover floating vectors use the function dequantize_embeddings in the olmoearth_pretrain repo.
One important point: embeddings are computed on demand, not pulled from a globally precomputed file. That means they reflect exactly the temporal period and conditions you requested — for example monthly snapshots to detect seasonality.
Practical examples and code snippets
The examples use OlmoEarth-v1-Tiny (192 dim) at 40 m with Sentinel-2 L2A composites, unless stated otherwise. Tiny is lightweight and surprisingly capable; if you need more fidelity you can try Base at the cost of more compute and storage.
Similarity search
You pick a query pixel, extract its embedding and compute cosine similarity against all other pixels. The result is a map showing where the landscape resembles your query. Useful to find repeated patterns like irrigated fields, urban strips or similar wetlands.
Few-shot segmentation
With very few labels you can train a linear classifier on embeddings and produce wall-to-wall maps. In an experiment with 60 labeled pixels in a mangrove area a weighted F1=0.84 was reached with logistic regression.
Minimal Python example to train and predict over an embeddings COG:
import rasterio
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Load COG exported from Studio
with rasterio.open('embeddings.tif') as ds:
emb = ds.read().astype(np.float32) # (C, H, W)
C, H, W = emb.shape
X = emb.reshape(C, -1).T # (H*W, C)
# train_idx and labels are indices and labels of sampled pixels
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000))
clf.fit(X[train_idx], labels[train_idx])
prediction = clf.predict(X).reshape(H, W)
That flow is a linear probe. If a linear classifier already separates classes well, it means the learned representations contain the needed structure.
Change detection
Generate embeddings for two periods (for example September 2023 and September 2024) and measure per-pixel cosine distance. Strong changes like burn scars or floods show up with high distance without needing labels.
Unsupervised exploration
Apply PCA to reduce dimensions to three and map to R/G/B. It's a quick way to visualize what the model learned: crops, water and urban areas tend to color differently.
From export to reproducible analysis
Exported COGs work with QGIS, GDAL, rasterio or your scripts. OlmoEarth publishes code and weights, and the original post includes tutorials and a Colab notebook to run the examples without local setup.
If you need more performance for a specific task, Studio also supports supervised fine-tuning (SFT): you train a task head on your labels and typically outperform the linear probe over frozen features.
Limitations and practical recommendations
Input quality matters: persistent clouds, atmospheric artifacts or missing data in the composite affect embeddings.
Validate with a small reference set: 30 to 100 labels will give you a quick sense of quality.
Keep the int8 quantization in mind: use dequantize_embeddings to recover floats when you need numeric precision.
For time series, generate monthly embeddings instead of an annual summary if you want to detect seasonal phenomena or abrupt events.
How to access
Custom embeddings are already available in OlmoEarth Studio. If you need export access contact the Studio team. The model code and weights are public, and the original post includes instructions to compute embeddings on your own using the open models.
Think of this as a toolbox: embeddings give you a dense, shareable representation of the territory, and with a few classic operations —similarity, linear regression, PCA— you can solve many practical tasks without training a model from scratch.