-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathraster-catalogs.pymd
More file actions
196 lines (153 loc) · 8.16 KB
/
raster-catalogs.pymd
File metadata and controls
196 lines (153 loc) · 8.16 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# Raster Catalogs
While interesting processing can be done on a @ref:[single raster file](raster-read.md#single-raster), RasterFrames shines when _catalogs_ of raster data are to be processed. In its simplest form, a _catalog_ is a list of @ref:[URLs referencing raster files](raster-read.md#uri-formats). This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The _catalog_ is input into the `raster` DataSource described in the @ref:[next page](raster-read.md), which creates _tiles_ from the rasters at the referenced URLs.
A _catalog_ can have one or two dimensions:
* One-D: A single column contains raster URLs across the rows. All referenced rasters represent the same @ref:[band](concepts.md#band). For example, a column of URLs to Landsat 8 near-infrared rasters covering Europe. Each row represents different places and times.
* Two-D: Many columns contain raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single @ref:[scene](concepts.md#scene) with the same resolution, extent, [_CRS_][CRS], etc across the row.
## Creating a Catalog
This section will provide some examples of _catalogs_ creation, as well as introduce some experimental _catalogs_ built into RasterFrames. Reading raster data represented by a _catalog_ is covered in more detail in the @ref:[next page](raster-read.md).
```python, setup, echo=False
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
import pyrasterframes.rf_ipython
import pandas as pd
spark = create_rf_spark_session()
```
### One-D
A single URL is the simplest form of a catalog.
```python, oned_onerow_catalog
file_uri = "/data/raster/myfile.tif"
# Pandas DF
my_cat = pd.DataFrame({'B01': [file_uri]})
# equivalent Spark DF
from pyspark.sql import Row
my_cat = spark.createDataFrame([Row(B01=file_uri)])
#equivalent CSV string
my_cat = "B01\n{}".format(file_uri)
```
A single column represents the same content type with different observations along the rows. In this example it is band 1 of MODIS surface reflectance, which is visible red. In the example the location of the images is the same, indicated by the granule identifier `h04v09`, but the dates differ: 2018185 (July 4, 2018) and 2018188 (July 7, 2018).
```python, oned_tworow_catalog
scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
# a pandas DF
one_d_cat_pd = pd.DataFrame({'B01': [scene1_B01, scene2_B01]})
# equivalent spark DF
one_d_cat_df = spark.createDataFrame([Row(B01=scene1_B01), Row(B01=scene2_B01)])
# equivalent CSV string
one_d_cat_csv = '\n'.join(['B01', scene1_B01, scene2_B01])
```
This is what it looks like in DataFrame form:
```python, view_spark_cat_1
one_d_cat_df
```
### Two-D
In this example, multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id `h04v09` on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.
```python, twod_catalog
scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene1_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
scene2_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF"
# Pandas DF
two_d_cat_pd = pd.DataFrame([
{'B01': [scene1_B01], 'B02': [scene1_B02]},
{'B01': [scene2_B01], 'B02': [scene2_B02]}
])
# or
two_d_cat_df = spark.createDataFrame([
Row(B01=scene1_B01, B02=scene1_B02),
Row(B01=scene2_B01, B02=scene2_B02)
])
# As CSV string
tow_d_cat_csv = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])
```
This is what it looks like in DataFrame form:
```python, view_spark_cat_1
two_d_cat_df
```
## Using External Catalogs
The concept of a _catalog_ is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here's an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a _catalog_. The metadata describing the content of each URL is an important aspect of processing raster data.
```python, remote_csv
from pyspark import SparkFiles
from pyspark.sql import functions as F
spark.sparkContext.addFile("https://modis-pds.s3.amazonaws.com/MCD43A4.006/2018-07-04_scenes.txt")
scene_list = spark.read \
.format("csv") \
.option("header", "true") \
.load(SparkFiles.get("2018-07-04_scenes.txt"))
scene_list
```
Observe the scenes list file has URIs to `index.html` files in the download_url column. The image URI's are in the same directory. The filenames are of the form `${gid}_B${band}.TIF`. The next code chunk builds these URIs, which completes our catalog.
```python, show_remote_catalog
modis_catalog = scene_list \
.withColumn('base_url',
F.concat(F.regexp_replace('download_url', 'index.html$', ''), 'gid',)
) \
.withColumn('B01' , F.concat('base_url', F.lit("_B01.TIF"))) \
.withColumn('B02' , F.concat('base_url', F.lit("_B02.TIF"))) \
.withColumn('B03' , F.concat('base_url', F.lit("_B03.TIF")))
modis_catalog
```
## Using Built-in Catalogs
RasterFrames comes with two experimental catalogs over the AWS PDS [Landsat 8][Landsat] and [MODIS][MODIS] repositories. They are created by downloading the latest scene lists and building up the appropriate band URI columns as in the prior example.
> Note: The first time you run these may take some time, as the catalogs are large and have to be downloaded. However, they are cached and subsequent invocations should be faster.
### MODIS
```python, evaluate=False
modis_catalog = spark.read.format('aws-pds-modis-catalog').load()
modis_catalog.printSchema()
```
```
root
|-- product_id: string (nullable = false)
|-- acquisition_date: timestamp (nullable = false)
|-- granule_id: string (nullable = false)
|-- gid: string (nullable = false)
|-- B01: string (nullable = true)
|-- B01qa: string (nullable = true)
|-- B02: string (nullable = true)
|-- B02qa: string (nullable = true)
|-- B03: string (nullable = true)
|-- B03aq: string (nullable = true)
|-- B04: string (nullable = true)
|-- B04qa: string (nullable = true)
|-- B05: string (nullable = true)
|-- B05qa: string (nullable = true)
|-- B06: string (nullable = true)
|-- B06qa: string (nullable = true)
|-- B07: string (nullable = true)
|-- B07qa: string (nullable = true)
```
### Landsat 8
The Landsat 8 catalog includes a richer set of metadata describing the contents of each scene.
```python, evaluate=False
l8 = spark.read.format('aws-pds-l8-catalog').load()
l8.printSchema()
```
```
root
|-- product_id: string (nullable = false)
|-- entity_id: string (nullable = false)
|-- acquisition_date: timestamp (nullable = false)
|-- cloud_cover_pct: float (nullable = false)
|-- processing_level: string (nullable = false)
|-- path: short (nullable = false)
|-- row: short (nullable = false)
|-- bounds_wgs84: struct (nullable = false)
| |-- minX: double (nullable = false)
| |-- maxX: double (nullable = false)
| |-- minY: double (nullable = false)
| |-- maxY: double (nullable = false)
|-- B1: string (nullable = true)
|-- B2: string (nullable = true)
|-- B3: string (nullable = true)
|-- B4: string (nullable = true)
|-- B5: string (nullable = true)
|-- B6: string (nullable = true)
|-- B7: string (nullable = true)
|-- B8: string (nullable = true)
|-- B9: string (nullable = true)
|-- B10: string (nullable = true)
|-- B11: string (nullable = true)
|-- BQA: string (nullable = true)
```
[MODIS]: https://docs.opendata.aws/modis-pds/readme.html
[Landsat]: https://docs.opendata.aws/landsat-pds/readme.html
[CRS]: https://en.wikipedia.org/wiki/Spatial_reference_system