rasterframes/python/docs/raster-catalogs.pymd at develop · locationtech/rasterframes

History

196 lines (153 loc) · 8.16 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

# Raster Catalogs

While interesting processing can be done on a @ref:[single raster file](raster-read.md#single-raster), RasterFrames shines when _catalogs_ of raster data are to be processed. In its simplest form, a _catalog_ is a list of @ref:[URLs referencing raster files](raster-read.md#uri-formats). This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The _catalog_ is input into the `raster` DataSource described in the @ref:[next page](raster-read.md), which creates _tiles_ from the rasters at the referenced URLs.

A _catalog_ can have one or two dimensions:

* One-D: A single column contains raster URLs across the rows. All referenced rasters represent the same @ref:[band](concepts.md#band). For example, a column of URLs to Landsat 8 near-infrared rasters covering Europe. Each row represents different places and times.

* Two-D: Many columns contain raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single @ref:[scene](concepts.md#scene) with the same resolution, extent, [_CRS_][CRS], etc across the row.

## Creating a Catalog

This section will provide some examples of _catalogs_ creation, as well as introduce some experimental _catalogs_ built into RasterFrames. Reading raster data represented by a _catalog_ is covered in more detail in the @ref:[next page](raster-read.md).

```python, setup, echo=False

from pyrasterframes.utils import create_rf_spark_session

from pyrasterframes.rasterfunctions import *

import pyrasterframes.rf_ipython

import pandas as pd

spark = create_rf_spark_session()

```

### One-D

A single URL is the simplest form of a catalog.

```python, oned_onerow_catalog

file_uri = "/data/raster/myfile.tif"

# Pandas DF

my_cat = pd.DataFrame({'B01': [file_uri]})

# equivalent Spark DF

from pyspark.sql import Row

my_cat = spark.createDataFrame([Row(B01=file_uri)])

#equivalent CSV string

my_cat = "B01\n{}".format(file_uri)

```

A single column represents the same content type with different observations along the rows. In this example it is band 1 of MODIS surface reflectance, which is visible red. In the example the location of the images is the same, indicated by the granule identifier `h04v09`, but the dates differ: 2018185 (July 4, 2018) and 2018188 (July 7, 2018).

```python, oned_tworow_catalog

scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"

scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"

# a pandas DF

one_d_cat_pd = pd.DataFrame({'B01': [scene1_B01, scene2_B01]})

# equivalent spark DF

one_d_cat_df = spark.createDataFrame([Row(B01=scene1_B01), Row(B01=scene2_B01)])

# equivalent CSV string

one_d_cat_csv = '\n'.join(['B01', scene1_B01, scene2_B01])

```

This is what it looks like in DataFrame form:

```python, view_spark_cat_1

one_d_cat_df

```

### Two-D

In this example, multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id `h04v09` on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.

```python, twod_catalog

scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"

scene1_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF"

scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"

scene2_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF"

# Pandas DF

two_d_cat_pd = pd.DataFrame([

{'B01': [scene1_B01], 'B02': [scene1_B02]},

{'B01': [scene2_B01], 'B02': [scene2_B02]}

])

# or

two_d_cat_df = spark.createDataFrame([

Row(B01=scene1_B01, B02=scene1_B02),

Row(B01=scene2_B01, B02=scene2_B02)

])

# As CSV string

tow_d_cat_csv = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])

```

This is what it looks like in DataFrame form:

```python, view_spark_cat_1

two_d_cat_df

```

## Using External Catalogs

The concept of a _catalog_ is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here's an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a _catalog_. The metadata describing the content of each URL is an important aspect of processing raster data.

```python, remote_csv

from pyspark import SparkFiles

from pyspark.sql import functions as F

spark.sparkContext.addFile("https://modis-pds.s3.amazonaws.com/MCD43A4.006/2018-07-04_scenes.txt")

scene_list = spark.read \

.format("csv") \

.option("header", "true") \

.load(SparkFiles.get("2018-07-04_scenes.txt"))

scene_list

```

Observe the scenes list file has URIs to `index.html` files in the download_url column. The image URI's are in the same directory. The filenames are of the form `${gid}_B${band}.TIF`. The next code chunk builds these URIs, which completes our catalog.

```python, show_remote_catalog

modis_catalog = scene_list \

.withColumn('base_url',

F.concat(F.regexp_replace('download_url', 'index.html$', ''), 'gid',)

) \

.withColumn('B01' , F.concat('base_url', F.lit("_B01.TIF"))) \

.withColumn('B02' , F.concat('base_url', F.lit("_B02.TIF"))) \

.withColumn('B03' , F.concat('base_url', F.lit("_B03.TIF")))

modis_catalog

```

## Using Built-in Catalogs

RasterFrames comes with two experimental catalogs over the AWS PDS [Landsat 8][Landsat] and [MODIS][MODIS] repositories. They are created by downloading the latest scene lists and building up the appropriate band URI columns as in the prior example.

> Note: The first time you run these may take some time, as the catalogs are large and have to be downloaded. However, they are cached and subsequent invocations should be faster.

### MODIS

```python, evaluate=False

modis_catalog = spark.read.format('aws-pds-modis-catalog').load()

modis_catalog.printSchema()

```

root

|-- product_id: string (nullable = false)

|-- acquisition_date: timestamp (nullable = false)

|-- granule_id: string (nullable = false)

|-- gid: string (nullable = false)

|-- B01: string (nullable = true)

|-- B01qa: string (nullable = true)

|-- B02: string (nullable = true)

|-- B02qa: string (nullable = true)

|-- B03: string (nullable = true)

|-- B03aq: string (nullable = true)

|-- B04: string (nullable = true)

|-- B04qa: string (nullable = true)

|-- B05: string (nullable = true)

|-- B05qa: string (nullable = true)

|-- B06: string (nullable = true)

|-- B06qa: string (nullable = true)

|-- B07: string (nullable = true)

|-- B07qa: string (nullable = true)

```

### Landsat 8

The Landsat 8 catalog includes a richer set of metadata describing the contents of each scene.

```python, evaluate=False

l8 = spark.read.format('aws-pds-l8-catalog').load()

l8.printSchema()

```

root

|-- product_id: string (nullable = false)

|-- entity_id: string (nullable = false)

|-- acquisition_date: timestamp (nullable = false)

|-- cloud_cover_pct: float (nullable = false)

|-- processing_level: string (nullable = false)

|-- path: short (nullable = false)

|-- row: short (nullable = false)

|-- bounds_wgs84: struct (nullable = false)

| |-- minX: double (nullable = false)

| |-- maxX: double (nullable = false)

| |-- minY: double (nullable = false)

| |-- maxY: double (nullable = false)

|-- B1: string (nullable = true)

|-- B2: string (nullable = true)

|-- B3: string (nullable = true)

|-- B4: string (nullable = true)

|-- B5: string (nullable = true)

|-- B6: string (nullable = true)

|-- B7: string (nullable = true)

|-- B8: string (nullable = true)

|-- B9: string (nullable = true)

|-- B10: string (nullable = true)

|-- B11: string (nullable = true)

|-- BQA: string (nullable = true)

```

[MODIS]: https://docs.opendata.aws/modis-pds/readme.html

[Landsat]: https://docs.opendata.aws/landsat-pds/readme.html

[CRS]: https://en.wikipedia.org/wiki/Spatial_reference_system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

raster-catalogs.pymd

Latest commit

History

raster-catalogs.pymd

File metadata and controls